監(jiān)控系統(tǒng)
數(shù)據(jù)可視化:Grafana
數(shù)據(jù)存儲(chǔ):InfluxDB/Prometheus
數(shù)據(jù)采集:Telegraf/NodeExporter
Grafana
Grafana官方提供了很多dashboard,可以用來(lái)呈現(xiàn)操作系統(tǒng)、數(shù)據(jù)庫(kù)、應(yīng)用程序的運(yùn)行狀態(tài)。
我選擇了以下幾個(gè)dashboard:
系統(tǒng)dashboard:https://grafana.com/grafana/dashboards/928
數(shù)據(jù)庫(kù)dashboard:https://grafana.com/grafana/dashboards/1177
java應(yīng)用dashboard:https://grafana.com/grafana/dashboards/4701
這里選擇的系統(tǒng)dashboard和數(shù)據(jù)庫(kù)dashboard采用了InfluxDB作為數(shù)據(jù)源,InfluxDB一般通過(guò)Telegraf采集數(shù)據(jù)。
Java應(yīng)用dashboard采用了Prometheus作為數(shù)據(jù)源,Prometheus一般通過(guò)NodeExporter采集數(shù)據(jù),對(duì)于Java應(yīng)用,可以借助micrometer采集數(shù)據(jù)。
參考資料:
Grafana安裝:
https://grafana.com/docs/grafana/latest/installation/rpm/#install-manually-with-yum
Grafana基本操作,包括創(chuàng)建數(shù)據(jù)源、創(chuàng)建dashboard等。
https://grafana.com/tutorials/grafana-fundamentals/#1
InfluxDB
InfluxDB概念
| 概念 | 數(shù)據(jù)庫(kù) | 表 | 記錄 | 數(shù)據(jù)保留多久,保留多少份 | 索引字段 | 普通字段 | 記錄的時(shí)間戳 |
|---|---|---|---|---|---|---|---|
| InfluxDB | database | measurement | point | retention policy | tag | field | timestamp |
| MySQL | database | table | row | indexed column | column |
參考資料:
https://docs.influxdata.com/influxdb/v1.8/concepts/key_concepts/
Sample Data
- 創(chuàng)建數(shù)據(jù)庫(kù)
CREATE DATABASE NOAA_water_database
- 下載并寫(xiě)入數(shù)據(jù)
curl https://s3.amazonaws.com/noaa.water-database/NOAA_data.txt -o NOAA_data.txt
influx -import -path=NOAA_data.txt -precision=s -database=NOAA_water_database
- 測(cè)試查詢(xún)
> SHOW measurements
name: measurements
------------------
name
average_temperature
h2o_feet
h2o_pH
h2o_quality
h2o_temperature
?
> SELECT COUNT("water_level") FROM h2o_feet
name: h2o_feet
--------------
time count
1970-01-01T00:00:00Z 15258
?
> SELECT * FROM h2o_feet LIMIT 2
name: h2o_feet
--------------
time level description location water_level
2015-08-18T00:00:00Z below 3 feet santa_monica 2.064
2015-08-18T00:00:00Z between 6 and 9 feet coyote_creek 8.12
參考資料:
https://docs.influxdata.com/influxdb/v1.8/query_language/sample-data/
Explore Schema
SHOW DATABASES
SHOW MEASUREMENTS
SHOW TAG KEYS
SHOW FIELD KEYS
參考資料:
https://docs.influxdata.com/influxdb/v1.8/query_language/explore-schema/
Explore Data
- The SELECT statement
SELECT <field_key>[,<field_key>,<tag_key>] FROM <measurement_name>[,<measurement_name>]
- The WHERE clause
SELECT_clause FROM_clause WHERE <conditional_expression> [(AND|OR) <conditional_expression> [...]]
- The GROUP By clause
SELECT_clause FROM_clause [WHERE_clause] GROUP BY [* | <tag_key>[,<tag_key]]
ORDER BY time DESC
The LIMIT and SLIMIT clauses
參考資料:
https://docs.influxdata.com/influxdb/v1.8/query_language/explore-data/
Functions
聚合(Aggregations)
選擇(Selectors)
轉(zhuǎn)換(Transformations)
參考資料:
https://docs.influxdata.com/influxdb/v1.8/query_language/functions/
Telegraf
telegraf用于采集數(shù)據(jù),輸出到influxdb中。
telegraf支持采集系統(tǒng)和數(shù)據(jù)庫(kù)的指標(biāo)數(shù)據(jù),只需要在/etc/telegraf/telegraf.conf做簡(jiǎn)單的配置。
telegraf在寫(xiě)入數(shù)據(jù)時(shí),會(huì)為每一條數(shù)據(jù)加上一個(gè)tag[host],用來(lái)區(qū)分是哪個(gè)應(yīng)用上報(bào)的數(shù)據(jù)。host的值可以在telegraf.conf中配置,也可以修改linux hostname。
### OUTPUT
?
# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
urls = ["http://localhost:8089"]
database = "telegraf_metrics"
?
## Retention policy to write to. Empty string writes to the default rp.
retention_policy = ""
## Write consistency (clusters only), can be: "any", "one", "quorum", "all"
write_consistency = "any"
?
## Write timeout (for the InfluxDB client), formatted as a string.
## If not provided, will default to 5s. 0s means no timeout (not recommended).
timeout = "5s"
?
# Read metrics about cpu usage
[[inputs.cpu]]
## Whether to report per-cpu stats or not
percpu = true
## Whether to report total system cpu stats or not
totalcpu = true
## Comment this line if you want the raw CPU time metrics
fielddrop = ["time_*"]
?
?
# Read metrics about disk usage by mount point
[[inputs.disk]]
## By default, telegraf gather stats for all mountpoints.
## Setting mountpoints will restrict the stats to the specified mountpoints.
# mount_points = ["/"]
?
## Ignore some mountpoints by filesystem type. For example (dev)tmpfs (usually
## present on /run, /var/run, /dev/shm or /dev).
ignore_fs = ["tmpfs", "devtmpfs"]
?
?
# Read metrics about disk IO by device
[[inputs.diskio]]
## By default, telegraf will gather stats for all devices including
## disk partitions.
## Setting devices will restrict the stats to the specified devices.
# devices = ["sda", "sdb"]
## Uncomment the following line if you need disk serial numbers.
# skip_serial_number = false
?
?
# Get kernel statistics from /proc/stat
[[inputs.kernel]]
# no configuration
?
?
# Read metrics about memory usage
[[inputs.mem]]
# no configuration
?
?
# Get the number of processes and group them by status
[[inputs.processes]]
# no configuration
?
?
# Read metrics about swap memory usage
[[inputs.swap]]
# no configuration
?
?
# Read metrics about system load & uptime
[[inputs.system]]
# no configuration
?
# Read metrics about network interface usage
[[inputs.net]]
# collect data only about specific interfaces
# interfaces = ["eth0"]
?
[[inputs.netstat]]
# no configuration
[[inputs.mysql]]
server = ["root:root@tcp(127.0.0.1:3306)/"]
Prometheus
架構(gòu)

概念
| 概念 | 數(shù)據(jù)庫(kù) | 表 | 記錄 | 數(shù)據(jù)保留多久,保留多少份 | 索引字段 | 普通字段 | 記錄的時(shí)間戳 |
|---|---|---|---|---|---|---|---|
| Prometheus | - | metric | time series | - | - | label | timestamp |
| InfluxDB | database | measurement | point | retention policy | tag | field | timestamp |
| MySQL | database | table | row | indexed column | column |
Prometheus和InfluxDB差異:
Prometheus metric的一條記錄由多個(gè)label加一個(gè)value構(gòu)成,metric類(lèi)型包括Counter、Gauge、Histogram、Summary,InfluxDB measurement并沒(méi)有區(qū)分這些類(lèi)型。
Prometheus通過(guò)pull的方式拉取數(shù)據(jù),InfluxDB通過(guò)push的方式推送數(shù)據(jù)。
Prometheus的一條記錄一般只有一個(gè)value,同樣是記錄cpu的指標(biāo)數(shù)據(jù),InfluxDB measurement會(huì)包含3個(gè)field[usage_idle, usage_system, usage_user],1條記錄[97, 2, 1],Prometheus table會(huì)包含1個(gè)label[mode],3條記錄['idle', 97], ['system', 2], ['user', 1]。
參考資料:
https://prometheus.io/docs/concepts/metric_types/
查詢(xún)數(shù)據(jù)
Prometheus通過(guò)網(wǎng)頁(yè)查詢(xún)數(shù)據(jù),默認(rèn)地址是http://your_host:9090。
${Prometheus_home}/prometheus.yml文件可以添加需要拉取數(shù)據(jù)的實(shí)例(instance),通過(guò)Metric Up 可以查看所有實(shí)例的工作狀態(tài)。
參考資料:
https://prometheus.io/docs/prometheus/latest/querying/examples/
Micrometer
micrometer用于采集java應(yīng)用的指標(biāo)數(shù)據(jù),可以適配多數(shù)主流的監(jiān)控系統(tǒng),比如Prometheus、InfluxDB。有點(diǎn)像SLF4J,適配很多日志系統(tǒng),而micrometer面向的是應(yīng)用的Metrics。
使用Spring為Prometheus提供指標(biāo)數(shù)據(jù):
@Controller
@RequestMapping(value = "/prometheus")
public class PrometheusController {
?
@Getter
private PrometheusMeterRegistry registry;
?
@PostConstruct
private void init() {
PrometheusConfig config = k -> {
return null;
};
this.registry = new PrometheusMeterRegistry(config);
this.registry.config().commonTags("application", "myAppName");
new ClassLoaderMetrics().bindTo(this.registry);
new JvmMemoryMetrics().bindTo(this.registry);
new JvmGcMetrics().bindTo(this.registry);
new ProcessorMetrics().bindTo(this.registry);
new JvmThreadMetrics().bindTo(this.registry);
}
?
@RequestMapping(method = { RequestMethod.Get, RequestMethod.POST})
public void index(HttpServletRequest req, HttpServletResponse resp) {
resp.getWriter().write(registry.scrape());
resp.getWriter().flush();
}
}
參考資料: