
1) 【一句话结论】在360 Web服务中,监控告警系统通过Prometheus(拉取业务与资源指标)、Grafana(可视化展示)、Alertmanager(告警处理)的协同,设计覆盖QPS、响应时间、错误率、CPU/内存等核心指标,结合业务流量模型动态调整阈值,实现服务性能的实时监控与精准告警,确保系统稳定运行。
2) 【原理/概念讲解】老师讲解:监控告警系统遵循“采集-存储-处理-展示-告警”分层架构。Prometheus采用拉模式(Pull Mode),通过客户端库(如client-go)从服务端拉取指标,支持三类指标:Counter(递增计数器,如请求总数)、Gauge(实时测量值,如当前QPS、内存使用率)、Histogram(分布式直方图,如响应时间分布)。Grafana连接Prometheus数据源,创建Dashboard展示指标趋势(如QPS曲线、响应时间箱线图)。Alertmanager负责处理告警规则,当指标超过阈值时触发通知(邮件、短信)。类比:Prometheus像超市收银员(主动拉取数据),Grafana是电子货架屏(展示销量趋势),Alertmanager是广播系统(商品销量超阈值时通知店员)。补充:资源指标(CPU、内存)通过Prometheus的node_exporter或自定义指标采集,反映基础设施健康状态。
3) 【对比与适用场景】
| 指标类型 | 定义 | 特性 | 使用场景 | 注意点 |
|---|---|---|---|---|
| Counter | 递增计数器 | 只能递增,累计数据 | 请求总数、错误总数(如requests_total) | 不能用于动态值,如响应时间 |
| Gauge | 实时测量值 | 可增可减,反映当前状态 | 当前QPS、内存使用率(如http_requests_total{method="GET"}[1m]的当前值) | 需实时更新,反映瞬时状态 |
| Histogram | 分布式直方图 | 记录数据分布(分桶) | 响应时间分布(如request_latency_seconds) | 分析性能分布,识别异常值 |
| CPU使用率 | 节点CPU占用百分比 | 实时测量值 | 服务器资源健康 | 需通过node_exporter采集 |
| 内存使用率 | 节点内存占用百分比 | 实时测量值 | 服务器资源健康 | 需通过node_exporter采集 |
4) 【示例】
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"net/http"
)
var (
requestCount = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "web_service_requests_total",
Help: "Total number of requests received",
},
[]string{"method", "path"},
)
requestLatency = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "web_service_request_latency_seconds",
Help: "Request latency distribution",
Buckets: prometheus.ExponentialBuckets(0.1, 2, 10),
},
[]string{"method", "path"},
)
errorRate = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "web_service_error_rate",
Help: "Error rate of requests",
},
[]string{"method", "path"},
)
cpuUsage = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "node_cpu_usage_percent",
Help: "CPU usage percentage of the node",
},
)
memUsage = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "node_memory_usage_bytes",
Help: "Memory usage of the node",
},
)
)
func init() {
prometheus.MustRegister(requestCount)
prometheus.MustRegister(requestLatency)
prometheus.MustRegister(errorRate)
prometheus.MustRegister(cpuUsage)
prometheus.MustRegister(memUsage)
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":9090", nil)
}
sum(rate(web_service_requests_total{method="GET", path="/api/v1/search"}[1m])) by (path)quantile(0.9, rate(web_service_request_latency_seconds{method="GET", path="/api/v1/search"}[1m]))100 * (sum(rate(web_service_requests_total{method="GET", path="/api/v1/search", code="5xx"}[1m])) / sum(rate(web_service_requests_total{method="GET", path="/api/v1/search"}[1m])))avg by (instance) (rate(node_cpu_seconds_total{cpu="cpu0", mode="idle"}[1m])) * 100groups:
- name: web_service_alerts
rules:
- alert: HighRequestLatency
expr: quantile(0.9, rate(web_service_request_latency_seconds{method="GET", path="/api/v1/search"}[1m])) > (0.5 + (rate(web_service_requests_total{method="GET", path="/api/v1/search"}[5m]) / 1000) * 0.1) # 动态阈值,随流量增加
for: 5m
labels:
severity: critical
annotations:
summary: "High latency for GET /api/v1/search"
description: "90% of requests exceed threshold (动态调整)"
- alert: HighResourceUsage
expr: (avg by (instance) (rate(node_cpu_seconds_total{cpu="cpu0", mode="idle"}[1m])) * 100) < 20 or (node_memory_usage_bytes / node_memory_total_bytes) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU or memory usage on node"
description: "CPU <20% or memory >80% usage"
5) 【面试口播版答案】在360的Web服务项目中,监控告警系统通常采用Prometheus+Grafana+Alertmanager的架构。首先,通过Prometheus客户端库(如client-go)采集关键业务指标(QPS、响应时间、错误率)和资源指标(CPU、内存使用率),比如QPS用Gauge类型记录当前请求速率,响应时间用Histogram分析分布。然后,Grafana创建Dashboard展示这些指标的趋势,比如实时QPS曲线和响应时间箱线图。告警规则在Prometheus中配置,比如当QPS超过阈值(结合业务流量动态调整,如流量增加时阈值提升)或响应时间超过阈值(如90%请求超过0.5秒)时,通过Alertmanager发送通知(邮件、短信)。核心是指标覆盖业务与资源维度,阈值根据业务场景(如360高并发搜索服务,QPS阈值设为2000/s,响应时间阈值设为0.8秒)动态调整,确保监控全面且告警有效。
6) 【追问清单】
alerting_rules中的表达式,根据当前QPS动态计算阈值(如阈值 = 基础阈值 + 流量波动系数),或通过外部配置中心(如Nacos)动态更新阈值配置。7) 【常见坑/雷区】
> 0但实际是> 0.5),导致误报或漏报。interval设置过大,导致指标延迟,影响告警及时性。