SRE 每日主题:ELK 日志收集与分析
日期: 2026-03-11
主题序号: 11 (11 % 12 = 11)
难度等级: ⭐⭐⭐⭐
适用场景: 生产环境日志集中化管理
一、架构概述
1.1 ELK 栈组件
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Filebeat │───▶│ Logstash │───▶│ Elasticsearch│
│ (采集端) │ │ (处理端) │ │ (存储端) │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ Kibana │
│ (展示端) │
└─────────────┘
1.2 推荐部署架构(生产环境)
应用集群 (多台)
│
├── Filebeat ──┐
├── Filebeat ──┼──▶ Kafka (缓冲) ──▶ Logstash 集群 ──▶ Elasticsearch 集群
├── Filebeat ──┘
│
└───▶ Kibana (负载均衡后)
二、Elasticsearch 生产配置
2.1 系统内核参数调优
# /etc/sysctl.conf
vm.max_map_count=262144
fs.file-max=655360
vm.swappiness=1
net.ipv4.tcp_retries2=5
net.core.somaxconn=65535
# /etc/security/limits.conf
elasticsearch - nofile 65536
elasticsearch - nproc 65536
elasticsearch - memlock unlimited
2.2 JVM 配置
# /etc/elasticsearch/jvm.options
-Xms16g
-Xmx16g
-XX:+UseG1GC
-XX:G1HeapRegionSize=4m
-XX:InitiatingHeapOccupancyPercent=30
-XX:+ParallelRefProcEnabled
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/elasticsearch
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
2.3 Elasticsearch 主配置
# /etc/elasticsearch/elasticsearch.yml
cluster.name: production-elk
node.name: es-node-01
node.roles: [ master, data, ingest ]
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300
discovery.seed_hosts: ["es-node-01", "es-node-02", "es-node-03"]
cluster.initial_master_nodes: ["es-node-01", "es-node-02", "es-node-03"]
# 生产环境安全配置
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: certs/elastic-certificates.p12
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: certs/http.p12
# 性能优化
indices.memory.index_buffer_size: 20%
thread_pool.write.queue_size: 1000
thread_pool.search.queue_size: 1000
# 慢日志配置
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.fetch.warn: 1s
index.indexing.slowlog.threshold.index.warn: 10s
2.4 索引模板配置
PUT _index_template/logs-template
{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval": "5s",
"index.lifecycle.name": "logs-policy",
"index.lifecycle.rollover_alias": "logs",
"codec": "best_compression"
},
"mappings": {
"properties": {
"@timestamp": { "type": "date" },
"host": {
"type": "object",
"properties": {
"name": { "type": "keyword" },
"ip": { "type": "ip" }
}
},
"service": { "type": "keyword" },
"level": { "type": "keyword" },
"message": {
"type": "text",
"analyzer": "standard"
},
"trace_id": { "type": "keyword" },
"span_id": { "type": "keyword" },
"duration_ms": { "type": "long" },
"status_code": { "type": "integer" }
}
}
},
"priority": 100
}
2.5 ILM 生命周期策略
PUT _ilm/policy/logs-policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "1d"
},
"set_priority": { "priority": 100 }
}
},
"warm": {
"min_age": "2d",
"actions": {
"set_priority": { "priority": 50 },
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 }
}
},
"cold": {
"min_age": "7d",
"actions": {
"set_priority": { "priority": 0 },
"freeze": {}
}
},
"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}
}
}
}
三、Logstash 配置
3.1 主配置文件
# /etc/logstash/logstash.yml
http.host: "0.0.0.0"
http.port: 9600
log.level: info
path.logs: /var/log/logstash
pipeline.workers: 4
pipeline.batch.size: 125
pipeline.batch.delay: 50
queue.type: persisted
queue.max_bytes: 4gb
queue.checkpoint.writes: 1024
dead_letter_queue.enable: true
3.2 Pipeline 配置(多文件组织)
# /etc/logstash/conf.d/01-inputs.conf
input {
kafka {
bootstrap_servers => "kafka-01:9092,kafka-02:9092,kafka-03:9092"
topics => ["app-logs", "system-logs", "nginx-logs"]
group_id => "logstash-consumer"
consumer_threads => 4
decorate_events => true
auto_offset_reset => "latest"
codec => json
}
beats {
port => 5044
ssl => true
ssl_certificate => "/etc/logstash/certs/logstash.crt"
ssl_key => "/etc/logstash/certs/logstash.key"
}
}
# /etc/logstash/conf.d/02-filters.conf
filter {
# 解析时间
date {
match => [ "timestamp", "ISO8601", "yyyy-MM-dd HH:mm:ss", "UNIX" ]
target => "@timestamp"
timezone => "Asia/Shanghai"
}
# 解析 JSON 消息
if [message] =~ /^\{.*\}$/ {
json {
source => "message"
target => "json_data"
}
}
# Grok 解析常见日志格式
grok {
match => {
"message" => [
"%{COMBINEDAPACHELOG}",
"%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:host} %{DATA:program}: %{GREEDYDATA:log_message}",
"%{TIMESTAMP_ISO8601:log_timestamp} %{LOGLEVEL:level} %{DATA:logger} - %{GREEDYDATA:log_message}"
]
}
tag_on_failure => ["_grokparsefailure"]
}
# 添加环境标签
mutate {
add_field => {
"environment" => "production"
"datacenter" => "cn-east-1"
}
}
# 删除不需要的字段
mutate {
remove_field => ["host", "agent", "ecs", "input_type"]
}
# 根据日志级别路由
if [level] in ["ERROR", "FATAL", "CRITICAL"] {
mutate {
add_tag => ["high_priority"]
}
}
}
# /etc/logstash/conf.d/03-outputs.conf
output {
elasticsearch {
hosts => ["https://es-node-01:9200", "https://es-node-02:9200", "https://es-node-03:9200"]
index => "logs-%{+YYYY.MM.dd}"
user => "logstash_writer"
password => "${ES_PASSWORD}"
ssl => true
ssl_certificate_verification => true
cacert => "/etc/logstash/certs/ca.crt"
manage_template => false
ilm_enabled => true
ilm_rollover_alias => "logs"
ilm_pattern => "000001"
ilm_policy => "logs-policy"
action => "create"
}
# 错误日志单独输出
if "high_priority" in [tags] {
elasticsearch {
hosts => ["https://es-node-01:9200"]
index => "alerts-%{+YYYY.MM.dd}"
user => "logstash_writer"
password => "${ES_PASSWORD}"
ssl => true
ssl_certificate_verification => true
cacert => "/etc/logstash/certs/ca.crt"
}
}
# 调试输出(生产环境关闭)
# stdout { codec => rubydebug }
}
四、Filebeat 配置
4.1 应用日志采集
# /etc/filebeat/filebeat.yml
filebeat.inputs:
# Nginx 访问日志
- type: log
enabled: true
paths:
- /var/log/nginx/access.log
tags: ["nginx", "access"]
fields:
service: nginx
log_type: access
multiline.pattern: '^\d+\.\d+\.\d+\.\d+'
multiline.negate: true
multiline.match: after
json.keys_under_root: true
json.overwrite_keys: true
json.add_error_key: true
# 应用日志(JSON 格式)
- type: log
enabled: true
paths:
- /var/log/app/*.log
tags: ["application"]
fields:
service: myapp
log_type: application
json.keys_under_root: true
json.overwrite_keys: true
json.add_error_key: true
include_lines: ['ERROR', 'WARN', 'INFO', 'DEBUG']
# 系统日志
- type: syslog
enabled: true
protocol.udp:
host: "localhost:514"
fields:
service: syslog
log_type: system
# Kubernetes 容器日志
- type: container
enabled: true
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
# Logstash 输出
output.logstash:
hosts: ["logstash-01:5044", "logstash-02:5044"]
ssl.enabled: true
ssl.certificate_authorities: ["/etc/filebeat/certs/ca.crt"]
ssl.certificate: "/etc/filebeat/certs/filebeat.crt"
ssl.key: "/etc/filebeat/certs/filebeat.key"
loadbalance: true
worker: 2
# 监控配置
monitoring.enabled: true
monitoring.elasticsearch:
hosts: ["https://es-node-01:9200"]
username: "filebeat_internal"
password: "${ES_PASSWORD}"
ssl.enabled: true
ssl.certificate_authorities: ["/etc/filebeat/certs/ca.crt"]
# 日志配置
logging.level: info
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat.log
keepfiles: 7
permissions: 0644
4.2 Filebeat Module 使用
# 启用常用模块
filebeat modules enable nginx
filebeat modules enable mysql
filebeat modules enable redis
filebeat modules enable kafka
filebeat modules enable system
# 配置模块
cat > /etc/filebeat/modules.d/nginx.yml << EOF
- module: nginx
access:
enabled: true
var.paths: ["/var/log/nginx/access.log"]
error:
enabled: true
var.paths: ["/var/log/nginx/error.log"]
ingress_controller:
enabled: false
EOF
五、Kibana 配置
5.1 基础配置
# /etc/kibana/kibana.yml
server.port: 5601
server.host: "0.0.0.0"
server.name: "kibana"
elasticsearch.hosts: ["https://es-node-01:9200", "https://es-node-02:9200"]
elasticsearch.username: "kibana_system"
elasticsearch.password: "${ES_PASSWORD}"
elasticsearch.ssl.verificationMode: certificate
elasticsearch.ssl.certificateAuthorities: ["/etc/kibana/certs/ca.crt"]
xpack.security.enabled: true
xpack.encryptedSavedObjects.encryptionKey: "至少 32 字节的随机密钥"
xpack.reporting.encryptionKey: "至少 32 字节的随机密钥"
xpack.security.encryptionKey: "至少 32 字节的随机密钥"
# 性能优化
elasticsearch.requestTimeout: 300000
elasticsearch.pingTimeout: 1500
elasticsearch.shardTimeout: 30000
# 中文支持
i18n.locale: "zh-CN"
5.2 常用查询示例
# 查询错误日志
level: ERROR AND @timestamp > now-1h
# 查询特定服务
service: "payment-service" AND level: ERROR
# 查询慢请求
duration_ms > 1000
# 查询特定 Trace ID
trace_id: "abc123def456"
# 聚合查询(按服务统计错误数)
POST /logs-*/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{ "range": { "@timestamp": { "gte": "now-1h" } } },
{ "term": { "level": "ERROR" } }
]
}
},
"aggs": {
"services": {
"terms": { "field": "service" }
}
}
}
六、监控与告警
6.1 Elasticsearch 健康检查命令
# 集群健康状态
curl -X GET "localhost:9200/_cluster/health?pretty"
# 节点状态
curl -X GET "localhost:9200/_cat/nodes?v&h=name,ip,heap.percent,ram.percent,cpu,load_1m,node.role"
# 索引状态
curl -X GET "localhost:9200/_cat/indices?v&s=store.size:desc"
# 分片分布
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,docs,store,node"
# 慢查询日志
curl -X GET "localhost:9200/_tasks?detailed&pretty"
# 集群统计
curl -X GET "localhost:9200/_cluster/stats?pretty"
6.2 关键监控指标
| 指标 | 阈值 | 说明 |
|---|---|---|
| 集群状态 | red/yellow | 红色=数据丢失,黄色=副本缺失 |
| CPU 使用率 | > 80% | 持续高负载需扩容 |
| 堆内存使用率 | > 75% | 触发 GC 频繁 |
| 磁盘使用率 | > 85% | 触发水位告警 |
| 查询延迟 P95 | > 1s | 用户体验下降 |
| 写入延迟 P95 | > 500ms | 写入瓶颈 |
| 拒绝写入数 | > 0 | 队列已满 |
6.3 Prometheus 监控配置
# prometheus.yml scrape_configs
- job_name: 'elasticsearch'
static_configs:
- targets: ['es-node-01:9200', 'es-node-02:9200']
metrics_path: /_prometheus/metrics
scheme: https
basic_auth:
username: prometheus
password: ${ES_PASSWORD}
tls_config:
ca_file: /etc/prometheus/ca.crt
- job_name: 'logstash'
static_configs:
- targets: ['logstash-01:9600', 'logstash-02:9600']
metrics_path: /_node/stats/prometheus
- job_name: 'filebeat'
static_configs:
- targets: ['filebeat-exporter:5066']
6.4 告警规则(Prometheus AlertManager)
# alerting-rules.yml
groups:
- name: elasticsearch
rules:
- alert: ElasticsearchClusterRed
expr: elasticsearch_cluster_health_status{color="red"} == 1
for: 1m
labels:
severity: critical
annotations:
summary: "ES 集群状态为 RED"
- alert: ElasticsearchHighHeapUsage
expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "ES 堆内存使用率超过 85%"
- alert: ElasticsearchDiskWatermarkHigh
expr: elasticsearch_indices_store_size_bytes / elasticsearch_filesystem_data_size_bytes > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "ES 磁盘使用率超过 85%"
- alert: LogstashPipelineBackpressure
expr: rate(logstash_events_out[5m]) < rate(logstash_events_in[5m]) * 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Logstash 处理积压"
七、故障排查指南
7.1 常见问题及解决方案
问题 1: 写入拒绝 (Write Rejected)
症状: 日志显示 rejected execution of processing
排查步骤:
# 检查线程池队列
curl -X GET "localhost:9200/_cat/thread_pool/write?v&h=node_name,name,active,queue,rejected"
# 检查热点分片
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,docs,store,node" | sort -k4 -rn | head -20
解决方案:
- 增加写入线程池队列:
thread_pool.write.queue_size: 2000 - 增加分片数或减少副本数
- 优化 bulk 写入批次大小(建议 5-15MB)
- 检查是否有单一大文档
问题 2: 查询超时
症状: Kibana 查询超时或返回部分结果
排查步骤:
# 查看慢查询
curl -X GET "localhost:9200/_tasks?detailed&actions=*search"
# 检查分片分布是否均匀
curl -X GET "localhost:9200/_cat/shards/{index}?v&h=index,shard,prirep,docs,store,node"
解决方案:
- 优化查询语句,避免全表扫描
- 使用 filter 代替 query(filter 可缓存)
- 增加查询超时时间
- 对高频查询字段建立索引
问题 3: 磁盘水位告警
症状: 集群变为 yellow 或 red,无法写入
排查步骤:
# 检查磁盘使用
curl -X GET "localhost:9200/_cat/allocation?v"
# 查看索引大小
curl -X GET "localhost:9200/_cat/indices?v&s=store.size:desc"
解决方案:
- 临时降低水位线(应急):
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d' { "transient": { "cluster.routing.allocation.disk.watermark.low": "90%", "cluster.routing.allocation.disk.watermark.high": "95%", "cluster.routing.allocation.disk.watermark.flood_stage": "97%" } }' - 删除旧索引或缩短 ILM 保留时间
- 扩容数据节点
- 启用冷热架构
问题 4: Filebeat 日志丢失
症状: 部分日志未采集到 ES
排查步骤:
# 检查 Filebeat 状态
filebeat status
filebeat test output
# 查看 Filebeat 日志
tail -f /var/log/filebeat/filebeat.log | grep -i error
# 检查 registry 文件
cat /var/lib/filebeat/registry/filebeat/log.json
解决方案:
- 检查日志文件权限
- 调整
scan_frequency和harvester_limit - 检查 multiline 配置是否正确
- 增加
queue.mem.events和output.logstash.worker
7.2 性能诊断命令
# 热点分析(CPU)
curl -X GET "localhost:9200/_nodes/hot_threads"
# 内存分析
curl -X GET "localhost:9200/_nodes/stats/jvm?pretty"
# 索引性能
curl -X GET "localhost:9200/_nodes/stats/indices?pretty"
# 恢复状态
curl -X GET "localhost:9200/_cat/recovery?v&active_only=true"
# 段合并状态
curl -X GET "localhost:9200/_cat/segment_replication?v"
八、最佳实践
8.1 索引设计
- 按时间分索引:
logs-2026.03.11,便于管理和删除 - 合理分片: 单分片 20-50GB,过多分片影响性能
- 使用别名: 通过别名实现无缝切换
- 关闭不需要的字段:
enabled: false减少存储
8.2 写入优化
- 使用 Bulk API: 批量写入,减少网络开销
- 调整批次大小: 5-15MB 或 1000-5000 条
- 异步写入: 不等待确认,提高吞吐量
- 使用 Kafka 缓冲: 削峰填谷,保护 ES
8.3 查询优化
- 使用 filter context: 可缓存,性能更好
- 避免 wildcard 查询:
*abc*性能极差 - 使用 routing: 定向查询特定分片
- 预计算聚合: 对复杂聚合提前计算
8.4 安全加固
- 启用 X-Pack Security: 强制认证和加密
- 最小权限原则: 为不同角色分配最小权限
- 网络隔离: ES 集群不暴露公网
- 定期轮换密码: 自动化密码管理
- 审计日志: 开启访问审计
8.5 容量规划
| 日志量/天 | ES 节点配置 | 存储需求 (30 天) |
|---|---|---|
| 10GB | 3 节点 (8C16G) | 900GB (含副本) |
| 50GB | 5 节点 (16C32G) | 4.5TB |
| 100GB | 9 节点 (16C64G) | 9TB |
| 500GB+ | 20+ 节点 (32C128G) | 45TB+ |
计算公式:
总存储 = 日增量 × 保留天数 × (1 + 副本数) × 1.2(开销系数)
九、快速部署脚本
9.1 Docker Compose 快速启动(测试环境)
# docker-compose.yml
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- "ES_JAVA_OPTS=-Xms2g -Xmx2g"
ports:
- "9200:9200"
volumes:
- es_data:/usr/share/elasticsearch/data
logstash:
image: docker.elastic.co/logstash/logstash:8.11.0
volumes:
- ./logstash/pipeline:/usr/share/logstash/pipeline
ports:
- "5044:5044"
- "9600:9600"
depends_on:
- elasticsearch
kibana:
image: docker.elastic.co/kibana/kibana:8.11.0
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
ports:
- "5601:5601"
depends_on:
- elasticsearch
filebeat:
image: docker.elastic.co/beats/filebeat:8.11.0
user: root
volumes:
- ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
depends_on:
- logstash
volumes:
es_data:
9.2 生产环境部署检查清单
- 系统内核参数已调优
- JVM 堆内存已配置(不超过 32GB)
- SSL/TLS 证书已生成并配置
- 用户权限已创建(最小权限原则)
- ILM 策略已配置
- 索引模板已创建
- 监控告警已配置
- 备份策略已配置(Snapshot)
- 网络防火墙规则已配置
- 日志轮转已配置
十、参考资源
文档版本: v1.0
最后更新: 2026-03-11
维护团队: SRE Team