SRE 每日主题:Elasticsearch 生产优化与监控
日期: 2026-03-06
主题序号: 6(3 月 6 日 → 6 % 12 = 6)
适用版本: Elasticsearch 8.x
一、生产环境部署架构
1.1 推荐集群规模
| 集群规模 | 节点数 | 适用场景 |
|---|---|---|
| 小型 | 3 节点 | 日志量 < 100GB/天 |
| 中型 | 5-7 节点 | 日志量 100-500GB/天 |
| 大型 | 9+ 节点 | 日志量 > 500GB/天 |
1.2 节点角色分离配置
# elasticsearch.yml - Master 节点
cluster.name: production-es
node.name: es-master-01
node.roles: ["master"]
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
# elasticsearch.yml - Data 节点
cluster.name: production-es
node.name: es-data-01
node.roles: ["data", "data_content"]
path.data: /mnt/ssd1/elasticsearch,/mnt/ssd2/elasticsearch
path.logs: /var/log/elasticsearch
# elasticsearch.yml - Coordinating 节点
cluster.name: production-es
node.name: es-coord-01
node.roles: ["ingest", "remote_cluster_client"]
1.3 JVM 配置优化
# jvm.options
-Xms16g
-Xmx16g
-XX:+UseG1GC
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/elasticsearch
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
关键参数说明:
-Xms和-Xmx必须相等,避免动态调整开销- 堆内存不超过 31GB(避免压缩指针失效)
- 堆内存不超过物理内存的 50%
二、核心参数调优
2.1 系统级优化
# /etc/security/limits.conf
elasticsearch - nofile 65535
elasticsearch - nproc 4096
elasticsearch - memlock unlimited
# /etc/sysctl.conf
vm.max_map_count = 262144
vm.swappiness = 1
fs.file-max = 65535
# 应用配置
sysctl -p
ulimit -n 65535
2.2 索引配置模板
PUT _index_template/logs_template
{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1,
"refresh_interval": "30s",
"index.translog.durability": "async",
"index.translog.sync_interval": "5s",
"index.codec": "best_compression",
"index.routing.allocation.total_shards_per_node": 3,
"index.merge.policy.merge_factor": 10,
"index.merge.policy.segments_per_tier": 10
},
"mappings": {
"dynamic": "strict",
"properties": {
"@timestamp": { "type": "date" },
"message": {
"type": "text",
"analyzer": "standard",
"fielddata": false
},
"level": { "type": "keyword" },
"service": { "type": "keyword" },
"trace_id": { "type": "keyword" }
}
}
}
}
2.3 关键参数调优说明
| 参数 | 推荐值 | 说明 |
|---|---|---|
number_of_shards |
按数据量计算 | 每分片 20-50GB,预留 20% 增长空间 |
number_of_replicas |
1(生产) | 高可用要求,读多写少场景可设为 2 |
refresh_interval |
30s | 默认 1s,批量写入场景调大提升性能 |
translog.durability |
async | 异步刷盘提升写入性能 |
translog.sync_interval |
5s | 平衡性能与数据安全性 |
codec |
best_compression | 节省存储空间,轻微 CPU 开销 |
三、监控体系
3.1 核心监控指标
# 集群健康状态
GET _cluster/health
# 节点统计信息
GET _nodes/stats
# 索引统计
GET _cat/indices?v&s=store.size:desc
# 分片分布
GET _cat/shards?v
# 热点查询
GET _nodes/hot_threads
3.2 Prometheus 监控配置
# prometheus.yml
scrape_configs:
- job_name: 'elasticsearch'
static_configs:
- targets: ['es-master-01:9200']
metrics_path: '/_prometheus/metrics'
scrape_interval: 30s
# grafana-dashboard.json (关键面板)
{
"panels": [
{"title": "Cluster Health", "targets": [{"expr": "elasticsearch_cluster_health_status"}]},
{"title": "JVM Memory", "targets": [{"expr": "elasticsearch_jvm_memory_used_bytes"}]},
{"title": "Index Rate", "targets": [{"expr": "rate(elasticsearch_indices_indexing_index_total[5m])"}]},
{"title": "Search Rate", "targets": [{"expr": "rate(elasticsearch_indices_search_query_total[5m])"}]},
{"title": "Thread Pool Queue", "targets": [{"expr": "elasticsearch_thread_pool_queue_count"}]}
]
}
3.3 告警规则
# alertmanager.yml
groups:
- name: elasticsearch
rules:
- alert: ESClusterRed
expr: elasticsearch_cluster_health_status == 2
for: 5m
annotations:
summary: "ES 集群状态为 RED"
- alert: ESNodeDown
expr: elasticsearch_node_stats_up == 0
for: 2m
annotations:
summary: "ES 节点宕机"
- alert: ESHighJVM
expr: elasticsearch_jvm_memory_used_bytes / elasticsearch_jvm_memory_max_bytes > 0.85
for: 10m
annotations:
summary: "ES JVM 使用率超过 85%"
- alert: ESUnassignedShards
expr: elasticsearch_cluster_health_number_of_unassigned_shards > 0
for: 15m
annotations:
summary: "存在未分配分片"
四、故障排查
4.1 常见问题诊断命令
# 检查未分配分片原因
GET _cluster/allocation/explain
# 查看慢查询日志
GET _cluster/settings
{
"index.search.slowlog.threshold.query.warn": "10s",
"index.search.slowlog.threshold.query.info": "5s",
"index.search.slowlog.threshold.fetch.warn": "1s",
"index.indexing.slowlog.threshold.index.warn": "10s"
}
# 检查分片分布均衡
GET _cat/allocation?v
# 查看节点资源使用
GET _nodes/stats/jvm,os,process
# 检查索引健康
GET _cat/health?v
GET _cat/shards?v&h=index,shard,prirep,state,node,docs,store
4.2 典型故障场景
场景 1:集群状态 YELLOW
# 原因:副本分片无法分配
# 排查步骤:
1. GET _cluster/health - 确认未分配分片数
2. GET _cat/shards?v - 查看未分配分片
3. GET _cluster/allocation/explain - 获取详细原因
4. 检查磁盘空间:GET _cat/allocation?v
5. 检查节点状态:GET _cat/nodes?v
# 临时解决方案(仅测试环境)
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "primaries"
}
}
场景 2:写入性能下降
# 排查步骤:
1. 检查 JVM GC:GET _nodes/stats/jvm
2. 检查线程池:GET _cat/thread_pool?v
3. 检查磁盘 IO:iostat -x 1
4. 检查慢日志:查看 slowlog
5. 检查分片大小:GET _cat/indices?v
# 优化措施:
- 调大 refresh_interval
- 使用 bulk 批量写入
- 检查分片是否过大(>50GB)
- 考虑增加数据节点
场景 3:查询超时
# 排查步骤:
1. 检查热点线程:GET _nodes/hot_threads
2. 分析慢查询:查看 search slowlog
3. 检查字段映射:GET <index>/_mapping
4. 检查缓存:GET _nodes/stats/indices/query_cache
# 优化措施:
- 避免深度分页(使用 search_after)
- 优化查询语句(避免 wildcard 前缀)
- 使用 filter 代替 query
- 增加查询超时时间
五、最佳实践
5.1 索引生命周期管理(ILM)
PUT _ilm/policy/logs_policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "1d"
},
"set_priority": { "priority": 100 }
}
},
"warm": {
"min_age": "2d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 },
"set_priority": { "priority": 50 }
}
},
"cold": {
"min_age": "7d",
"actions": {
"set_priority": { "priority": 0 },
"freeze": {}
}
},
"delete": {
"min_age": "30d",
"actions": { "delete": {} }
}
}
}
}
5.2 备份与恢复
# 创建快照仓库
PUT _snapshot/backup_repo
{
"type": "fs",
"settings": {
"location": "/mnt/backup/elasticsearch",
"compress": true
}
}
# 创建快照
PUT _snapshot/backup_repo/snapshot_20260306
{
"indices": "logs-*",
"ignore_unavailable": true,
"include_global_state": false
}
# 恢复快照
POST _snapshot/backup_repo/snapshot_20260306/_restore
{
"indices": "logs-*",
"rename_pattern": "logs-(.+)",
"rename_replacement": "restored-logs-$1"
}
5.3 安全配置
# elasticsearch.yml
xpack.security.enabled: true
xpack.security.enrollment.enabled: false
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: certs/http.p12
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: certs/transport.p12
xpack.security.transport.ssl.truststore.path: certs/transport.p12
# 创建用户
bin/elasticsearch-users useradd admin -p <password> -r superuser
5.4 性能优化清单
- 使用 SSD 存储
- 禁用 swap(swappiness=1)
- 设置合适的堆内存(16-31GB)
- 使用 G1GC 垃圾收集器
- 分片大小控制在 20-50GB
- 合理设置副本数
- 启用索引压缩
- 配置慢查询日志
- 实施 ILM 策略
- 定期 force_merge 历史索引
六、常用运维命令速查
# 集群信息
GET _cluster/health
GET _cluster/state
GET _cat/nodes?v
GET _cat/indices?v
# 索引管理
POST <index>/_forcemerge?max_num_segments=1
POST <index>/_refresh
POST <index>/_flush
DELETE <index>
# 分片操作
POST _cluster/reroute?retry_failed=true
POST _shrink/<index>
# 缓存管理
POST _cache/clear
POST <index>/_cache/clear
# 任务管理
GET _tasks
POST _tasks/<task_id>/_cancel
七、参考资料
文档生成时间: 2026-03-06 10:00