SRE 每日主题：Elasticsearch 生产优化与监控

日期： 2026-03-06
主题序号： 6（3 月 6 日 → 6 % 12 = 6）
适用版本： Elasticsearch 8.x

一、生产环境部署架构

1.1 推荐集群规模

集群规模	节点数	适用场景
小型	3 节点	日志量 < 100GB/天
中型	5-7 节点	日志量 100-500GB/天
大型	9+ 节点	日志量 > 500GB/天

1.2 节点角色分离配置

# elasticsearch.yml - Master 节点
cluster.name: production-es
node.name: es-master-01
node.roles: ["master"]
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch

# elasticsearch.yml - Data 节点
cluster.name: production-es
node.name: es-data-01
node.roles: ["data", "data_content"]
path.data: /mnt/ssd1/elasticsearch,/mnt/ssd2/elasticsearch
path.logs: /var/log/elasticsearch

# elasticsearch.yml - Coordinating 节点
cluster.name: production-es
node.name: es-coord-01
node.roles: ["ingest", "remote_cluster_client"]

1.3 JVM 配置优化

# jvm.options
-Xms16g
-Xmx16g
-XX:+UseG1GC
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/elasticsearch
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m

关键参数说明：

-Xms 和 -Xmx 必须相等，避免动态调整开销
堆内存不超过 31GB（避免压缩指针失效）
堆内存不超过物理内存的 50%

二、核心参数调优

2.1 系统级优化

# /etc/security/limits.conf
elasticsearch - nofile 65535
elasticsearch - nproc 4096
elasticsearch - memlock unlimited

# /etc/sysctl.conf
vm.max_map_count = 262144
vm.swappiness = 1
fs.file-max = 65535

# 应用配置
sysctl -p
ulimit -n 65535

2.2 索引配置模板

PUT _index_template/logs_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 5,
      "number_of_replicas": 1,
      "refresh_interval": "30s",
      "index.translog.durability": "async",
      "index.translog.sync_interval": "5s",
      "index.codec": "best_compression",
      "index.routing.allocation.total_shards_per_node": 3,
      "index.merge.policy.merge_factor": 10,
      "index.merge.policy.segments_per_tier": 10
    },
    "mappings": {
      "dynamic": "strict",
      "properties": {
        "@timestamp": { "type": "date" },
        "message": { 
          "type": "text",
          "analyzer": "standard",
          "fielddata": false
        },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" },
        "trace_id": { "type": "keyword" }
      }
    }
  }
}

2.3 关键参数调优说明

参数	推荐值	说明
`number_of_shards`	按数据量计算	每分片 20-50GB，预留 20% 增长空间
`number_of_replicas`	1（生产）	高可用要求，读多写少场景可设为 2
`refresh_interval`	30s	默认 1s，批量写入场景调大提升性能
`translog.durability`	async	异步刷盘提升写入性能
`translog.sync_interval`	5s	平衡性能与数据安全性
`codec`	best_compression	节省存储空间，轻微 CPU 开销

三、监控体系

3.1 核心监控指标

# 集群健康状态
GET _cluster/health

# 节点统计信息
GET _nodes/stats

# 索引统计
GET _cat/indices?v&s=store.size:desc

# 分片分布
GET _cat/shards?v

# 热点查询
GET _nodes/hot_threads

3.2 Prometheus 监控配置

# prometheus.yml
scrape_configs:
  - job_name: 'elasticsearch'
    static_configs:
      - targets: ['es-master-01:9200']
    metrics_path: '/_prometheus/metrics'
    scrape_interval: 30s

# grafana-dashboard.json (关键面板)
{
  "panels": [
    {"title": "Cluster Health", "targets": [{"expr": "elasticsearch_cluster_health_status"}]},
    {"title": "JVM Memory", "targets": [{"expr": "elasticsearch_jvm_memory_used_bytes"}]},
    {"title": "Index Rate", "targets": [{"expr": "rate(elasticsearch_indices_indexing_index_total[5m])"}]},
    {"title": "Search Rate", "targets": [{"expr": "rate(elasticsearch_indices_search_query_total[5m])"}]},
    {"title": "Thread Pool Queue", "targets": [{"expr": "elasticsearch_thread_pool_queue_count"}]}
  ]
}

3.3 告警规则

# alertmanager.yml
groups:
  - name: elasticsearch
    rules:
      - alert: ESClusterRed
        expr: elasticsearch_cluster_health_status == 2
        for: 5m
        annotations:
          summary: "ES 集群状态为 RED"

      - alert: ESNodeDown
        expr: elasticsearch_node_stats_up == 0
        for: 2m
        annotations:
          summary: "ES 节点宕机"

      - alert: ESHighJVM
        expr: elasticsearch_jvm_memory_used_bytes / elasticsearch_jvm_memory_max_bytes > 0.85
        for: 10m
        annotations:
          summary: "ES JVM 使用率超过 85%"

      - alert: ESUnassignedShards
        expr: elasticsearch_cluster_health_number_of_unassigned_shards > 0
        for: 15m
        annotations:
          summary: "存在未分配分片"

四、故障排查

4.1 常见问题诊断命令

# 检查未分配分片原因
GET _cluster/allocation/explain

# 查看慢查询日志
GET _cluster/settings
{
  "index.search.slowlog.threshold.query.warn": "10s",
  "index.search.slowlog.threshold.query.info": "5s",
  "index.search.slowlog.threshold.fetch.warn": "1s",
  "index.indexing.slowlog.threshold.index.warn": "10s"
}

# 检查分片分布均衡
GET _cat/allocation?v

# 查看节点资源使用
GET _nodes/stats/jvm,os,process

# 检查索引健康
GET _cat/health?v
GET _cat/shards?v&h=index,shard,prirep,state,node,docs,store

4.2 典型故障场景

场景 1：集群状态 YELLOW

# 原因：副本分片无法分配
# 排查步骤：
1. GET _cluster/health - 确认未分配分片数
2. GET _cat/shards?v - 查看未分配分片
3. GET _cluster/allocation/explain - 获取详细原因
4. 检查磁盘空间：GET _cat/allocation?v
5. 检查节点状态：GET _cat/nodes?v

# 临时解决方案（仅测试环境）
PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "primaries"
  }
}

场景 2：写入性能下降

# 排查步骤：
1. 检查 JVM GC：GET _nodes/stats/jvm
2. 检查线程池：GET _cat/thread_pool?v
3. 检查磁盘 IO：iostat -x 1
4. 检查慢日志：查看 slowlog
5. 检查分片大小：GET _cat/indices?v

# 优化措施：
- 调大 refresh_interval
- 使用 bulk 批量写入
- 检查分片是否过大（>50GB）
- 考虑增加数据节点

场景 3：查询超时

# 排查步骤：
1. 检查热点线程：GET _nodes/hot_threads
2. 分析慢查询：查看 search slowlog
3. 检查字段映射：GET <index>/_mapping
4. 检查缓存：GET _nodes/stats/indices/query_cache

# 优化措施：
- 避免深度分页（使用 search_after）
- 优化查询语句（避免 wildcard 前缀）
- 使用 filter 代替 query
- 增加查询超时时间

五、最佳实践

5.1 索引生命周期管理（ILM）

PUT _ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "2d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "7d",
        "actions": {
          "set_priority": { "priority": 0 },
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": { "delete": {} }
      }
    }
  }
}

5.2 备份与恢复

# 创建快照仓库
PUT _snapshot/backup_repo
{
  "type": "fs",
  "settings": {
    "location": "/mnt/backup/elasticsearch",
    "compress": true
  }
}

# 创建快照
PUT _snapshot/backup_repo/snapshot_20260306
{
  "indices": "logs-*",
  "ignore_unavailable": true,
  "include_global_state": false
}

# 恢复快照
POST _snapshot/backup_repo/snapshot_20260306/_restore
{
  "indices": "logs-*",
  "rename_pattern": "logs-(.+)",
  "rename_replacement": "restored-logs-$1"
}

5.3 安全配置

# elasticsearch.yml
xpack.security.enabled: true
xpack.security.enrollment.enabled: false
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: certs/http.p12
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: certs/transport.p12
xpack.security.transport.ssl.truststore.path: certs/transport.p12

# 创建用户
bin/elasticsearch-users useradd admin -p <password> -r superuser

5.4 性能优化清单

使用 SSD 存储
禁用 swap（swappiness=1）
设置合适的堆内存（16-31GB）
使用 G1GC 垃圾收集器
分片大小控制在 20-50GB
合理设置副本数
启用索引压缩
配置慢查询日志
实施 ILM 策略
定期 force_merge 历史索引

六、常用运维命令速查

# 集群信息
GET _cluster/health
GET _cluster/state
GET _cat/nodes?v
GET _cat/indices?v

# 索引管理
POST <index>/_forcemerge?max_num_segments=1
POST <index>/_refresh
POST <index>/_flush
DELETE <index>

# 分片操作
POST _cluster/reroute?retry_failed=true
POST _shrink/<index>

# 缓存管理
POST _cache/clear
POST <index>/_cache/clear

# 任务管理
GET _tasks
POST _tasks/<task_id>/_cancel

七、参考资料

文档生成时间： 2026-03-06 10:00

Elasticsearch 生产优化与监控

SRE 每日主题：Elasticsearch 生产优化与监控

一、生产环境部署架构

1.1 推荐集群规模

1.2 节点角色分离配置

1.3 JVM 配置优化

二、核心参数调优

2.1 系统级优化

2.2 索引配置模板

2.3 关键参数调优说明

三、监控体系

3.1 核心监控指标

3.2 Prometheus 监控配置

3.3 告警规则

四、故障排查

4.1 常见问题诊断命令

4.2 典型故障场景

场景 1：集群状态 YELLOW

场景 2：写入性能下降

场景 3：查询超时

五、最佳实践

5.1 索引生命周期管理（ILM）

5.2 备份与恢复

5.3 安全配置

5.4 性能优化清单

六、常用运维命令速查

七、参考资料

results matching ""

No results matching ""