SRE 每日主题:Elasticsearch 生产优化与监控

日期: 2026-03-06
主题序号: 6(3 月 6 日 → 6 % 12 = 6)
适用版本: Elasticsearch 8.x


一、生产环境部署架构

1.1 推荐集群规模

集群规模 节点数 适用场景
小型 3 节点 日志量 < 100GB/天
中型 5-7 节点 日志量 100-500GB/天
大型 9+ 节点 日志量 > 500GB/天

1.2 节点角色分离配置

# elasticsearch.yml - Master 节点
cluster.name: production-es
node.name: es-master-01
node.roles: ["master"]
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch

# elasticsearch.yml - Data 节点
cluster.name: production-es
node.name: es-data-01
node.roles: ["data", "data_content"]
path.data: /mnt/ssd1/elasticsearch,/mnt/ssd2/elasticsearch
path.logs: /var/log/elasticsearch

# elasticsearch.yml - Coordinating 节点
cluster.name: production-es
node.name: es-coord-01
node.roles: ["ingest", "remote_cluster_client"]

1.3 JVM 配置优化

# jvm.options
-Xms16g
-Xmx16g
-XX:+UseG1GC
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/elasticsearch
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m

关键参数说明:

  • -Xms-Xmx 必须相等,避免动态调整开销
  • 堆内存不超过 31GB(避免压缩指针失效)
  • 堆内存不超过物理内存的 50%

二、核心参数调优

2.1 系统级优化

# /etc/security/limits.conf
elasticsearch - nofile 65535
elasticsearch - nproc 4096
elasticsearch - memlock unlimited

# /etc/sysctl.conf
vm.max_map_count = 262144
vm.swappiness = 1
fs.file-max = 65535

# 应用配置
sysctl -p
ulimit -n 65535

2.2 索引配置模板

PUT _index_template/logs_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 5,
      "number_of_replicas": 1,
      "refresh_interval": "30s",
      "index.translog.durability": "async",
      "index.translog.sync_interval": "5s",
      "index.codec": "best_compression",
      "index.routing.allocation.total_shards_per_node": 3,
      "index.merge.policy.merge_factor": 10,
      "index.merge.policy.segments_per_tier": 10
    },
    "mappings": {
      "dynamic": "strict",
      "properties": {
        "@timestamp": { "type": "date" },
        "message": { 
          "type": "text",
          "analyzer": "standard",
          "fielddata": false
        },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" },
        "trace_id": { "type": "keyword" }
      }
    }
  }
}

2.3 关键参数调优说明

参数 推荐值 说明
number_of_shards 按数据量计算 每分片 20-50GB,预留 20% 增长空间
number_of_replicas 1(生产) 高可用要求,读多写少场景可设为 2
refresh_interval 30s 默认 1s,批量写入场景调大提升性能
translog.durability async 异步刷盘提升写入性能
translog.sync_interval 5s 平衡性能与数据安全性
codec best_compression 节省存储空间,轻微 CPU 开销

三、监控体系

3.1 核心监控指标

# 集群健康状态
GET _cluster/health

# 节点统计信息
GET _nodes/stats

# 索引统计
GET _cat/indices?v&s=store.size:desc

# 分片分布
GET _cat/shards?v

# 热点查询
GET _nodes/hot_threads

3.2 Prometheus 监控配置

# prometheus.yml
scrape_configs:
  - job_name: 'elasticsearch'
    static_configs:
      - targets: ['es-master-01:9200']
    metrics_path: '/_prometheus/metrics'
    scrape_interval: 30s

# grafana-dashboard.json (关键面板)
{
  "panels": [
    {"title": "Cluster Health", "targets": [{"expr": "elasticsearch_cluster_health_status"}]},
    {"title": "JVM Memory", "targets": [{"expr": "elasticsearch_jvm_memory_used_bytes"}]},
    {"title": "Index Rate", "targets": [{"expr": "rate(elasticsearch_indices_indexing_index_total[5m])"}]},
    {"title": "Search Rate", "targets": [{"expr": "rate(elasticsearch_indices_search_query_total[5m])"}]},
    {"title": "Thread Pool Queue", "targets": [{"expr": "elasticsearch_thread_pool_queue_count"}]}
  ]
}

3.3 告警规则

# alertmanager.yml
groups:
  - name: elasticsearch
    rules:
      - alert: ESClusterRed
        expr: elasticsearch_cluster_health_status == 2
        for: 5m
        annotations:
          summary: "ES 集群状态为 RED"

      - alert: ESNodeDown
        expr: elasticsearch_node_stats_up == 0
        for: 2m
        annotations:
          summary: "ES 节点宕机"

      - alert: ESHighJVM
        expr: elasticsearch_jvm_memory_used_bytes / elasticsearch_jvm_memory_max_bytes > 0.85
        for: 10m
        annotations:
          summary: "ES JVM 使用率超过 85%"

      - alert: ESUnassignedShards
        expr: elasticsearch_cluster_health_number_of_unassigned_shards > 0
        for: 15m
        annotations:
          summary: "存在未分配分片"

四、故障排查

4.1 常见问题诊断命令

# 检查未分配分片原因
GET _cluster/allocation/explain

# 查看慢查询日志
GET _cluster/settings
{
  "index.search.slowlog.threshold.query.warn": "10s",
  "index.search.slowlog.threshold.query.info": "5s",
  "index.search.slowlog.threshold.fetch.warn": "1s",
  "index.indexing.slowlog.threshold.index.warn": "10s"
}

# 检查分片分布均衡
GET _cat/allocation?v

# 查看节点资源使用
GET _nodes/stats/jvm,os,process

# 检查索引健康
GET _cat/health?v
GET _cat/shards?v&h=index,shard,prirep,state,node,docs,store

4.2 典型故障场景

场景 1:集群状态 YELLOW

# 原因:副本分片无法分配
# 排查步骤:
1. GET _cluster/health - 确认未分配分片数
2. GET _cat/shards?v - 查看未分配分片
3. GET _cluster/allocation/explain - 获取详细原因
4. 检查磁盘空间:GET _cat/allocation?v
5. 检查节点状态:GET _cat/nodes?v

# 临时解决方案(仅测试环境)
PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "primaries"
  }
}

场景 2:写入性能下降

# 排查步骤:
1. 检查 JVM GC:GET _nodes/stats/jvm
2. 检查线程池:GET _cat/thread_pool?v
3. 检查磁盘 IO:iostat -x 1
4. 检查慢日志:查看 slowlog
5. 检查分片大小:GET _cat/indices?v

# 优化措施:
- 调大 refresh_interval
- 使用 bulk 批量写入
- 检查分片是否过大(>50GB)
- 考虑增加数据节点

场景 3:查询超时

# 排查步骤:
1. 检查热点线程:GET _nodes/hot_threads
2. 分析慢查询:查看 search slowlog
3. 检查字段映射:GET <index>/_mapping
4. 检查缓存:GET _nodes/stats/indices/query_cache

# 优化措施:
- 避免深度分页(使用 search_after)
- 优化查询语句(避免 wildcard 前缀)
- 使用 filter 代替 query
- 增加查询超时时间

五、最佳实践

5.1 索引生命周期管理(ILM)

PUT _ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "2d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "7d",
        "actions": {
          "set_priority": { "priority": 0 },
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": { "delete": {} }
      }
    }
  }
}

5.2 备份与恢复

# 创建快照仓库
PUT _snapshot/backup_repo
{
  "type": "fs",
  "settings": {
    "location": "/mnt/backup/elasticsearch",
    "compress": true
  }
}

# 创建快照
PUT _snapshot/backup_repo/snapshot_20260306
{
  "indices": "logs-*",
  "ignore_unavailable": true,
  "include_global_state": false
}

# 恢复快照
POST _snapshot/backup_repo/snapshot_20260306/_restore
{
  "indices": "logs-*",
  "rename_pattern": "logs-(.+)",
  "rename_replacement": "restored-logs-$1"
}

5.3 安全配置

# elasticsearch.yml
xpack.security.enabled: true
xpack.security.enrollment.enabled: false
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: certs/http.p12
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: certs/transport.p12
xpack.security.transport.ssl.truststore.path: certs/transport.p12

# 创建用户
bin/elasticsearch-users useradd admin -p <password> -r superuser

5.4 性能优化清单

  • 使用 SSD 存储
  • 禁用 swap(swappiness=1)
  • 设置合适的堆内存(16-31GB)
  • 使用 G1GC 垃圾收集器
  • 分片大小控制在 20-50GB
  • 合理设置副本数
  • 启用索引压缩
  • 配置慢查询日志
  • 实施 ILM 策略
  • 定期 force_merge 历史索引

六、常用运维命令速查

# 集群信息
GET _cluster/health
GET _cluster/state
GET _cat/nodes?v
GET _cat/indices?v

# 索引管理
POST <index>/_forcemerge?max_num_segments=1
POST <index>/_refresh
POST <index>/_flush
DELETE <index>

# 分片操作
POST _cluster/reroute?retry_failed=true
POST _shrink/<index>

# 缓存管理
POST _cache/clear
POST <index>/_cache/clear

# 任务管理
GET _tasks
POST _tasks/<task_id>/_cancel

七、参考资料


文档生成时间: 2026-03-06 10:00

results matching ""

    No results matching ""