SRE 每日主题:Higress 云原生网关部署与生产实践

日期: 2026-03-13
主题序号: 1 (13 % 12 = 1)
难度等级: ⭐⭐⭐⭐
适用场景: 生产环境云原生网关部署


一、Higress 概述

Higress 是阿里巴巴开源的云原生网关,基于 Envoy + Istio 构建,提供:

  • 流量网关:南北向流量入口
  • 微服务网关:东西向服务治理
  • 安全网关:WAF、认证、限流
  • AI 网关:大模型 API 统一接入

核心优势

特性 说明
高性能 基于 Envoy,单机 10W+ QPS
热更新 配置变更无需重启
多协议 HTTP/HTTPS/gRPC/Dubbo
可观测 内置 Prometheus 指标
插件化 WASM 插件扩展能力

二、生产环境部署方案

2.1 前置要求

# Kubernetes 版本要求
kubectl version --short
# 要求:v1.20+

# Helm 版本要求
helm version
# 要求:v3.0+

# 节点资源要求(生产环境最小配置)
# CPU: 4 核 × 3 节点
# 内存:8Gi × 3 节点

2.2 添加 Helm Chart 仓库

helm repo add higress https://higress.io/helm-charts
helm repo update

2.3 创建命名空间

kubectl create namespace higress-system

2.4 生产环境 values.yaml 配置

# higress-production-values.yaml

# ========== 全局配置 ==========
global:
  # 镜像仓库(国内使用阿里云镜像)
  imageRepository: registry.cn-hangzhou.aliyuncs.com/higress
  # 镜像拉取策略
  imagePullPolicy: IfNotPresent

# ========== Gateway 配置 ==========
gateway:
  # 副本数(生产环境至少 3 副本)
  replicas: 3

  # 资源限制(关键!防止 OOM)
  resources:
    requests:
      cpu: "2"
      memory: "4Gi"
    limits:
      cpu: "4"
      memory: "8Gi"

  # 自动扩缩容配置
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 80

  # Pod 反亲和性(分散到不同节点)
  antiAffinity:
    enabled: true
    type: "preferred"

  # 容忍度(允许调度到 master 节点,如需)
  tolerations:
    - key: "node-role.kubernetes.io/master"
      operator: "Exists"
      effect: "NoSchedule"

  # 节点选择器
  nodeSelector:
    gateway-node: "true"

  # 健康检查
  livenessProbe:
    httpGet:
      path: /healthz/ready
      port: 15021
    initialDelaySeconds: 5
    periodSeconds: 10
    timeoutSeconds: 5
    failureThreshold: 3

  readinessProbe:
    httpGet:
      path: /healthz/ready
      port: 15021
    initialDelaySeconds: 5
    periodSeconds: 10
    timeoutSeconds: 5
    failureThreshold: 3

  # Service 配置(LoadBalancer 类型)
  service:
    type: LoadBalancer
    # 阿里云 SLB 注解
    annotations:
      service.beta.kubernetes.io/alibaba-cloud-loadbalancer-type: "nlb"
      service.beta.kubernetes.io/alibaba-cloud-loadbalancer-spec: "slb.s3.small"
      service.beta.kubernetes.io/alibaba-cloud-loadbalancer-charge-type: "paybytraffic"
    # 外部 IP(如使用固定 IP)
    # loadBalancerIP: "192.168.1.100"
    ports:
      - name: http2
        port: 80
        targetPort: 80
        protocol: TCP
      - name: https
        port: 443
        targetPort: 443
        protocol: TCP

  # 日志配置
  logging:
    level: "warning"  # production: warning, debug: debug
    format: "json"    # 生产环境使用 JSON 格式便于日志收集

# ========== Controller 配置 ==========
controller:
  replicas: 2

  resources:
    requests:
      cpu: "500m"
      memory: "512Mi"
    limits:
      cpu: "1"
      memory: "1Gi"

  # Leader 选举配置
  leaderElection:
    enabled: true
    leaseDuration: 30s
    renewDeadline: 20s
    retryPeriod: 5s

# ========== 监控配置 ==========
monitoring:
  enabled: true
  serviceMonitor:
    enabled: true
    namespace: higress-system
    interval: 30s
    scrapeTimeout: 10s

# ========== TLS/SSL 配置 ==========
tls:
  # 启用自动证书(Let's Encrypt)
  autoCert:
    enabled: true
    email: "admin@example.com"
    server: "https://acme-v02.api.letsencrypt.org/directory"
  # 或手动指定证书 Secret
  # secretName: "higress-tls"

# ========== 限流配置 ==========
rateLimit:
  enabled: true
  redis:
    # Redis 地址(生产环境使用独立 Redis)
    host: "redis-master.redis.svc.cluster.local"
    port: 6379
    password: "your-redis-password"
    db: 0

# ========== WAF 配置 ==========
waf:
  enabled: true
  # 自定义规则
  customRules:
    - name: "block-sql-injection"
      action: "block"
      conditions:
        - field: "uri_query"
          operator: "contains"
          value: "union select"
        - field: "uri_query"
          operator: "contains"
          value: "or 1=1"

# ========== 认证配置 ==========
auth:
  enabled: true
  # JWT 认证
  jwt:
    enabled: true
    issuer: "https://auth.example.com"
    jwksUri: "https://auth.example.com/.well-known/jwks.json"
    audiences:
      - "higress-gateway"

2.5 部署命令

# 安装 Higress(生产环境)
helm install higress higress/higress \
  -n higress-system \
  -f higress-production-values.yaml \
  --wait \
  --timeout 10m

# 验证部署
kubectl get pods -n higress-system
kubectl get svc -n higress-system

# 查看部署详情
helm status higress -n higress-system

2.6 升级命令

# 平滑升级(零停机)
helm upgrade higress higress/higress \
  -n higress-system \
  -f higress-production-values.yaml \
  --reuse-values \
  --wait

# 回滚到上一版本
helm rollback higress -n higress-system

# 查看历史版本
helm history higress -n higress-system

三、路由配置示例

3.1 基础 HTTP 路由

# http-route.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-app
  namespace: default
  annotations:
    kubernetes.io/ingress.class: higress
    # 路径匹配类型:Exact, Prefix, ImplementationSpecific
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web-service
            port:
              number: 80
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080

3.2 灰度发布(Canary)

# canary-release.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-app-canary
  namespace: default
  annotations:
    kubernetes.io/ingress.class: higress
    # 灰度流量比例(10%)
    higress.io/canary: "true"
    higress.io/canary-by-header: "X-Canary"
    higress.io/canary-by-header-value: "true"
    # 或按权重
    # higress.io/canary-weight: "10"
spec:
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web-service-v2
            port:
              number: 80

3.3 gRPC 路由

# grpc-route.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grpc-service
  namespace: default
  annotations:
    kubernetes.io/ingress.class: higress
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
spec:
  tls:
  - hosts:
    - grpc.example.com
    secretName: grpc-tls
  rules:
  - host: grpc.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: grpc-backend
            port:
              number: 50051

3.4 WebSocket 支持

# websocket-route.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: websocket-app
  namespace: default
  annotations:
    kubernetes.io/ingress.class: higress
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
spec:
  rules:
  - host: ws.example.com
    http:
      paths:
      - path: /ws
        pathType: Prefix
        backend:
          service:
            name: websocket-service
            port:
              number: 8080

四、关键参数调优

4.1 Envoy 连接参数

# 在 values.yaml 的 gateway.extraEnvoyConfig 中添加
gateway:
  extraEnvoyConfig: |
    # 连接超时配置
    connect_timeout: 5s

    # 连接池配置
    max_connections: 1024
    max_pending_requests: 1024
    max_requests: 1024
    max_retries: 3

    # HTTP/2 配置
    http2_protocol_options:
      max_concurrent_streams: 100
      initial_stream_window_size: 65536
      initial_connection_window_size: 1048576

    # 保持连接
    keepalive:
      time: 30s
      interval: 10s
      timeout: 5s

4.2 超时配置(生产推荐值)

参数 推荐值 说明
connect_timeout 5s 连接建立超时
request_timeout 60s 请求总超时
idle_timeout 300s 空闲连接超时
stream_idle_timeout 30s 流空闲超时
max_stream_duration 3600s 最大流时长(WebSocket)

4.3 限流配置

# 全局限流
apiVersion: networking.higress.io/v1
kind: HigressRateLimit
metadata:
  name: global-rate-limit
  namespace: higress-system
spec:
  # 限流维度:global, route, cluster
  domain: higress
  descriptors:
  - key: remote_address
    rate_limit:
      unit: second
      requests_per_unit: 100  # 每 IP 每秒 100 请求
  - key: header_match
    value: "api-key"
    rate_limit:
      unit: minute
      requests_per_unit: 1000  # 每 API Key 每分钟 1000 请求

4.4 熔断配置

# 熔断器配置(HigressRoute)
apiVersion: networking.higress.io/v1
kind: HigressRoute
metadata:
  name: api-route
  namespace: default
spec:
  hosts:
  - "api.example.com"
  routes:
  - match:
      uri:
        prefix: /api
    route:
    - destination:
        host: api-service
        port: 8080
    # 熔断配置
    outlierDetection:
      consecutive5xxErrors: 5      # 连续 5 次 5xx 错误触发
      interval: 30s                 # 检测间隔
      baseEjectionTime: 30s         # 隔离基础时间
      maxEjectionPercent: 50        # 最大隔离比例 50%
      minHealthPercent: 30          # 最小健康实例比例

五、监控与告警

5.1 Prometheus 指标

# 关键指标列表
# 请求量
higress_gateway_requests_total{route, status_code}

# 延迟
higress_gateway_request_duration_seconds{route, quantile}

# 连接数
higress_gateway_connections_active
higress_gateway_connections_total

# 限流
higress_gateway_rate_limited_requests_total

# 熔断
higress_gateway_circuit_breaker_open

# 证书
higress_gateway_ssl_cert_expiry_timestamp_seconds

5.2 Grafana 仪表盘配置

{
  "dashboard": {
    "title": "Higress Gateway 监控",
    "panels": [
      {
        "title": "QPS",
        "targets": [{
          "expr": "sum(rate(higress_gateway_requests_total[1m]))"
        }]
      },
      {
        "title": "P99 延迟",
        "targets": [{
          "expr": "histogram_quantile(0.99, rate(higress_gateway_request_duration_seconds_bucket[5m]))"
        }]
      },
      {
        "title": "错误率",
        "targets": [{
          "expr": "sum(rate(higress_gateway_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(higress_gateway_requests_total[5m]))"
        }]
      },
      {
        "title": "活跃连接数",
        "targets": [{
          "expr": "higress_gateway_connections_active"
        }]
      }
    ]
  }
}

5.3 告警规则(Prometheus AlertManager)

# higress-alerts.yaml
groups:
- name: higress-alerts
  rules:
  # 高错误率告警
  - alert: HigressHighErrorRate
    expr: |
      sum(rate(higress_gateway_requests_total{status_code=~"5.."}[5m])) 
      / sum(rate(higress_gateway_requests_total[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Higress 错误率超过 5%"
      description: "当前错误率:{{ $value | humanizePercentage }}"

  # 高延迟告警
  - alert: HigressHighLatency
    expr: |
      histogram_quantile(0.99, rate(higress_gateway_request_duration_seconds_bucket[5m])) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Higress P99 延迟超过 1 秒"
      description: "当前 P99 延迟:{{ $value }}s"

  # Pod 重启告警
  - alert: HigressPodRestarting
    expr: |
      increase(kube_pod_container_status_restarts_total{namespace="higress-system"}[1h]) > 3
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: "Higress Pod 频繁重启"
      description: "Pod {{ $labels.pod }} 1 小时内重启 {{ $value }} 次"

  # 证书即将过期告警
  - alert: HigressCertExpiring
    expr: |
      (higress_gateway_ssl_cert_expiry_timestamp_seconds - time()) < 86400 * 7
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "SSL 证书将在 7 天内过期"

5.4 监控命令

# 查看 Gateway Pod 状态
kubectl get pods -n higress-system -o wide

# 查看资源使用
kubectl top pods -n higress-system

# 查看实时日志
kubectl logs -n higress-system -l app=higress-gateway -f --tail=100

# 查看 Envoy 配置
kubectl exec -n higress-system $(kubectl get pod -n higress-system -l app=higress-gateway -o jsonpath='{.items[0].metadata.name}') -- pilot-agent request GET /config_dump

# 查看连接统计
kubectl exec -n higress-system $(kubectl get pod -n higress-system -l app=higress-gateway -o jsonpath='{.items[0].metadata.name}') -- pilot-agent request GET /stats | grep connection

# 测试延迟
for i in {1..100}; do curl -s -o /dev/null -w "%{time_total}\n" https://app.example.com; done | awk '{sum+=$1} END {print "avg:", sum/NR}'

# 压力测试(ab)
ab -n 10000 -c 100 https://app.example.com/

# 压力测试(wrk)
wrk -t12 -c400 -d30s https://app.example.com/

六、故障排查

6.1 常见问题排查流程

1. 检查 Pod 状态
   kubectl get pods -n higress-system

2. 查看 Pod 事件
   kubectl describe pod <pod-name> -n higress-system

3. 查看日志
   kubectl logs <pod-name> -n higress-system

4. 检查 Service/Endpoints
   kubectl get svc,ep -n higress-system

5. 检查 Ingress 配置
   kubectl get ingress -A
   kubectl describe ingress <ingress-name>

6. 检查路由配置
   kubectl get higressroute -A

7. 验证 DNS 解析
   nslookup app.example.com
   dig app.example.com

8. 测试连通性
   curl -v https://app.example.com

6.2 典型故障场景

场景 1:502 Bad Gateway

# 原因:后端服务不可用
# 排查步骤:

# 1. 检查后端 Pod 状态
kubectl get pods -n default -l app=web-service

# 2. 检查 Endpoints
kubectl get endpoints web-service -n default

# 3. 查看 Gateway 日志中的 upstream 错误
kubectl logs -n higress-system -l app=higress-gateway | grep "upstream"

# 4. 测试后端直连
kubectl exec -n default <backend-pod> -- curl localhost:8080/health

场景 2:503 Service Unavailable

# 原因:无可用后端实例或熔断触发
# 排查步骤:

# 1. 检查熔断状态
kubectl exec -n higress-system <gateway-pod> -- pilot-agent request GET /stats | grep circuit_breaker

# 2. 检查限流状态
kubectl exec -n higress-system <gateway-pod> -- pilot-agent request GET /stats | grep rate_limit

# 3. 查看是否有健康检查失败
kubectl logs -n higress-system -l app=higress-gateway | grep "health_check"

场景 3:SSL/TLS 证书问题

# 原因:证书过期或配置错误
# 排查步骤:

# 1. 检查证书有效期
echo | openssl s_client -connect app.example.com:443 2>/dev/null | openssl x509 -noout -dates

# 2. 检查 Secret 中的证书
kubectl get secret higress-tls -n higress-system -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

# 3. 验证证书链
openssl s_client -connect app.example.com:443 -showcerts

# 4. 检查自动证书状态(如使用 Let's Encrypt)
kubectl get certificaterequest -n higress-system
kubectl describe certificaterequest <request-name> -n higress-system

场景 4:路由不匹配

# 原因:Ingress 配置错误或路径不匹配
# 排查步骤:

# 1. 查看 Ingress 配置
kubectl get ingress <name> -o yaml

# 2. 检查 Higress 路由配置
kubectl get higressroute -A -o yaml

# 3. 查看 Envoy 路由表
kubectl exec -n higress-system <gateway-pod> -- pilot-agent request GET /config_dump | jq '.configs[] | select(.route_config != null)'

# 4. 测试不同路径
curl -v -H "Host: app.example.com" http://<gateway-ip>/api
curl -v -H "Host: app.example.com" http://<gateway-ip>/static

场景 5:性能下降

# 原因:资源不足或配置不当
# 排查步骤:

# 1. 检查资源使用
kubectl top pods -n higress-system

# 2. 检查连接数
kubectl exec -n higress-system <gateway-pod> -- pilot-agent request GET /stats | grep connection

# 3. 检查请求队列
kubectl exec -n higress-system <gateway-pod> -- pilot-agent request GET /stats | grep queue

# 4. 查看慢请求日志
kubectl logs -n higress-system -l app=higress-gateway | grep -E "duration.*[1-9][0-9]{2,}ms"

# 5. 检查是否有 OOM
kubectl describe pod -n higress-system | grep -A5 "OOM"

6.3 调试工具

# 启用 Debug 日志
kubectl patch deploy higress-gateway -n higress-system \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/env", "value": [{"name": "LOG_LEVEL", "value": "debug"}]}]'

# 抓取 Envoy 配置快照
kubectl exec -n higress-system <gateway-pod> -- pilot-agent request GET /config_dump > envoy-config.json

# 抓取性能剖析
kubectl exec -n higress-system <gateway-pod> -- curl -s localhost:15000/ready
kubectl exec -n higress-system <gateway-pod> -- curl -s localhost:15000/stats/prometheus > metrics.prom

# 网络抓包(需要 debug 容器)
kubectl debug -n higress-system <gateway-pod> -it --image=nicolaka/netshoot -- tcpdump -i any port 80 or 443

七、最佳实践

7.1 部署最佳实践

实践 说明 推荐配置
多副本部署 避免单点故障 至少 3 副本
跨可用区部署 提高容灾能力 Pod 反亲和性 + 多 AZ
资源限制 防止资源耗尽 设置 requests/limits
PDB 配置 保证升级可用性 minAvailable: 2
健康检查 快速故障检测 5s interval, 3 次失败

7.2 安全最佳实践

# 1. 启用 mTLS
apiVersion: security.higress.io/v1
kind: PeerAuthentication
metadata:
  name: default
  namespace: higress-system
spec:
  mtls:
    mode: STRICT

# 2. 配置 WAF 规则
apiVersion: security.higress.io/v1
kind: WafPolicy
metadata:
  name: default-waf
  namespace: higress-system
spec:
  rules:
  - name: sql-injection
    action: BLOCK
    conditions:
    - field: ARGS
      operator: CONTAINS
      value: "(?i)(union.*select|select.*from)"

  - name: xss-protection
    action: BLOCK
    conditions:
    - field: ARGS
      operator: CONTAINS
      value: "(?i)(<script|javascript:)"

# 3. IP 白名单
apiVersion: networking.higress.io/v1
kind: HigressGateway
metadata:
  name: internal-gateway
spec:
  accessLog:
  - filter:
      remoteIp:
        cidr: "10.0.0.0/8"

7.3 性能最佳实践

# 1. 启用 HTTP/2
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/http2: "true"

# 2. 启用 Gzip 压缩
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/enable-gzip: "true"
    nginx.ingress.kubernetes.io/gzip-types: "text/plain,text/css,application/json,application/javascript"
    nginx.ingress.kubernetes.io/gzip-min-length: "256"

# 3. 配置连接池
# 在 HigressRoute 中
spec:
  routes:
  - route:
    - destination:
        host: backend-service
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000

7.4 运维最佳实践

# 1. 定期备份配置
kubectl get ingress,higressroute,virtualservice -A -o yaml > higress-config-backup-$(date +%Y%m%d).yaml

# 2. 证书监控(提前 30 天告警)
# 使用 cert-manager + Prometheus

# 3. 配置变更审计
# 启用 Kubernetes Audit Log

# 4. 定期压测
# 每月执行一次全链路压测

# 5. 灾备演练
# 每季度执行一次故障切换演练

八、配置模板速查

8.1 完整 Ingress 模板

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: production-app
  namespace: production
  annotations:
    kubernetes.io/ingress.class: higress
    # TLS
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    # 限流
    nginx.ingress.kubernetes.io/limit-rps: "100"
    # 超时
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "5"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
    # 重定向
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    # CORS
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/cors-allow-origin: "https://*.example.com"
    nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS"
    nginx.ingress.kubernetes.io/cors-allow-headers: "DNT,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Range,Authorization"
spec:
  tls:
  - hosts:
    - app.example.com
    secretName: app-tls
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: frontend
            port:
              number: 80
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: backend
            port:
              number: 8080
      - path: /health
        pathType: Exact
        backend:
          service:
            name: frontend
            port:
              number: 80

8.2 HigressRoute 模板

apiVersion: networking.higress.io/v1
kind: HigressRoute
metadata:
  name: api-route
  namespace: production
spec:
  hosts:
  - "api.example.com"
  http:
  - name: "api-v1"
    match:
    - uri:
        prefix: "/api/v1"
    route:
    - destination:
        host: api-v1-service
        port:
          number: 8080
      weight: 90
    - destination:
        host: api-v2-service
        port:
          number: 8080
      weight: 10
    timeout: 30s
    retries:
      attempts: 3
      perTryTimeout: 10s
      retryOn: "5xx,reset,connect-failure"
    fault:
      delay:
        percentage:
          value: 0.1
        fixedDelay: 100ms
    corsPolicy:
      allowOrigins:
      - exact: "https://app.example.com"
      allowMethods:
      - GET
      - POST
      allowHeaders:
      - Authorization
      - Content-Type
      exposeHeaders:
      - X-Request-Id
      maxAge: 24h
      allowCredentials: true
    rateLimit:
      type: Local
      qps: 100
      burst: 200

九、参考资源


十、今日检查清单

  • 检查 Gateway Pod 健康状态
  • 验证 SSL 证书有效期(> 30 天)
  • 检查错误率(< 1%)
  • 检查 P99 延迟(< 500ms)
  • 查看限流触发次数
  • 检查熔断状态
  • 备份当前配置
  • 审查最近变更的 Ingress 配置

文档生成时间: 2026-03-13 10:00 CST
下次主题: 2026-03-14 - Redis 生产配置与性能调优

results matching ""

    No results matching ""