SRE 每日主题 #12: CI/CD 流水线设计与实践

日期: 2026-03-12
主题索引: 12 (12 % 12 = 0 → 主题 12)
作者: SRE Team


目录

  1. 概述
  2. CI/CD 架构设计
  3. GitLab CI/CD 完整配置
  4. GitHub Actions 配置模板
  5. Jenkins Pipeline 最佳实践
  6. Docker 镜像构建优化
  7. Kubernetes 部署策略
  8. 环境管理与配置
  9. 质量门禁与测试策略
  10. 监控与可观测性
  11. 故障排查手册
  12. 安全最佳实践

概述

CI/CD 核心价值

  • 持续集成 (CI): 代码频繁合并到主干,自动构建和测试
  • 持续交付 (CD): 自动化发布流程,随时可部署到生产
  • 持续部署: 通过所有测试后自动部署到生产

关键指标

指标 目标值 说明
构建时间 < 10 分钟 单次 CI 流水线执行时间
部署频率 每日多次 生产部署次数
变更失败率 < 5% 部署导致故障的比例
恢复时间 (MTTR) < 1 小时 故障恢复平均时间

CI/CD 架构设计

参考架构图

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Developer │───▶│  Git Repo   │───▶│  CI Runner  │
└─────────────┘    └─────────────┘    └─────────────┘
                                              │
                                              ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Production │◀───│   Staging   │◀───│  Registry   │
└─────────────┘    └─────────────┘    └─────────────┘

组件选型

组件 推荐方案 备选方案
代码托管 GitLab / GitHub Bitbucket
CI 引擎 GitLab CI / GitHub Actions Jenkins / CircleCI
镜像仓库 Harbor / Docker Hub ECR / GCR
部署目标 Kubernetes ECS / VM
配置管理 Helm / Kustomize Ansible

GitLab CI/CD 完整配置

.gitlab-ci.yml 完整示例

# .gitlab-ci.yml
stages:
  - validate
  - build
  - test
  - security
  - deploy-staging
  - deploy-production

variables:
  DOCKER_REGISTRY: registry.example.com
  DOCKER_IMAGE: $CI_PROJECT_PATH
  KUBE_NAMESPACE: $CI_PROJECT_NAME-$CI_COMMIT_REF_SLUG
  HELM_RELEASE: $CI_PROJECT_NAME
  MAVEN_OPTS: "-Dmaven.repo.local=.m2/repository"
  PIP_CACHE_DIR: "$CI_PROJECT_DIR/.pip-cache"

# 缓存配置
cache:
  key: "$CI_COMMIT_REF_SLUG"
  paths:
    - .m2/repository/
    - .pip-cache/
    - node_modules/
  policy: pull-push

# 默认配置
default:
  image: docker:24.0
  services:
    - docker:24.0-dind
  tags:
    - kubernetes
  retry:
    max: 2
    when:
      - runner_system_failure
      - stuck_or_timeout_failure

# 阶段 1: 代码验证
validate:
  stage: validate
  image: node:20-alpine
  script:
    - npm ci
    - npm run lint
    - npm run format:check
  rules:
    - changes:
        - "**/*.js"
        - "**/*.ts"
        - "**/*.jsx"
        - "**/*.tsx"
        - package*.json

# 阶段 2: 构建
build:
  stage: build
  image: docker:24.0
  script:
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
    - docker build
      -t $DOCKER_REGISTRY/$DOCKER_IMAGE:$CI_COMMIT_SHA
      -t $DOCKER_REGISTRY/$DOCKER_IMAGE:latest
      --build-arg BUILD_DATE=$(date -u +'%Y-%m-%dT%H:%M:%SZ')
      --build-arg VCS_REF=$CI_COMMIT_SHA
      --cache-from $DOCKER_REGISTRY/$DOCKER_IMAGE:latest
      .
    - docker push $DOCKER_REGISTRY/$DOCKER_IMAGE:$CI_COMMIT_SHA
    - docker push $DOCKER_REGISTRY/$DOCKER_IMAGE:latest
  only:
    - main
    - develop
    - /^release\/.*$/
  artifacts:
    reports:
      dotenv: build.env

# 阶段 3: 测试
unit-test:
  stage: test
  image: node:20-alpine
  script:
    - npm ci
    - npm run test:unit -- --coverage
  coverage: '/Lines\s*:\s*\d+.\d+\s*\(.*?\)/'
  artifacts:
    reports:
      junit: test-results/junit.xml
      coverage_report:
        coverage_format: cobertura
        path: coverage/cobertura-coverage.xml

integration-test:
  stage: test
  image: docker:24.0
  services:
    - postgres:15-alpine
    - redis:7-alpine
  variables:
    POSTGRES_DB: test_db
    POSTGRES_USER: test_user
    POSTGRES_PASSWORD: test_pass
    REDIS_URL: redis://redis:6379
  script:
    - docker run --rm
      -e DATABASE_URL=postgresql://test_user:test_pass@postgres:5432/test_db
      -e REDIS_URL=$REDIS_URL
      $DOCKER_REGISTRY/$DOCKER_IMAGE:$CI_COMMIT_SHA
      npm run test:integration
  needs:
    - build

# 阶段 4: 安全扫描
security-scan:
  stage: security
  image: aquasec/trivy:latest
  script:
    - trivy image --exit-code 0 --severity UNKNOWN,LOW,MEDIUM
      $DOCKER_REGISTRY/$DOCKER_IMAGE:$CI_COMMIT_SHA
    - trivy image --exit-code 1 --severity HIGH,CRITICAL
      $DOCKER_REGISTRY/$DOCKER_IMAGE:$CI_COMMIT_SHA
  allow_failure: true
  rules:
    - if: $CI_COMMIT_BRANCH == "main"

sast:
  stage: security
  image: semgrep/semgrep:latest
  script:
    - semgrep --config auto --json --output semgrep.json .
  artifacts:
    reports:
      sast: semgrep.json
  rules:
    - if: $CI_COMMIT_BRANCH == "main"

dependency-check:
  stage: security
  image: node:20-alpine
  script:
    - npm audit --audit-level=high
  allow_failure: true
  rules:
    - changes:
        - package*.json

# 阶段 5: 部署到 Staging
deploy-staging:
  stage: deploy-staging
  image: bitnami/kubectl:latest
  script:
    - kubectl config use-context staging
    - kubectl create namespace $KUBE_NAMESPACE --dry-run=client -o yaml | kubectl apply -f -
    - helm upgrade --install $HELM_RELEASE ./charts/$CI_PROJECT_NAME
      --namespace $KUBE_NAMESPACE
      --set image.tag=$CI_COMMIT_SHA
      --set environment=staging
      --wait --timeout 5m
  environment:
    name: staging
    url: https://staging.$CI_PROJECT_NAME.example.com
  rules:
    - if: $CI_COMMIT_BRANCH == "develop"
    - if: $CI_COMMIT_BRANCH == "main"

# 阶段 6: 部署到 Production
deploy-production:
  stage: deploy-production
  image: bitnami/kubectl:latest
  script:
    - kubectl config use-context production
    - kubectl create namespace $KUBE_NAMESPACE --dry-run=client -o yaml | kubectl apply -f -
    - helm upgrade --install $HELM_RELEASE ./charts/$CI_PROJECT_NAME
      --namespace $KUBE_NAMESPACE
      --set image.tag=$CI_COMMIT_SHA
      --set environment=production
      --wait --timeout 10m
      --atomic
  environment:
    name: production
    url: https://$CI_PROJECT_NAME.example.com
  when: manual
  only:
    - main
  before_script:
    - echo "⚠️ 生产部署需要审批"
  after_script:
    - kubectl rollout status deployment/$HELM_RELEASE --namespace $KUBE_NAMESPACE --timeout=5m

# 回滚任务
rollback:
  stage: deploy-production
  image: bitnami/kubectl:latest
  script:
    - kubectl config use-context production
    - helm rollback $HELM_RELEASE --namespace $KUBE_NAMESPACE
  when: manual
  variables:
    GIT_STRATEGY: none

GitLab CI 变量管理

# .gitlab/variables.yml (示例)
variables:
  # 通用变量
  CI_DEBUG_TRACE: "false"
  GIT_DEPTH: "50"
  GIT_STRATEGY: clone

  # Docker 优化
  DOCKER_TLS_CERTDIR: "/certs"
  DOCKER_BUILDKIT: "1"
  DOCKER_DRIVER: overlay2

  # 构建优化
  NODE_OPTIONS: "--max-old-space-size=4096"
  NPM_CONFIG_LOGLEVEL: warn

GitHub Actions 配置模板

完整 Workflow 示例

# .github/workflows/ci-cd.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
    tags: ['v*']
  pull_request:
    branches: [main, develop]
  workflow_dispatch:
    inputs:
      environment:
        description: 'Deploy environment'
        required: true
        default: 'staging'
        type: choice
        options:
          - staging
          - production

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}
  HELM_VERSION: 3.12.0

jobs:
  # 作业 1: 代码验证
  validate:
    name: Validate
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run linter
        run: npm run lint

      - name: Check formatting
        run: npm run format:check

  # 作业 2: 构建
  build:
    name: Build
    runs-on: ubuntu-latest
    needs: validate
    permissions:
      contents: read
      packages: write
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=
            type=ref,event=branch
            type=semver,pattern={{version}}

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
          build-args: |
            BUILD_DATE=${{ github.event.head_commit.timestamp }}
            VCS_REF=${{ github.sha }}

  # 作业 3: 测试
  test:
    name: Test
    runs-on: ubuntu-latest
    needs: build
    services:
      postgres:
        image: postgres:15-alpine
        env:
          POSTGRES_DB: test_db
          POSTGRES_USER: test_user
          POSTGRES_PASSWORD: test_pass
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 5432:5432
      redis:
        image: redis:7-alpine
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 6379:6379

    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run unit tests
        run: npm run test:unit -- --coverage
        env:
          CI: true

      - name: Run integration tests
        run: npm run test:integration
        env:
          DATABASE_URL: postgresql://test_user:test_pass@localhost:5432/test_db
          REDIS_URL: redis://localhost:6379

      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage/lcov.info
          fail_ci_if_error: false

  # 作业 4: 安全扫描
  security:
    name: Security Scan
    runs-on: ubuntu-latest
    needs: build
    steps:
      - uses: actions/checkout@v4

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          format: 'sarif'
          output: 'trivy-results.sarif'
          severity: 'CRITICAL,HIGH'

      - name: Upload Trivy results to GitHub Security
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: 'trivy-results.sarif'

      - name: Run npm audit
        run: npm audit --audit-level=high
        continue-on-error: true

  # 作业 5: 部署 Staging
  deploy-staging:
    name: Deploy Staging
    runs-on: ubuntu-latest
    needs: [build, test, security]
    if: github.ref == 'refs/heads/develop' || github.ref == 'refs/heads/main'
    environment:
      name: staging
      url: https://staging.example.com
    steps:
      - uses: actions/checkout@v4

      - name: Setup Helm
        uses: azure/setup-helm@v3
        with:
          version: ${{ env.HELM_VERSION }}

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          method: kubeconfig
          kubeconfig: ${{ secrets.KUBE_CONFIG_STAGING }}

      - name: Deploy to Staging
        run: |
          helm upgrade --install ${{ github.event.repository.name }} ./charts/${{ github.event.repository.name }} \
            --namespace ${{ github.event.repository.name }}-staging \
            --set image.tag=${{ github.sha }} \
            --set environment=staging \
            --wait --timeout 5m

      - name: Verify deployment
        run: |
          kubectl rollout status deployment/${{ github.event.repository.name }} \
            --namespace ${{ github.event.repository.name }}-staging \
            --timeout=5m

  # 作业 6: 部署 Production
  deploy-production:
    name: Deploy Production
    runs-on: ubuntu-latest
    needs: deploy-staging
    if: github.ref == 'refs/heads/main'
    environment:
      name: production
      url: https://example.com
    steps:
      - uses: actions/checkout@v4

      - name: Setup Helm
        uses: azure/setup-helm@v3
        with:
          version: ${{ env.HELM_VERSION }}

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          method: kubeconfig
          kubeconfig: ${{ secrets.KUBE_CONFIG_PRODUCTION }}

      - name: Deploy to Production
        run: |
          helm upgrade --install ${{ github.event.repository.name }} ./charts/${{ github.event.repository.name }} \
            --namespace ${{ github.event.repository.name }}-prod \
            --set image.tag=${{ github.sha }} \
            --set environment=production \
            --wait --timeout 10m \
            --atomic

      - name: Verify deployment
        run: |
          kubectl rollout status deployment/${{ github.event.repository.name }} \
            --namespace ${{ github.event.repository.name }}-prod \
            --timeout=5m

      - name: Notify success
        if: success()
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "✅ Production deployment successful: ${{ github.event.repository.name }}@${{ github.sha }}"
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

  # 作业 7: 回滚
  rollback:
    name: Rollback Production
    runs-on: ubuntu-latest
    needs: deploy-production
    if: failure()
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Setup Helm
        uses: azure/setup-helm@v3
        with:
          version: ${{ env.HELM_VERSION }}

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          method: kubeconfig
          kubeconfig: ${{ secrets.KUBE_CONFIG_PRODUCTION }}

      - name: Rollback deployment
        run: |
          helm rollback ${{ github.event.repository.name }} \
            --namespace ${{ github.event.repository.name }}-prod

Jenkins Pipeline 最佳实践

Jenkinsfile (Declarative Pipeline)

// Jenkinsfile
pipeline {
    agent {
        kubernetes {
            yaml '''
                apiVersion: v1
                kind: Pod
                spec:
                  containers:
                  - name: docker
                    image: docker:24.0
                    command:
                    - cat
                    tty: true
                    volumeMounts:
                    - name: docker-sock
                      mountPath: /var/run/docker.sock
                  - name: kubectl
                    image: bitnami/kubectl:latest
                    command:
                    - cat
                    tty: true
                  volumes:
                  - name: docker-sock
                    hostPath:
                      path: /var/run/docker.sock
            '''
        }
    }

    environment {
        DOCKER_REGISTRY = 'registry.example.com'
        DOCKER_IMAGE = "${env.JOB_NAME}/${env.BRANCH_NAME}".toLowerCase()
        KUBE_NAMESPACE = "${env.JOB_NAME}-${env.BRANCH_NAME}".toLowerCase()
    }

    options {
        timeout(time: 60, unit: 'MINUTES')
        retry(2)
        disableConcurrentBuilds()
        buildDiscarder(logRotator(numToKeepStr: '20'))
    }

    triggers {
        pollSCM('H/5 * * * *')
        cron('H 2 * * 1-5') // 工作日凌晨 2 点
    }

    parameters {
        string(name: 'DEPLOY_ENV', defaultValue: 'staging', description: '部署环境')
        booleanParam(name: 'SKIP_TESTS', defaultValue: false, description: '跳过测试')
        choice(name: 'DEPLOY_TYPE', choices: ['blue-green', 'rolling', 'canary'], description: '部署策略')
    }

    stages {
        stage('Checkout') {
            steps {
                checkout scm
                script {
                    env.GIT_COMMIT_SHORT = sh(script: 'git rev-parse --short HEAD', returnStdout: true).trim()
                }
            }
        }

        stage('Validate') {
            steps {
                script {
                    if (fileExists('package.json')) {
                        sh 'npm ci'
                        sh 'npm run lint'
                    }
                }
            }
        }

        stage('Build') {
            steps {
                script {
                    withCredentials([usernamePassword(credentialsId: 'docker-registry', usernameVariable: 'DOCKER_USER', passwordVariable: 'DOCKER_PASS')]) {
                        sh """
                            docker login -u ${DOCKER_USER} -p ${DOCKER_PASS} ${DOCKER_REGISTRY}
                            docker build -t ${DOCKER_REGISTRY}/${DOCKER_IMAGE}:${GIT_COMMIT_SHORT} .
                            docker push ${DOCKER_REGISTRY}/${DOCKER_IMAGE}:${GIT_COMMIT_SHORT}
                        """
                    }
                }
            }
            post {
                success {
                    archiveArtifacts artifacts: '**/target/*.jar', allowEmptyArchive: true
                }
            }
        }

        stage('Test') {
            when {
                expression { return !params.SKIP_TESTS }
            }
            parallel {
                stage('Unit Tests') {
                    steps {
                        script {
                            if (fileExists('package.json')) {
                                sh 'npm run test:unit -- --coverage'
                            }
                        }
                    }
                    post {
                        always {
                            junit allowEmptyResults: true, testResults: '**/test-results/*.xml'
                            publishHTML(target: [
                                allowMissing: true,
                                alwaysLinkToLastBuild: true,
                                keepAllHistory: true,
                                reportDir: 'coverage',
                                reportFiles: 'index.html',
                                reportName: 'Code Coverage Report'
                            ])
                        }
                    }
                }

                stage('Integration Tests') {
                    steps {
                        script {
                            withCredentials([string(credentialsId: 'db-url', variable: 'DATABASE_URL')]) {
                                sh 'npm run test:integration'
                            }
                        }
                    }
                }
            }
        }

        stage('Security Scan') {
            steps {
                script {
                    sh """
                        docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \\
                            aquasec/trivy:latest image \\
                            --exit-code 0 --severity UNKNOWN,LOW,MEDIUM \\
                            ${DOCKER_REGISTRY}/${DOCKER_IMAGE}:${GIT_COMMIT_SHORT}

                        docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \\
                            aquasec/trivy:latest image \\
                            --exit-code 1 --severity HIGH,CRITICAL \\
                            ${DOCKER_REGISTRY}/${DOCKER_IMAGE}:${GIT_COMMIT_SHORT}
                    """
                }
            }
            post {
                always {
                    recordIssues enabledForFailure: true, tool: trivy(pattern: '**/trivy-report.json')
                }
            }
        }

        stage('Deploy Staging') {
            when {
                branch 'develop'
            }
            steps {
                script {
                    withKubeConfig([credentialsId: 'kube-staging', serverUrl: '']) {
                        sh """
                            kubectl create namespace ${KUBE_NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -
                            helm upgrade --install ${JOB_NAME} ./charts/${JOB_NAME} \\
                                --namespace ${KUBE_NAMESPACE} \\
                                --set image.tag=${GIT_COMMIT_SHORT} \\
                                --set environment=staging \\
                                --wait --timeout 5m
                        """
                    }
                }
            }
        }

        stage('Deploy Production') {
            when {
                allOf {
                    branch 'main'
                    expression { return params.DEPLOY_ENV == 'production' }
                }
            }
            steps {
                input message: 'Deploy to production?', ok: 'Deploy'
                script {
                    withKubeConfig([credentialsId: 'kube-production', serverUrl: '']) {
                        sh """
                            kubectl create namespace ${KUBE_NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -
                            helm upgrade --install ${JOB_NAME} ./charts/${JOB_NAME} \\
                                --namespace ${KUBE_NAMESPACE} \\
                                --set image.tag=${GIT_COMMIT_SHORT} \\
                                --set environment=production \\
                                --wait --timeout 10m \\
                                --atomic
                        """
                    }
                }
            }
            post {
                success {
                    script {
                        slackSend(channel: '#deployments', color: 'good', message: "✅ Production deployment successful: ${JOB_NAME}@${GIT_COMMIT_SHORT}")
                    }
                }
                failure {
                    script {
                        slackSend(channel: '#deployments', color: 'danger', message: "❌ Production deployment failed: ${JOB_NAME}@${GIT_COMMIT_SHORT}")
                    }
                }
            }
        }
    }

    post {
        always {
            cleanWs()
            script {
                currentBuild.description = "Build ${GIT_COMMIT_SHORT} - ${currentBuild.result}"
            }
        }
        success {
            script {
                if (env.BRANCH_NAME == 'main') {
                    sh 'git tag -a v${BUILD_NUMBER} -m "Release ${BUILD_NUMBER}" || true'
                    sh 'git push origin v${BUILD_NUMBER} || true'
                }
            }
        }
        failure {
            script {
                emailext(
                    subject: "Build failed: ${env.JOB_NAME} #${env.BUILD_NUMBER}",
                    body: """Build failed. Check console output at ${BUILD_URL}""",
                    to: 'team@example.com',
                    recipientProviders: [[$class: 'DevelopersRecipientProvider']]
                )
            }
        }
    }
}

Docker 镜像构建优化

多阶段构建 Dockerfile

# Dockerfile - 多阶段构建优化
ARG NODE_VERSION=20-alpine
ARG ALPINE_VERSION=3.19

# ========== Stage 1: Dependencies ==========
FROM node:${NODE_VERSION} AS deps
WORKDIR /app

# 复制 package 文件
COPY package*.json ./

# 安装依赖(利用缓存层)
RUN npm ci --only=production && \
    npm cache clean --force

# ========== Stage 2: Build ==========
FROM node:${NODE_VERSION} AS builder
WORKDIR /app

# 复制依赖
COPY --from=deps /app/node_modules ./node_modules
COPY package*.json ./
COPY tsconfig*.json ./
COPY src/ ./src/

# 构建应用
RUN npm run build && \
    npm prune --production

# ========== Stage 3: Production ==========
FROM node:${NODE_VERSION} AS production
WORKDIR /app

# 创建非 root 用户
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001

# 复制构建产物
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/package.json ./

# 设置用户
USER nodejs

# 健康检查
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD node -e "require('http').get('http://localhost:3000/health', (r) => process.exit(r.statusCode === 200 ? 0 : 1))"

# 暴露端口
EXPOSE 3000

# 启动命令
CMD ["node", "dist/index.js"]

# ========== Labels (构建参数) ==========
ARG BUILD_DATE
ARG VCS_REF
LABEL org.label-schema.build-date="${BUILD_DATE}" \
      org.label-schema.vcs-ref="${VCS_REF}" \
      org.label-schema.schema-version="1.0"

Docker Build 优化技巧

#!/bin/bash
# build-optimized.sh

# 启用 BuildKit
export DOCKER_BUILDKIT=1

# 构建参数
IMAGE_NAME="myapp"
IMAGE_TAG="${GIT_COMMIT:-latest}"
REGISTRY="registry.example.com"

# 构建命令(带缓存优化)
docker build \
  --progress=plain \
  --build-arg BUILD_DATE=$(date -u +'%Y-%m-%dT%H:%M:%SZ') \
  --build-arg VCS_REF=${GIT_COMMIT} \
  --cache-from ${REGISTRY}/${IMAGE_NAME}:latest \
  --tag ${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG} \
  --tag ${REGISTRY}/${IMAGE_NAME}:latest \
  --file Dockerfile \
  .

# 推送镜像
docker push ${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG}
docker push ${REGISTRY}/${IMAGE_NAME}:latest

# 清理悬空镜像
docker image prune -f

.dockerignore 优化

# .dockerignore
# Git
.git
.gitignore
.gitattributes

# Documentation
*.md
!README.md
docs/

# Tests
test/
tests/
__tests__/
*.test.js
*.spec.js
coverage/

# Development
.env
.env.*
!.env.example
.vscode/
.idea/
*.log
npm-debug.log*

# Dependencies (will be installed in container)
node_modules/

# Build artifacts
dist/
build/
*.tgz

# Docker
Dockerfile*
docker-compose*.yml
.docker/

# CI/CD
.github/
.gitlab-ci.yml
Jenkinsfile
.travis.yml
.circleci/

# OS files
.DS_Store
Thumbs.db

Kubernetes 部署策略

Helm Chart 模板

# charts/myapp/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "myapp.fullname" . }}
  labels:
    {{- include "myapp.labels" . | nindent 4 }}
spec:
  {{- if not .Values.autoscaling.enabled }}
  replicas: {{ .Values.replicaCount }}
  {{- end }}
  selector:
    matchLabels:
      {{- include "myapp.selectorLabels" . | nindent 6 }}
  strategy:
    type: {{ .Values.deploymentStrategy.type }}
    {{- if eq .Values.deploymentStrategy.type "RollingUpdate" }}
    rollingUpdate:
      maxSurge: {{ .Values.deploymentStrategy.rollingUpdate.maxSurge }}
      maxUnavailable: {{ .Values.deploymentStrategy.rollingUpdate.maxUnavailable }}
    {{- end }}
  template:
    metadata:
      annotations:
        checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
        checksum/secret: {{ include (print $.Template.BasePath "/secret.yaml") . | sha256sum }}
      labels:
        {{- include "myapp.selectorLabels" . | nindent 8 }}
    spec:
      serviceAccountName: {{ include "myapp.serviceAccountName" . }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
        - name: {{ .Chart.Name }}
          securityContext:
            {{- toYaml .Values.securityContext | nindent 12 }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - name: http
              containerPort: {{ .Values.service.port }}
              protocol: TCP
          livenessProbe:
            httpGet:
              path: /healthz
              port: http
            initialDelaySeconds: {{ .Values.probes.liveness.initialDelaySeconds }}
            periodSeconds: {{ .Values.probes.liveness.periodSeconds }}
            timeoutSeconds: {{ .Values.probes.liveness.timeoutSeconds }}
            failureThreshold: {{ .Values.probes.liveness.failureThreshold }}
          readinessProbe:
            httpGet:
              path: /ready
              port: http
            initialDelaySeconds: {{ .Values.probes.readiness.initialDelaySeconds }}
            periodSeconds: {{ .Values.probes.readiness.periodSeconds }}
            timeoutSeconds: {{ .Values.probes.readiness.timeoutSeconds }}
            failureThreshold: {{ .Values.probes.readiness.failureThreshold }}
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
          env:
            - name: NODE_ENV
              value: {{ .Values.environment }}
            - name: PORT
              value: {{ .Values.service.port | quote }}
          envFrom:
            - configMapRef:
                name: {{ include "myapp.fullname" . }}-config
            - secretRef:
                name: {{ include "myapp.fullname" . }}-secret
          volumeMounts:
            - name: tmp
              mountPath: /tmp
      volumes:
        - name: tmp
          emptyDir: {}
      {{- with .Values.nodeSelector }}
      nodeSelector:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.affinity }}
      affinity:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.tolerations }}
      tolerations:
        {{- toYaml . | nindent 8 }}
      {{- end }}

values-production.yaml

# charts/myapp/values-production.yaml
replicaCount: 3

image:
  repository: registry.example.com/myapp
  pullPolicy: IfNotPresent
  tag: "" # 使用 Chart 的 appVersion

deploymentStrategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

environment: production

probes:
  liveness:
    initialDelaySeconds: 30
    periodSeconds: 10
    timeoutSeconds: 5
    failureThreshold: 3
  readiness:
    initialDelaySeconds: 10
    periodSeconds: 5
    timeoutSeconds: 3
    failureThreshold: 3

resources:
  limits:
    cpu: 1000m
    memory: 1Gi
  requests:
    cpu: 500m
    memory: 512Mi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

podDisruptionBudget:
  enabled: true
  minAvailable: 2

service:
  type: ClusterIP
  port: 80

ingress:
  enabled: true
  className: nginx
  annotations:
    nginx.ingress.kubernetes.io/rate-limit: "100"
    nginx.ingress.kubernetes.io/rate-limit-window: "1m"
  hosts:
    - host: myapp.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: myapp-tls
      hosts:
        - myapp.example.com

podSecurityContext:
  runAsNonRoot: true
  runAsUser: 1001
  fsGroup: 1001

securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop:
      - ALL

nodeSelector: {}

tolerations: []

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: myapp
          topologyKey: kubernetes.io/hostname

部署策略对比

策略 适用场景 优点 缺点
RollingUpdate 默认推荐 零停机,资源效率高 版本混合期
Blue-Green 关键应用 快速回滚,完全隔离 资源翻倍
Canary 高风险变更 风险可控,渐进式 配置复杂
Recreate 开发环境 简单直接 服务中断

Canary 部署示例

# canary-deployment.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  progressDeadlineSeconds: 600
  service:
    port: 80
    targetPort: 3000
  analysis:
    interval: 1m
    threshold: 10
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m
    webhooks:
      - name: load-test
        type: pre-rollout
        url: http://flagger-loadtester.testing/
        timeout: 60s

环境管理与配置

环境变量管理

# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: myapp-config
data:
  LOG_LEVEL: "info"
  LOG_FORMAT: "json"
  ENABLE_METRICS: "true"
  CACHE_TTL: "3600"
  MAX_CONNECTIONS: "100"
  TIMEOUT_MS: "30000"
# secret.yaml (使用 SealedSecrets 或 ExternalSecrets)
apiVersion: v1
kind: Secret
metadata:
  name: myapp-secret
type: Opaque
stringData:
  DATABASE_URL: "postgresql://user:pass@db:5432/myapp"
  REDIS_URL: "redis://redis:6379"
  JWT_SECRET: "change-me-in-production"
  API_KEY: "sk-xxx"

配置管理最佳实践

#!/bin/bash
# manage-config.sh

# 使用 Kustomize 管理多环境配置
ENV=${1:-staging}

kubectl apply -k overlays/${ENV}/

# 或使用 Helm 管理
helm upgrade --install myapp ./charts/myapp \
  -f values.yaml \
  -f values-${ENV}.yaml \
  --namespace myapp-${ENV}

# 验证配置
kubectl get configmap,secret -l app=myapp -n myapp-${ENV}

质量门禁与测试策略

质量门禁标准

检查项 阈值 执行阶段
单元测试覆盖率 ≥ 80% CI
代码重复率 ≤ 5% CI
严重安全漏洞 0 CI
构建时间 ≤ 10 分钟 CI
E2E 测试通过率 100% CD-Staging
性能回归 ≤ 5% CD-Staging

测试金字塔配置

# .github/workflows/test-strategy.yml
test-strategy:
  unit-tests:
    count: 500+
    execution-time: "< 2 min"
    run-on: "every commit"

  integration-tests:
    count: 50+
    execution-time: "< 5 min"
    run-on: "PR to main/develop"

  e2e-tests:
    count: 20+
    execution-time: "< 10 min"
    run-on: "before production deploy"

  performance-tests:
    count: 10+
    execution-time: "< 15 min"
    run-on: "weekly / before major release"

监控与可观测性

流水线监控指标

# prometheus-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: cicd-monitor
spec:
  selector:
    matchLabels:
      app: gitlab-runner
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

关键监控告警

# alertmanager-rules.yaml
groups:
  - name: cicd-alerts
    rules:
      - alert: HighBuildFailureRate
        expr: |
          sum(rate(gitlab_ci_builds_failed_total[5m])) 
          / sum(rate(gitlab_ci_builds_total[5m])) > 0.2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CI 构建失败率超过 20%"

      - alert: LongBuildDuration
        expr: |
          histogram_quantile(0.95, 
            rate(gitlab_ci_build_duration_seconds_bucket[5m])) > 600
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "构建时间 P95 超过 10 分钟"

      - alert: DeploymentFailure
        expr: |
          sum(rate(deployment_failures_total[5m])) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "生产部署失败"

监控命令

# 查看流水线执行历史
gitlab-ci-stats --project myapp --days 30

# 查看构建时间趋势
kubectl get pods -n gitlab -l app=gitlab-runner -o json | \
  jq '.items[].status.containerStatuses[].state'

# 查看部署状态
kubectl rollout status deployment/myapp --namespace production

# 查看最近部署
kubectl get deployments -A --sort-by='.metadata.creationTimestamp'

# 查看 Pod 事件
kubectl get events --namespace production --field-selector type=Warning

故障排查手册

常见问题及解决方案

1. 构建失败 - 依赖安装超时

# 检查网络
curl -I https://registry.npmjs.org

# 使用镜像源
npm config set registry https://registry.npmmirror.com

# 清理缓存
npm cache clean --force
rm -rf node_modules package-lock.json
npm install

2. Docker 构建内存不足

# 增加 Docker 资源限制
# Docker Desktop: Preferences → Resources → Memory: 8GB

# 或在 CI 中设置
export DOCKER_BUILDKIT=1
export BUILDKIT_STEP_LOG_MAX_SIZE=10485760

# 优化 Dockerfile 使用多阶段构建

3. Kubernetes 部署卡住

# 查看部署状态
kubectl describe deployment myapp -n production

# 查看 Pod 事件
kubectl get events -n production --sort-by='.lastTimestamp'

# 查看 Pod 日志
kubectl logs -f deployment/myapp -n production

# 强制回滚
kubectl rollout undo deployment/myapp -n production

# 检查资源配额
kubectl describe quota -n production
kubectl describe limitrange -n production

4. Helm 部署失败

# 查看 Helm 历史
helm history myapp -n production

# 查看失败原因
helm status myapp -n production

# 回滚到上一版本
helm rollback myapp 1 -n production

# 清理卡住的发布
kubectl delete secret -n production -l owner=helm

# 重新部署
helm upgrade --install myapp ./charts/myapp --atomic --timeout 10m

5. CI Runner 无响应

# GitLab Runner
gitlab-runner verify --delete
gitlab-runner restart

# 查看 Runner 日志
journalctl -u gitlab-runner -f

# 检查 Runner 状态
gitlab-runner list

# 重新注册 Runner
gitlab-runner register --url https://gitlab.example.com/

排查流程图

构建失败
    │
    ├─→ 检查构建日志 → 定位错误行
    │
    ├─→ 本地复现 → docker build --no-cache
    │
    ├─→ 检查依赖 → npm install / mvn dependency:tree
    │
    ├─→ 检查资源 → 内存/磁盘/CPU
    │
    └─→ 联系平台团队 → Runner 配置问题

安全最佳实践

CI/CD 安全清单

  • 使用最小权限原则配置 Service Account
  • 敏感信息使用 Secret 管理,禁止硬编码
  • 启用镜像签名和验证 (Cosign/Notary)
  • 定期扫描镜像漏洞 (Trivy/Grype)
  • 限制 Runner 权限,使用隔离环境
  • 启用分支保护,要求 PR 审查
  • 审计日志保留至少 90 天
  • 定期轮换 CI/CD 凭证

Secret 管理

# 使用 External Secrets Operator
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: myapp-secret
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secretsmanager
    kind: ClusterSecretStore
  target:
    name: myapp-secret
  data:
    - secretKey: DATABASE_URL
      remoteRef:
        key: myapp/database
        property: url

镜像安全

# 签名镜像
cosign sign --key cosign.key registry.example.com/myapp:latest

# 验证镜像
cosign verify --key cosign.pub registry.example.com/myapp:latest

# 扫描漏洞
trivy image --severity HIGH,CRITICAL registry.example.com/myapp:latest

# 生成 SBOM
syft registry.example.com/myapp:latest -o spdx-json > sbom.json

附录:快速参考

常用命令速查

# GitLab CI
gitlab-runner register
gitlab-runner verify
gitlab-ci-local  # 本地测试 CI

# GitHub Actions
act  # 本地运行 Actions

# Jenkins
java -jar jenkins-cli.jar -s http://localhost:8080/

# Docker
docker build -t myapp:latest .
docker push myapp:latest
docker scan myapp:latest

# Kubernetes
kubectl apply -f deployment.yaml
kubectl rollout status deployment/myapp
kubectl rollout undo deployment/myapp
kubectl describe pod myapp-xxx

# Helm
helm lint ./charts/myapp
helm template myapp ./charts/myapp
helm upgrade --install myapp ./charts/myapp
helm rollback myapp 1

推荐工具链

类别 工具 用途
本地 CI 测试 gitlab-ci-local, act 本地验证流水线
镜像扫描 Trivy, Grype 漏洞扫描
镜像签名 Cosign, Notary 签名验证
Secret 管理 External Secrets, SealedSecrets 密钥管理
配置管理 Helm, Kustomize 部署配置
渐进式交付 Flagger, Argo Rollouts Canary/蓝绿部署
工作流引擎 Argo Workflows, Tekton 复杂流水线

文档版本: 1.0
最后更新: 2026-03-12
维护团队: SRE Team
反馈渠道: #sre-feedback

results matching ""

    No results matching ""