Terraform IaC 从入门到精通系列文档
面向 DevOps / SRE / 云架构师的基础设施即代码(IaC)实战指南 基于 Terraform v1.5+,覆盖 AWS / Azure / GCP / 阿里云多云场景
01-terraform-intro
什么是基础设施即代码(IaC)?
IaC(Infrastructure as Code)是一种通过代码定义和管理基础设施的方法,将服务器、网络、存储等资源声明为配置文件,实现:
- ✅ 版本控制(Git 管理变更)
- ✅ 可重复部署(消除“雪花服务器”)
- ✅ 自动化(CI/CD 集成)
- ✅ 审计与合规
为什么选择 Terraform?
| 工具 | 优点 | 缺点 |
|---|---|---|
| Terraform | 多云支持、声明式、状态管理、模块化 | 学习曲线中等 |
| CloudFormation | AWS 原生、深度集成 | 仅限 AWS |
| Pulumi | 支持通用语言(Python/Go) | 运行时依赖复杂 |
| Ansible | 无代理、适合配置管理 | 非声明式,状态难管理 |
💡 Terraform 核心优势:Provider 生态(3000+ 云/服务支持)
核心概念
- Provider:云厂商插件(如
aws,azurerm) - Resource:基础设施单元(如
aws_instance) - State:记录已创建资源的元数据(
.tfstate) - Plan/Apply:预览 → 执行变更
02-install-and-cli
安装 Terraform
Linux (APT)
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install terraform
macOS
brew tap hashicorp/tap
brew install hashicorp/tap/terraform
验证
terraform version
# Terraform v1.6.0
核心 CLI 命令
| 命令 | 作用 |
|---|---|
terraform init |
初始化工作目录(下载 Provider) |
terraform fmt |
格式化代码 |
terraform validate |
验证语法 |
terraform plan |
预览变更(不执行) |
terraform apply |
应用变更 |
terraform destroy |
销毁所有资源 |
terraform state list |
查看当前状态 |
💡 最佳实践:始终先
plan再apply
03-hello-world-aws
前提条件
- AWS 账号 + IAM 用户(具备 EC2 权限)
- 配置 AWS CLI:
aws configure
步骤 1:创建 main.tf
provider "aws" {
region = "us-east-1"
}
resource "aws_instance" "web" {
ami = "ami-0c02fb55956c7d316" # Amazon Linux 2
instance_type = "t3.micro"
tags = {
Name = "HelloWorld"
}
}
步骤 2:初始化并部署
terraform init
terraform plan
terraform apply
步骤 3:验证
aws ec2 describe-instances --filters "Name=tag:Name,Values=HelloWorld"
步骤 4:清理
terraform destroy
⚠️ 注意:不要在生产账号直接操作!使用沙箱账号。
04-configuration-syntax
HCL(HashiCorp Configuration Language)基础
变量定义
variable "region" {
description = "AWS region"
type = string
default = "us-east-1"
}
资源引用
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
}
resource "aws_subnet" "public" {
vpc_id = aws_vpc.main.id # 引用 VPC ID
cidr_block = "10.0.1.0/24"
}
表达式与函数
# 条件表达式
instance_type = var.env == "prod" ? "m5.large" : "t3.micro"
# 函数
tags = merge(var.common_tags, { Name = "web-${var.env}" })
for_each 循环
locals {
subnets = ["10.0.1.0/24", "10.0.2.0/24"]
}
resource "aws_subnet" "public" {
for_each = toset(local.subnets)
vpc_id = aws_vpc.main.id
cidr_block = each.value
}
💡 提示:避免
count,优先使用for_each(更稳定)
05-state-management
什么是 State?
- 记录实际创建的资源与配置的映射关系
- 默认存储为
terraform.tfstate(JSON 格式)
远程后端(Remote Backend)
避免本地 state 丢失,支持团队协作。
AWS S3 + DynamoDB(推荐)
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/web/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
🔒 DynamoDB 表用于状态锁定,防止并发冲突
State 操作技巧
# 导入现有资源
terraform import aws_instance.web i-1234567890abcdef0
# 移动资源(重构)
terraform state mv aws_instance.web module.web.aws_instance.main
# 手动修复(慎用!)
terraform state rm aws_instance.broken
06-modules
为什么需要模块?
- 复用:一次编写,多处调用
- 抽象:隐藏复杂性(如 VPC 创建细节)
- 标准化:团队统一最佳实践
创建模块目录结构
modules/
└── vpc/
├── main.tf
├── variables.tf
└── outputs.tf
模块示例(modules/vpc/main.tf)
resource "aws_vpc" "this" {
cidr_block = var.cidr_block
}
output "vpc_id" {
value = aws_vpc.this.id
}
调用模块
module "prod_vpc" {
source = "./modules/vpc"
cidr_block = "10.0.0.0/16"
}
# 引用输出
resource "aws_subnet" "public" {
vpc_id = module.prod_vpc.vpc_id
}
发布模块
- 本地路径:
./modules/vpc - Git:
github.com/org/terraform-aws-vpc?ref=v1.0.0 - Terraform Registry:
terraform-aws-modules/vpc/aws
👉 下一步:变量与输出管理
07-variables-and-outputs
变量类型
variable "instance_count" {
type = number
}
variable "tags" {
type = map(string)
default = {}
}
variable "db_password" {
type = string
sensitive = true # 不在 plan/apply 中显示
}
输出定义
output "public_ip" {
value = aws_instance.web.public_ip
description = "Web server public IP"
}
敏感数据安全
- 不要硬编码密码!
- 使用
sensitive = true - 结合外部密钥管理(见 10-secrets-management.md)
tfvars 文件
# prod.tfvars
region = "us-west-2"
instance_type = "m5.large"
调用:
terraform apply -var-file="prod.tfvars"
08-multi-cloud-strategy
统一多云管理
# providers.tf
provider "aws" {
region = "us-east-1"
alias = "us_east"
}
provider "azurerm" {
features {}
alias = "eastus"
}
# resources.tf
module "aws_web" {
source = "./modules/web"
providers = { aws = aws.us_east }
}
module "azure_web" {
source = "./modules/azure-web"
providers = { azurerm = azurerm.eastus }
}
抽象层设计(推荐)
- 创建统一接口模块,屏蔽云差异
- 示例:
module "database"内部根据cloud参数选择 RDS 或 Azure SQL
💡 挑战:网络、安全组、IAM 等无法完全抽象,需谨慎设计
09-on-prem-with-terraform
支持的本地平台
- VMware vSphere (
vsphereprovider) - Proxmox (
proxmoxprovider) - Libvirt (
libvirtprovider) - Custom scripts (
null_resource+local-exec)
示例:Proxmox VM
provider "proxmox" {
pm_api_url = "https://proxmox.example.com:8006/api2/json"
pm_user = "terraform@pve"
pm_password = var.proxmox_password
pm_tls_insecure = true
}
resource "proxmox_vm_qemu" "web" {
name = "web-server"
target_node = "pve-node1"
clone = "ubuntu-template"
cores = 2
memory = 2048
}
⚠️ 注意:本地环境需确保 API 可访问、凭证安全
10-secrets-management
安全原则
- 绝不提交密钥到 Git
- 最小权限原则
- 自动轮换
集成方案
AWS SSM Parameter Store
data "aws_ssm_parameter" "db_password" {
name = "/prod/db/password"
}
resource "aws_rds_cluster" "main" {
master_password = data.aws_ssm_parameter.db_password.value
}
HashiCorp Vault
provider "vault" {
address = "https://vault.example.com"
}
data "vault_generic_secret" "db" {
path = "secret/data/prod/db"
}
resource "aws_rds_cluster" "main" {
master_password = data.vault_generic_secret.db.data["password"]
}
Terraform Cloud/Enterprise
使用 Variables 功能标记为 Sensitive
11-remote-backend-best-practices
S3 + DynamoDB 后端完整配置
1. 创建 S3 Bucket(启用版本控制 + 加密)
aws s3api create-bucket --bucket my-terraform-state --region us-east-1
aws s3api put-bucket-versioning --bucket my-terraform-state --versioning-configuration Status=Enabled
aws s3api put-bucket-encryption --bucket my-terraform-state --server-side-encryption-configuration '{
"Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "AES256"}}]
}'
2. 创建 DynamoDB 表
aws dynamodb create-table \
--table-name terraform-locks \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST
3. 配置 backend
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "global/s3/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
🔐 IAM 权限最小化:仅授予
s3:GetObject,s3:PutObject,dynamodb:GetItem等必要权限
12-ci-cd-integration
GitHub Actions 示例
name: Terraform
on:
push:
branches: [ main ]
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Terraform Init
run: terraform init
- name: Terraform Validate
run: terraform validate
- name: Terraform Plan
run: terraform plan -out=tfplan
env:
AWS_ACCESS_KEY_ID: $
AWS_SECRET_ACCESS_KEY: $
- name: Terraform Apply
if: github.ref == 'refs/heads/main'
run: terraform apply -auto-approve tfplan
审批流程(企业级)
- 使用 Terraform Cloud 的 Run Triggers + Manual Approval
- 或在 CI 中暂停等待人工确认
13-testing-terraform
测试金字塔
- 单元测试:验证 HCL 逻辑(
terraform validate+checkov) - 集成测试:部署到临时环境验证(Terratest)
- 合规测试:扫描安全策略(OPA/Conftest)
Terratest 示例(Go)
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
)
func TestTerraformHelloWorld(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../examples/hello-world",
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// 验证输出
output := terraform.Output(t, terraformOptions, "public_ip")
assert.NotEmpty(t, output)
}
🧪 运行:
go test -v .
14-workspaces-and-environments
Workspaces vs 目录分离
| 方式 | 优点 | 缺点 |
|---|---|---|
| Workspaces | 单一代码库 | 难以差异化配置 |
| 目录分离 | 环境完全独立 | 代码重复 |
推荐:目录分离 + 模块复用
environments/
├── dev/
│ ├── main.tf
│ └── terraform.tfvars
├── staging/
└── prod/
Workspace 适用场景
- 临时测试环境(
terraform workspace new test-123) - 多租户 SaaS(每个客户一个 workspace)
15-policy-as-code
使用 Open Policy Agent (OPA)
1. 定义策略(policy.rego)
package terraform
deny[msg] {
input.resource_changes[_].change.actions[_] == "delete"
input.resource_changes[_].type == "aws_s3_bucket"
msg := "Deleting S3 buckets is not allowed"
}
2. 扫描计划
terraform show -json tfplan > plan.json
conftest test plan.json -p policy.rego
Terraform Sentinel(企业版)
- 在 Terraform Cloud 中强制执行策略
- 支持更复杂的逻辑(如成本控制)
A1-cheat-sheet
常用命令
# 初始化
terraform init
# 格式化
terraform fmt -recursive
# 验证
terraform validate
# 预览
terraform plan -var="env=prod"
# 应用
terraform apply -auto-approve
# 销毁
terraform destroy -auto-approve
# 查看状态
terraform state list
terraform state show aws_instance.web
快速调试
TF_LOG=DEBUG terraform apply # 输出详细日志
A2-troubleshooting
常见错误
1. Error: Invalid for_each argument
- 原因:
for_each的集合在 plan 和 apply 阶段不一致 - 解决:确保依赖资源已创建,或使用
depends_on
2. BucketRegionError: incorrect region
- 原因:S3 bucket 在不同 region
- 解决:在 backend 配置中指定正确
region
3. ResourceInUse: resource is in use
- 原因:资源被其他服务依赖(如 EIP 绑定到实例)
- 解决:先解绑,或使用
lifecycle { ignore_changes = [...] }
4. State 锁定冲突
- 解决:
terraform force-unlock <LOCK_ID>
🆘 终极手段:手动编辑 state(
terraform state pull→ 修改 →push),极度危险!