Infrastructure as Code Explained: Managing Servers Like Software

Before GitHub stored its source code on GitHub, the company faced a paradox familiar to many growing startups: they had excellent practices for managing application code but their infrastructure was managed the old-fashioned way---manually, through web consoles, with institutional knowledge living in the heads of a few senior engineers rather than in any document.

When GitHub needed to provision new servers for a new feature, an engineer would SSH into a reference machine, look at what was installed, try to replicate it, and inevitably produce a slightly different server. When a machine failed, recovering it required someone to remember (or guess) exactly what configuration it had. When they needed to create a staging environment that matched production for testing, the exercise revealed that nobody could fully describe what production looked like---the documentation was the servers themselves, and the servers were not talking.

GitHub solved this by adopting Infrastructure as Code. Within a year, they could provision a complete, production-accurate environment in minutes instead of days. Every infrastructure change was reviewed by engineers before application. Servers were identical because they were produced from identical code. The documentation was the code, and the code was in Git.

Infrastructure as Code is one of the defining practices of modern software operations. Organizations that have adopted it describe the transition in remarkably similar terms: the period before IaC feels like "the dark ages" compared to what came after.


What Infrastructure as Code Means

Infrastructure as Code (IaC) is the practice of managing and provisioning computing infrastructure---servers, networks, databases, load balancers, firewalls, DNS records---through machine-readable definition files rather than through manual processes or interactive configuration tools.

Instead of clicking through a cloud console to create a server (select region, choose instance type, configure networking, attach storage, set security groups), you write a configuration file that specifies all of these parameters. A tool reads the file and creates the server. Running the same file again either confirms the server already exists in the correct configuration or updates it to match.

The word "code" matters. IaC treats infrastructure definitions with the same engineering rigor as application code:

  • Version controlled: Every change tracked in Git, with author, timestamp, and description
  • Reviewed: Pull request workflow means changes are seen by multiple engineers before application
  • Tested: Automated validation runs before any change reaches production
  • Documented: The code is the documentation; you cannot have configuration drift between documentation and reality

This disciplined approach transforms infrastructure from a manual art practiced by a small number of specialists into a documented, repeatable engineering process that any team member can participate in.


The Problem IaC Solves: Snowflake Servers

The term snowflake server describes a server (or other infrastructure component) that has been individually hand-configured over time, making it unique and irreplaceable. Like an actual snowflake, no two are exactly alike, and each is delicate.

Snowflake servers accumulate through entirely reasonable processes: an engineer made a configuration change to fix an urgent problem and never documented it. A security patch required a configuration flag that wasn't in the standard setup. A performance optimization was applied manually to one server but not others. Over months and years, a server diverges from any documented baseline until its configuration is known only to the engineers who worked on it---and only imperfectly to them.

The consequences are predictable:

  • Disaster recovery failures: When a snowflake server fails, recreating it requires reverse-engineering its configuration from logs, memory, and other servers
  • Environment inconsistency: Staging servers configured differently from production cause bugs that only appear in production
  • Knowledge bottlenecks: Infrastructure knowledge concentrated in individual engineers creates organizational risk when those engineers leave
  • Slow provisioning: Creating new servers requires manual work, taking days instead of minutes
  • Security gaps: Manual configuration is inconsistent, creating servers with different security postures

IaC replaces snowflakes with what practitioners call cattle: interchangeable, replaceable servers produced from the same code. If a server fails, delete it and create a new one from the code. If you need 50 more servers for a traffic spike, run the code 50 times. The code is the source of truth, not the servers themselves.


Declarative vs. Imperative Approaches

IaC tools take one of two fundamental philosophical approaches, and understanding the distinction affects tool selection and how you think about infrastructure management.

Declarative IaC

Declarative IaC describes the desired end state: "I want three EC2 instances of type t3.medium, running Amazon Linux 2, in the us-east-1 region, in subnet subnet-abc123, with security group sg-xyz789."

The tool handles the transition from current state to desired state. If no instances exist, it creates three. If one is missing, it creates one. If one has the wrong instance type, it changes it. If all three already match the description, it does nothing. The operator specifies what they want; the tool figures out how to get there.

Key advantages:

  • Idempotency by design: Running declarative IaC repeatedly is safe---it converges to the desired state without creating duplicates or causing errors
  • State management: The tool tracks what exists and what needs to change
  • Safety: The tool can show a plan of changes before applying them, allowing review before any action is taken

Terraform, CloudFormation (AWS), Azure Resource Manager (ARM), and Pulumi (when used declaratively) are the major declarative tools.

Imperative IaC

Imperative IaC describes the steps to reach the desired state: "Connect to the server. Run apt-get update. Install nginx. Copy the configuration file to /etc/nginx/nginx.conf. Enable the service."

The tool executes each step in sequence. The operator specifies how to reach the desired state; the tool executes the instructions.

Key advantage: More flexibility for complex configuration logic, conditional steps, and one-time tasks.

Key disadvantage: Running the same imperative script twice can cause problems (creating a resource that already exists, installing a package that is already installed). Idempotency must be explicitly designed in.

Ansible, Chef, Puppet, and shell scripts are imperative. Ansible can be written in idempotent ways, but this requires more careful design than declarative tools.

Characteristic Declarative Imperative
You specify Desired end state Steps to reach state
Idempotency Built-in Must be explicitly designed
State tracking Tool manages Manual or absent
Best for Infrastructure provisioning Configuration management, one-time tasks
Examples Terraform, CloudFormation Ansible, Chef, shell scripts

In practice, many teams use both: Terraform (declarative) to provision the infrastructure, and Ansible (configuration management) to configure what runs on it.


The Major Tools in Depth

Terraform

HashiCorp Terraform (launched 2014) is the most widely adopted IaC tool for cloud infrastructure provisioning. Its dominance comes from several distinctive characteristics.

Provider ecosystem: Terraform works with AWS, Azure, Google Cloud, and hundreds of other providers through a plugin system called providers. Over 3,000 providers exist, covering everything from major cloud platforms to SaaS APIs, DNS registrars, and monitoring services. A single Terraform codebase can provision across multiple cloud providers.

HCL (HashiCorp Configuration Language): Terraform's configuration language is designed for human readability and machine parsing. Resources, variables, outputs, and data sources are expressed in a structured but readable format.

State management: Terraform maintains a state file that maps the desired configuration to real-world infrastructure. Before any change, Terraform generates a plan---a diff showing exactly what will be created, modified, or destroyed. Operators review the plan before applying. This "plan before apply" workflow is a critical safety mechanism.

Remote state: The state file can be stored remotely (S3, Terraform Cloud, Azure Blob) with locking to prevent concurrent modifications. This enables team collaboration without state conflicts.

Example: Airbnb manages the majority of their infrastructure using Terraform across AWS. Their IaC repository contains thousands of Terraform modules defining EC2 instances, RDS databases, ElastiCache clusters, VPCs, and hundreds of other resource types. New infrastructure provisioning that previously took days of manual work completes in minutes through automated Terraform runs.

Terraform Cloud and Enterprise: HashiCorp's managed platforms for Terraform add CI/CD integration, policy enforcement (Sentinel policies), cost estimation, and team access controls.

AWS CloudFormation

CloudFormation is AWS's native IaC service (launched 2011). It predates Terraform and remains widely used in AWS-focused organizations.

Deep AWS integration: CloudFormation integrates with every AWS service, typically supporting new features on launch day (Terraform support often lags by weeks or months). The integration is tighter than third-party tools---CloudFormation can use IAM roles directly without managing access keys, for example.

Managed service: Unlike Terraform, CloudFormation requires no tool installation, no state file management, and no backend configuration. AWS manages all of this. You submit a template to CloudFormation, and AWS creates, updates, or deletes resources according to the template.

Stack model: CloudFormation organizes resources into stacks. Creating a stack creates the resources; deleting the stack deletes all resources. This clean lifecycle management is valuable for ephemeral environments.

Limitations: CloudFormation only works with AWS. Organizations using multiple cloud providers cannot use it for non-AWS infrastructure.

Example: Amazon uses CloudFormation internally for provisioning infrastructure across their retail and AWS operations. The team that built AWS Lambda originally deployed and managed the Lambda infrastructure using CloudFormation, with the template defining EC2 instances, load balancers, VPCs, and IAM roles for the service.

AWS CDK and Pulumi: Code Over Configuration

A newer category of IaC tools uses general-purpose programming languages instead of domain-specific configuration languages.

AWS CDK (Cloud Development Kit): Allows defining AWS infrastructure using TypeScript, Python, Java, or Go. CDK synthesizes to CloudFormation templates, combining CDK's programming abstractions with CloudFormation's managed execution.

Pulumi: Similar to CDK but multi-cloud. Define infrastructure in TypeScript, Python, Go, or Java. Pulumi compiles to its own engine, supports multiple providers, and stores state in Pulumi Cloud or self-managed backends.

The advantage: Full programming language features---loops, conditionals, functions, classes, testing frameworks, package managers. Complex infrastructure that would require hundreds of repetitive declarative blocks can be expressed in a loop.

The trade-off: The power of a full programming language can create complexity. Infrastructure that "just does what the code says" but is difficult to reason about at a glance is harder to review and audit than explicit declarative configuration.

Configuration Management Tools

Ansible: Agentless (connects via SSH), uses YAML playbooks to describe configuration steps. Widely used for configuring servers after they are provisioned by Terraform or CloudFormation: installing packages, managing files, enabling services, and deploying application code. Strong idempotency support through Ansible modules.

Chef: Uses Ruby DSL for configuration "recipes." Requires a Chef server and a Chef client installed on managed nodes. More complex than Ansible but more powerful for complex configuration management at scale.

Puppet: Declarative configuration management language. Mature, robust, used in large enterprise environments. Requires Puppet server infrastructure.

The general trend: Ansible's simplicity and agentless architecture have made it dominant for new configuration management implementations, while Chef and Puppet persist in organizations with existing investments.


The IaC Workflow in Practice

The Development Workflow

  1. Branch: Create a Git branch for the infrastructure change
  2. Write: Modify or add IaC configuration
  3. Validate locally: Run terraform validate or equivalent to catch syntax errors
  4. Plan: Run terraform plan to see what changes will occur
  5. Open pull request: Submit the branch for review; automated CI runs validation
  6. Code review: One or more engineers review the plan and code changes
  7. Approve: Reviewer approves the pull request
  8. Apply: Automated system or reviewer runs terraform apply (or deploys via CI/CD)
  9. Verify: Confirm infrastructure is in the expected state

This workflow mirrors software development practices. Infrastructure changes have the same rigor as application code changes.

CI/CD Integration for IaC

IaC should integrate with CI/CD pipelines just as application code does. A typical IaC CI pipeline:

  • On pull request: Run terraform validate, terraform plan, security scanning (Checkov, Terrascan), cost estimation (Infracost)
  • On merge to main: Run terraform apply to apply changes to non-production environments
  • On production deployment: Require additional approval before applying to production; run plan again to catch any changes since the PR

Automated policy checking enforces organizational standards. Tools like Open Policy Agent (OPA) and HashiCorp Sentinel evaluate IaC configurations against rules: "all EC2 instances must have an Owner tag," "no security groups should allow inbound traffic from 0.0.0.0/0 on port 22," "all S3 buckets must have encryption enabled."

These policy-as-code checks enforce security and compliance standards automatically, making violations impossible to deploy rather than requiring manual review to catch.

Module Design and Reuse

IaC code should be organized into modules: reusable components that encapsulate common infrastructure patterns.

A web-application module might provision an Application Load Balancer, an Auto Scaling Group, a security group, IAM roles, and CloudWatch alarms---everything needed to run a web application. Individual services instantiate this module with their specific parameters rather than duplicating all this configuration.

Benefits of modularization:

  • Encode best practices once; enforce them everywhere
  • Changes to the module (adding security controls, updating instance types) propagate to all users
  • New services spin up faster by using existing modules
  • Consistency across all infrastructure that uses the module

Module versioning is critical. Modules should be versioned (using Git tags or Terraform registry), and module consumers should pin to specific versions. Updating a module is a deliberate decision that can be tested before wide rollout.


Managing State

Terraform's state file is a foundational concept that deserves careful attention. The state file maps the Terraform configuration to real-world resources---it is how Terraform knows that aws_instance.web_server corresponds to EC2 instance i-0abc123456789.

State File Risks

Losing the state file: If Terraform loses track of existing resources, it thinks they do not exist and will try to create new ones---while the old ones continue running (and costing money). Recovering from lost state is possible but painful.

Corrupted state: Corrupted state can cause Terraform to attempt destructive operations it would otherwise not perform.

Concurrent modification: Two engineers running terraform apply simultaneously can corrupt state by both attempting to update the same resources.

State Best Practices

Remote state with locking: Store state in S3 with DynamoDB locking (AWS), Terraform Cloud, or Azure Blob Storage with lease locking. Remote state allows team collaboration; locking prevents concurrent modifications.

State encryption: State files can contain sensitive values (database passwords, access keys). Enable encryption at rest for the state backend.

Separate state per environment: Use separate state files for development, staging, and production. This prevents a Terraform operation in development from affecting production resources.

Never manually edit state: The terraform state commands provide safe mechanisms for state manipulation. Direct file editing is fragile and error-prone.


Secrets Management in IaC

Secrets---passwords, API keys, certificates, private keys---require special handling. The cardinal rule: never commit secrets to version control.

Secrets committed to Git exist in the repository's history forever, even after deletion. They may be inadvertently shared when the repository is shared, forked, or cloned. GitHub's secret scanning feature discovers thousands of accidentally committed credentials every day.

The Right Approach

Reference secrets from secret managers: IaC code should reference secrets stored in AWS Secrets Manager, Azure Key Vault, Google Secret Manager, or HashiCorp Vault. The IaC code accesses the secret at apply time through the provider integration---the actual secret value never appears in the IaC code.

Environment variables for sensitive inputs: Terraform accepts variable values from environment variables (TF_VAR_*). CI/CD systems inject secrets as environment variables from their own secret management without committing them to code.

OIDC authentication: Modern CI/CD platforms support OpenID Connect authentication to cloud providers, eliminating the need for any stored credentials. The CI runner proves its identity cryptographically; the cloud provider issues temporary credentials. No secrets to manage, rotate, or expose.

Understanding how IaC secrets management integrates with broader cloud security practices reveals how these disciplines reinforce each other---security requires IaC to enforce consistent configurations, and IaC requires security practices to protect the access credentials it uses.


Common Mistakes and Anti-Patterns

Manual Changes Outside IaC (Configuration Drift)

The most destructive anti-pattern: making changes to infrastructure manually after it has been defined in IaC. This creates configuration drift---the actual infrastructure diverges from what the IaC code describes.

When Terraform runs against drifted infrastructure, it may:

  • Overwrite the manual change back to the IaC-defined state (losing the change)
  • Fail with unexpected errors due to the inconsistent state
  • Produce unpredictable behavior when the current state does not match what Terraform expects

The discipline: all infrastructure changes go through IaC. If an urgent situation requires a manual change, document it immediately and create an IaC change to reflect it. Many organizations use tools like terraform plan in CI to detect drift automatically.

Example: AWS Config and Terraform's drift detection features can identify when real infrastructure diverges from IaC definitions. Organizations configure alerts when drift is detected and treat it as an incident requiring immediate resolution.

Giant Monolithic States

Storing all infrastructure in a single Terraform state file creates risk and performance problems. A state file with thousands of resources is slow to plan and apply, difficult to reason about, and creates wide blast radius when something goes wrong.

Solution: Separate state into layers and teams. A common pattern:

  • Core networking (VPCs, subnets) in one state
  • Shared services (DNS, monitoring, security tools) in another
  • Each application in its own state
  • Each environment (dev, staging, prod) in separate states

This separation limits blast radius, allows teams to work independently, and makes plans faster.

Skipping Reviews for "Small" Changes

The temptation to apply small changes directly without review is high and dangerous. A seemingly small change---"I'm just adding one security group rule"---can have unexpected consequences. Security group rules can unintentionally expose services. Small network changes can disrupt routing. Modifying a widely-used module can affect dozens of dependent resources.

The review process provides value proportional to risk, and risk is not always obvious from the change size. Enforce the review process for all infrastructure changes.

Not Testing IaC

IaC can be tested, and testing it prevents expensive production failures:

  • Terratest: Go library for writing tests that provision real infrastructure, validate it, then destroy it
  • kitchen-terraform: Test kitchen integration for Terraform
  • terraform-compliance: BDD-style policy testing for Terraform plans

At minimum, run terraform validate (syntax checking) and plan reviews in CI. For critical infrastructure modules, automated integration tests that create real resources provide much stronger guarantees.


Real-World Impact

The practical impact of IaC adoption is measurable and consistently reported across organizations.

Provisioning speed: Infrastructure that previously took days or weeks to provision through manual processes takes minutes with IaC. Twilio reported going from multi-week provisioning cycles to under an hour after IaC adoption.

Incident recovery: When infrastructure fails, IaC enables recovery in minutes rather than hours or days. The failed resources are deleted and recreated from code. Disaster recovery exercises that previously took days of manual effort become routine automated drills.

Environment consistency: "It works in staging but breaks in production" bugs drop dramatically when staging and production are defined by identical code differing only in variable values (instance sizes, domain names).

Compliance and auditing: Every infrastructure change has an author, timestamp, and description in Git. Security audits that previously required interviewing engineers about what they remembered changing become Git log reviews. SOC 2 and ISO 27001 auditors can verify change management processes from Git history.

Onboarding: New engineers can understand the entire infrastructure by reading code rather than exploring the cloud console and asking colleagues. The learning curve for infrastructure operations decreases significantly.

Example: Netflix manages thousands of AWS resources across multiple regions and accounts using IaC, enabling them to deploy new services and environments rapidly without the bottleneck of manual infrastructure provisioning. Their chaos engineering practice---intentionally introducing failures to test resilience---would be impractical without IaC, since recovering from induced failures requires rapid, reliable infrastructure recreation.


Starting with IaC

For teams new to IaC, starting with everything at once is overwhelming and likely to fail. A staged approach:

Phase 1: Start with new infrastructure only. Do not attempt to import existing manually-configured resources. Use Terraform or CloudFormation for all new resources, letting the existing manual infrastructure continue unchanged while demonstrating the IaC approach.

Phase 2: Define basic modules for common patterns. A web server module, a database module. Reuse them as new infrastructure is added.

Phase 3: Gradually import existing resources into IaC. Terraform's import command and CloudFormation import both allow bringing existing resources under IaC management. Prioritize critical production infrastructure.

Phase 4: Enforce IaC discipline. Implement policies requiring IaC review for all changes. Detect and remediate drift automatically.

The investment in IaC compounds. The first few months feel slower than manual configuration. By the end of the first year, provisioning, reviewing, and changing infrastructure is dramatically faster, more reliable, and less stressful than the manual approach.


References