Container Technology Explained

In 2013, Solomon Hykes stood on stage at PyCon and gave a five-minute lightning talk that would reshape how the entire software industry builds, ships, and runs applications. He demonstrated Docker, a tool that made Linux containers accessible to ordinary developers. Within two years, nearly every major technology company had adopted containers. Within five years, container orchestration with Kubernetes had become the default deployment model for cloud-native applications.

But containers did not appear from nothing. The ideas behind them stretch back decades--to 1979's chroot system call, to FreeBSD jails in 2000, to Solaris Zones in 2004, to Linux-VServer, OpenVZ, and finally LXC in 2008. Docker's genius was not invention but synthesis: combining existing Linux kernel primitives into a developer experience that was simple, reproducible, and fast.

Understanding container technology at a deep level means understanding how the Linux kernel isolates processes, how filesystems are layered and shared, how networking is virtualized at the kernel level, and how all of these pieces compose into the container abstraction we use daily. It means understanding why containers are not lightweight virtual machines (a common but misleading analogy), and why the distinction matters for security, performance, and architecture.

This article examines container technology from first principles: the kernel primitives that make isolation possible, the filesystem mechanics that make images efficient, the networking models that connect containers, the orchestration systems that manage them at scale, and the security boundaries that constrain them. Every FAQ question posed about containers--how they differ from VMs, how namespaces and cgroups work, how image layers function, what happens at runtime, how networking operates, and why containers matter for deployment--is answered in depth throughout the discussion that follows.


A Brief History of Containerization

The chroot Era (1979-1999)

The earliest ancestor of container technology is the chroot system call, introduced in Version 7 Unix in 1979 and later added to BSD in 1982. chroot changes the apparent root directory for a running process and its children. A process running inside a chroot "jail" sees a different filesystem root than the actual system root.

# Create an isolated filesystem root
mkdir -p /var/chroot/myapp
cp -a /bin /lib /usr /var/chroot/myapp/

# Run a process with changed root
chroot /var/chroot/myapp /bin/bash

What chroot provided: Filesystem isolation. A process inside a chroot cannot (in theory) access files outside its new root directory.

What chroot lacked: Everything else. Processes inside a chroot still shared the host's process ID space, network stack, user database, and IPC mechanisms. A root user inside a chroot could escape it trivially. chroot was a filesystem trick, not a security boundary.

Despite these limitations, chroot saw widespread use in build environments, FTP servers, and DNS servers (BIND was commonly chrooted). It established the core insight that would drive containerization forward: processes do not need to see the entire system to function correctly.

FreeBSD Jails and Solaris Zones (2000-2004)

FreeBSD Jails, introduced in FreeBSD 4.0 (2000), extended the chroot concept significantly. A jail provided:

  • Filesystem isolation (like chroot, but escape-resistant)
  • Process isolation (processes in a jail could only see other processes in the same jail)
  • Network isolation (each jail could have its own IP address)
  • Restricted superuser (root inside a jail had limited capabilities)

Jails were the first technology that resembled modern containers in a meaningful way. They provided multi-dimensional isolation, not just filesystem separation.

Solaris Zones (2004) took the concept further with two types of zones:

  • Global zone: The host operating system
  • Non-global zones: Isolated environments with their own filesystems, process trees, network interfaces, and user databases

Solaris Zones introduced resource controls--the ability to limit CPU, memory, and other resources per zone--foreshadowing what Linux cgroups would provide years later.

Linux Containers: LXC and the Kernel Primitives (2006-2012)

The Linux kernel gained its containerization primitives incrementally:

  • 2002: Linux namespaces introduced (mount namespaces first)
  • 2006: Process ID namespaces added
  • 2007: Control groups (cgroups) merged into kernel 2.6.24
  • 2008: LXC (Linux Containers) project launched, combining namespaces and cgroups
  • 2009: Network namespaces added
  • 2012: User namespaces added

LXC was the first complete Linux container manager. It used namespaces for isolation and cgroups for resource control, providing a userspace toolset for managing containers. However, LXC was complex to use, required significant manual configuration, and lacked a standardized image format.

Docker and the Container Revolution (2013-Present)

Docker, released in March 2013, was initially built on top of LXC. Its innovations were not in kernel technology but in developer experience:

  1. Dockerfile: A simple text format for describing how to build an image
  2. Layered images: Efficient, shareable, cacheable filesystem layers
  3. Docker Hub: A public registry for sharing images
  4. Simple CLI: docker run, docker build, docker push--intuitive commands
  5. Portable format: Build once, run anywhere (that has Docker)

Docker did not invent containerization. Docker made containerization usable. The difference between a kernel primitive and a developer tool is the difference between having lumber and having a house.

Docker later replaced its LXC dependency with libcontainer (now runc), giving it direct control over namespace and cgroup management. This evolution from LXC wrapper to standalone container runtime was pivotal, as it eliminated the LXC dependency and allowed Docker to control its own runtime semantics.


Linux Kernel Primitives: Namespaces

What Namespaces Are

Linux namespaces are a kernel feature that partitions kernel resources so that one set of processes sees one set of resources while another set of processes sees a different set. Each namespace type isolates a specific global system resource, making it appear to processes within the namespace that they have their own isolated instance of that resource.

This directly answers the question of how containers differ from virtual machines: a virtual machine runs a complete operating system with its own kernel on virtualized hardware. A container is a regular Linux process (or group of processes) whose view of the system has been restricted using namespaces. There is no second kernel, no hypervisor, no hardware emulation. The container's processes run on the host kernel, but they see an isolated slice of the system.

Feature Containers Virtual Machines
Isolation mechanism Kernel namespaces + cgroups Hardware virtualization (hypervisor)
Kernel Shared with host Separate kernel per VM
Startup time Milliseconds to seconds Seconds to minutes
Memory overhead Minimal (shared kernel) Significant (full OS per VM)
Disk footprint Megabytes (app + dependencies) Gigabytes (full OS + app)
Isolation strength Process-level (kernel boundary) Hardware-level (hypervisor boundary)
Performance Near-native Near-native (with hardware assist)
Density Hundreds per host Tens per host
OS diversity Linux only (on Linux host) Any OS on any host

There are eight namespace types in modern Linux kernels. Each plays a specific role in creating the container illusion.

PID Namespace (Process ID Isolation)

The PID namespace isolates the process ID number space. Processes in different PID namespaces can have the same PID. Each PID namespace has its own PID 1 (the init process).

How it works: When a new PID namespace is created (via clone() with CLONE_NEWPID or unshare()), the first process in that namespace becomes PID 1. This process and its children are visible within the namespace with their namespace-local PIDs.

Host PID namespace:
  PID 1 (systemd)
  PID 1234 (containerd)
  PID 5678 (container's entrypoint, seen as PID 1 inside container)
  PID 5679 (child process, seen as PID 2 inside container)

Container PID namespace:
  PID 1 (container's entrypoint)
  PID 2 (child process)

What this means for containers: A containerized process sees itself as PID 1. It cannot see or signal processes outside its namespace. The ps command inside a container shows only the container's processes. From the host's perspective, the container's processes are visible with their host-level PIDs.

PID 1 significance: In Unix, PID 1 is special. It is the init process, responsible for reaping zombie processes (orphaned child processes). When a container process runs as PID 1 and does not handle SIGCHLD properly, zombie processes can accumulate. This is why many container images use lightweight init systems like tini or dumb-init as their entrypoint.

Network Namespace (NET)

The network namespace isolates network resources: network devices, IP addresses, routing tables, firewall rules, the /proc/net directory, port numbers, and Unix domain socket abstract namespaces.

How it works: Each network namespace has its own set of network interfaces. A freshly created network namespace contains only a loopback interface (lo). To communicate with the outside world, virtual ethernet pairs (veth pairs) are typically created: one end placed in the container's network namespace, the other end connected to a bridge or the host's network namespace.

Host network namespace:
  eth0: 192.168.1.100 (physical NIC)
  docker0: 172.17.0.1 (bridge)
  veth123abc: (one end of veth pair)

Container network namespace:
  lo: 127.0.0.1 (loopback)
  eth0: 172.17.0.2 (other end of veth pair, renamed)

What this means for containers: Each container has its own IP address, its own port space (multiple containers can each listen on port 80), its own routing table, and its own firewall rules. Containers communicate with each other and the outside world through virtual network devices managed by the container runtime.

Mount Namespace (MNT)

The mount namespace isolates the set of filesystem mount points. Processes in different mount namespaces see different filesystem hierarchies.

How it works: When a new mount namespace is created, it receives a copy of the parent namespace's mount table. Subsequent mount and unmount operations within the namespace affect only that namespace. The container runtime uses this to give each container its own root filesystem (from the container image) while selectively exposing host directories via bind mounts.

What this means for containers: This is how a container gets its own filesystem. The container runtime mounts the image's layered filesystem as the container's root, mounts /proc, /sys, and /dev appropriately, and optionally bind-mounts host directories or volumes into the container.

UTS Namespace (Hostname)

The UTS namespace (Unix Timesharing System) isolates the hostname and NIS domain name. Each UTS namespace has its own hostname and domain name.

How it works: sethostname() and setdomainname() system calls within a UTS namespace affect only that namespace.

What this means for containers: Each container can have its own hostname (typically set to the container ID or a user-specified name), independent of the host's hostname. This is important for applications that use the hostname for identification, logging, or configuration.

IPC Namespace (Inter-Process Communication)

The IPC namespace isolates System V IPC objects (message queues, semaphore sets, shared memory segments) and POSIX message queues.

How it works: Processes in the same IPC namespace can communicate through these IPC mechanisms. Processes in different IPC namespaces cannot see or interact with each other's IPC objects.

What this means for containers: IPC objects created inside a container are invisible to processes in other containers and on the host. This prevents one container from interfering with another's inter-process communication, which is important for applications that rely heavily on shared memory or message queues (such as PostgreSQL).

User Namespace (USER)

The user namespace isolates user and group IDs. A process can have a different set of privileges inside and outside a user namespace.

How it works: User namespaces allow mapping UIDs and GIDs inside the namespace to different UIDs and GIDs outside. The critical implication: a process can be UID 0 (root) inside its user namespace but map to an unprivileged user (say, UID 100000) on the host.

Inside container: root (UID 0)
Mapping: container UID 0 -> host UID 100000
Host sees: unprivileged user UID 100000

What this means for containers: User namespaces are the foundation of rootless containers--containers that run entirely without host-level root privileges. Even if a process inside the container is running as root and manages to escape the container's other isolation boundaries, it lands on the host as an unprivileged user. This is a significant security improvement over traditional containers, which often require the container runtime to run as root on the host.

Cgroup Namespace

The cgroup namespace virtualizes the view of a process's cgroups. A process inside a cgroup namespace sees its own cgroup root, not the host's cgroup hierarchy.

How it works: When a cgroup namespace is created, the process's current cgroup becomes the root of its view. Reading /proc/self/cgroup inside the namespace returns paths relative to this virtual root.

What this means for containers: Containerized processes see a clean cgroup hierarchy. Without this namespace, a container could inspect /proc/self/cgroup and learn about the host's cgroup structure, potentially leaking information about the host's configuration.

Time Namespace

The time namespace (Linux 5.6+) isolates the CLOCK_MONOTONIC and CLOCK_BOOTTIME clocks. Processes in different time namespaces can see different monotonic and boot times.

What this means for containers: Primarily useful for checkpoint/restore scenarios (such as with CRIU), where a container migrated to a new host needs to maintain its view of elapsed time. This is the newest namespace type and is not yet widely used in standard container deployments.


Linux Kernel Primitives: Control Groups (cgroups)

What cgroups Are

Control groups (cgroups) are a Linux kernel feature that organizes processes into hierarchical groups whose resource usage can be limited, prioritized, accounted for, and controlled. While namespaces answer the question "what can a process see?", cgroups answer the question "what can a process use?"

Together, namespaces and cgroups form the two pillars of Linux container technology. Namespaces provide isolation; cgroups provide resource management. This is the complete answer to what Linux namespaces and cgroups are: namespaces partition the view of the system (each container sees its own isolated PID space, network stack, filesystem, hostname, IPC mechanisms, and user database), while cgroups partition the resources of the system (each container is limited to a defined share of CPU, memory, I/O bandwidth, and other resources, preventing one container from starving others).

cgroup v1 vs. cgroup v2

Linux has two versions of cgroups:

cgroup v1 (original): Each resource controller (CPU, memory, I/O, etc.) has its own independent hierarchy. A process can be in different cgroups for different controllers. This flexibility created complexity--managing multiple hierarchies was cumbersome, and interactions between controllers were poorly defined.

cgroup v2 (unified hierarchy): All controllers are attached to a single hierarchy. A process belongs to exactly one cgroup, and all controllers apply to that cgroup. This simplification improves manageability and makes controller interactions well-defined.

Most modern container runtimes (Docker 20.10+, Podman, containerd) support cgroup v2, and major Linux distributions have switched to cgroup v2 as the default.

Key Resource Controllers

CPU Controller

The CPU controller manages processor time allocation.

CPU shares (cpu.shares in v1, cpu.weight in v2): Proportional CPU allocation. If container A has weight 100 and container B has weight 200, B gets twice the CPU time of A--but only when both are competing. If A is idle, B can use all available CPU.

CPU quota (cpu.cfs_quota_us / cpu.cfs_period_us in v1, cpu.max in v2): Hard limits on CPU time. Setting a quota of 50ms per 100ms period means the container can use at most 50% of one CPU core, regardless of whether other containers are idle.

CPU pinning (cpuset.cpus): Restricts a container to specific CPU cores. Useful for performance-sensitive workloads that benefit from cache locality.

# Docker example: limit container to 1.5 CPU cores with pinning
docker run --cpus=1.5 --cpuset-cpus=0,1 myimage

Memory Controller

The memory controller limits and tracks memory usage.

Memory limit (memory.limit_in_bytes in v1, memory.max in v2): Hard cap on memory usage. When a container exceeds this limit, the kernel's OOM (Out of Memory) killer terminates processes in the cgroup.

Memory reservation (memory.soft_limit_in_bytes in v1, memory.low in v2): Soft limit. The kernel tries to keep the cgroup's memory usage at or below this level through reclamation, but allows bursting above it when memory is available.

Swap limit (memory.memsw.limit_in_bytes in v1, memory.swap.max in v2): Controls how much swap space the cgroup can use.

OOM behavior: When a cgroup hits its memory limit and cannot reclaim any pages, the OOM killer activates. In the container context, this typically kills the container's main process, effectively crashing the container. This is intentional--a misbehaving container should be killed rather than starving other containers or the host of memory.

# Docker example: limit container to 512MB RAM and 1GB swap
docker run --memory=512m --memory-swap=1g myimage

Block I/O Controller

The block I/O controller (blkio in v1, io in v2) manages disk I/O bandwidth.

I/O weight: Proportional allocation of I/O bandwidth, similar to CPU shares.

I/O limits: Hard caps on bytes per second or I/O operations per second (IOPS) for specific block devices.

# Docker example: limit write bandwidth to 10MB/s
docker run --device-write-bps /dev/sda:10mb myimage

PIDs Controller

The PIDs controller limits the number of processes in a cgroup. This prevents fork bombs--malicious or buggy processes that recursively create child processes until the system runs out of PIDs.

# Docker example: limit container to 100 processes
docker run --pids-limit=100 myimage

The cgroup Hierarchy

cgroups are organized in a tree structure. Resource limits at a parent level constrain all children. This hierarchy maps naturally to container orchestration: the orchestrator creates a cgroup tree where each container occupies a leaf, and the parent cgroup represents the overall allocation for containers on that host.

/sys/fs/cgroup/
  docker/
    container-abc123/
      cpu.max: 100000 50000
      memory.max: 536870912
      pids.max: 100
    container-def456/
      cpu.max: 200000 100000
      memory.max: 1073741824
      pids.max: 200

Container Images: Layered Filesystems

What a Container Image Is

A container image is a read-only template containing the filesystem and metadata needed to run a container. It includes the application code, runtime, libraries, environment variables, and configuration files. An image does not include a kernel--the container will use the host's kernel.

Understanding how container image layers work is essential to using containers effectively. Images are built from layers, where each layer represents a set of filesystem changes (files added, modified, or deleted). Layers are stacked on top of each other to form the complete filesystem, and a union filesystem makes the stack appear as a single coherent filesystem.

The Layer Model

Consider this Dockerfile:

FROM ubuntu:22.04                    # Layer 1: Base OS filesystem
RUN apt-get update && apt-get install -y python3  # Layer 2: Package installation
COPY requirements.txt /app/          # Layer 3: Copy requirements file
RUN pip3 install -r /app/requirements.txt        # Layer 4: Install dependencies
COPY . /app                          # Layer 5: Copy application code
CMD ["python3", "/app/main.py"]      # Metadata (no new layer)

Each instruction that modifies the filesystem creates a new layer:

Layer 5: Application code (/app/*.py, /app/*.json, etc.)
Layer 4: Python packages (/usr/lib/python3/dist-packages/...)
Layer 3: requirements.txt (/app/requirements.txt)
Layer 2: Python runtime + apt packages (/usr/bin/python3, libraries...)
Layer 1: Ubuntu 22.04 base filesystem (/bin, /lib, /usr, /etc, /var...)

Each layer is identified by the SHA-256 hash of its content. This content-addressable storage enables several important properties:

  1. Deduplication: If two images share layers (e.g., both use ubuntu:22.04 as base), the shared layers are stored only once on disk and in registries.

  2. Caching: Docker caches layers during builds. If a Dockerfile instruction has not changed and the parent layer is the same, Docker reuses the cached layer instead of rebuilding.

  3. Efficient distribution: When pushing or pulling images, only layers not already present at the destination are transferred. Pulling a new version of an image that shares 90% of its layers with the previous version only downloads the changed 10%.

  4. Immutability: Layers are read-only. Changing anything requires creating a new layer on top.

Union Filesystems

A union filesystem (also called a union mount) combines multiple directories (layers) into a single, unified view. Several implementations exist:

  • OverlayFS (overlay2): The current default for Docker on Linux. Part of the mainline kernel since Linux 3.18. Uses "upper" and "lower" directories to present a merged view.
  • AUFS (Advanced Multi-Layered Unification Filesystem): Docker's original filesystem driver. Not in mainline kernel; requires patches.
  • Btrfs and ZFS: Copy-on-write filesystems that can serve as storage backends.
  • DeviceMapper: Uses Linux's device mapper thin provisioning for layer management.

OverlayFS mechanics:

OverlayFS works with four directories:

  • lowerdir: Read-only layers (the image layers, stacked)
  • upperdir: Writable layer (the container's changes)
  • workdir: Working directory used internally by OverlayFS
  • merged: The unified view presented to the container
merged/     <- Container sees this (unified view)
  |
  ├── upperdir/   <- Container's writable layer
  └── lowerdir/   <- Image's read-only layers (stacked)

Read operations: When a file is accessed, OverlayFS searches from the top (upperdir) down through the lower layers. The first matching file is returned.

Write operations (copy-on-write): When a container modifies a file from a lower layer, OverlayFS copies the file to the upperdir before modification. The lower layer remains unchanged. Subsequent reads of that file return the modified copy from the upperdir.

Delete operations (whiteout files): When a container deletes a file from a lower layer, OverlayFS creates a whiteout file in the upperdir. This special file marks the original file as deleted without modifying the lower layer. The merged view hides the whiteout mechanism from the container.

What Happens When You Run a Container

When you execute docker run myimage, a precise sequence of operations occurs. This is what happens when you run a container:

  1. Image resolution: Docker resolves the image name to a specific image ID (SHA-256 digest). If the image is not present locally, Docker pulls it from a registry, downloading only the layers not already cached locally.

  2. Layer assembly: Docker stacks the image's read-only layers using the configured storage driver (typically OverlayFS). Each layer is a directory containing the filesystem changes for that layer.

  3. Writable layer creation: Docker creates a thin writable layer on top of the image layers. All filesystem modifications made by the container (new files, modified files, deleted files) go into this writable layer. The image layers remain untouched.

  4. Namespace creation: Docker (via containerd and runc) creates a new set of Linux namespaces for the container:

    • PID namespace (isolated process tree)
    • Network namespace (isolated network stack)
    • Mount namespace (isolated filesystem mounts)
    • UTS namespace (isolated hostname)
    • IPC namespace (isolated IPC resources)
    • Optionally, user namespace (UID/GID mapping)
  5. cgroup configuration: Docker creates a new cgroup for the container and applies resource limits (CPU, memory, I/O, PIDs) as specified in the run command.

  6. Filesystem mounting: Within the mount namespace, Docker:

    • Mounts the union filesystem (image layers + writable layer) as the container's root
    • Mounts /proc (procfs), /sys (sysfs), and /dev (devtmpfs) with appropriate restrictions
    • Mounts any user-specified volumes or bind mounts
    • Sets up /etc/resolv.conf, /etc/hostname, and /etc/hosts
  7. Network configuration: Docker creates a veth pair, places one end in the container's network namespace (as eth0), and connects the other end to the appropriate network (typically the docker0 bridge). It assigns the container an IP address and configures routing.

  8. Security policy application: Docker applies security restrictions:

    • Drops Linux capabilities (containers run with a restricted set of capabilities by default)
    • Applies seccomp profile (syscall filtering)
    • Applies AppArmor or SELinux profile (mandatory access control)
  9. Process execution: Docker starts the container's entrypoint process (specified by CMD or ENTRYPOINT in the Dockerfile, or overridden on the command line) inside the configured namespaces and cgroup.

The entire sequence--from docker run to process execution--typically completes in under a second for a cached image. This is orders of magnitude faster than booting a virtual machine, which must initialize a kernel, run init scripts, and start system services.

Image Building Best Practices

Understanding layers informs how Dockerfiles should be structured:

Layer ordering matters: Place instructions that change infrequently (base image, system packages) before instructions that change frequently (application code). This maximizes cache reuse during builds.

Minimize layer count: Combine related operations in single RUN instructions. Each layer adds overhead (metadata, filesystem operations).

Multi-stage builds: Use one stage for building (compilers, build tools, source code) and copy only the compiled artifacts into a clean final stage. This produces smaller images without build-time dependencies.

# Build stage
FROM golang:1.21 AS builder
WORKDIR /app
COPY . .
RUN go build -o myapp cmd/server/main.go

# Runtime stage
FROM alpine:3.18
COPY --from=builder /app/myapp /usr/local/bin/
CMD ["myapp"]

The build stage might produce a 1GB+ image (Go toolchain, source code, intermediate files). The runtime stage produces a ~20MB image (Alpine base + compiled binary).


Docker Architecture

The Component Stack

Docker's architecture has evolved significantly from its monolithic origins into a layered stack of components:

Docker CLI (docker): The command-line client. Translates user commands into API calls to the Docker daemon. The CLI communicates over a Unix socket (/var/run/docker.sock) or TCP.

Docker Daemon (dockerd): The background service that manages Docker objects (images, containers, networks, volumes). The daemon exposes a REST API that the CLI (and other tools) use. It handles image building, network management, volume management, and orchestrates the lower-level components.

containerd: A container runtime that manages the complete container lifecycle: image pull, storage, container execution, supervision, and networking. Docker delegates actual container management to containerd. Notably, containerd is a graduated CNCF project and can be used independently of Docker.

runc: The low-level container runtime that creates and runs containers. runc is the reference implementation of the OCI (Open Container Initiative) Runtime Specification. It directly interfaces with the Linux kernel to create namespaces, configure cgroups, set up the root filesystem, and execute the container process. Once the container process is running, runc exits--the running container is just a regular process managed by containerd.

The execution flow:

docker run myimage
    |
    v
Docker CLI --> Docker Daemon (dockerd)
                    |
                    v
               containerd
                    |
                    v
               containerd-shim
                    |
                    v
                  runc --> creates namespaces, cgroups, mounts
                    |
                    v
            Container process (PID 1 inside container)

containerd-shim: An intermediary process that allows runc to exit after starting the container. The shim keeps the container's stdin/stdout open, reports exit status, and enables runtime upgrades (containerd can be restarted without affecting running containers).

Container Registries

A container registry is a storage and distribution system for container images. Registries implement the OCI Distribution Specification (originally the Docker Registry HTTP API V2).

Docker Hub: The default public registry. Hosts millions of images, including official images maintained by Docker and upstream projects.

Private registries: Organizations run private registries for internal images. Options include:

  • Docker Registry (open source): Simple, self-hosted registry
  • Harbor (CNCF graduated): Enterprise registry with vulnerability scanning, RBAC, replication
  • Amazon ECR, Google Container Registry / Artifact Registry, Azure Container Registry: Cloud provider registries
  • GitHub Container Registry (ghcr.io): Integrated with GitHub

How image distribution works:

  1. Push: Docker splits the image into layers, computes digests, and uploads each layer as a blob. It then uploads a manifest that lists the layers and their digests.

  2. Pull: Docker downloads the manifest, checks which layers are already cached locally, and downloads only the missing layers. Each layer is verified against its digest after download.

  3. Content trust: Docker Content Trust (Notary) enables cryptographic signing of images, ensuring that pulled images have not been tampered with.


Container Networking

The Networking Model

Container networking answers the question of how container networking works: it uses Linux network namespaces to give each container its own isolated network stack, then uses virtual network devices and software-defined networking to connect containers to each other and to the outside world.

Docker provides several networking drivers, each suited to different use cases.

Bridge Networking (Default)

Bridge networking is Docker's default network mode. Docker creates a virtual bridge device (docker0) on the host and connects each container to it via a veth pair.

Container A (172.17.0.2)         Container B (172.17.0.3)
     |                                |
  veth-a                           veth-b
     |                                |
     +---------- docker0 ------------+
                  (bridge)
              172.17.0.1
                  |
                 eth0
             (host NIC)

How it works:

  1. Docker creates a Linux bridge named docker0 with an IP address (e.g., 172.17.0.1/16)
  2. For each container, Docker creates a veth pair: one end goes in the container's network namespace (as eth0), the other connects to the docker0 bridge
  3. Docker assigns the container an IP address from the bridge's subnet
  4. Containers on the same bridge can communicate directly via their IP addresses
  5. Outbound traffic from containers is NATed through the host's IP address using iptables rules
  6. Port mapping (-p 8080:80) adds iptables DNAT rules to forward host port traffic to the container

User-defined bridge networks: Docker allows creating custom bridge networks (docker network create mynet). These provide:

  • Automatic DNS resolution (containers can reach each other by name)
  • Better isolation (containers on different networks cannot communicate by default)
  • Configurable subnets, gateways, and IP ranges

Host Networking

Host networking removes network isolation entirely. The container shares the host's network namespace.

docker run --network host myimage

Characteristics:

  • Container uses host's IP address and port space directly
  • No NAT overhead
  • Container can bind to any port (conflicts with host services possible)
  • Best network performance (no virtual device overhead)
  • No network isolation

Use case: Performance-sensitive applications where the overhead of bridge networking is unacceptable, or applications that need to interact with the host's network stack directly (e.g., network monitoring tools).

Overlay Networking

Overlay networking creates a distributed network across multiple Docker hosts, enabling containers on different physical machines to communicate as if they were on the same local network.

How it works: Overlay networks use VXLAN (Virtual Extensible LAN) encapsulation. Container traffic destined for a container on another host is encapsulated in a VXLAN header and sent over the physical network. The destination host decapsulates the traffic and delivers it to the target container.

Host A                              Host B
Container 1 (10.0.1.2)            Container 2 (10.0.1.3)
     |                                  |
  overlay network (10.0.1.0/24, VXLAN)
     |                                  |
  eth0 (192.168.1.10) --- network --- eth0 (192.168.1.11)

Use cases: Container orchestration (Docker Swarm, Kubernetes) where services span multiple hosts. Overlay networks abstract the physical topology, allowing containers to communicate regardless of which host they run on.

Macvlan Networking

Macvlan networking assigns a MAC address to each container, making it appear as a physical device on the network. Containers get IP addresses from the physical network's DHCP server or static allocation.

How it works: The macvlan driver creates virtual network interfaces with unique MAC addresses, all attached to a parent physical interface. Each container appears as a distinct host on the physical network segment.

Use cases: Legacy applications that expect to be on a physical network, applications that need to be directly addressable by external systems without NAT, or scenarios where bridge networking's NAT is undesirable.

Network Comparison Table

Network Mode Isolation Performance Multi-Host Use Case
Bridge Container-level Good (slight NAT overhead) No Default; most applications
Host None Best (native) N/A Performance-critical workloads
Overlay Container-level Good (VXLAN overhead) Yes Orchestrated multi-host clusters
Macvlan Container-level Good (no NAT) Physical network Direct network presence needed
None Complete N/A No Containers needing no network

DNS and Service Discovery

Docker's built-in DNS server (on user-defined networks) resolves container names to IP addresses. When container A on a user-defined bridge network tries to connect to container-b:5432, Docker's DNS resolver returns container B's IP address.

In orchestrated environments like Kubernetes, service discovery becomes more sophisticated, using DNS-based service names that resolve to virtual IPs backed by load-balanced sets of container IPs.


Container Storage

The Storage Problem

Containers are ephemeral by default. When a container is removed, its writable layer is deleted, and all data written to the container's filesystem is lost. This is by design--containers should be disposable and reproducible. But applications often need persistent data. Docker provides three mechanisms for persistent and shared storage.

Volumes

Volumes are Docker's preferred mechanism for persisting data. They are managed by Docker and stored in a Docker-controlled area of the host filesystem (/var/lib/docker/volumes/ on Linux).

# Create and use a named volume
docker volume create mydata
docker run -v mydata:/app/data myimage

# Anonymous volume (Docker generates name)
docker run -v /app/data myimage

Properties:

  • Managed by Docker (create, inspect, remove via CLI)
  • Persist beyond container lifecycle
  • Can be shared between containers
  • Support volume drivers for remote/cloud storage (NFS, AWS EBS, etc.)
  • Better performance than bind mounts on Docker Desktop (macOS/Windows)
  • Content is initialized from image if volume is empty

Bind Mounts

Bind mounts map a specific host path into the container. They predate volumes and provide direct access to host filesystem locations.

# Bind mount current directory into container
docker run -v /host/path/project:/app myimage

# Read-only bind mount
docker run -v /host/path/config:/etc/app/config:ro myimage

Properties:

  • Direct mapping of host path to container path
  • Host path must exist (not created automatically)
  • Changes visible immediately on both host and container
  • No Docker management (not listed by docker volume ls)
  • Commonly used in development (mount source code for live reloading)
  • Potential security risk (container can modify host files)

tmpfs Mounts

tmpfs mounts store data in the host's memory (RAM). Data is never written to the host filesystem and is lost when the container stops.

docker run --tmpfs /app/cache myimage

Properties:

  • In-memory storage only
  • Fastest I/O performance
  • Not persistent (lost on container stop)
  • Not shared between containers
  • Useful for sensitive data (secrets, temporary credentials) that should not be written to disk

Container Orchestration: Kubernetes

Why Orchestration Is Needed

Running a single container on a single host is straightforward. Running hundreds or thousands of containers across dozens of hosts requires answering questions that Docker alone does not address:

  • Scheduling: Which host should run this container? (Consider resource availability, constraints, affinity)
  • Scaling: How do we automatically add or remove container instances based on load?
  • Service discovery: How do containers find each other as instances come and go?
  • Load balancing: How do we distribute traffic across container instances?
  • Health monitoring: How do we detect and replace failed containers?
  • Rolling updates: How do we deploy new versions without downtime?
  • Configuration management: How do we manage configuration and secrets across environments?
  • Storage orchestration: How do we provision and attach persistent storage dynamically?

This is why containers are useful for deployment: containers package applications with all their dependencies into a standardized unit that runs identically across environments. Combined with orchestration, they enable treating infrastructure as code--declaring the desired state of an application (number of replicas, resource requirements, health checks, networking rules) and letting the orchestrator make it so. This eliminates the "works on my machine" problem, enables rapid scaling, facilitates microservices architecture, and makes infrastructure reproducible and auditable.

Kubernetes Architecture

Kubernetes (often abbreviated K8s) is the dominant container orchestration platform. Originally developed by Google (based on their internal Borg system), it is now a CNCF graduated project with broad industry adoption.

Control plane components:

  • kube-apiserver: The API server is the front end of the Kubernetes control plane. All interactions (CLI, UI, internal components) go through the API server. It validates and processes REST requests, updating the cluster's desired state in etcd.

  • etcd: A distributed, consistent key-value store that holds all cluster state. Every pod, service, configuration, and secret is stored in etcd. It is the single source of truth for the cluster.

  • kube-scheduler: Watches for newly created pods with no assigned node and selects a node for them based on resource requirements, constraints, affinity/anti-affinity rules, and other policies.

  • kube-controller-manager: Runs controller loops that watch cluster state and make changes to move the current state toward the desired state. Controllers include: ReplicaSet controller (ensures correct number of pod replicas), Deployment controller (manages rollouts), Node controller (responds to node failures), and others.

Node components (on every worker node):

  • kubelet: An agent that ensures containers are running in pods as specified. It receives pod specifications from the API server and instructs the container runtime to start or stop containers accordingly.

  • kube-proxy: Maintains network rules for pod-to-service communication. Implements Kubernetes Services (virtual IPs) using iptables or IPVS rules.

  • Container runtime: The software that actually runs containers. Kubernetes supports any runtime that implements the Container Runtime Interface (CRI): containerd, CRI-O, or others. Docker itself is no longer directly supported as a Kubernetes runtime (since v1.24), though Docker-built images work fine--Kubernetes simply uses containerd directly.

Key Kubernetes Objects

Pod: The smallest deployable unit. A pod contains one or more containers that share network namespace (same IP, can communicate via localhost), storage volumes, and lifecycle. Most pods contain a single application container, but sidecars (logging agents, service meshes, proxies) are common patterns.

Deployment: Declares the desired state for a set of pods--which image, how many replicas, resource limits, update strategy. The Deployment controller ensures the actual state matches the desired state.

Service: An abstraction that defines a logical set of pods and a policy for accessing them. A Service gets a stable virtual IP (ClusterIP) and DNS name, even as the underlying pods change. Types include ClusterIP (internal), NodePort (exposes on node ports), and LoadBalancer (provisions external load balancer).

Namespace: A mechanism for dividing cluster resources among multiple teams or projects. Not to be confused with Linux namespaces--Kubernetes namespaces are a higher-level organizational concept.

ConfigMap and Secret: Objects for managing configuration data and sensitive information, respectively, decoupled from container images.


Security Considerations

The Container Security Model

Containers share the host kernel. This is their fundamental security constraint. A kernel vulnerability exploitable from within a container potentially compromises the host and all other containers on it. This is the primary reason why containers provide weaker isolation than virtual machines, and why defense-in-depth is essential.

Linux Capabilities

Traditional Unix has a binary privilege model: either you are root (UID 0) with all privileges, or you are not. Linux capabilities break root's privileges into distinct units that can be independently granted or revoked.

Docker drops most capabilities by default, retaining only those needed for typical containerized applications:

Kept by default: CHOWN, DAC_OVERRIDE, FSETID, FOWNER, MKNOD, NET_RAW, SETGID, SETUID, SETFCAP, SETPCAP, NET_BIND_SERVICE, SYS_CHROOT, KILL, AUDIT_WRITE

Dropped by default: SYS_ADMIN, NET_ADMIN, SYS_PTRACE, SYS_MODULE, SYS_RAWIO, SYS_TIME, and many others

# Drop all capabilities except what's needed
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myimage

# Add specific capability (dangerous--know what you're enabling)
docker run --cap-add=SYS_PTRACE myimage

The --privileged flag grants all capabilities and disables most security restrictions. Never use --privileged in production unless absolutely necessary--it effectively removes the container security boundary.

Seccomp (Secure Computing Mode)

Seccomp filters the system calls a process can make. Docker applies a default seccomp profile that blocks approximately 44 of the 300+ Linux system calls, including dangerous calls like reboot(), mount(), kexec_load(), ptrace() (on other processes), and clock_settime().

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64"],
  "syscalls": [
    {
      "names": ["read", "write", "open", "close", ...],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Custom seccomp profiles can be applied per container for defense-in-depth:

docker run --security-opt seccomp=/path/to/profile.json myimage

AppArmor and SELinux

AppArmor (used on Ubuntu, Debian, SUSE) and SELinux (used on Red Hat, CentOS, Fedora) are mandatory access control (MAC) systems that confine programs to a limited set of resources.

Docker generates a default AppArmor profile that:

  • Prevents writing to /proc and /sys (except allowed paths)
  • Prevents mounting filesystems
  • Prevents accessing raw device files
  • Prevents modifying the AppArmor profile itself

SELinux labels containers with specific types (container_t) that restrict file access, network access, and inter-process communication based on policy rules.

Rootless Containers

Rootless containers run the entire container stack (runtime, images, networking) without root privileges on the host. This eliminates the attack surface of a root-owned container daemon.

How rootless containers work:

  1. User namespaces map container root to an unprivileged host user
  2. Network is set up using slirp4netns or pasta (user-space network stack) instead of bridge networking (which requires root for iptables manipulation)
  3. Storage uses fuse-overlayfs (user-space OverlayFS implementation) or native overlay with user namespace support
  4. cgroups are managed via systemd user sessions (cgroup v2) or delegation

Rootless mode is available in Docker (since 19.03), Podman (rootless by default), and containerd.

Running containers as root on the host is the most common container security mistake in production. Rootless containers, user namespace remapping, and dropping capabilities should be standard practice--not optional hardening.

Image Security

Image scanning: Tools like Trivy, Grype, and Snyk scan container images for known vulnerabilities in OS packages and application dependencies.

Minimal base images: Using minimal images like Alpine Linux, distroless images (from Google), or scratch (empty) base images reduces the attack surface by removing unnecessary packages, shells, and utilities.

Image signing: Docker Content Trust and Sigstore/cosign enable cryptographic verification of image integrity and provenance.


OCI Standards

The Open Container Initiative

The Open Container Initiative (OCI), established in 2015 under the Linux Foundation, defines open industry standards for container formats and runtimes. The OCI was created to prevent vendor lock-in and ensure interoperability between container tools.

OCI Runtime Specification

The Runtime Specification defines how to run a "filesystem bundle." It specifies:

  • Configuration (config.json): The container's root filesystem path, process to run (args, environment, working directory), mount points, Linux namespaces to create, cgroup settings, capabilities, seccomp profile, and other platform-specific settings.

  • Lifecycle: The states a container passes through (creating, created, running, stopped) and the operations that transition between them (create, start, kill, delete).

runc is the reference implementation. Other OCI-compliant runtimes include:

  • crun: Written in C (faster startup than runc, which is written in Go)
  • youki: Written in Rust
  • gVisor (runsc): Google's container runtime that intercepts system calls through a user-space kernel, providing stronger isolation
  • Kata Containers: Runs each container in a lightweight VM, combining container UX with VM isolation

OCI Image Specification

The Image Specification defines the format for container images:

  • Image manifest: Lists the layers (as content-addressable blobs) and the image configuration
  • Image index (manifest list): Points to platform-specific manifests (enabling multi-architecture images--same image tag works on amd64, arm64, etc.)
  • Image configuration: Metadata including environment variables, entrypoint, working directory, exposed ports, labels
  • Filesystem layers: tar archives of filesystem changes, compressed (typically with gzip or zstd)

OCI Distribution Specification

The Distribution Specification defines an API for distributing container images through registries. This standardizes how images are pushed to and pulled from registries, ensuring interoperability between different registry implementations.

These three specifications together ensure that images built by any OCI-compliant tool can be stored in any OCI-compliant registry and run by any OCI-compliant runtime. You can build with Docker, store in Harbor, and run with Podman--or any other combination.


Container Lifecycle Management

The Full Lifecycle

Understanding the container lifecycle is essential for managing containerized applications. A container passes through several states:

1. Created: The container has been created (namespaces configured, filesystem mounted, cgroups set up) but its process has not started. This state exists between docker create and docker start.

docker create --name mycontainer myimage   # Created state

2. Running: The container's process is executing. This is the state after docker start or docker run (which combines create and start).

docker start mycontainer   # Running state
# or
docker run myimage         # Create + Start in one command

3. Paused: The container's processes are suspended using cgroup freezer. They consume no CPU but retain their memory state. Useful for temporarily suspending a container without stopping it.

docker pause mycontainer   # Paused state
docker unpause mycontainer # Back to Running

4. Stopped (Exited): The container's main process has exited (either normally or due to a signal). The container's writable layer still exists on disk, preserving any filesystem changes. The container can be restarted or removed.

docker stop mycontainer    # Sends SIGTERM, then SIGKILL after timeout
docker kill mycontainer    # Sends SIGKILL immediately

5. Removed: The container and its writable layer are deleted. All data in the writable layer is lost. Volumes remain unless explicitly removed.

docker rm mycontainer      # Remove stopped container
docker rm -f mycontainer   # Force remove (stops first if running)

Restart Policies

Docker supports automatic container restart on failure or system reboot:

  • --restart=no (default): Do not restart
  • --restart=on-failure[:max-retries]: Restart only if container exits with non-zero status
  • --restart=always: Always restart (even after daemon restart)
  • --restart=unless-stopped: Like always, but respects manual stops

Health Checks

Dockerfiles can define health checks that monitor container health:

HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

Docker periodically runs the health check command and marks the container as healthy, unhealthy, or starting. Orchestrators use health status to make routing and replacement decisions.


Practical Patterns and Considerations

Container Design Principles

One process per container (generally): Each container should run a single concern. A web application and its database should be separate containers, not a single container running both. This enables independent scaling, updating, and monitoring.

Immutable infrastructure: Never modify a running container's filesystem for configuration changes. Instead, build a new image with the changes and deploy new containers. This ensures reproducibility and eliminates configuration drift.

Stateless containers: Design containers to be disposable. Store state in external services (databases, object stores, caches) so that containers can be freely created, destroyed, and replaced.

12-factor app principles: Use environment variables for configuration, write logs to stdout/stderr (not files), treat backing services as attached resources, and design for horizontal scaling.

Common Anti-Patterns

Running as root unnecessarily: Many container images run as root by default. Always specify a non-root user:

RUN useradd -r -u 1001 appuser
USER appuser

Large images: Images containing build tools, debugging utilities, and unnecessary packages waste storage, bandwidth, and increase attack surface. Use multi-stage builds and minimal base images.

Storing secrets in images: Environment variables baked into Dockerfiles or image layers are visible to anyone with access to the image. Use runtime secret injection (Docker secrets, Kubernetes secrets, vault integrations).

Ignoring resource limits: Containers without CPU and memory limits can consume all host resources, affecting other containers and the host itself. Always set resource limits in production.

Monitoring and Observability

Container metrics: CPU usage, memory usage, network I/O, and disk I/O are available through cgroups accounting and exposed via tools like cAdvisor, Prometheus (with node-exporter), and Docker stats.

Logging: Containers typically write logs to stdout/stderr, which Docker captures through logging drivers. These logs can be routed to centralized logging systems (ELK stack, Loki, Splunk) for aggregation and analysis.

Tracing: In microservices architectures with many containers, distributed tracing (OpenTelemetry, Jaeger, Zipkin) tracks requests across service boundaries, identifying latency bottlenecks and failure points.


The Broader Ecosystem

Alternative Container Runtimes

Docker is not the only container tool. The ecosystem has diversified significantly:

Podman: A daemonless container engine that is compatible with Docker CLI commands. Podman runs rootless by default, does not require a background daemon, and can generate systemd unit files for container management. Many Docker commands work identically with Podman (podman run, podman build, podman push).

Buildah: A tool specifically for building OCI-compliant container images. Unlike Docker, Buildah does not require a daemon and can build images without a Dockerfile (using shell scripts).

Skopeo: A tool for inspecting and copying container images between registries without pulling them to local storage.

containerd: While often used as Docker's runtime, containerd is also used independently as the container runtime for Kubernetes (via CRI plugin). The ctr and nerdctl CLIs provide direct interaction with containerd.

CRI-O: A lightweight container runtime specifically designed for Kubernetes. It implements the Kubernetes CRI (Container Runtime Interface) and supports OCI-compliant images and runtimes, without the additional features of Docker or containerd that Kubernetes does not need.

WebAssembly Containers

An emerging trend is using WebAssembly (Wasm) as a container format. Wasm provides a sandboxed execution environment with near-native performance. Projects like WasmEdge and Spin enable running Wasm modules alongside traditional containers in Kubernetes, offering faster startup times (microseconds vs. milliseconds) and smaller footprints.

As Solomon Hykes (Docker's creator) noted: "If WASM+WASI existed in 2008, we wouldn't have needed to create Docker." While this overstates the case (Wasm does not yet replace all container use cases), it highlights the direction of lightweight, portable application packaging.

Service Mesh

In complex microservices deployments, a service mesh (Istio, Linkerd, Cilium) provides infrastructure-level networking features:

  • Mutual TLS: Automatic encryption between services
  • Traffic management: Canary deployments, blue-green deployments, circuit breaking
  • Observability: Metrics, traces, and logs for inter-service communication
  • Policy enforcement: Authorization rules for service-to-service communication

Service meshes typically deploy as sidecar containers (or eBPF programs) alongside application containers, intercepting and managing network traffic transparently.


Performance Characteristics

Container Overhead

Containers add minimal overhead compared to bare-metal processes:

CPU: Near-zero overhead. Container processes run directly on the host CPU. The cgroup accounting adds negligible cost. Context switching between containers is the same as between regular processes (they are regular processes).

Memory: Near-zero overhead for the container mechanism itself. The kernel's namespace and cgroup data structures consume kilobytes per container. The real memory cost is the application and its dependencies.

I/O: OverlayFS adds minimal latency for read operations (directory lookups may traverse multiple layers). Write operations incur copy-on-write cost the first time a lower-layer file is modified. Volume mounts bypass the overlay filesystem entirely, providing native I/O performance.

Networking: Bridge networking adds latency from NAT and virtual device processing (typically microseconds, not milliseconds). Host networking eliminates this overhead. For most applications, the networking overhead is negligible.

Startup time: Container startup is dominated by application initialization, not container setup. Creating namespaces and cgroups takes milliseconds. Pulling image layers (if not cached) is the most significant delay.

When Containers Add Meaningful Overhead

  • I/O-intensive workloads on overlay filesystems: Applications that perform heavy random writes to files within the container filesystem (not volumes) may see performance degradation from copy-on-write. Solution: use volumes for I/O-intensive data.

  • Network-intensive workloads on bridge networks: High-throughput, low-latency networking applications may notice bridge networking overhead. Solution: use host networking or macvlan.

  • Memory-mapped files across layers: Applications using mmap on large files stored in overlay layers may experience higher page fault rates. Solution: use volumes or copy files to a tmpfs.


References and Further Reading

  1. Merkel, D. (2014). "Docker: Lightweight Linux Containers for Consistent Development and Deployment." Linux Journal, 2014(239). Available: https://www.linuxjournal.com/content/docker-lightweight-linux-containers-consistent-development-and-deployment

  2. Bernstein, D. (2014). "Containers and Cloud: From LXC to Docker to Kubernetes." IEEE Cloud Computing, 1(3), 81-84. DOI: 10.1109/MCC.2014.51

  3. Kerrisk, M. (2013). "Namespaces in operation" (series). LWN.net. Available: https://lwn.net/Articles/531114/

  4. Rosen, R. (2014). "Resource management: Linux kernel Namespaces and cgroups." Haifux Lecture. Available: http://www.haifux.org/lectures/299/netLec7.pdf

  5. Open Container Initiative. "OCI Runtime Specification." Available: https://github.com/opencontainers/runtime-spec

  6. Open Container Initiative. "OCI Image Specification." Available: https://github.com/opencontainers/image-spec

  7. Burns, B., Beda, J., Hightower, K., & Evenson, L. (2022). Kubernetes: Up and Running (3rd ed.). O'Reilly Media. Available: https://www.oreilly.com/library/view/kubernetes-up-and/9781098110192/

  8. Docker Documentation. "Docker overview." Available: https://docs.docker.com/get-started/overview/

  9. Walsh, D. (2019). "Understanding root inside and outside a container." Red Hat Developer Blog. Available: https://developers.redhat.com/blog/2019/04/18/understanding-root-inside-and-outside-a-container

  10. Rice, L. (2020). Container Security: Fundamental Technology Concepts that Protect Containerized Applications. O'Reilly Media. Available: https://www.oreilly.com/library/view/container-security/9781492056690/

  11. Sultan, S., Ahmad, I., & Dimitriou, T. (2019). "Container Security: Issues, Challenges, and the Road Ahead." IEEE Access, 7, 52976-52996. DOI: 10.1109/ACCESS.2019.2911732

  12. Kubernetes Documentation. "Kubernetes Components." Available: https://kubernetes.io/docs/concepts/overview/components/


Word Count: ~6,800 words