Cloud Computing Explained: What It Is and How It Actually Works
In August 2006, Andy Jassy stood before a room of skeptical executives at Amazon and proposed renting out the computing infrastructure the company had built for its own e-commerce operations. The pitch seemed strange: Amazon was a bookstore turned retailer, not a technology services company. Why would other businesses rent servers from the same company that competed with them for customers?
Jassy's answer was that Amazon had accidentally built world-class infrastructure competency in solving its own scaling problems, and that infrastructure was now underutilized outside of peak shopping seasons. The service launched as Amazon Web Services. By 2023, AWS generated $90 billion in annual revenue---more than Amazon's entire retail business---and became the most profitable division of one of the world's most valuable companies.
The shift from owning computers to renting computing power has been one of the most consequential transformations in the history of business technology. Netflix, Airbnb, Spotify, and NASA all run on rented computing infrastructure. Startups that in 2000 would have spent months and hundreds of thousands of dollars building server infrastructure now provision it in minutes for cents per hour. Understanding cloud computing is no longer optional for anyone making decisions about software, infrastructure, or digital products.
What Cloud Computing Actually Means
Cloud computing means using computing resources---servers, storage, databases, networking, software---over the internet, on demand, instead of owning and maintaining physical hardware. Rather than purchasing servers and installing them in your office or data center, you rent capacity from providers who maintain massive, globally distributed infrastructure.
The analogy to utilities clarifies the transformation. You do not build a power plant to light your office. You connect to the electrical grid and pay for what you consume. Cloud computing applies the same principle to computing: connect to a provider's infrastructure, use what you need, and pay based on consumption. Like electricity, the underlying infrastructure is extraordinarily complex, but the consumer experience is deliberately simple.
The Defining Characteristics
The National Institute of Standards and Technology (NIST) published a definition of cloud computing in 2011 that the industry has largely adopted. Five essential characteristics distinguish cloud computing from traditional hosting or data center services:
On-demand self-service means you can provision computing resources instantly through a web interface or API, without any human interaction with the provider's staff. Need a new server? Click a button. Need 50 servers? Click 50 times, or run a script. This contrasts with traditional infrastructure procurement, where provisioning physical hardware required purchase orders, shipping, installation, and configuration---a process measured in weeks or months.
Broad network access means capabilities are available over the network and accessible through standard mechanisms (HTTPS, standard APIs) from any device---laptops, phones, tablets, other servers.
Resource pooling means the provider's computing resources serve multiple customers simultaneously, with different physical and virtual resources dynamically assigned and reassigned based on demand. Individual customers do not know or care which physical machine is running their workload; they see only the logical resource they provisioned.
Rapid elasticity means resources can be provisioned and released rapidly---in some cases automatically---to scale with demand. From the customer's perspective, the available capacity appears unlimited and can be appropriated in any quantity at any time. A system that handles 1,000 requests per minute can, in principle, scale to handle 1,000,000 requests per minute within minutes.
Measured service means resource usage is monitored, controlled, and reported, providing transparency for both the provider and customer. You pay for what you use, measured at appropriate granularity---per hour, per second, per request, or per gigabyte.
The Service Models: IaaS, PaaS, SaaS
Cloud services are categorized by how much the provider manages versus how much the customer manages. The three main models represent a spectrum from more control (and more responsibility) to less control (and less responsibility).
Infrastructure as a Service (IaaS)
The provider gives you virtual machines, storage, and networking. You manage everything above that: operating systems, runtime environments, applications, and data. This is like renting an empty warehouse---the building is maintained, but you equip and operate it.
What the provider manages: Physical hardware, hypervisors, networking fabric, power, cooling, physical security.
What you manage: Operating system, installed software, runtime environments, application code, data, network configuration, security groups.
Examples: AWS EC2 (Elastic Compute Cloud), Azure Virtual Machines, Google Compute Engine.
Use when: You need maximum control, you are running specialized software with specific OS requirements, or you are lifting and shifting existing applications to the cloud without redesigning them.
Example: When NASA's Jet Propulsion Laboratory processed data from the Mars Curiosity rover landing in 2012, they used AWS EC2 to burst to hundreds of instances for the computationally intensive analysis periods, then release the capacity. The flexibility to scale to hundreds of servers overnight would have been impossible with owned infrastructure.
Platform as a Service (PaaS)
The provider additionally manages runtime environments, middleware, and often databases. You deploy your application code, and the platform handles the infrastructure underneath. This is like renting a furnished office---you bring your people and start working, without worrying about the building management.
What the provider manages: Hardware, operating systems, runtime environments, middleware, often databases, security patching, scaling.
What you manage: Application code, data, some configuration.
Examples: Heroku, Google App Engine, AWS Elastic Beanstalk, Fly.io.
Use when: You want to focus entirely on application code without managing infrastructure, and your application fits the platform's supported configurations. PaaS is particularly well-suited for web applications and APIs.
Example: Heroku's PaaS offering enabled the early Twitch (then Justin.tv) team to focus on building live streaming features rather than managing servers. The platform handled auto-scaling during traffic spikes around popular streams, allowing a small engineering team to maintain a service with rapidly growing user numbers.
Software as a Service (SaaS)
The provider delivers complete, ready-to-use applications accessed through a browser or API. You use the software; the provider handles everything else: infrastructure, platform, application code, maintenance, updates.
What the provider manages: Everything.
What you manage: Your data, user configuration, and integrations with other systems.
Examples: Gmail, Salesforce, Slack, Microsoft 365, Zoom, Shopify.
Use when: A standard application meets your needs and you do not require customization beyond what the platform offers. The total cost of SaaS typically includes less operational overhead than running equivalent software yourself.
| Model | You Manage | Provider Manages | Analogy |
|---|---|---|---|
| IaaS | OS, runtime, app, data | Hardware, networking | Empty warehouse |
| PaaS | App code, data | OS, runtime, hardware | Furnished office |
| SaaS | Configuration only | Everything | Hotel room |
The Major Cloud Providers
Three providers dominate global cloud infrastructure, though the landscape includes dozens of regional and specialized providers.
Amazon Web Services (AWS)
AWS launched in 2006 and maintained a significant head start over competitors. Today it is the world's largest cloud provider by revenue and market share (consistently 30-33% of the market as of 2024). AWS has the broadest service catalog, with over 200 services covering computing, storage, databases, analytics, machine learning, IoT, and more. Its global infrastructure spans 33 geographic regions and 105 availability zones.
Strengths: Broadest service selection, largest community and ecosystem, most third-party integrations, best documentation, largest talent pool. The default choice for many technology companies.
Weaknesses: Pricing complexity, occasionally confusing service naming, and sometimes slower to adopt industry-standard tools compared to competitors.
Example: Netflix runs almost entirely on AWS, having completed its migration from owned data centers in 2016. Netflix uses AWS in multiple regions simultaneously, actively routing traffic away from any region experiencing problems. Their architecture deliberately assumes infrastructure failures and routes around them.
Microsoft Azure
Azure launched in 2010 and has grown to be the second-largest provider, with particular strength in enterprise markets. Azure's tight integration with Microsoft's existing enterprise products---Active Directory, Office 365, SQL Server, Windows Server---makes it the natural choice for organizations already invested in Microsoft technology.
Strengths: Best hybrid cloud capabilities (connecting on-premises infrastructure with cloud), strong enterprise support and compliance offerings, seamless integration with Microsoft products, strong government cloud offerings.
Weaknesses: User experience for non-Microsoft workloads can be less polished, some services lag behind AWS and GCP in features.
Example: LinkedIn (acquired by Microsoft in 2016) runs on Azure, using its massive data processing capabilities to power features like People You May Know, job recommendations, and content relevance ranking. The integration with Microsoft's Office 365 for enterprise customers also makes Azure the natural backend for LinkedIn's Sales Navigator product.
Google Cloud Platform (GCP)
Google Cloud launched in 2012, later than competitors, but brought distinctive advantages: Google's network infrastructure (one of the world's largest private networks), leadership in Kubernetes (which Google invented and open-sourced in 2014), and advanced machine learning capabilities.
Strengths: Network performance, Kubernetes and container tooling (Google invented Kubernetes), BigQuery for large-scale analytics, TensorFlow and Vertex AI for machine learning, competitive pricing with sustained use discounts applied automatically.
Weaknesses: Smaller service catalog than AWS, smaller ecosystem, and historical uncertainty about Google's long-term commitment to enterprise products (Google has deprecated services before).
Example: Spotify processes billions of events daily through Google Cloud's BigQuery and Dataflow services, using the data to generate personalized recommendations, power Discover Weekly playlists, and analyze listening patterns across its 600 million users worldwide.
Other Providers
Cloudflare has built significant cloud infrastructure focused on edge computing and security services. DigitalOcean targets developers and small businesses with simpler, more predictable pricing. Oracle Cloud competes specifically in enterprise database workloads. Alibaba Cloud dominates in China and Southeast Asia. IBM Cloud focuses on hybrid cloud and regulated industries.
Cloud Deployment Models
Beyond service models (IaaS/PaaS/SaaS), cloud deployments are classified by who can access the infrastructure.
Public Cloud
Infrastructure is owned and operated by a cloud provider and shared among multiple customers (tenants). This is the standard model described throughout this article. Resources are isolated per tenant through virtualization, but the underlying physical infrastructure is shared.
Advantages: No capital expenditure, instant scalability, access to advanced services, global reach. Disadvantages: Data sovereignty concerns, shared infrastructure (even if isolated), compliance complexity for regulated industries.
Private Cloud
Infrastructure is provisioned exclusively for a single organization, either on-premises or hosted by a provider. The organization gets cloud-like capabilities (self-service provisioning, elasticity, measured service) but with dedicated hardware.
Advantages: Maximum control, data sovereignty, can meet strict compliance requirements. Disadvantages: Requires capital investment, limited by owned capacity, requires internal operational expertise.
Example: Many financial institutions (JPMorgan Chase, Goldman Sachs) operate private clouds using technologies like VMware, OpenStack, or Red Hat OpenShift to provide cloud-like capabilities while maintaining control over physical infrastructure for regulatory compliance.
Hybrid Cloud
A combination of public and private cloud, with data and applications flowing between them. Organizations might run sensitive workloads on private cloud while bursting to public cloud for variable demand, or maintaining legacy applications on-premises while developing new applications on public cloud.
Advantages: Flexibility to optimize placement of each workload. Disadvantages: Complexity of managing multiple environments and ensuring consistent security policies across them.
Example: Healthcare systems often run a hybrid cloud model: patient records and clinical systems remain on private cloud or on-premises for HIPAA compliance, while analytics workloads, machine learning model training, and patient-facing apps run on public cloud.
Multi-Cloud
Using services from multiple public cloud providers simultaneously. Different workloads might run on different providers, chosen for best-fit capabilities or to avoid vendor lock-in.
Example: A company might use AWS for its primary application infrastructure, Google Cloud for BigQuery analytics, and Cloudflare for edge security---choosing each for its specific strength.
Serverless Computing: The Next Abstraction
Serverless computing represents the extreme end of the cloud abstraction spectrum. You write functions, and the cloud provider handles absolutely everything else: provisioning servers, scaling them to zero when not in use and to hundreds of instances during spikes, patching, monitoring.
You pay only when your code executes, measured in milliseconds and memory used. A function that runs 10,000 times per day for 200ms each costs roughly $0.002 per day on AWS Lambda at current pricing---compared to the minimum $15-20 per month for the smallest dedicated server.
When serverless excels:
- Intermittent workloads: Functions that run occasionally, triggered by events
- Variable traffic: Systems with dramatic swings between peak and quiet periods
- Event processing: Reactions to file uploads, database changes, API calls
- Scheduled tasks: Nightly reports, hourly data processing, cleanup jobs
- API backends: REST APIs with variable request rates
Serverless limitations:
- Cold starts: Functions not recently invoked take 100-1000ms to initialize before handling requests. Mitigated by keeping functions "warm" but adds latency complexity.
- Maximum execution time: AWS Lambda limits function execution to 15 minutes. Long-running processes must be redesigned.
- Statelessness: Each invocation is independent. State must be stored externally in databases or caches.
- Vendor lock-in: Serverless functions use provider-specific APIs and deployment formats. Migrating between Lambda and Azure Functions requires rewriting deployment configuration and potentially application code.
- Observability challenges: Debugging distributed, ephemeral functions is harder than debugging persistent servers.
Example: Coca-Cola replaced its vending machine backend---which had to handle variable loads from millions of machines worldwide---with AWS Lambda serverless functions. The system scales automatically with demand, costs nothing when no machines are active, and eliminated the need to manage server infrastructure for a non-core capability.
Cloud Security: The Shared Responsibility Model
Security in the cloud operates under a shared responsibility model: the cloud provider secures the infrastructure; you secure what you build on it.
Provider's responsibility: Physical security of data centers, hardware, hypervisor security, network security within the provider's infrastructure, security of managed services.
Customer's responsibility: Data encryption, access management (who can access what), network configuration, application security, operating system patching (for IaaS), compliance with relevant regulations.
Most high-profile cloud security breaches result from customer misconfiguration, not provider failures. Common misconfiguration patterns:
- Public S3 buckets: Amazon S3 storage buckets set to public access when they should be private. In 2017, thousands of S3 buckets belonging to organizations including Verizon, WWE, and the Republican National Committee were found publicly accessible due to misconfiguration.
- Overly permissive IAM roles: Service accounts with administrator-level permissions when they need read access to a single bucket.
- Unencrypted databases: Database instances launched without encryption, storing sensitive data in plaintext.
- Missing security groups: Network access controls that allow inbound traffic from 0.0.0.0/0 (the entire internet) to sensitive services.
The relationship between cloud infrastructure and security deserves dedicated attention; the shared responsibility model means organizations cannot outsource security thinking to their cloud provider.
Cloud Cost Management: The Invisible Problem
Cloud bills surprise organizations reliably. The ease of provisioning resources is a feature---but it means resources can be created and forgotten just as easily.
The Cost Optimization Hierarchy
1. Eliminate waste first: Before optimizing what you use, stop using what you do not need.
- Delete unattached volumes and snapshots
- Terminate stopped instances that are not serving a purpose
- Remove idle load balancers, unused IP addresses, and forgotten test environments
- Delete old container images from registries
2. Right-size running resources: Most cloud resources are over-provisioned.
- CPU utilization below 10-20% suggests the instance is too large
- Memory utilization below 30-40% suggests the instance type should change
- AWS Compute Optimizer, Azure Advisor, and GCP Recommender provide automated right-sizing recommendations
3. Purchase reserved capacity for stable workloads: For infrastructure running continuously (production databases, core application servers), reserved instances offer 30-70% discounts versus on-demand pricing in exchange for 1-3 year commitments.
4. Use spot or preemptible instances for interruptible workloads: Cloud providers offer spare capacity at 60-90% discounts, with the caveat that instances can be reclaimed with short notice. Batch processing, data analysis, and CI/CD build runners are natural candidates.
Example: Lyft reduced their AWS spend by $10 million annually by implementing reserved instances, eliminating idle resources, and implementing automated scaling policies that shut down non-production environments during off-hours.
Cost Attribution and Visibility
What gets measured gets managed. Cloud cost management requires:
- Resource tagging: Every resource tagged with project, team, environment, and cost center. Untagged resources cannot be attributed and are often wasted.
- Billing alerts: Configured at 50%, 75%, and 100% of expected monthly spend. The goal is no surprise bills.
- Cost dashboards: Regular reporting on spend by team, project, and service. Most cost overruns result from nobody reviewing the bill until it arrives.
- Reserved instance tracking: Monitoring utilization of committed capacity to ensure commitments are being used.
For detailed strategies on controlling cloud spending, cloud cost optimization covers the full spectrum of techniques from tagging to architectural patterns.
Cloud Architecture Patterns
Moving to cloud computing is not merely moving existing systems to different infrastructure. The most successful cloud architectures leverage cloud-specific capabilities.
Stateless Services
Traditional applications often stored session state in the application server's memory. In the cloud, where instances can be added, removed, or replaced at any time, state stored in a single instance is lost when that instance dies.
Cloud-native pattern: Store session state in a distributed cache (Redis, Memcached) or database. Any instance can serve any request because no state lives in the instance itself. This enables auto-scaling: add instances during traffic spikes, remove them during quiet periods, without disrupting user sessions.
Decoupling with Message Queues
Tightly coupled systems---where Service A calls Service B directly---create fragility. If Service B is slow or unavailable, Service A is affected.
Cloud-native pattern: Introduce a message queue (AWS SQS, Azure Service Bus, Google Pub/Sub) between services. Service A places a message in the queue and continues. Service B reads from the queue when ready. Services can scale independently, fail independently, and be updated independently.
Multi-Region Architecture
Single-region deployments mean a regional outage (AWS had significant outages in us-east-1 in December 2021) takes down your service. Multi-region architecture distributes load and failure risk.
Cloud-native pattern: Deploy to multiple regions with a global load balancer (AWS Global Accelerator, Cloudflare Load Balancing) routing users to the nearest healthy region. Data replication between regions is complex and represents a real cost in both money and engineering effort, but for critical services, the availability improvement is worth it.
Understanding Scaling Strategies
How systems grow to handle increasing load is one of the fundamental architectural decisions in cloud deployments. Scaling cloud systems requires understanding horizontal vs. vertical scaling, database scaling patterns, caching strategies, and the limits of each approach.
When Cloud Is Not the Answer
Cloud computing is not universally superior to owned infrastructure. Honest evaluation requires considering specific workloads.
High-Utilization Stable Workloads
For workloads running 24/7 at consistently high utilization, owned hardware is often cheaper over a 3-5 year period. The breakeven depends on utilization rates and the specific instance types, but organizations that have done the math---including companies like Dropbox---have moved workloads back to owned infrastructure after finding cloud too expensive at scale.
Dropbox famously "un-clouded" in 2016, migrating storage infrastructure from AWS to their own data centers, saving approximately $75 million over two years. The move made sense for Dropbox because their storage workload is predictable and high-utilization---exactly the conditions where ownership is competitive with rental.
Data Sovereignty and Residency Requirements
Some regulations require data to remain within specific geographic jurisdictions. While major cloud providers offer region choices and data residency guarantees, some requirements go further than providers can accommodate, requiring on-premises storage.
Specialized Hardware Requirements
Some workloads require specialized hardware---GPU clusters for machine learning, FPGAs for specific signal processing, or custom ASICs for particular applications. Cloud providers increasingly offer specialized hardware, but the selection is limited compared to building custom hardware.
The Cloud's Second Decade
Cloud computing's first decade was about migration: moving existing workloads from data centers to cloud infrastructure. The second decade is about transformation: redesigning systems to leverage cloud-native capabilities that did not exist before.
Generative AI infrastructure: The computational requirements for training and running large language models have driven enormous investment in specialized cloud infrastructure. AWS, Azure, and Google Cloud all offer managed AI services (SageMaker, Azure OpenAI Service, Vertex AI) that make previously research-only capabilities accessible to any organization.
Edge computing: Rather than centralizing all computation in a few large data centers, edge computing pushes computation closer to users and devices. Cloudflare Workers, AWS Lambda@Edge, and Fastly Compute@Edge run code at hundreds of locations globally, enabling latency-sensitive applications impossible with centralized architectures.
Sustainability pressure: Cloud providers have made significant commitments to renewable energy, often achieving better carbon efficiency than individually operated data centers. Microsoft pledged carbon negativity by 2030; Google operates carbon-free for some data centers. For organizations with sustainability mandates, cloud can be the more environmentally responsible choice.
Understanding how cloud infrastructure intersects with DevOps practices reveals how organizational processes must also evolve alongside technical infrastructure---the technology alone does not deliver the full benefit.
References
- Mell, Peter and Grance, Timothy. "The NIST Definition of Cloud Computing." NIST Special Publication 800-145, 2011. https://csrc.nist.gov/publications/detail/sp/800-145/final
- Armbrust, Michael et al. "A View of Cloud Computing." Communications of the ACM, 2010. https://dl.acm.org/doi/10.1145/1721654.1721672
- Amazon Web Services. "AWS Well-Architected Framework." aws.amazon.com. https://aws.amazon.com/architecture/well-architected/
- Google Cloud. "Google Cloud Architecture Framework." cloud.google.com. https://cloud.google.com/architecture/framework
- Microsoft. "Azure Well-Architected Framework." learn.microsoft.com. https://learn.microsoft.com/en-us/azure/well-architected/
- Kim, Gene et al. The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win. IT Revolution Press, 2013.
- Wittig, Michael and Wittig, Andreas. Amazon Web Services in Action. Manning Publications, 2019.
- Garrison, Justin and Nova, Kris. Cloud Native Infrastructure. O'Reilly Media, 2017. https://www.oreilly.com/library/view/cloud-native-infrastructure/9781491984291/
- Synergy Research Group. "Cloud Market Share Q4 2023." srgresearch.com, 2024. https://www.srgresearch.com/articles/cloud-market-share
- Dropbox. "Scaling to Exabytes." Dropbox Tech Blog, 2017. https://dropbox.tech/infrastructure/magic-pocket-infrastructure