In 1956, a small group of researchers gathered at Dartmouth College for a summer workshop and coined the term "artificial intelligence." Their proposal was optimistic to the point of hubris: they believed that "every aspect of learning or every other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it." They expected significant progress within one summer.

What followed were seven decades of cycles — waves of enthusiasm, spectacular demonstrations, and then "AI winters" when progress stalled and funding evaporated. The field struggled because early approaches relied on hand-coded rules: programmers explicitly wrote the logic for every situation. This works for chess (a finite set of rules) but fails catastrophically for recognizing faces, understanding speech, or translating language — tasks humans do effortlessly but that are extraordinarily difficult to specify as explicit rules.

The breakthrough came when researchers stopped trying to program intelligence and started trying to learn it. Modern machine learning does not specify rules; it learns patterns from examples. Given enough labeled data and computational power, a machine learning system can learn to recognize cats in photographs, translate between languages, play Go at superhuman levels, and generate coherent essays — all without a programmer ever writing the rules for these tasks.

Understanding how this learning happens requires understanding neural networks, gradient descent, and the mathematics of optimization — not to write the code, but to understand the remarkable thing that is actually occurring when an AI system "learns."

"A breakthrough in machine learning would be worth ten Microsofts." — Bill Gates, Business @ the Speed of Thought (1999)


Key Definitions

Machine learning (ML) — A subset of artificial intelligence in which systems learn to perform tasks from data, rather than being explicitly programmed with rules. The system is given a set of examples and adjusts its internal parameters to produce the correct output for each example, developing a model that generalizes to new, unseen examples.

Neural network — A computational architecture loosely inspired by the brain, consisting of layers of interconnected processing units (neurons) with adjustable weights. Data flows through the network from input to output, being transformed at each layer. The connections' weights are the learned parameters.

Deep learning — Machine learning using neural networks with many layers (hence "deep"). The depth allows the network to learn hierarchical representations: early layers learn simple features, later layers combine these into increasingly abstract patterns. Deep learning has driven the most significant recent AI progress.

Parameter — An adjustable numerical value within a neural network (typically a weight or bias). Neural networks have many parameters: a small network might have thousands; large language models have tens of billions. Training adjusts all parameters to minimize error.

Loss function — A mathematical function measuring how wrong the model's predictions are, given a set of labeled examples. Common loss functions: mean squared error (for regression), cross-entropy (for classification). Training minimizes the loss function.

Gradient descent — The optimization algorithm used to train neural networks. Computes the gradient of the loss function with respect to each parameter — essentially, which direction and how much each parameter change would reduce the error — and adjusts parameters in the direction that reduces loss.

Backpropagation — The algorithm that efficiently computes the gradient in a neural network. It propagates the error signal backward through the network (from output to input), using the chain rule of calculus to compute each parameter's contribution to the error. Rumelhart, Hinton, and Williams popularized it in 1986.

Activation function — A non-linear function applied to each neuron's output. Without non-linearity, stacking multiple layers would be equivalent to a single linear transformation. Activation functions allow networks to learn complex, non-linear patterns. Common activations: ReLU (Rectified Linear Unit: output = max(0, input)), sigmoid, tanh.

Epoch — One complete pass through the training dataset. Networks are typically trained for many epochs, repeatedly presenting the training data and adjusting parameters, until performance converges.

Overfitting — When a model learns the training data too precisely — including its noise and idiosyncrasies — and loses the ability to generalize to new data. The model has memorized rather than learned. Overfitting is the most common failure mode in machine learning.

Regularization — Techniques to prevent overfitting by constraining the model's complexity. L1/L2 regularization adds a penalty for large parameter values to the loss function. Dropout randomly deactivates some neurons during training, preventing co-adaptation. Data augmentation artificially increases training set size.

Supervised learning — Learning from labeled examples: input-output pairs. The model is trained to predict the output given the input. Examples: image classification (input: image, output: label), machine translation (input: sentence in one language, output: sentence in another).

Unsupervised learning — Finding patterns in unlabeled data. No correct outputs are provided. Examples: clustering (grouping similar data points), dimensionality reduction, generative modeling.

Reinforcement learning — Training an agent through interaction with an environment. The agent takes actions, receives rewards or penalties, and adjusts its policy to maximize cumulative reward. Used in game-playing AI (AlphaGo, AlphaZero, OpenAI Five).


Types of Machine Learning: A Comparison

Type Training Data Feedback Signal Typical Applications
Supervised learning Labeled input-output pairs Correct answer provided per example Image classification, translation, spam detection
Unsupervised learning Unlabeled data No explicit feedback Clustering, anomaly detection, compression
Reinforcement learning Environment interactions Reward/penalty signals Game playing (AlphaGo), robotics, recommendation
Self-supervised learning Unlabeled data, self-generated labels Predicting masked/future inputs Large language models (GPT), BERT
Transfer learning Pre-trained + task-specific fine-tuning Task-specific labels Medical imaging, low-resource languages

The Long Road to the Breakthrough: A Brief History

Understanding where modern AI came from illuminates why it works the way it does.

The perceptron, developed by Frank Rosenblatt in 1958, was the first trainable neural network — a single layer of adjustable connections that could learn to classify inputs. Early enthusiasm was extravagant. The New York Times reported in 1958 that the Navy had shown a machine "capable of learning" that it expected "will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." The enthusiasm was premature. In 1969, Minsky and Papert's Perceptrons demonstrated mathematically that single-layer networks could not solve even simple problems like the XOR function. Funding collapsed; the first AI winter began.

The field revived in the mid-1980s when Rumelhart, Hinton, and Williams (1986) demonstrated that backpropagation could train multi-layer networks effectively. Networks with hidden layers could, in principle, approximate any continuous function. A second wave of enthusiasm followed, then another winter in the early 1990s as networks remained computationally expensive, training was slow, and performance on real-world problems was modest.

The decisive turning point came in 2012, when Krizhevsky, Sutskever, and Hinton submitted a deep convolutional neural network (AlexNet) to the ImageNet Large Scale Visual Recognition Challenge and won by a margin so large — reducing the error rate from ~26% to ~15% — that the entire computer vision community pivoted to deep learning almost immediately. The combination of larger datasets, greater computational power (particularly GPU acceleration), and improved training techniques had finally unlocked the potential that backpropagation theoretically offered (Krizhevsky et al., 2012).


How a Neural Network Learns

The Basic Architecture

A neural network consists of:

Input layer: Receives raw data — pixel values for an image, word embeddings for text, numerical features for tabular data.

Hidden layers: Process the input through learned transformations. Each neuron computes a weighted sum of its inputs (z = w1x1 + w2x2 + ... + wnxn + b) and passes the result through an activation function (output = f(z)). The weights w and biases b are the learned parameters.

Output layer: Produces the prediction — a class probability for classification, a continuous value for regression, a probability distribution over vocabulary tokens for language modeling.

Forward Pass: Making a Prediction

Data flows from input to output in a "forward pass." At each layer, neurons compute their weighted sums and apply activation functions. The final layer produces a prediction.

For a digit classification task (recognizing handwritten digits 0-9), the input might be a 28x28 pixel image (784 numbers between 0 and 1). The network processes these through hidden layers and outputs 10 numbers representing the probability that the image is each digit. The predicted class is the digit with the highest probability.

Initially, the network's weights are random — it makes essentially random predictions. Training adjusts the weights to make correct predictions.

Computing Loss

After the forward pass, the loss function measures how wrong the prediction was. For digit classification, cross-entropy loss is common: it penalizes confident wrong predictions heavily and confident correct predictions minimally.

If the true label is "7" but the network assigned 5% probability to 7 and 60% probability to 4, the loss is high. If the network assigned 95% probability to 7, the loss is low.

The choice of loss function is not arbitrary: it defines what "learning" means. A mean squared error loss trains the network to minimize average prediction error. A cross-entropy loss trains it to calibrate confidence as well as accuracy. Different tasks require different loss functions, and misaligned loss functions are a common source of model failure in real-world applications.

Backpropagation: Assigning Blame

To improve the prediction, the network must know which parameters (weights) contributed most to the error and in what direction they should change. This is what backpropagation computes.

Backpropagation applies the chain rule of calculus to propagate the error signal backward through the network, computing the gradient of the loss with respect to every parameter. The gradient tells you: "if this weight increased by a tiny amount, the loss would increase/decrease by this amount."

For a simple example: if the output neuron for "4" has a weight connected to a hidden neuron that is too large, backpropagation identifies this weight's contribution to the error and marks it for reduction.

Rumelhart, Hinton, and Williams (1986) showed that backpropagation was efficient — scaling linearly with the number of parameters rather than exponentially, as naive methods would — and that it worked in practice on real problems. This was the algorithmic insight that made deep learning possible; without efficient gradient computation, large networks would be untrainable.

Gradient Descent: Updating Parameters

Once gradients are computed, gradient descent updates each parameter:

w_new = w_old - learning_rate x gradient

The learning rate controls the step size — how much to adjust each parameter per update. Too large and training is unstable (overshooting the minimum); too small and training is very slow.

This process — forward pass, compute loss, backpropagation, gradient descent update — is repeated for each batch of training examples (stochastic gradient descent), for many epochs, until the loss converges.

After training on millions of handwritten digit examples, the network's weights have been adjusted to capture the visual patterns that distinguish different digits. It generalizes to new examples it has never seen.

The Loss Landscape and Why Training Works

A crucial question is why gradient descent finds good solutions at all. The space of all possible weight configurations is enormously high-dimensional — a network with millions of parameters has a loss "landscape" in millions of dimensions. Naively, you might expect this landscape to be full of poor local minima that would trap gradient descent.

Goodfellow, Vinyals, and Saxe (2015) and subsequent theoretical work have shown that deep networks tend to have a surprisingly benign loss landscape: most local minima have loss values close to the global minimum, and saddle points (where the gradient is zero but the point is not a local minimum) are far more common than true poor local minima. This theoretical result — still an active area of research — helps explain why gradient descent reliably finds good solutions in practice despite the apparent difficulty of the optimization problem (Goodfellow, Bengio, and Courville, 2016).


How Deep Networks Learn Representations

The power of deep learning comes from hierarchical feature learning. Different layers learn different levels of abstraction.

Convolutional Neural Networks for Images

In a CNN trained on photographs:

Layer 1: Neurons learn to detect simple features — edges at various orientations, color gradients. Layer 2: Neurons learn to detect combinations of edges — corners, curves, simple textures. Layer 3: Neurons learn to detect more complex shapes — eyes, wheels, textures. Layer 4: Neurons learn to detect object parts — faces, car fronts, animal limbs. Layer 5: Neurons learn to detect whole objects — faces, cats, cars.

This hierarchy is not programmed — it emerges from training on labeled images. The network discovers that breaking images into hierarchical features is an effective strategy for classification. Zeiler and Fergus (2014) developed visualization techniques that allowed researchers to see what each layer of a trained CNN was detecting, confirming the hierarchical feature learning hypothesis and revealing that the learned features were often interpretable in human terms.

The same principle applies to other domains. A language model's early layers capture simple word patterns; later layers capture sentence structure, semantic meaning, and contextual inference.

Transfer Learning

A critical insight: representations learned for one task are often useful for related tasks. A network trained to classify ImageNet (1.4 million images, 1,000 categories) learns general visual features that transfer to new visual tasks — recognizing medical images, detecting objects in satellite photos, classifying plant diseases.

Transfer learning dramatically reduces the data and compute required for new tasks. Modern AI applications typically fine-tune pre-trained models rather than training from scratch. Raghu et al. (2019), in a study published in NeurIPS, demonstrated that transfer from ImageNet pre-training produced substantial improvements even in medical imaging tasks, despite the domain difference, suggesting that the visual representations learned from natural images capture genuinely general structure.


Reinforcement Learning: Learning from Consequences

Reinforcement learning (RL) is the paradigm most clearly analogous to how biological organisms learn: through trial, error, and feedback. An agent interacts with an environment, takes actions, and receives rewards (or penalties). The goal is to learn a policy — a mapping from states to actions — that maximizes cumulative reward.

The algorithmic challenge in RL is the credit assignment problem: when a reward (or punishment) arrives, which of the many prior actions was responsible? A game-playing agent that wins after 200 moves must distribute credit — and blame — across all the decisions that led to that outcome.

Deep Q-Networks (DQN), developed by DeepMind and published in Nature (Mnih et al., 2015), combined deep neural networks with Q-learning to produce an agent that learned to play 49 Atari games at or above human performance, learning solely from raw pixel inputs and the game score. The system used no game-specific knowledge — the same architecture, trained with the same algorithm and hyperparameters, learned to play Space Invaders, Breakout, and Pong, each from scratch.

AlphaGo (Silver et al., 2016) combined deep learning with reinforcement learning and tree search to defeat the world's top Go players — a feat widely considered impossible for AI just a few years earlier, because Go's game tree is too large for exhaustive search and the patterns of strong play were thought to require genuine strategic intuition. AlphaGo Zero (2017) then exceeded AlphaGo's performance while training purely through self-play, with no human games in its training data — demonstrating that sufficiently powerful learning systems can surpass the accumulated expertise of the best human practitioners.


Training at Scale: Large Language Models

Modern large language models (LLMs) like GPT-4 apply the same principles at enormous scale: billions of parameters, trained on hundreds of billions of text tokens, using thousands of GPUs for months.

The Pre-Training Objective

LLMs are trained on a "next token prediction" task: given a sequence of words, predict the next word. This objective, applied to enormous text corpora (web pages, books, code), forces the model to learn language patterns, factual knowledge, reasoning structures, and world models that generalize far beyond predicting text.

Why next-token prediction is surprisingly powerful: To predict the next word accurately in context "The capital of France is ___", the model must know that Paris is the capital of France. To predict "The patient's blood pressure medication was ___", it must understand medical contexts. By optimizing next-token prediction at scale, the model incidentally learns vast factual knowledge and reasoning capabilities.

Brown et al. (2020) introduced GPT-3 — with 175 billion parameters — and demonstrated that it could perform a remarkable range of tasks (translation, question answering, arithmetic, code generation) with only a few examples provided in the prompt, or sometimes with no examples at all. This "few-shot" and "zero-shot" learning represented a dramatic expansion of what language models could do, and set off the current wave of investment in large language models.

Scaling Laws

Kaplan et al. (2020) at OpenAI demonstrated that LLM performance (measured by loss on held-out text) follows predictable power laws as a function of model size (parameters), training data size, and compute budget. Larger models trained on more data with more compute reliably perform better — the relationship is smooth and predictable over many orders of magnitude.

This empirical observation — that scaling works — drove the investment in increasingly large models: GPT-3 (175 billion parameters), GPT-4 (estimated ~1 trillion), and ongoing scaling by major AI labs. The improvements from scaling have been dramatic: capabilities that didn't exist in smaller models emerged at larger scales.

Hoffmann et al. (2022) from DeepMind refined the scaling laws analysis with the Chinchilla study, finding that existing large models were significantly undertrained relative to their parameter count — that compute was better spent on more training data than more parameters for a given compute budget. Chinchilla (70 billion parameters, trained on 1.4 trillion tokens) outperformed GPT-3 (175B parameters, 300B tokens) and even GPT-3-sized models trained with 4x more compute, demonstrating that data and parameters need to scale together.

Emergent Capabilities

Striking about large language models is the emergence of capabilities that were not explicitly trained and were not present in smaller models: multi-step reasoning, code generation, arithmetic, translation between low-resource language pairs, and many others appeared when models reached sufficient scale.

Wei et al. (2022), in a survey published in Transactions on Machine Learning Research, documented over 100 such emergent abilities — tasks on which smaller models score near-randomly but large models score substantially above chance. These emergent capabilities — appearing suddenly as a function of scale rather than gradually — suggest that quantitative scaling can produce qualitative capability changes. This is both exciting and somewhat mysterious to researchers.


The Transformer Architecture: Why It Changed Everything

The dominant architecture underlying modern LLMs is the Transformer, introduced by Vaswani et al. (2017) in the paper "Attention Is All You Need." The key innovation was the self-attention mechanism, which allows the network to directly model dependencies between any pair of positions in a sequence, regardless of their distance.

Before Transformers, sequence modeling relied on recurrent networks (RNNs and LSTMs), which processed tokens sequentially and struggled to retain information over long sequences. Self-attention processes all tokens in parallel — dramatically reducing training time — and allows the network to attend to any relevant context anywhere in the input, regardless of position.

In self-attention, each token is represented as three vectors: a query, a key, and a value. The attention weight between two tokens is computed as the dot product of one token's query with the other's key, normalized by a softmax. The output representation for each token is a weighted sum of all value vectors, where the weights come from attention scores. This mechanism allows the network to learn which other tokens are relevant to understanding each token in context — effectively learning the grammar of attention.

The Transformer's parallelizability was as important as its expressiveness. GPU hardware is optimized for massive parallelism. Recurrent networks, processing tokens sequentially, could not fully exploit this parallelism; Transformers could. This hardware compatibility, combined with self-attention's expressiveness, made scaling to billions of parameters practical.


The Limits of Current AI

Despite remarkable capabilities, current AI systems have significant limitations:

They cannot reliably reason: LLMs perform impressively on many reasoning tasks but fail unpredictably on others, suggesting they are capturing surface statistical patterns rather than implementing genuine logical reasoning.

They hallucinate: LLMs generate plausible-sounding false information confidently. Because they are trained to produce fluent text, they produce confident text even when their training data doesn't support the claim. Ji et al. (2023), in a survey in ACM Computing Surveys, documented hallucination as a pervasive problem across all major LLMs, with rates varying by task and domain but never reaching zero even on simple factual queries.

They lack grounding: LLMs have no direct sensory experience, no embodied understanding of physics, and no real-world interaction. Their "knowledge" is entirely derived from text patterns.

They don't learn from interaction: A standard LLM does not update its parameters from interactions with users. It applies fixed learned weights to new prompts.

They have training data cutoffs: Knowledge is frozen at training time. Events after the training cutoff are unknown.

Adversarial vulnerability: Neural networks are susceptible to adversarial examples — inputs specifically crafted to cause misclassification. Szegedy et al. (2014) showed that small, carefully chosen perturbations to an image, invisible to human observers, could cause a network to misclassify with high confidence. This vulnerability has implications for security-sensitive applications of AI in areas like autonomous vehicles and medical diagnosis.

These limitations inform current research directions: better reasoning architectures, retrieval augmentation, grounding through multimodal training, adversarial robustness, and continuous learning methods.


AI Learning in Practice: Real-World Scale

The infrastructure required for training frontier AI models has grown to extraordinary scale. Training GPT-4 was estimated to have required approximately 10^24 to 10^25 floating point operations (FLOP), taking months of computation on thousands of high-end GPUs. Estimates of the energy consumption for training frontier models range from hundreds of thousands to millions of kilowatt-hours per run, with associated carbon emissions comparable to transcontinental flights for each training run.

This computational scale is not merely an engineering curiosity — it has policy implications. The concentration of the computational resources required to train frontier models in a small number of companies (primarily in the United States and China) has prompted regulatory discussions about access, safety, and the governance of AI development. The energy requirements have raised questions about the environmental sustainability of continued scaling, and have driven investment in more efficient training approaches including sparse models, mixture-of-experts architectures, and improved hardware.


References

  • Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning Representations by Back-Propagating Errors. Nature, 323(6088), 533-536. https://doi.org/10.1038/323533a0
  • LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Learning. Nature, 521(7553), 436-444. https://doi.org/10.1038/nature14539
  • Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361. https://arxiv.org/abs/2001.08361
  • Brown, T., et al. (2020). Language Models are Few-Shot Learners (GPT-3). Advances in Neural Information Processing Systems, 33, 1877-1901. https://arxiv.org/abs/2005.14165
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org/
  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 25. https://doi.org/10.1145/3065386
  • Wei, J., et al. (2022). Emergent Abilities of Large Language Models. Transactions on Machine Learning Research. https://arxiv.org/abs/2206.07682
  • Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762
  • Silver, D., et al. (2016). Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 529(7587), 484-489. https://doi.org/10.1038/nature16961
  • Mnih, V., et al. (2015). Human-level Control through Deep Reinforcement Learning. Nature, 518(7540), 529-533. https://doi.org/10.1038/nature14236
  • Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models. arXiv preprint arXiv:2203.15556. https://arxiv.org/abs/2203.15556
  • Zeiler, M. D., & Fergus, R. (2014). Visualizing and Understanding Convolutional Networks. European Conference on Computer Vision (ECCV). https://doi.org/10.1007/978-3-319-10590-1_53
  • Ji, Z., et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1-38. https://doi.org/10.1145/3571730
  • Szegedy, C., et al. (2014). Intriguing Properties of Neural Networks. International Conference on Learning Representations. https://arxiv.org/abs/1312.6199
  • Raghu, M., et al. (2019). Transfusion: Understanding Transfer Learning for Medical Imaging. Advances in Neural Information Processing Systems, 32.
  • Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6), 386-408.
  • Minsky, M., & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.

For related concepts, see transformer architecture explained, AI hallucinations explained, and reinforcement learning from human feedback explained.

Frequently Asked Questions

What is the difference between AI, machine learning, and deep learning?

AI is the broad field of creating intelligent machines. Machine learning is a subset where systems learn from data rather than explicit rules. Deep learning is machine learning using multi-layer neural networks, which has driven modern AI breakthroughs in image recognition, language, and game-playing.

How does a neural network learn?

By adjusting its internal parameters (weights) to minimize prediction error. Backpropagation computes the gradient of the error with respect to every parameter; gradient descent then nudges each parameter in the direction that reduces error. This is repeated millions of times across training examples.

What is a neural network and why is it called that?

A network of interconnected processing units (neurons) in layers, loosely inspired by the brain. Input features flow through hidden layers — each neuron computing a weighted sum plus nonlinear activation — to produce an output. The 'neural' analogy is approximate; real neurons are far more complex.

What is overfitting in machine learning?

When a model memorizes training data — including its noise — and fails to generalize to new examples. High training accuracy but low test accuracy is the signature. Remedies include regularization, dropout, early stopping, and more training data.

Why do large language models like GPT need so much data and compute?

Because performance follows predictable scaling laws: more parameters, more data, and more compute reliably produce better models. GPT-3 has 175 billion parameters; training required months on thousands of GPUs. Capabilities that don't exist at smaller scales emerge at sufficient scale.

What is the difference between supervised, unsupervised, and reinforcement learning?

Supervised learning uses labeled examples (correct answers provided). Unsupervised learning finds patterns in unlabeled data. Reinforcement learning trains an agent through rewards and penalties during interaction with an environment. Modern AI typically combines multiple paradigms.

Can AI become conscious or self-aware?

Current AI systems show no evidence of consciousness or self-awareness — they are sophisticated pattern-matching systems. Whether sufficiently complex information processing could produce consciousness remains a deeply unresolved philosophical and empirical question.