In 1956, a small group of researchers gathered at Dartmouth College for a summer workshop and coined the term "artificial intelligence." Their proposal was optimistic to the point of hubris: they believed that "every aspect of learning or every other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it." They expected significant progress within one summer.
What followed were seven decades of cycles — waves of enthusiasm, spectacular demonstrations, and then "AI winters" when progress stalled and funding evaporated. The field struggled because early approaches relied on hand-coded rules: programmers explicitly wrote the logic for every situation. This works for chess (a finite set of rules) but fails catastrophically for recognizing faces, understanding speech, or translating language — tasks humans do effortlessly but that are extraordinarily difficult to specify as explicit rules.
The breakthrough came when researchers stopped trying to program intelligence and started trying to learn it. Modern machine learning does not specify rules; it learns patterns from examples. Given enough labeled data and computational power, a machine learning system can learn to recognize cats in photographs, translate between languages, play Go at superhuman levels, and generate coherent essays — all without a programmer ever writing the rules for these tasks.
Understanding how this learning happens requires understanding neural networks, gradient descent, and the mathematics of optimization — not to write the code, but to understand the remarkable thing that is actually occurring when an AI system "learns."
"A breakthrough in machine learning would be worth ten Microsofts." — Bill Gates, Business @ the Speed of Thought (1999)
Key Definitions
Machine learning (ML) — A subset of artificial intelligence in which systems learn to perform tasks from data, rather than being explicitly programmed with rules. The system is given a set of examples and adjusts its internal parameters to produce the correct output for each example, developing a model that generalizes to new, unseen examples.
Neural network — A computational architecture loosely inspired by the brain, consisting of layers of interconnected processing units (neurons) with adjustable weights. Data flows through the network from input to output, being transformed at each layer. The connections' weights are the learned parameters.
Deep learning — Machine learning using neural networks with many layers (hence "deep"). The depth allows the network to learn hierarchical representations: early layers learn simple features, later layers combine these into increasingly abstract patterns. Deep learning has driven the most significant recent AI progress.
Parameter — An adjustable numerical value within a neural network (typically a weight or bias). Neural networks have many parameters: a small network might have thousands; large language models have tens of billions. Training adjusts all parameters to minimize error.
Loss function — A mathematical function measuring how wrong the model's predictions are, given a set of labeled examples. Common loss functions: mean squared error (for regression), cross-entropy (for classification). Training minimizes the loss function.
Gradient descent — The optimization algorithm used to train neural networks. Computes the gradient of the loss function with respect to each parameter — essentially, which direction and how much each parameter change would reduce the error — and adjusts parameters in the direction that reduces loss.
Backpropagation — The algorithm that efficiently computes the gradient in a neural network. It propagates the error signal backward through the network (from output to input), using the chain rule of calculus to compute each parameter's contribution to the error. Rumelhart, Hinton, and Williams popularized it in 1986.
Activation function — A non-linear function applied to each neuron's output. Without non-linearity, stacking multiple layers would be equivalent to a single linear transformation. Activation functions allow networks to learn complex, non-linear patterns. Common activations: ReLU (Rectified Linear Unit: output = max(0, input)), sigmoid, tanh.
Epoch — One complete pass through the training dataset. Networks are typically trained for many epochs, repeatedly presenting the training data and adjusting parameters, until performance converges.
Overfitting — When a model learns the training data too precisely — including its noise and idiosyncrasies — and loses the ability to generalize to new data. The model has memorized rather than learned. Overfitting is the most common failure mode in machine learning.
Regularization — Techniques to prevent overfitting by constraining the model's complexity. L1/L2 regularization adds a penalty for large parameter values to the loss function. Dropout randomly deactivates some neurons during training, preventing co-adaptation. Data augmentation artificially increases training set size.
Supervised learning — Learning from labeled examples: input-output pairs. The model is trained to predict the output given the input. Examples: image classification (input: image, output: label), machine translation (input: sentence in one language, output: sentence in another).
Unsupervised learning — Finding patterns in unlabeled data. No correct outputs are provided. Examples: clustering (grouping similar data points), dimensionality reduction, generative modeling.
Reinforcement learning — Training an agent through interaction with an environment. The agent takes actions, receives rewards or penalties, and adjusts its policy to maximize cumulative reward. Used in game-playing AI (AlphaGo, AlphaZero, OpenAI Five).
How a Neural Network Learns
The Basic Architecture
A neural network consists of:
Input layer: Receives raw data — pixel values for an image, word embeddings for text, numerical features for tabular data.
Hidden layers: Process the input through learned transformations. Each neuron computes a weighted sum of its inputs (z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b) and passes the result through an activation function (output = f(z)). The weights w and biases b are the learned parameters.
Output layer: Produces the prediction — a class probability for classification, a continuous value for regression, a probability distribution over vocabulary tokens for language modeling.
Forward Pass: Making a Prediction
Data flows from input to output in a "forward pass." At each layer, neurons compute their weighted sums and apply activation functions. The final layer produces a prediction.
For a digit classification task (recognizing handwritten digits 0-9), the input might be a 28×28 pixel image (784 numbers between 0 and 1). The network processes these through hidden layers and outputs 10 numbers representing the probability that the image is each digit. The predicted class is the digit with the highest probability.
Initially, the network's weights are random — it makes essentially random predictions. Training adjusts the weights to make correct predictions.
Computing Loss
After the forward pass, the loss function measures how wrong the prediction was. For digit classification, cross-entropy loss is common: it penalizes confident wrong predictions heavily and confident correct predictions minimally.
If the true label is "7" but the network assigned 5% probability to 7 and 60% probability to 4, the loss is high. If the network assigned 95% probability to 7, the loss is low.
Backpropagation: Assigning Blame
To improve the prediction, the network must know which parameters (weights) contributed most to the error and in what direction they should change. This is what backpropagation computes.
Backpropagation applies the chain rule of calculus to propagate the error signal backward through the network, computing the gradient of the loss with respect to every parameter. The gradient tells you: "if this weight increased by a tiny amount, the loss would increase/decrease by this amount."
For a simple example: if the output neuron for "4" has a weight connected to a hidden neuron that is too large, backpropagation identifies this weight's contribution to the error and marks it for reduction.
Gradient Descent: Updating Parameters
Once gradients are computed, gradient descent updates each parameter:
w_new = w_old - learning_rate × gradient
The learning rate (η) controls the step size — how much to adjust each parameter per update. Too large and training is unstable (overshooting the minimum); too small and training is very slow.
This process — forward pass, compute loss, backpropagation, gradient descent update — is repeated for each batch of training examples (stochastic gradient descent), for many epochs, until the loss converges.
After training on millions of handwritten digit examples, the network's weights have been adjusted to capture the visual patterns that distinguish different digits. It generalizes to new examples it has never seen.
How Deep Networks Learn Representations
The power of deep learning comes from hierarchical feature learning. Different layers learn different levels of abstraction.
Convolutional Neural Networks for Images
In a CNN trained on photographs:
Layer 1: Neurons learn to detect simple features — edges at various orientations, color gradients. Layer 2: Neurons learn to detect combinations of edges — corners, curves, simple textures. Layer 3: Neurons learn to detect more complex shapes — eyes, wheels, textures. Layer 4: Neurons learn to detect object parts — faces, car fronts, animal limbs. Layer 5: Neurons learn to detect whole objects — faces, cats, cars.
This hierarchy is not programmed — it emerges from training on labeled images. The network discovers that breaking images into hierarchical features is an effective strategy for classification.
The same principle applies to other domains. A language model's early layers capture simple word patterns; later layers capture sentence structure, semantic meaning, and contextual inference.
Transfer Learning
A critical insight: representations learned for one task are often useful for related tasks. A network trained to classify ImageNet (1.4 million images, 1,000 categories) learns general visual features that transfer to new visual tasks — recognizing medical images, detecting objects in satellite photos, classifying plant diseases.
Transfer learning dramatically reduces the data and compute required for new tasks. Modern AI applications typically fine-tune pre-trained models rather than training from scratch.
Training at Scale: Large Language Models
Modern large language models (LLMs) like GPT-4 apply the same principles at enormous scale: billions of parameters, trained on hundreds of billions of text tokens, using thousands of GPUs for months.
The Pre-Training Objective
LLMs are trained on a "next token prediction" task: given a sequence of words, predict the next word. This objective, applied to enormous text corpora (web pages, books, code), forces the model to learn language patterns, factual knowledge, reasoning structures, and world models that generalize far beyond predicting text.
Why next-token prediction is surprisingly powerful: To predict the next word accurately in context "The capital of France is ___", the model must know that Paris is the capital of France. To predict "The patient's blood pressure medication was ___", it must understand medical contexts. By optimizing next-token prediction at scale, the model incidentally learns vast factual knowledge and reasoning capabilities.
Scaling Laws
Kaplan et al. (2020) at OpenAI demonstrated that LLM performance (measured by loss on held-out text) follows predictable power laws as a function of model size (parameters), training data size, and compute budget. Larger models trained on more data with more compute reliably perform better — the relationship is smooth and predictable over many orders of magnitude.
This empirical observation — that scaling works — drove the investment in increasingly large models: GPT-3 (175 billion parameters), GPT-4 (estimated ~1 trillion), and ongoing scaling by major AI labs. The improvements from scaling have been dramatic: capabilities that didn't exist in smaller models emerged at larger scales.
Emergent Capabilities
Striking about large language models is the emergence of capabilities that were not explicitly trained and were not present in smaller models: multi-step reasoning, code generation, arithmetic, translation between low-resource language pairs, and many others appeared when models reached sufficient scale.
These emergent capabilities — appearing suddenly as a function of scale rather than gradually — suggest that quantitative scaling can produce qualitative capability changes. This is both exciting and somewhat mysterious to researchers.
The Limits of Current AI
Despite remarkable capabilities, current AI systems have significant limitations:
They cannot reliably reason: LLMs perform impressively on many reasoning tasks but fail unpredictably on others, suggesting they are capturing surface statistical patterns rather than implementing genuine logical reasoning.
They hallucinate: LLMs generate plausible-sounding false information confidently. Because they are trained to produce fluent text, they produce confident text even when their training data doesn't support the claim.
They lack grounding: LLMs have no direct sensory experience, no embodied understanding of physics, and no real-world interaction. Their "knowledge" is entirely derived from text patterns.
They don't learn from interaction: A standard LLM does not update its parameters from interactions with users. It applies fixed learned weights to new prompts.
They have training data cutoffs: Knowledge is frozen at training time. Events after the training cutoff are unknown.
These limitations inform current research directions: better reasoning architectures, retrieval augmentation, grounding through multimodal training, and continuous learning methods.
For related concepts, see transformer architecture explained, AI hallucinations explained, and reinforcement learning from human feedback explained.
References
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning Representations by Back-Propagating Errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0
- LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539
- Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361. https://arxiv.org/abs/2001.08361
- Brown, T., et al. (2020). Language Models are Few-Shot Learners (GPT-3). Advances in Neural Information Processing Systems, 33, 1877–1901. https://arxiv.org/abs/2005.14165
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org/
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 25. https://doi.org/10.1145/3065386
- Wei, J., et al. (2022). Emergent Abilities of Large Language Models. Transactions on Machine Learning Research. https://arxiv.org/abs/2206.07682
Frequently Asked Questions
What is the difference between AI, machine learning, and deep learning?
AI is the broad field of creating machines that exhibit intelligent behavior. Machine learning is a subset of AI in which systems learn from data rather than being explicitly programmed with rules. Deep learning is a subset of machine learning using neural networks with many layers ('deep' architectures), which has driven recent AI breakthroughs in image recognition, natural language, and game-playing.
How does a neural network learn?
A neural network learns by adjusting its internal parameters (weights and biases) to minimize the difference between its predictions and the correct answers. This adjustment is made through gradient descent: computing how the error would change if each parameter were slightly adjusted, then moving each parameter in the direction that reduces error. This process, called backpropagation, is repeated across millions of training examples.
What is a neural network and why is it called that?
An artificial neural network is loosely inspired by the structure of the brain — it consists of layers of interconnected nodes ('neurons'), each combining inputs with learned weights and applying a nonlinear function. Input features flow through hidden layers, which extract increasingly abstract features, to produce an output. The 'neural' analogy is approximate — real biological neurons are far more complex.
What is overfitting in machine learning?
Overfitting occurs when a model learns the training data too well — including its random noise and quirks — and loses the ability to generalize to new data. An overfit model has high accuracy on training data but poor accuracy on unseen data. It's like memorizing the exam questions rather than learning the subject. Regularization, dropout, and using more training data are common remedies.
Why do large language models like GPT need so much data and compute?
Large language models learn statistical patterns in text by predicting the next word billions of times across vast corpora. The more data and compute used, the more parameters can be trained, and the more complex and nuanced patterns can be captured. Scaling laws show that model performance improves predictably with more data, compute, and parameters — but the cost scales accordingly.
What is the difference between supervised, unsupervised, and reinforcement learning?
Supervised learning trains on labeled examples (input-output pairs). Unsupervised learning finds patterns in unlabeled data (clustering, dimensionality reduction). Reinforcement learning trains an agent through interaction with an environment — the agent receives rewards for desired behaviors and learns to maximize cumulative reward. Modern AI systems often combine multiple paradigms.
Can AI become conscious or self-aware?
Current AI systems are not conscious and show no evidence of self-awareness in any meaningful sense. They are sophisticated pattern-matching systems that process inputs and produce outputs according to learned statistical patterns. The philosophical question of whether sufficiently complex information processing could produce consciousness remains deeply unresolved.