What Is a Neural Network: How AI Learns From Data
In October 2012, a graduate student named Alex Krizhevsky submitted an entry to the ImageNet Large Scale Visual Recognition Challenge, a competition where computer vision systems competed to correctly classify one million photographs across one thousand categories. The best system the year before had achieved a top-5 error rate of roughly 26 percent, meaning it failed to include the correct label among its five guesses on 26 percent of images. Krizhevsky's system, built with his supervisor Geoffrey Hinton and fellow student Ilya Sutskever at the University of Toronto, achieved an error rate of 15.3 percent, nearly cutting the error in half.
The system was called AlexNet. It was a neural network with eight layers. Before AlexNet, computer vision researchers had spent decades designing features by hand, carefully crafting mathematical descriptions of edges, textures, and shapes that would allow a computer to identify objects. AlexNet made all of that manual feature engineering obsolete. It learned the features automatically from data.
The impact on the field was immediate and total. Within a few years, every serious competitor at ImageNet used neural networks. Within a decade, the techniques AlexNet demonstrated had propagated across virtually every domain of artificial intelligence. Understanding what a neural network actually is — not the metaphor but the reality — is foundational to understanding modern AI.
The Brain Metaphor: What It Gets Right and What It Gets Wrong
Neural networks are called neural because their structure was loosely inspired by the biological neural networks in animal brains. The metaphor is useful up to a point and misleading beyond it.
"The brain is a tissue. It is a complicated, intricately woven tissue, like nothing else we know of in the universe, but it is composed of cells, as any tissue is." — Warren McCulloch, whose 1943 paper with Walter Pitts first modeled neurons as logic gates and laid the mathematical foundation for artificial neural networks
What the metaphor gets right: the brain processes information through large numbers of interconnected cells called neurons. These neurons do not work sequentially, like a traditional program, but in massively parallel fashion. Many neurons fire simultaneously, and the pattern of activity across the network encodes information and drives behavior. Artificial neural networks also consist of many interconnected processing units operating in parallel.
What the metaphor gets wrong: almost everything else. Biological neurons are extraordinarily complex cells that communicate through electrochemical signals across synaptic connections that are themselves dynamic, modifiable, and regulated by dozens of neurotransmitters and neuromodulators. Biological neurons fire in timing-dependent patterns, and the timing matters as much as which neurons fire. The brain has roughly 86 billion neurons connected by approximately 100 trillion synapses, organized in architectures that neuroscientists are still mapping.
An artificial neuron is a simple mathematical function. It receives numerical inputs, multiplies each by a weight, sums the results, adds a bias term, and passes the sum through an activation function to produce a single numerical output. That output becomes input to other artificial neurons in the next layer. The entire operation is arithmetic.
The brain metaphor was useful historically because it provided intuitive scaffolding for the underlying mathematics. It becomes harmful when people mistake it for literal description, inferring that neural networks work like brains, or that having many layers makes a network somehow more mind-like. Neither is true. A neural network is a mathematical object that happens to have been inspired by biological structure and happens to be very effective at learning complex patterns from data.
Architecture: Layers, Nodes, and Connections
Every neural network shares a basic architectural structure: an input layer, one or more hidden layers, and an output layer.
The input layer receives raw data. For an image recognition network, each input node receives the brightness value of one pixel. A 224x224 pixel image has roughly 150,000 pixels, and a network processing such images might have 150,000 input nodes, one per pixel. For a language model, each input represents a token, roughly a word or word fragment, encoded as a number.
Hidden layers are where the learning happens. Each node in a hidden layer receives input from every node in the previous layer, or from a local neighborhood of nodes in convolutional architectures, multiplied by connection weights. The node applies its activation function to the weighted sum and passes its output forward.
In a network with one hidden layer, this single transformation must capture whatever patterns are useful for the task. In a network with ten hidden layers, each transformation builds on the previous one, allowing the network to develop increasingly abstract representations. Early layers might detect edges and colors. Later layers might represent complex shapes and object parts. The deepest layers represent high-level concepts like "dog" or "car" or "malignant."
The output layer produces the network's final prediction. For a classification problem with ten categories, the output layer has ten nodes, one per category. The values of these nodes are converted into probabilities using a function called softmax, and the category with the highest probability is the network's prediction.
How Weights and Biases Encode Learning
Every connection between nodes in a neural network has an associated weight, a single floating-point number. A network with millions of connections has millions of weights. These weights are the parameters of the model, the numbers that actually encode what the network has learned.
Understanding what weights do makes it clearer why neural networks work. When an input passes through a connection with a high positive weight, it has a strong positive influence on the receiving node. A high negative weight has a strong inhibitory effect. A weight near zero means the connection barely matters. After training, the pattern of weights across the network encodes, in distributed numerical form, everything the network knows.
This distributed representation is fundamentally different from how information is stored in traditional software. If you wrote a program to recognize dogs, you might have a variable called has_four_legs that is explicitly checked. In a neural network, the concept "has four legs" is not stored in any single place. It is distributed across thousands of weights that collectively cause appropriate responses to leg-shaped visual features. This is powerful because it allows the network to handle noisy, incomplete, or ambiguous inputs gracefully. It is challenging because it means you cannot inspect the network and find where it stores any particular piece of knowledge.
Biases are additional learned parameters, one per node. The bias is simply a constant added to the weighted sum before the activation function is applied. Biases allow nodes to shift their activation thresholds, making them easier or harder to activate regardless of input values. They give the network more flexibility to fit the training data.
Backpropagation: How Neural Networks Learn
Backpropagation is the algorithm that makes neural network learning possible. It was known in principle for decades but was not widely applied to multi-layer networks until the 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams that showed clearly how to use it in practice.
"Learning multiple levels of representation and abstraction helps to make sense of data, whether images, sound or text. I believe that a key to intelligence is learning representations at multiple levels of abstraction." — Geoffrey Hinton
The learning process works in two passes. In the forward pass, an input example travels through the network, layer by layer, until the output layer produces a prediction. This prediction is compared to the correct answer using a loss function, which quantifies how wrong the prediction was. For classification, a common choice is cross-entropy loss, which punishes confident wrong predictions much more severely than uncertain ones.
In the backward pass, the algorithm works from the output layer back to the input layer, computing for each weight how much it contributed to the error. This computation uses the chain rule from calculus to propagate error gradients backward through the network. Each weight receives a gradient: a number indicating how much the loss would increase or decrease if that weight were changed slightly.
Once all gradients are computed, gradient descent adjusts every weight in the direction that reduces the loss. The step size is controlled by the learning rate hyperparameter. This entire process — one forward pass and one backward pass across a batch of examples — is called an iteration, and training consists of thousands to millions of iterations.
What makes this remarkable is that backpropagation distributes learning credit appropriately across the entire network, including through many layers. Without it, training deep networks was effectively impossible because there was no way to know how weights in early layers should change in response to errors at the output. With it, networks can learn useful representations at every layer simultaneously.
Activation Functions: Why They Matter
A neural network without activation functions would be useless. If every node simply computed a weighted sum of its inputs, the entire network, regardless of depth, would be equivalent to a single-layer linear model. You could mathematically collapse all the layers into one. The network could only learn linear relationships, which is a severe limitation for almost any interesting problem.
Activation functions introduce non-linearity, allowing neural networks to learn curved, complex relationships. The simplest activation function that achieves this is the sigmoid function, which squashes any input into a value between 0 and 1. For many years, sigmoid was standard. Its problem is the vanishing gradient: as networks get deeper, gradients computed during backpropagation shrink exponentially as they propagate backward, making early layers learn very slowly or not at all. This is part of why deep networks were so hard to train before roughly 2010.
The Rectified Linear Unit, or ReLU, is now the most commonly used activation function in hidden layers. ReLU is simple: output equals the input if the input is positive, zero otherwise. This single change, which sounds trivial, dramatically improved the ability to train deep networks by eliminating the vanishing gradient problem in positive-activation neurons. AlexNet used ReLU in 2012, and it was one of the reasons the network could be as deep as it was.
For output layers, activation functions depend on the task. Softmax converts a vector of raw scores into a probability distribution for multi-class classification. Sigmoid is used for binary classification or any output where the answer is a probability between 0 and 1. For regression, where the output is a continuous number, no activation function is applied to the output layer.
Types of Neural Networks
The basic feedforward architecture, where data flows in one direction from input to output, is the foundation. But different tasks have motivated architectural variations that are substantially more effective for their respective domains.
| Architecture | Best For | Key Innovation | Famous Example |
|---|---|---|---|
| Convolutional Neural Network (CNN) | Images, video, spatial data | Convolutional filters that detect local patterns and share weights across positions | AlexNet (2012), ResNet (2015) |
| Recurrent Neural Network (RNN) / LSTM | Sequential data: text, speech, time series | Hidden state that persists across time steps, giving the network memory | Google's speech recognition (2012), neural machine translation |
| Transformer | Language, vision, multimodal tasks | Self-attention: every position directly attends to every other position simultaneously | GPT-4, Claude, BERT, DALL-E |
Convolutional Neural Networks
Convolutional neural networks, CNNs, were designed for data with spatial structure, particularly images. The key operation is convolution: instead of connecting every input node to every hidden node, convolutional layers apply learned filters that slide across the input, detecting local patterns like edges, corners, and textures. The same filter is applied at every position, which makes the network invariant to where in the image a pattern appears.
Early convolutional layers detect low-level features: edges, gradients, color blobs. Deeper layers combine these into higher-level features: curves, shapes, object parts, and eventually complete objects. This hierarchical feature detection is precisely what makes CNNs so effective for visual data.
"Convolutional networks have proven spectacularly good at recognizing patterns in images. It turned out that the features they learn are not just pattern detectors, but hierarchical representations of visual concepts." — Yann LeCun
CNNs were formalized in modern form by Yann LeCun in his 1998 work on handwriting recognition, which was deployed commercially to read zip codes and checks. AlexNet in 2012 scaled the same basic approach with more layers, more data, and GPU training to achieve its landmark results. CNNs subsequently achieved superhuman performance on the ImageNet benchmark and are now used in virtually every industrial computer vision system.
Recurrent Neural Networks
Recurrent neural networks, RNNs, were designed for sequential data where order matters: text, speech, time series. Unlike feedforward networks that process each input independently, RNNs maintain a hidden state that persists across time steps, giving the network a form of memory.
When processing a sentence word by word, an RNN updates its hidden state at each word, allowing information from earlier words to influence processing of later words. This is useful for language, where understanding a pronoun depends on knowing what it refers to several words earlier.
RNNs have a long-range dependency problem: information from many steps ago tends to be diluted or forgotten by the time it is relevant. Long Short-Term Memory networks, LSTMs, introduced by Sepp Hochreiter and Jurgen Schmidhuber in 1997, add gating mechanisms that allow the network to explicitly control what information is retained and what is discarded. LSTMs became the standard approach for sequence modeling and powered a generation of machine translation, speech recognition, and text generation systems.
Transformers and the Attention Revolution
Transformers, introduced in the 2017 paper "Attention Is All You Need" by researchers at Google Brain, replaced recurrent networks as the dominant architecture for language tasks and subsequently expanded into vision and other domains.
The key insight was the attention mechanism. Instead of processing a sequence step by step and relying on a hidden state to carry information, transformers directly compute relationships between all positions in the input simultaneously. For every word in a sentence, attention asks: which other words are most relevant to understanding this word? The answer is learned from data, and it can capture relationships across arbitrary distances without the degradation that plagues RNNs.
Transformers also parallelize efficiently because they process all input positions simultaneously rather than sequentially, which made it practical to train on much larger datasets using modern GPU hardware. This scalability, combined with the expressive attention mechanism, is what enabled the large language models like GPT-4 and Claude that have defined the recent period of AI development. At a conceptual level, GPT works by predicting the next token in a sequence, using attention to consider the relationships between all previous tokens when making each prediction.
"The magic of neural nets, specifically deep learning, is that they can learn arbitrary mappings from inputs to outputs. The challenge is making them learn the right things." — Andrej Karpathy
The Black Box Problem
Neural networks are remarkably good at prediction and remarkably bad at explanation. For any given prediction, it is very difficult to trace through the millions of weights and activation values to say why the network produced that output rather than a different one.
This interpretability problem has serious consequences in high-stakes applications. When a neural network reviews a loan application, denies it, and the applicant asks why, there is no satisfying answer to give. The network might have learned to encode race or gender as implicit features in apparently neutral inputs, and there is no straightforward way to audit for this. When a medical imaging AI detects a tumor, clinicians cannot verify the diagnosis by examining the AI's reasoning the way they would examine a colleague's.
The field of explainable AI, sometimes abbreviated XAI, addresses this problem. Techniques like LIME (Local Interpretable Model-Agnostic Explanations), developed by Marco Ribeiro and colleagues at the University of Washington in 2016, and SHAP (SHapley Additive exPlanations), create local approximations of neural network behavior that can be inspected. These methods explain individual predictions but do not expose the full learned model. They are useful approximations, not complete solutions.
Gradient-based visualization methods attempt to identify which input features most influenced a prediction by computing gradients with respect to the input. For image classification, these methods can highlight which regions of an image the network attended to when making its decision. A tumor detection system that highlights the right area of a scan provides more trustworthy evidence than one that simply outputs a binary classification.
The interpretability gap is a genuine scientific problem, not a temporary engineering limitation waiting to be solved. Understanding why neural networks work as well as they do is itself an active research area.
What Neural Networks Cannot Do
Neural networks excel at tasks that can be formulated as pattern matching in high-dimensional data with abundant labeled examples. Image and audio recognition, machine translation, certain medical diagnostics, game-playing, and content generation all fit this description.
Neural networks struggle reliably with tasks that require systematic reasoning, counting, or applying rules consistently to new situations. Ask a language model what 387 plus 459 is, and it may answer correctly because that specific computation or similar ones appeared in training data. Ask it a long chain of arithmetic steps in a novel format, and performance degrades. Neural networks learn statistical associations between inputs and outputs, not the underlying rules of arithmetic.
They are also brittle in ways that human perception is not. A trained image classifier can be fooled by adversarial examples: images that are imperceptible to human eyes but have been mathematically modified to cause the network to produce confident wrong predictions. Ian Goodfellow and colleagues demonstrated this in 2014, showing that adding carefully computed noise to an image of a panda caused a neural network to classify it as a gibbon with 99.3 percent confidence while remaining completely panda-like to human observers.
This brittleness reflects something important: neural networks have learned statistical regularities that correlate with correct answers in their training distribution, not the causal structure of the world that makes those answers correct. Outside the training distribution, the statistical regularities break down.
From AlexNet to the Deep Learning Revolution
The trajectory from AlexNet in 2012 to today's large multimodal models represents the most rapid improvement in any technology capability in modern history. Each year from 2012 to 2017, the winning ImageNet error rate fell substantially. By 2015, Microsoft's ResNet had achieved superhuman performance on the benchmark, a top-5 error rate below 5 percent compared to human estimated error around 5 percent.
The architectural innovations that enabled this — deeper networks, better activation functions, dropout regularization, batch normalization, and residual connections — built on each other rapidly. Each solved a specific problem that had limited training of deeper networks: vanishing gradients, internal covariate shift, overfitting. The result was that each year it became practical to train networks that were deeper and more powerful than the year before.
Language followed vision by several years. The transformer architecture from 2017 removed the bottleneck that had prevented language models from scaling. GPT-1 in 2018, GPT-2 in 2019, and GPT-3 in 2020 demonstrated that scaling transformers in parameters and training data produced consistent capability improvements. GPT-3's 175 billion parameters, at the time of its release, produced a system capable of generating coherent text, writing code, and performing many language tasks zero-shot — without task-specific training — in ways that seemed qualitatively different from all prior systems.
Neural networks did not replace all other approaches to AI. Rule-based systems, classical statistical models, and gradient-boosted trees remain the best choice for many practical problems. But for tasks involving raw, high-dimensional, unstructured data, especially text and images, neural networks have produced results that alternative approaches cannot match, which is why they have become the dominant paradigm in AI research and increasingly in production systems.
Practical Takeaways
If you are evaluating whether a neural network is appropriate for a problem, ask first whether you have enough data. Shallow neural networks can work with thousands of examples. Deep networks typically need hundreds of thousands to millions of labeled examples to reach good performance. If your dataset is small, a simpler model will likely generalize better and be much easier to maintain.
If you are building systems that will use neural networks in consequential applications, treat interpretability as a first-order concern alongside accuracy. A system that makes correct predictions 95 percent of the time but cannot explain its reasoning creates different risks than a slightly less accurate system that provides auditable justifications. For medical, legal, or financial applications, the regulatory and ethical stakes often make explainability non-negotiable.
If you are learning about neural networks, begin with the feedforward architecture and the backpropagation learning algorithm before moving to specialized architectures. The principles of weights, activation functions, forward pass, and gradient descent are common to every neural network type. Understanding them well means that learning about CNNs, transformers, and future architectures is a matter of understanding what specific problem each architectural innovation was designed to solve.
The history of neural networks is a history of solving specific, concrete bottlenecks: the vanishing gradient problem, the inability to parallelize RNNs, the need for hand-engineered features. Future advances will come from solving the next generation of concrete bottlenecks: interpretability, data efficiency, robustness to distribution shift, and reliable reasoning. Knowing what those problems are and why they matter is the starting point for understanding where the field is going.
Frequently Asked Questions
What is a neural network in simple terms?
A neural network is a computational system loosely inspired by the structure of the human brain. It consists of layers of simple processing units called nodes or neurons, connected by weighted links. When data passes through the network, each layer transforms it slightly until the final layer produces a prediction or output. The network learns by adjusting the weights of those connections based on how wrong its predictions are, gradually improving its accuracy through repeated exposure to training data.
How is a neural network similar to the human brain?
Both biological brains and artificial neural networks use large numbers of connected processing units that work in parallel. In the brain, biological neurons fire signals across synapses; in a neural network, artificial neurons pass numerical values across weighted connections. The structural inspiration ends there. Artificial neural networks do not think, feel, or understand anything. They are mathematical functions that transform input data into output predictions through layers of arithmetic operations, with no consciousness or internal experience of any kind.
What are the layers of a neural network?
A neural network has an input layer that receives raw data, one or more hidden layers that transform that data through learned representations, and an output layer that produces the final prediction or classification. Each connection between nodes has a weight that determines how much influence one node has on another. Nodes also have a bias value and apply an activation function that introduces non-linearity, allowing the network to learn complex, curved relationships rather than only straight-line patterns.
How do weights work in a neural network?
Weights are numerical values assigned to every connection between nodes in the network, and they determine how strongly one node influences the next. When the network makes a prediction, each input is multiplied by its weight before being passed forward. During training, the learning algorithm repeatedly adjusts these weights to reduce prediction errors. A weight that consistently leads to correct predictions gets reinforced while weights leading to errors are reduced. By the end of training, the pattern of weights across the entire network encodes everything the model has learned.
What is backpropagation and how does it work?
Backpropagation is the algorithm that allows a neural network to learn by distributing responsibility for errors back through the network. When a prediction is wrong, the algorithm calculates how much each weight contributed to the mistake and adjusts those weights proportionally. Starting from the output layer and working backward toward the input layer, it updates each weight by a small amount in the direction that would have reduced the error. This process repeats millions of times across the training dataset, gradually pushing all the weights toward values that produce accurate predictions.
How is a neural network different from traditional programming?
Traditional programming requires a developer to write explicit rules covering every scenario the software might encounter. Neural networks instead learn rules automatically from data by adjusting their internal weights through training. This makes neural networks capable of handling tasks like recognizing faces in photos or understanding spoken language, where the rules are far too complex and context-dependent for any human to write explicitly. The tradeoff is that neural networks are much harder to interpret and debug than rule-based code.
What are the main types of neural networks?
Feedforward networks are the simplest type, passing data in one direction from input to output. Convolutional neural networks (CNNs) are specialized for image processing and use filters to detect spatial patterns like edges and shapes. Recurrent neural networks (RNNs) handle sequential data like text and time series by maintaining a form of memory across time steps. Transformers, the architecture behind large language models like GPT, use attention mechanisms to process relationships between all parts of an input simultaneously, making them exceptionally powerful for language tasks.
What tasks are neural networks best at?
Neural networks excel at tasks involving complex, high-dimensional data where the patterns are too subtle or numerous for humans to define manually. Image and video recognition, speech-to-text conversion, machine translation, drug discovery, game-playing, and content generation are all areas where neural networks have achieved remarkable results. They tend to require large amounts of training data and significant computing power to achieve these results, which limits their applicability in domains where data is scarce or expensive to collect.
Why are neural networks considered a black box?
Neural networks are called black boxes because, despite knowing their inputs and outputs, it is very difficult to explain why they produce a specific result. A network might correctly identify a tumor in a medical image, but tracing exactly which neurons and weights led to that conclusion is enormously complex. This interpretability problem is a serious concern in high-stakes applications like medicine, finance, and criminal justice, driving a growing field of research called explainable AI that seeks to make neural network decisions more transparent and auditable.
How does deep learning relate to neural networks?
Deep learning refers specifically to neural networks with many hidden layers, typically more than two. The word deep refers to the depth of these layers. Neural networks with just one or two hidden layers are considered shallow and are capable only of learning relatively simple patterns. Deep networks can learn much more abstract and hierarchical representations of data, which is why they power breakthroughs in image recognition, speech recognition, and natural language processing. In practice, deep learning and neural networks have become nearly synonymous because most modern neural networks are deep.