In June 2017, eight researchers at Google published a paper titled "Attention Is All You Need." The title was a provocation — a claim that the dominant paradigm for processing sequential data was unnecessary. Recurrent neural networks, long short-term memory networks, convolutional sequence models: the field had spent years refining these architectures for processing text, audio, and other sequential data. The Google researchers proposed replacing all of it with a single mechanism called self-attention.
The paper introduced the transformer architecture. Within three years, virtually every major advance in artificial intelligence would be built on it. GPT, BERT, T5, PaLM, LLaMA, Claude, Gemini — all are transformers. The image models, the audio models, the protein folding models, the code generation models — nearly all are transformers or transformer-derived. The architecture changed the field more profoundly than any single development since backpropagation.
Understanding the transformer — what it does, why it works, and what replaced before it — is understanding the foundation of every modern AI system.
"The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution." — Vaswani et al., Attention Is All You Need (2017)
Key Definitions
Transformer — A neural network architecture that processes sequential data using self-attention rather than recurrence or convolution. Introduced by Vaswani et al. (2017) at Google. The basis of virtually all large-scale AI language models, and increasingly applied to image, audio, video, and protein structure data.
Self-attention — The core mechanism of the transformer. For each position in an input sequence, self-attention computes a weighted representation that reflects the relationships between that position and every other position in the sequence. The weights (attention scores) determine how much each other position contributes to the representation at the current position.
Attention mechanism — A general technique for computing context-dependent representations by selectively focusing on relevant parts of the input. Self-attention is a specific form of attention where a sequence attends to itself. The attention mechanism pre-dates transformers but was previously used as a supplement to recurrent networks.
Recurrent neural network (RNN) — A neural network architecture that processes sequences one element at a time, maintaining a hidden state that accumulates information from previous elements. RNNs were the dominant architecture for sequence processing before transformers. Their key limitation: information from early in a long sequence degrades as it passes through many processing steps before influencing later outputs.
Long short-term memory (LSTM) — A variant of RNNs with gating mechanisms designed to learn longer-range dependencies. LSTMs significantly improved on vanilla RNNs for long sequences but still processed inputs sequentially, preventing parallelization during training.
Token — The basic unit of input for a transformer. Text is tokenized — split into words, subwords, or characters depending on the tokenizer — and each token is converted to a numerical vector (embedding) before processing. Modern language models typically use byte-pair encoding (BPE) or similar subword tokenization schemes.
Embedding — A numerical vector representation of a token. Embeddings encode semantic relationships: tokens with similar meanings have embeddings that are geometrically close. The transformer converts input tokens to embeddings, processes them, and converts output embeddings back to tokens.
Positional encoding — Information added to token embeddings to convey the position of each token in the sequence. Because transformers process all positions simultaneously, they have no inherent sense of order. Positional encodings inject this information. The original paper used fixed sinusoidal functions; modern architectures often use learned or rotary position encodings.
Multi-head attention — Running multiple self-attention computations in parallel, each potentially capturing different types of relationships. The outputs are concatenated and linearly transformed. Multiple heads allow the model to represent different aspects of input relationships simultaneously.
Feed-forward layer — Each transformer layer includes a position-wise feed-forward network after the attention sublayer: a two-layer fully connected network applied independently to each position. This component introduces non-linearity and increases model capacity.
Layer normalization — A normalization technique applied at each sublayer to stabilize training by normalizing the values flowing through the network. Helps prevent the gradient problems that plagued deep networks before normalization techniques were developed.
Encoder — The transformer component that processes the full input sequence bidirectionally, producing a rich representation of the input. Encoder models (BERT and its variants) are well-suited for understanding tasks.
Decoder — The transformer component that generates output sequences autoregressively — one token at a time, with each new token attending only to previously generated tokens and the encoder output. Decoder models (GPT and its variants) are well-suited for generation tasks.
Autoregressive generation — The process by which decoder transformers generate output: producing one token at a time, with each token generated by sampling from a probability distribution conditioned on all previous tokens. The model generates the most probable continuation given everything that came before.
The Problem Transformers Solved
Sequential Processing Was a Bottleneck
Recurrent neural networks were the standard architecture for sequence processing through the mid-2010s. An RNN processes a sequence element by element: at each step, it takes the current input and the previous hidden state, produces a new hidden state, and optionally an output. This hidden state is supposed to carry information about everything the model has seen so far.
The problem is that the hidden state is a fixed-size vector. For long sequences, compressing all relevant information into a single vector at each step is increasingly difficult. Information from early in the sequence must survive many transformations before it can influence outputs near the end of the sequence. The further apart two elements are in the sequence, the harder it is for the network to learn the relationship between them.
LSTMs and GRUs (Gated Recurrent Units) improved on vanilla RNNs by adding gating mechanisms that helped the network selectively retain and forget information. But the fundamental bottleneck remained: sequential processing, one element at a time.
"Long Short-Term Memory can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through 'constant error carousels' within special units." — Sepp Hochreiter and Jürgen Schmidhuber, Neural Computation (1997)
LSTMs were a genuine improvement, but they did not eliminate the sequential bottleneck. And sequential processing meant sequential training: you could not start computing step 50 until you had finished step 49. This severely limited the efficiency of training on modern parallel hardware.
The Attention Solution
The attention mechanism — in various forms — had been used as a supplement to recurrent networks since Bahdanau et al.'s 2015 paper on neural machine translation. The idea was to let the decoder at each generation step look back at all encoder hidden states, rather than relying solely on the final encoder hidden state. This gave the decoder direct access to information from anywhere in the input, significantly improving translation of long sentences.
What Vaswani et al. recognized was that attention was not a supplement — it was sufficient. If you could compute relationships between any two positions in a sequence directly, without needing recurrence to propagate information, you did not need recurrence at all. You could process all positions simultaneously, compute all pairwise relationships in parallel, and train on modern hardware without the sequential bottleneck.
This is the transformer: attention, and nothing else that is architecturally essential.
How Self-Attention Works
The transformer processes each input token by computing a weighted combination of all tokens, where the weights reflect how relevant each token is to the current one. The mechanism uses three learned projections of each token's embedding:
Query (Q): What this token is looking for. Key (K): What this token offers to others. Value (V): What information this token contributes when attended to.
For each position, the attention weight with every other position is computed as the dot product of the query at the current position with the key at the other position, scaled and passed through a softmax function to produce a probability distribution over all positions. The output is a weighted sum of the values, weighted by these attention probabilities.
The result: each token's output representation is a mixture of all tokens' value vectors, with higher weight given to tokens that are more relevant to the current query. A token can effectively look at the entire context and incorporate information from wherever it is most relevant.
Multi-head attention runs this computation multiple times in parallel with different learned projections. Each "head" may capture a different type of relationship: grammatical dependencies, semantic similarity, positional patterns, coreference. The outputs of all heads are concatenated and linearly transformed.
Encoder vs. Decoder Architectures
The original transformer paper described a full encoder-decoder architecture for sequence-to-sequence tasks like translation. Subsequent work showed that the encoder and decoder components could be used independently for different purposes.
| Architecture | Attention Direction | Primary Use | Examples |
|---|---|---|---|
| Encoder-only | Bidirectional (each position attends to all positions) | Understanding tasks: classification, named entity recognition, question answering | BERT, RoBERTa, ELECTRA |
| Decoder-only | Causal (each position attends only to previous positions) | Generation tasks: text completion, dialogue, code generation | GPT-2, GPT-3, GPT-4, LLaMA, Claude |
| Encoder-decoder | Encoder: bidirectional; Decoder: causal + cross-attention to encoder | Seq-to-seq tasks: translation, summarization, question answering | T5, BART, original Transformer |
The bidirectional attention in encoder models is powerful for understanding: the model can use context from both before and after a token to build its representation. BERT (Bidirectional Encoder Representations from Transformers) was designed around this insight, achieving state-of-the-art performance on language understanding benchmarks by processing text in both directions simultaneously.
Decoder-only models use causal (unidirectional) attention to preserve the autoregressive property needed for generation: each token's representation can only use information from previous tokens, which means the model can generate output one token at a time in a consistent left-to-right manner. The GPT series pioneered the argument that large decoder-only models trained on enough text would develop general capabilities without task-specific architectural modifications.
Why Transformers Scaled So Well
Parallelization During Training
Because self-attention processes all positions simultaneously, transformers can leverage modern GPU and TPU hardware far more efficiently than RNNs, which process positions sequentially. Training an RNN on a 512-token sequence requires 512 sequential steps. Training a transformer on the same sequence requires a fixed number of matrix operations that can be parallelized completely.
This parallelization advantage means transformers can be trained on far more data in a given amount of compute than their predecessors. More data + more compute = better models. The scaling advantage became a decisive factor as the field moved toward larger datasets and larger models.
The Scaling Laws
Jared Kaplan et al. at OpenAI published empirical scaling laws for transformer language models in 2020. They found that model performance — measured by language modeling loss — follows smooth power-law relationships with model size (number of parameters), training data size (number of tokens), and compute. The relationships held across many orders of magnitude.
"We find that language modeling performance depends strongly on scale — the number of model parameters, the size of the dataset, and the amount of compute used for training — and only weakly on the model architecture within a class." — Kaplan et al., Scaling Laws for Neural Language Models (2020)
This was a profound result: it meant that making transformers better was largely a matter of making them bigger and training them on more data. The architecture was not the bottleneck — scale was. And transformers were far better positioned to scale than any previous architecture.
Modern Developments
Sparse Attention and Efficient Transformers
Standard self-attention has quadratic complexity in sequence length: computing all pairwise attention scores for a sequence of length N requires N² operations. For long sequences, this becomes prohibitively expensive. Numerous efficient attention variants have been developed to address this, including sparse attention (computing attention only between subsets of positions), linear attention (approximating attention with linear complexity), and sliding window attention (each position attends only to a local window of neighbors).
Rotary Position Encoding
The original paper used fixed sinusoidal positional encodings added to token embeddings. More recent architectures use Rotary Position Encoding (RoPE), which incorporates position information directly into the attention computation by rotating the query and key vectors. RoPE integrates more naturally with the attention mechanism and generalizes better to sequence lengths longer than those seen during training.
Mixture of Experts
Mixture-of-Experts (MoE) architectures replace the standard feed-forward layer with multiple expert networks, selecting a subset of experts for each token rather than using all of them. This allows models to have more total parameters without a proportional increase in compute, since only a fraction of parameters are active for any given input.
For related concepts, see what is a neural network, large language models explained, and AI training explained.
References
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. https://arxiv.org/abs/1810.04805
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Blog. https://openai.com/research/language-unsupervised
- Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361. https://arxiv.org/abs/2001.08361
- Brown, T. B., et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33. https://arxiv.org/abs/2005.14165
- Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
- Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473. https://arxiv.org/abs/1409.0473
Frequently Asked Questions
What is a transformer in AI?
A transformer is a neural network architecture designed to process sequential data — text, audio, images — by computing relationships between all elements of the input simultaneously rather than one element at a time. Introduced in 2017 in the paper 'Attention Is All You Need' by Vaswani et al., transformers became the foundation for virtually all major AI language models including GPT, BERT, and their successors.
What is self-attention in a transformer?
Self-attention is the mechanism by which a transformer computes relationships between every element of its input and every other element simultaneously. For a sentence, self-attention allows the model to determine how much each word relates to every other word when processing any given word — capturing long-range dependencies that previous architectures handled poorly. The output of self-attention is a weighted representation of the input that encodes these relationships.
Why did transformers replace recurrent neural networks?
Recurrent neural networks (RNNs) processed sequences one element at a time, passing information forward through a hidden state. This created a bottleneck: information from early in a long sequence had to pass through many processing steps to influence later outputs, which made learning long-range dependencies difficult. Transformers process all positions simultaneously and compute relationships between any two positions directly, regardless of their distance in the sequence.
What is the difference between encoder and decoder transformers?
Encoder transformers (like BERT) process the full input bidirectionally and are designed for tasks that require understanding: classification, question answering, named entity recognition. Decoder transformers (like GPT) process input left-to-right and are designed for generation: producing text one token at a time, with each token able to attend only to previous tokens. Encoder-decoder transformers (like the original architecture and T5) combine both, with the encoder processing input and the decoder generating output.
What is positional encoding in transformers?
Because transformers process all input positions simultaneously rather than sequentially, they have no inherent sense of order. Positional encoding is added to the input embeddings to give the model information about where each token appears in the sequence. The original paper used sinusoidal functions of different frequencies to represent positions. More recent architectures use learned positional embeddings or rotary position encoding (RoPE) that integrates position information into the attention computation directly.
What is multi-head attention?
Multi-head attention runs multiple attention computations in parallel, each learning to attend to different types of relationships. One head might capture syntactic dependencies, another might capture semantic relationships, a third might capture positional patterns. The outputs of all heads are concatenated and linearly transformed. Multiple heads allow the model to simultaneously represent different types of relationships between input elements, increasing representational richness.
How large are modern transformer models?
The original transformer paper described a model with 65 million parameters. GPT-2 (2019) had 1.5 billion parameters. GPT-3 (2020) had 175 billion parameters. Modern frontier models are estimated to have hundreds of billions to over a trillion parameters, with exact counts often not disclosed publicly. Model size correlates with capability, though the relationship is not linear and architecture, training data, and fine-tuning matter as much as raw parameter count.