On a summer afternoon in 1948, the Bell System Technical Journal published a 77-page paper by a 32-year-old engineer named Claude Shannon. It was titled, with minimal fanfare, "A Mathematical Theory of Communication," and it answered questions that had seemed either unanswerable or not quite precise enough to be worth answering. How much information is in a message? How efficiently can any source be encoded? What is the maximum rate at which information can be reliably sent through a noisy channel? Shannon gave exact answers to all three questions, in the form of theorems with rigorous proofs. The paper is now routinely described as one of the most important scientific publications of the twentieth century.
The achievement came from an act of principled abstraction. Shannon stripped information down to its mathematical skeleton by ignoring meaning entirely. A message's information content, in his framework, depends solely on how surprising it is -- how much it reduces the receiver's uncertainty -- not on what it says or why it matters. This move seemed to leave out the most important thing, and critics said so. But the abstraction was exactly right for the engineering problem Shannon was solving, and it turned out to apply far beyond engineering: to biology, cryptography, statistics, neuroscience, and the theory of computation.
Shannon's paper arrived simultaneously with Norbert Wiener's Cybernetics (1948), a different but related treatment of information, communication, and control in machines and living systems, and within four years of Erwin Schrödinger's What Is Life? (1944), which had proposed that chromosomes carry hereditary information in a chemical code. The convergence was not coincidental. The mid-twentieth century was the moment when the concept of information became scientific -- not a vague intuition but a measurable, manipulable, physically real quantity with precise mathematical properties.
"The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point." -- Claude Shannon, 1948
Key Definitions
Bit: The fundamental unit of information, representing the answer to one binary question (yes or no, 0 or 1); a fair coin flip contains exactly 1 bit of information.
Shannon entropy: H(X) = -sum p(x) log2 p(x); the average information content (or equivalently, the average uncertainty) of a random variable; measures the minimum average bits per symbol needed to encode a source.
Channel capacity: The maximum rate at which information can be transmitted reliably through a noisy channel, measured in bits per second; for a channel with bandwidth B and signal-to-noise ratio S/N, Shannon's formula gives C = B log2(1 + S/N).
Mutual information: I(X;Y), the reduction in uncertainty about one variable given knowledge of another; channel capacity equals maximum mutual information between input and output.
Kolmogorov complexity: The length of the shortest computer program that produces a given string; a measure of the intrinsic complexity of individual objects rather than probability distributions.
Key Concepts at a Glance
| Concept | Definition | Formula | Significance |
|---|---|---|---|
| Shannon entropy | Average information content of a source | H = -sum p(x) log2 p(x) | Measures compressibility |
| Bit | Unit of information; answer to one binary question | 1 bit = log2(2) | Foundation of digital computing |
| Channel capacity | Maximum reliable transmission rate | C = B log2(1 + S/N) | Theoretical limit for all communication |
| Mutual information | Reduction in uncertainty about X given Y | I(X;Y) = H(X) - H(X | Y) |
| Kolmogorov complexity | Length of shortest program producing a string | K(x) | Defines intrinsic randomness |
| Source coding theorem | Compression cannot beat entropy | Average code length >= H | Limits on data compression |
Claude Shannon and the Birth of Information Theory
Shannon's Background
Claude Shannon grew up in Gaylord, Michigan, and showed early interest in building mechanical devices. He studied mathematics and electrical engineering at the University of Michigan, then pursued graduate work at MIT, where his 1937 master's thesis demonstrated that Boolean algebra could be applied to the analysis of electrical switching circuits -- an insight that became the theoretical foundation of digital circuit design.
During World War II Shannon worked at Bell Labs on cryptographic systems, experience that profoundly shaped his later work. Cryptography requires transmitting messages that are deliberately opaque to adversaries; the designer must think carefully about what makes messages informative or uninformative, predictable or random. Shannon published "Communication Theory of Secrecy Systems" in 1949, applying the same mathematical framework to cryptography and proving that the one-time pad is the unique encryption scheme providing perfect secrecy -- a system is perfectly secure if and only if the key is as long as the message and used only once.
The Bit and the Measure of Surprise
Shannon's conceptual breakthrough was the realization that information can be quantified by its reduction of uncertainty. If a message tells you something you already knew with certainty, it contains no information. If it tells you something completely unexpected among many equally likely alternatives, it contains a great deal.
Specifically, the information content of a message with probability p is log2(1/p) = -log2(p) bits. An event with probability 1/2 carries 1 bit; an event with probability 1/8 carries 3 bits. The expected value of this quantity over all possible messages is Shannon entropy: H = -sum p(x) log2 p(x). English text has entropy well below 8 bits per character because letters are not equally probable and sequences are highly constrained by grammar and vocabulary; Shannon estimated it at around 1 bit per character, implying that written English is highly redundant and could in principle be compressed by a factor of roughly 8 without loss.
Entropy and Its Physical Meaning
Maximum Entropy and Thermodynamics
Shannon entropy achieves its maximum value log2 n when all n outcomes are equally probable -- the uniform distribution -- and equals zero when one outcome is certain. This behavior mirrors thermodynamic entropy: a system at maximum entropy (thermodynamic equilibrium) has all microstates equally probable; a system at minimum entropy is in a single known state.
The formal resemblance between Shannon entropy and Boltzmann's entropy S = k ln W (where W is the number of accessible microstates) is genuine. They are the same mathematical object measured in different units. But the physical connection goes beyond formal similarity: information has thermodynamic consequences.
Maxwell's Demon and Landauer's Principle
James Clerk Maxwell proposed a thought experiment in 1867: a tiny demon guards a small door between two gas chambers at equal temperature and pressure. It observes the velocities of individual molecules and opens the door selectively, allowing fast molecules to accumulate in one chamber and slow molecules in the other, apparently creating a temperature difference without doing work, in violation of the second law of thermodynamics.
Leo Szilard analyzed the demon carefully in 1929 and argued that the demon's act of acquiring information about each molecule must have a thermodynamic cost that exactly compensates for the entropy reduction it achieves. Rolf Landauer (1961) made this precise with what is now called Landauer's principle: any logically irreversible operation on information -- specifically, erasing one bit -- must dissipate at least kT ln2 of energy as heat, where k is Boltzmann's constant and T is the temperature. This is not an engineering limitation but a fundamental physical law. Charles Bennett (1982) later showed that computation itself need not dissipate energy -- reversible computing is possible -- but erasure, the operation needed to reset a computer's memory, cannot be avoided and has an unavoidable thermodynamic cost.
Landauer's principle has been confirmed experimentally. In 2012, researchers at the Ecole Normale Superieure in Lyon directly measured the heat dissipated when a single bit is erased, confirming the theoretical minimum to high precision. Information is a physical quantity, not merely a mathematical abstraction.
Source Coding: Compression and Kolmogorov Complexity
Shannon's Source Coding Theorem
The source coding theorem establishes the fundamental limit on lossless compression. Any source with entropy H bits per symbol can be encoded in on average at least H bits per symbol; this bound can be approached arbitrarily closely by suitable codes. Compression below the entropy rate is possible only with loss of information.
This theorem sets an absolute ceiling on how much lossless compression is achievable, regardless of algorithm design or computational power. Engineers cannot beat Shannon; they can only approach his bound with better codes.
Huffman Coding and Practical Compression
David Huffman discovered his optimal prefix-free coding algorithm in 1952 while a graduate student at MIT, reportedly as an alternative to taking a final exam. The algorithm constructs a binary tree by repeatedly combining the two least probable symbols into a single node, assigning shorter binary codewords to more probable symbols. In English text, where 'e' is far more common than 'q', a Huffman code assigns 'e' a short codeword and 'q' a long one, reducing the average codeword length close to the entropy.
Practical compression formats such as ZIP, PNG, and PDF use variants of DEFLATE, which combines Huffman coding with Lempel-Ziv-Welch (LZW) dictionary compression -- a scheme that builds a dictionary of frequently occurring substrings and replaces them with short indices. JPEG achieves much higher compression by first transforming image blocks into frequency components using the discrete cosine transform, quantizing the high-frequency components (discarding spatial detail below perceptual thresholds), and then applying Huffman or arithmetic coding to the result. MP3 applies psychoacoustic models that identify audio components masked by louder sounds and allocates fewer bits to perceptually unimportant components.
Kolmogorov Complexity
Shannon entropy measures the average information content of a probability distribution. Kolmogorov complexity, developed independently by Andrei Kolmogorov, Ray Solomonoff, and Gregory Chaitin in the 1960s, measures the intrinsic complexity of individual objects: the length of the shortest program, running on a universal Turing machine, that outputs that object and halts.
A highly regular string -- "aaaaaaa..." for a million characters -- has low Kolmogorov complexity because a short program (print 'a' one million times) generates it. A truly random string has high Kolmogorov complexity close to its own length because no program shorter than the string itself can generate it. This formalizes the intuition that randomness and incompressibility are the same thing.
Kolmogorov complexity is uncomputable: there is no general algorithm that, given a string, returns its Kolmogorov complexity. This is a profound limitation with implications for the philosophy of science, model selection, and the theory of induction. Minimum description length (MDL) inference, developed by Jorma Rissanen, approximates Kolmogorov complexity with computable statistics to provide a principled Bayesian framework for model selection: the best model is the one that minimizes the total description length of the model and the data given the model.
Channel Coding: Noisy Channels and Error Correction
The Noisy Channel Coding Theorem
Shannon's most counterintuitive result is the noisy channel coding theorem: every noisy channel has a capacity C such that reliable communication -- with error probability approaching zero -- is achievable at any rate below C, and impossible at any rate above C. The theorem seems to defy intuition: surely any noise, no matter how small, will eventually corrupt communication at any positive rate? Shannon showed that by adding carefully designed redundancy, errors can be corrected faster than they accumulate, making the net error rate approach zero.
For a channel with bandwidth B hertz and signal-to-noise ratio S/N, the Shannon-Hartley formula gives capacity C = B log2(1 + S/N) bits per second. Increasing bandwidth increases capacity linearly; increasing signal-to-noise ratio increases it logarithmically. This formula has shaped the design of every communication system from telephone modems to 5G cellular networks.
Error-Correcting Codes
Richard Hamming, working at Bell Labs in the late 1940s and frustrated by errors in the relay-based computing machinery, invented the first systematic error-correcting codes. Hamming codes add parity bits at positions that are powers of 2 in the codeword; the pattern of parity failures identifies the location of a single-bit error, which can then be corrected. The [7,4] Hamming code encodes 4 data bits in 7 bits, correcting any single-bit error.
Reed-Solomon codes (Reed and Solomon, 1960) operate over finite fields and can correct bursts of errors rather than just isolated single-bit errors. Their resistance to burst errors makes them ideal for physical media: every CD and DVD uses Reed-Solomon coding, allowing scratched discs to play cleanly. Deep-space probes use Reed-Solomon codes to recover signals from billions of kilometers away.
Claude Berrou announced turbo codes at the 1993 IEEE International Conference on Communications, demonstrating codes that approached the Shannon limit to within 0.5 decibels. His presentation was initially met with disbelief; when the results were replicated and the algorithm dissected, it became clear that turbo codes used iterative decoding -- passing belief estimates back and forth between component decoders -- to achieve performance that had seemed unattainable. Robert Gallager had invented low-density parity-check (LDPC) codes in his 1962 MIT dissertation; they were largely forgotten until David MacKay and Radford Neal rediscovered and promoted them in the 1990s. LDPC codes also achieve near-Shannon performance via iterative decoding and are now deployed in Wi-Fi (IEEE 802.11), digital video broadcasting, and 5G.
Mutual Information, Inference, and Machine Learning
Mutual Information and the Data Processing Inequality
Mutual information I(X;Y) = H(X) - H(X|Y) measures the degree to which two variables are statistically dependent, in information-theoretic units. Unlike correlation, which captures only linear dependence, mutual information captures all statistical dependence and is zero if and only if X and Y are independent.
The data processing inequality states that for any Markov chain X -> Y -> Z (where Z depends on X only through Y), I(X;Z) <= I(X;Y). No manipulation of Y can increase the information it carries about X. This principle implies that transformations cannot create information; they can only preserve or discard it, providing theoretical grounding for the intuition that feature engineering and dimensionality reduction cannot exceed the information in the raw data.
Maximum Entropy and Bayesian Inference
E.T. Jaynes (1957) proposed maximum entropy as a general principle of probabilistic inference: given constraints on a distribution (such as known mean and variance), the distribution that maximizes Shannon entropy subject to those constraints is the most honest or least biased choice, incorporating exactly the constraints and nothing else. Jaynes used this principle to rederive statistical mechanics from information-theoretic foundations, showing that thermodynamic equilibrium distributions emerge as the maximum-entropy distributions subject to macroscopic constraints such as total energy.
The principle connects information theory to Bayesian inference: a uniform prior over unknown parameters is the maximum-entropy prior given no information, and Bayes' theorem provides the procedure for updating it as evidence accumulates. KL divergence D(P||Q) = sum P(x) log(P(x)/Q(x)) measures the information lost when Q is used to approximate P; it is the asymmetric distance between distributions that appears throughout Bayesian statistics, variational inference, and machine learning.
Cross-Entropy and Machine Learning
In supervised machine learning, cross-entropy H(P,Q) = -sum P(x) log Q(x) is the most common training objective for classification tasks. When P is the empirical distribution of true labels and Q is the model's predicted distribution, minimizing cross-entropy is equivalent to maximum likelihood estimation: finding the model parameters that make the observed labels most probable. KL divergence between P and Q equals cross-entropy minus entropy of P; since entropy of the true distribution is constant, minimizing cross-entropy and minimizing KL divergence are the same optimization.
Variational autoencoders (VAEs), introduced by Kingma and Welling in 2013, use a variational lower bound on the log-likelihood that decomposes into a reconstruction term (cross-entropy or mean squared error) and a KL divergence regularization term, directly encoding the information-theoretic trade-off between fidelity and compression.
Applications Across Disciplines
DNA as Information
Schrödinger's 1944 prediction that heredity is implemented in a chemical code was confirmed when Watson and Crick determined the double-helical structure of DNA in 1953. The genetic code maps 64 nucleotide triplets (codons) to 20 amino acids plus stop signals; the redundancy of the code provides some error correction. The human genome's approximately 3.2 billion base pairs, with four possible values per position, contain approximately 6.4 billion bits, or roughly 750 megabytes, before accounting for the highly repetitive structure of the genome.
George Church, Yuan Gao, and Sriram Kosuri demonstrated in 2012 (Science) that oligonucleotide synthesis could encode a digital book -- a 53,426-word text plus illustrations -- at a density of approximately 700 terabytes per gram, exploiting the extraordinary information density of DNA. Subsequent work by Nick Goldman and colleagues at the European Bioinformatics Institute (2013) improved error correction and retrieval accuracy, bringing DNA data storage closer to practical deployment.
Cryptography
Shannon's 1949 paper "Communication Theory of Secrecy Systems" provided the first rigorous theoretical framework for cryptographic security. Shannon proved that a cipher provides perfect secrecy -- an eavesdropper learns nothing about the plaintext from the ciphertext -- if and only if the entropy of the key is at least as great as the entropy of the plaintext. This immediately implies that the one-time pad is the only perfectly secure cipher, and it requires a key as long as the message. Modern cryptographic systems, including AES and RSA, achieve computational security rather than information-theoretic security: breaking them is computationally infeasible but theoretically possible with unlimited resources.
Telecommunications
Shannon's channel capacity formula directly governs the design of wireless communication systems. The 4G LTE standard, deployed from 2010, uses OFDM (orthogonal frequency-division multiplexing), adaptive modulation, and LDPC codes to approach Shannon capacity in varying channel conditions. The 5G NR (New Radio) standard goes further with massive MIMO (multiple-input multiple-output) antenna arrays, exploiting spatial degrees of freedom to multiply capacity by the number of antenna elements, and with even closer approaches to the Shannon limit via polar codes (Arikan, 2009) in control channels.
Legacy and Limits
Shannon himself was notoriously modest about the scope of his theory. In a 1956 editorial in the IRE Transactions on Information Theory, he cautioned against applying the information-theoretic bandwagon to problems in biology, psychology, and social science where the mathematical conditions for the theory's applicability might not hold. The warning was prophetic: information-theoretic vocabulary was widely adopted in mid-century social science, sometimes productively and sometimes as a rhetorical flourish disconnected from the mathematics.
The theory is deliberately and necessarily silent on meaning. Bar-Hillel and Carnap's semantic information theory (1952) attempted to extend Shannon's framework to capture the semantic content of logical sentences, defining information as the set of possible worlds excluded by a sentence. The project illuminated some problems but did not achieve the generality or the technical payoff of Shannon's syntactic theory. The question of how meaning arises from syntactic information processing remains one of the deepest in cognitive science and the philosophy of mind.
Information overload -- the condition in which the supply of information exceeds the cognitive capacity to process it -- is a social and organizational problem that Shannon's theory identifies but cannot cure. The theory sets a bound on how much information a channel can transmit; it says nothing about how much the receiver can use. As communication technology has relentlessly approached and extended Shannon's channel capacity limits, the binding constraint on information use has shifted from transmission to attention, from bandwidth to cognition.
References
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379-423.
Shannon, C. E. (1949). Communication theory of secrecy systems. Bell System Technical Journal, 28(4), 656-715.
Huffman, D. A. (1952). A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9), 1098-1101.
Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System Technical Journal, 29(2), 147-160.
Landauer, R. (1961). Irreversibility and heat generation in the computing process. IBM Journal of Research and Development, 5(3), 183-191.
Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106(4), 620-630.
Kolmogorov, A. N. (1968). Three approaches to the quantitative definition of information. International Journal of Computer Mathematics, 2(1-4), 157-168.
Church, G. M., Gao, Y., & Kosuri, S. (2012). Next-generation digital information storage in DNA. Science, 337(6102), 1628.
Berrou, C., Glavieux, A., & Thitimajshima, P. (1993). Near Shannon limit error-correcting coding and decoding: Turbo-codes. Proceedings of the IEEE International Conference on Communications, 1064-1070.
Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenblith (Ed.), Sensory Communication (pp. 217-234). MIT Press.
Bar-Hillel, Y., & Carnap, R. (1952). An outline of a theory of semantic information. MIT Research Laboratory of Electronics Technical Report No. 247.
Wiener, N. (1948). Cybernetics: Or Control and Communication in the Animal and the Machine. MIT Press.
Frequently Asked Questions
What is information theory and who founded it?
Information theory is the mathematical framework for quantifying, encoding, storing, and transmitting information. It provides rigorous answers to questions such as: how much information is contained in a message, how compactly can a source be encoded without losing data, and how fast can information be transmitted reliably over a noisy channel?The field was founded by Claude Shannon, an engineer and mathematician at Bell Telephone Laboratories, with the publication of 'A Mathematical Theory of Communication' in the Bell System Technical Journal in 1948. The paper, later reprinted as a book with a preface by Warren Weaver, is widely regarded as one of the most important scientific papers of the twentieth century. Shannon brought unusual preparation to the problem: he had studied mathematics at the University of Michigan, electrical engineering at MIT, and had worked on cryptographic systems for the US military during World War II, an experience that sharpened his thinking about information and secrecy.Shannon's key innovation was to define information quantitatively, independently of meaning. A message's information content is determined solely by how surprising it is -- how much it reduces uncertainty -- not by what it is about. This abstraction allowed him to prove fundamental theorems about the limits of communication systems that apply universally, regardless of whether the messages are text, images, voice, or data.
What is Shannon entropy and why is it called entropy?
Shannon entropy, denoted H, measures the average uncertainty or information content of a random variable. For a discrete random variable X with possible values x1, x2, ..., xn and probabilities p(x1), p(x2), ..., p(xn), Shannon entropy is defined as H(X) = -sum of p(xi) log2 p(xi) over all i. The logarithm base 2 gives the result in bits. A fair coin has entropy of 1 bit; a fair six-sided die has entropy of approximately 2.58 bits; a completely predictable outcome has entropy of 0.The name 'entropy' comes from a famous exchange. When Shannon asked the mathematician John von Neumann what to call his measure of uncertainty, von Neumann reportedly suggested 'entropy' because the formula resembles the entropy expression in statistical mechanics and because 'no one knows what entropy really is, so in a debate you will always have the advantage.' The mathematical resemblance is genuine: Boltzmann's entropy S = -k sum p ln p in statistical mechanics and Shannon's H = -sum p log p differ only by a constant factor.The connection to thermodynamics is deeper than formal resemblance. Maxwell's demon, a thought experiment from 1867 in which a tiny observer sorts fast and slow molecules to create a temperature gradient without doing work, seemed to violate the second law. Leo Szilard (1929) resolved the paradox by arguing that acquiring information about the molecules has a thermodynamic cost. Rolf Landauer (1961) made this precise: erasing one bit of information dissipates at least kT ln2 of energy as heat, where k is Boltzmann's constant and T is temperature. Information is not merely analogous to a physical quantity; it is a physical quantity.
What is the source coding theorem and how does data compression work?
Shannon's source coding theorem, also called the noiseless channel coding theorem, establishes a fundamental limit on lossless data compression. It states that any source of information can be compressed to its entropy rate -- the average number of bits per symbol needed to represent it -- but no further without loss. If a source emits symbols with entropy H bits per symbol, it can be encoded using on average at least H bits per symbol, and there exist codes that achieve this bound.Huffman coding, developed by David Huffman in 1952 while a graduate student at MIT, constructs an optimal prefix-free code for a source with known symbol probabilities by assigning shorter codewords to more probable symbols. The procedure builds a binary tree from the bottom up, repeatedly merging the two least probable symbols into a single node. Huffman coding achieves entropy within one bit per symbol and is used in many compression systems.Lossless compression formats such as ZIP, PNG, and the DEFLATE algorithm combine Huffman coding with Lempel-Ziv-Welch dictionary coding, which exploits repeated patterns in the source. Lossy compression formats such as JPEG and MP3 go further by discarding perceptually unimportant information -- high-frequency spatial detail in images, high-frequency audio components masked by louder tones -- before applying lossless coding to the residual.Kolmogorov complexity, developed independently by Andrei Kolmogorov, Ray Solomonoff, and Gregory Chaitin in the 1960s, provides a different and deeper measure of information: the length of the shortest program that produces a given string. Unlike Shannon entropy, Kolmogorov complexity applies to individual strings rather than probability distributions, and it is uncomputable in general -- a fundamental limitation with important implications for the theory of randomness and scientific modeling.
What is channel capacity and how do error-correcting codes work?
Shannon's channel capacity theorem, the noisy channel coding theorem, is arguably the most surprising result in information theory. It states that every noisy communication channel has a capacity C -- a maximum rate of information transmission in bits per second -- such that reliable communication (with arbitrarily low error probability) is possible at any rate below C and impossible at any rate above C. The theorem is existential: it proves that good codes exist without constructing them, and its proof that reliable communication is possible at nonzero rates despite noise was counterintuitive to many of Shannon's contemporaries.For a channel with bandwidth B and signal-to-noise ratio S/N, the Shannon-Hartley formula gives C = B log2(1 + S/N). This formula has guided wireless communications engineering for seven decades.Error-correcting codes are the practical realization of Shannon's theorem. Richard Hamming, frustrated by parity-check failures on the Bell Labs relay computer, invented the first efficient error-correcting codes in 1950. Hamming codes can detect and correct single-bit errors in a codeword by adding strategically placed parity bits. Reed-Solomon codes (1960), which operate over finite fields, are used in compact discs, DVDs, QR codes, and deep-space communication; they can correct bursts of errors, making them ideal for physical media with scratches or dropouts. Claude Berrou's turbo codes (1993) approached the Shannon limit to within a fraction of a decibel, a breakthrough that was initially met with skepticism at the 1993 IEEE International Conference on Communications. Low-density parity-check (LDPC) codes, invented by Robert Gallager in 1962 and rediscovered in the 1990s, are now used in Wi-Fi, satellite broadcasting, and 5G cellular networks and also approach the Shannon limit.
What is mutual information and how does it connect to machine learning?
Mutual information I(X;Y) measures how much knowing one variable reduces uncertainty about another. It is defined as I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X), the reduction in entropy of X given knowledge of Y (or equivalently, of Y given knowledge of X). Mutual information is zero when X and Y are independent and equals H(X) = H(Y) when they are perfectly determined by each other. Shannon showed that the channel capacity C is equal to the maximum mutual information between channel input and output, maximized over all possible input distributions.E.T. Jaynes developed the maximum entropy principle (1957) as a method of statistical inference: given known constraints on a probability distribution (such as the mean), the distribution that maximizes entropy subject to those constraints is the least-biased choice consistent with the constraints. This principle connects information theory to Bayesian inference and statistical mechanics.In machine learning, information-theoretic concepts are pervasive. Cross-entropy loss -- the most common training objective for classification -- measures the average number of bits needed to encode true labels using the model's predicted distribution; minimizing cross-entropy is equivalent to maximum likelihood estimation. Kullback-Leibler (KL) divergence measures the additional cost of encoding a true distribution using an approximating distribution; it is used in variational autoencoders, reinforcement learning, and Bayesian model comparison. The data processing inequality states that no processing of data can increase mutual information, providing theoretical grounding for feature learning and dimensionality reduction.
How is information theory applied in biology and neuroscience?
Erwin Schrödinger's What Is Life? (1944) proposed that chromosomes carry a 'code-script' for the organism, and that living systems are characterized by their ability to store and use information against the tendency toward thermodynamic disorder. Francis Crick formalized this with the central dogma of molecular biology (1958): information flows from DNA to RNA to protein but not in reverse. The genome can be understood as an information storage system; the human genome contains approximately 3.2 billion base pairs, encoding roughly 750 megabytes of information (with considerable redundancy).George Church and colleagues demonstrated in 2012 (Science) that DNA could serve as a digital storage medium, encoding a 5.27-megabit book in oligonucleotide sequences at a density of approximately 700 terabytes per gram -- orders of magnitude denser than silicon storage.In neuroscience, the efficient coding hypothesis proposes that the nervous system encodes sensory information as efficiently as possible, minimizing redundancy. Fred Attneave (1954) and Horace Barlow (1961) independently proposed that the visual system removes statistical redundancies in the natural image environment, a prediction supported by the discovery that retinal ganglion cells and lateral geniculate neurons have center-surround receptive fields that implement approximate decorrelation. Barlow's adaptation of information theory to sensory neuroscience generated a research program that continues to shape computational neuroscience. Neural coding questions -- how information is represented in spike trains, how population codes work, how the brain reads out signals -- are directly framed in information-theoretic terms.
What are the limits of information theory and how does it relate to other sciences?
Shannon's information theory is deliberately silent on meaning: a message that says 'the bridge is safe' and one that says 'the bridge is unsafe' have the same information-theoretic structure if they are equally probable, but their practical significance is completely different. Yehoshua Bar-Hillel and Rudolf Carnap (1952) attempted to develop a theory of semantic information that would capture relevance and meaning, but the project proved technically difficult and has not achieved the influence of Shannon's theory.Kolmogorov complexity, Fisher information, and Shannon entropy are three distinct mathematical measures that capture different aspects of information. Fisher information, central to classical statistics, measures the information a sample carries about an unknown parameter and underlies the Cramer-Rao bound on estimation accuracy. These measures are related but not equivalent, and different applications call for different tools.Norbert Wiener's Cybernetics (1948), published the same year as Shannon's paper, developed a parallel framework for thinking about information, communication, and control in biological and mechanical systems. Wiener was interested in feedback and regulation; Shannon was interested in transmission and coding. Their work is complementary, and both influenced the early development of cognitive science and AI.Information overload -- the condition in which the rate of incoming information exceeds the capacity to process it meaningfully -- is a social and cognitive problem that information theory identifies but cannot solve. Shannon's theorem guarantees reliable transmission over noisy channels up to capacity; it says nothing about the capacity of human attention or the organizational structures needed to make transmitted information useful.