In 2012, a Harvard Business Review article called data scientist "the sexiest job of the 21st century." A decade later, that article's author, Thomas Davenport, revisited the claim and found it had proven largely accurate -- while also identifying a significant problem. Organizations had hired thousands of data scientists, purchased expensive tools, and built data teams. Many had struggled to generate proportionate business value. The issue was not the talent. It was the organizational and data infrastructure that talent needed to do meaningful work.
The gap between the expectation and the reality of data science in organizations illuminates something important about the field: data science is not a technology that you acquire. It is a discipline that requires investment in foundations before sophisticated work becomes possible. Understanding what data science actually is -- what it requires, what it produces, how it differs from adjacent fields, and what the career path looks like -- is increasingly important for professionals at every level of an organization that handles data.
What Is Data Science?
Data science is an interdisciplinary field that uses statistical methods, computational tools, and domain knowledge to extract meaning from data -- particularly complex, large, or unstructured data that cannot be analyzed by simple querying or reporting.
The term was popularized in the early 2000s and gained mainstream traction around 2010 when companies began accumulating large datasets from digital behavior that existing analytical approaches could not process. The canonical activities of data science include:
- Building predictive models -- algorithms that estimate future outcomes based on patterns in historical data (which customers are likely to churn, which transactions are likely fraudulent, what price a property will sell for)
- Conducting causal analyses -- using experimental design and statistical methods to estimate whether an intervention actually caused an outcome
- Developing recommendation systems -- algorithms that suggest relevant items to users based on behavioral patterns
- Performing natural language processing -- extracting structured meaning from text data
- Creating classification systems -- models that assign observations to categories (spam vs. not spam, positive vs. negative sentiment, high-risk vs. low-risk)
What distinguishes data science from simpler data work is the ambiguity and complexity of the questions it addresses. Querying a database to find last month's revenue is data retrieval. Predicting which products a customer will buy next week, and estimating how confident you should be in that prediction, is data science.
The field has grown at a pace that few predicted. The World Economic Forum's "Future of Jobs Report 2023" listed data analysts and scientists as the second-fastest growing role category globally, with an estimated 1 million new data science positions expected to be created by 2027. The IBM Institute for Business Value estimated in 2020 that demand for data scientists and related roles would grow 28% by 2024 -- a projection that was exceeded by actual hiring trends in most major markets.
Data Science vs. Data Analytics vs. Machine Learning Engineering
These three roles overlap substantially in practice and are used inconsistently across organizations. Understanding the distinctions clarifies what different jobs actually require.
Data Analytics
Data analytics is oriented toward description and explanation: what happened, how much, why it happened, how it compares to a baseline. Data analysts query databases, build dashboards, design reports, and communicate business performance to decision-makers.
A data analyst asks: How many users signed up last month? What is our customer acquisition cost by channel? How did retention change after we launched the new onboarding flow?
Typical tools: SQL (the primary skill), business intelligence platforms (Tableau, Looker, Power BI), Excel, basic statistics. The 2023 Stack Overflow Developer Survey found SQL to be the most widely used language among data and analytics professionals, used by 56% of respondents -- a higher penetration rate than Python, JavaScript, or any other language in that segment.
Data Science
Data science is oriented toward prediction and prescription: what will happen, what should we do about it, what will the outcome be. Data scientists build models that make predictions about future or unobserved events and conduct analyses that go beyond description to estimate causal effects.
A data scientist asks: Which users are likely to churn in the next 30 days, so we can intervene? Did our price change cause the increase in conversion, or was it caused by something else that changed at the same time? What price maximizes expected revenue given demand elasticity?
Typical tools: Python (dominant), R, SQL, scikit-learn, PyTorch or TensorFlow, statistical modeling, experimental design.
Machine Learning Engineering
Machine learning engineering is oriented toward production: taking a validated model and deploying it reliably at scale in a real application. ML engineers build the infrastructure, pipelines, and APIs that serve model predictions to users in real time or batch processes.
An ML engineer asks: How do we serve this recommendation model to 10 million users with sub-100ms latency? How do we monitor the model for performance drift? How do we automate the retraining pipeline?
Typical tools: Python, MLflow, Kubernetes, Docker, cloud ML platforms (SageMaker, Vertex AI, Azure ML), software engineering practices.
| Dimension | Data Analytics | Data Science | ML Engineering |
|---|---|---|---|
| Core question | What happened? | What will happen? | How do we deploy at scale? |
| Primary output | Reports, dashboards | Models, insights, experiments | Production ML systems |
| Python required? | Sometimes | Yes | Yes |
| Statistics emphasis | Basic | Substantial | Applied |
| Software engineering | Minimal | Moderate | Strong |
| SQL proficiency | Essential | Essential | Moderate |
| Typical background | Business, economics | Stats, CS, quantitative fields | Software engineering + ML |
In practice, many organizations use these titles interchangeably or combine them into hybrid roles. A small startup might have a single "data scientist" who does all three functions. A large technology company might have hundreds of specialists in each category with minimal overlap.
The Data Science Hierarchy of Needs
Monica Rogati, a former data scientist at LinkedIn, published a widely influential blog post describing what she called the data science hierarchy of needs -- a pyramid illustrating that sophisticated machine learning sits atop a set of foundational capabilities that must be built first.
From bottom to top of the pyramid:
Data collection and storage (foundation): Does the organization reliably collect the data it needs? Does it have appropriate infrastructure to store it? Many organizations try to do data science before they have trustworthy data pipelines. The result is models trained on unreliable data that produce unreliable predictions.
Data movement and processing: Can data be moved, transformed, and joined across systems? Do data pipelines run reliably, fail gracefully, and produce consistent results? Data engineering -- the work of building these pipelines -- is unglamorous but essential.
Exploration and transformation: Can analysts actually access, query, and understand the data? Are there business intelligence tools that let decision-makers explore what is happening? A/B testing infrastructure for measuring causal effects?
Aggregate / label: Are there well-defined metrics? Are there labeled datasets for supervised learning problems? Is data quality sufficient that models trained on it will generalize?
Learning / optimization (top): Machine learning, deep learning, AI. The sophisticated techniques that generate public excitement and inflated expectations.
Rogati's insight is that organizations that skip the foundational layers and jump straight to the top fail systematically. You cannot build a reliable fraud detection model on an unreliable data pipeline. You cannot run valid A/B tests without a rigorous metrics framework. The pyramid has to be built from the bottom.
A McKinsey Global Institute survey (2022) found that among organizations that described themselves as "pursuing AI and data science initiatives," only 18% had achieved what they described as widespread AI adoption generating significant value. The most common barriers cited: data quality (cited by 58% of respondents), lack of talent (47%), and absence of the foundational infrastructure the hierarchy requires.
"Data quality is boring work. It is also, for almost every organization, more impactful than any machine learning model you will ever build on top of it." -- Monica Rogati
The Origins and Evolution of the Field
Data science did not emerge from nowhere. Its intellectual roots extend through multiple disciplines, each contributing essential tools.
Statistics provides the theoretical foundation: probability theory, hypothesis testing, regression analysis, experimental design. Fisher, Neyman, and Pearson established inferential statistics in the 1920s-1940s. Tukey's "exploratory data analysis" (1977) legitimized the use of visualization and informal analysis alongside formal hypothesis testing. Both traditions are essential in modern data science practice.
Computer science contributed the algorithmic foundations: efficient data structures, database theory, information retrieval, and eventually machine learning. The database revolution of the 1970s-1980s, led by Edgar Codd's relational model, created the infrastructure that makes data accessible at scale.
Machine learning as a formal field emerged from the intersection of statistics and computer science in the 1980s-1990s. Key developments: Breiman's random forests (2001), Friedman's gradient boosting (1999), and the deep learning renaissance triggered by Hinton, LeCun, and Bengio's work on neural networks -- recognized by the 2018 Turing Award -- created the toolkit that contemporary data science draws on.
The big data era, roughly 2005-2015, shifted the field's focus from algorithmic sophistication to scale and infrastructure. Google's MapReduce paper (Dean and Ghemawat, 2004) and the open-source Hadoop ecosystem that followed enabled processing of datasets that previously could not be analyzed at all. The challenge shifted from "can we compute this?" to "can we store, move, and query the data efficiently?"
The deep learning era, roughly 2012-present, brought neural network approaches to dominance in image recognition, natural language processing, and speech recognition. ImageNet competition results (Krizhevsky et al., 2012) demonstrated that deep convolutional networks dramatically outperformed all previous approaches -- a discontinuity that triggered the current wave of AI investment and hiring.
The Core Skills Stack
Data science competencies are commonly described in terms of three dimensions, originally expressed by Drew Conway's 2010 Venn diagram. The diagram has three circles -- math and statistics knowledge, hacking skills (computational ability), and substantive expertise (domain knowledge) -- with data science at the intersection of all three. The danger zone, Conway noted, was the intersection of math and hacking without domain expertise: "a person who is technically proficient, but who doesn't understand the problem they are solving."
Python
Python is the dominant language in data science. The ecosystem of libraries makes it the practical standard:
- pandas for data manipulation and analysis
- NumPy for numerical computation
- scikit-learn for classical machine learning (regression, classification, clustering, preprocessing)
- matplotlib and seaborn for visualization
- PyTorch and TensorFlow for deep learning
- statsmodels for statistical modeling with inference (p-values, confidence intervals)
- Jupyter notebooks for exploratory analysis and communication
Python proficiency at the data science level means being comfortable with data manipulation, exploratory analysis, model building and evaluation, and communicating results -- not necessarily deep software engineering. The 2023 Kaggle Machine Learning and Data Science Survey (n=23,000+ data professionals) found Python used by 86.8% of respondents, with the next most common language (SQL) at 39.2% and R at 24.6%.
SQL
SQL (Structured Query Language) is the foundational skill for accessing data in relational databases, and it is non-negotiable across virtually all data roles. Most organizational data lives in relational databases (PostgreSQL, MySQL, SQL Server, Snowflake, BigQuery, Redshift), and SQL is the primary interface for extracting and preparing it.
Advanced SQL skills -- window functions, CTEs, subqueries, performance optimization -- are frequently tested in data science interviews and genuinely matter in day-to-day work. The most common finding in data science hiring assessments is that candidates with strong Python and machine learning knowledge have surprisingly weak SQL skills -- a gap that matters significantly in practice, where most data wrangling happens at the database level before a Python notebook is ever opened.
Statistics
Core statistical knowledge for data science includes:
- Probability: distributions, conditional probability, Bayes' theorem
- Statistical inference: hypothesis testing, p-values, confidence intervals, Type I and II errors
- Regression: linear and logistic regression, interpretation, assumptions and diagnostics
- Experimental design: A/B testing, randomization, control for confounders, power analysis
- Causal inference: difference-in-differences, regression discontinuity, instrumental variables -- methods for estimating causal effects from observational data
The most dangerous data scientist is one who is technically competent but statistically naive -- someone who can build a model but does not understand when its output is misleading. Andrew Gelman of Columbia University has written extensively on this problem, arguing that the "replication crisis" in social science research (Ioannidis, 2005; Open Science Collaboration, 2015) is partially an applied statistics crisis -- a failure to understand statistical power, multiple comparisons, and the conditions under which p-values are meaningful.
In applied data science contexts, this manifests as "p-hacking" or "data dredging" -- testing many hypotheses until a significant result appears by chance, without appropriate correction for multiple comparisons. Responsible data science practice requires understanding these pitfalls before building or presenting any analysis.
Machine Learning Fundamentals
Beyond the statistical foundations, data scientists need working knowledge of the core machine learning algorithm families:
- Supervised learning: linear regression, logistic regression, decision trees, random forests, gradient boosting (XGBoost, LightGBM), support vector machines, neural networks
- Unsupervised learning: k-means clustering, hierarchical clustering, dimensionality reduction (PCA, t-SNE, UMAP)
- Evaluation and validation: train/validation/test splits, cross-validation, bias-variance tradeoff, precision/recall/AUC-ROC for classification
- Feature engineering: the often-underestimated craft of transforming raw data into representations that models can use effectively
The dominant trend in model selection has moved toward gradient boosting methods (XGBoost, LightGBM, CatBoost) for tabular data and toward transformer-based architectures (BERT, GPT variants) for text data. Understanding when to use which approach -- and why -- requires both theoretical grounding and practical experience.
A Day in the Life of a Data Scientist
Data science work does not resemble its public portrayal -- running cutting-edge deep learning algorithms against vast datasets. In most organizations, the actual work distribution looks something like this:
~30-40% data cleaning and preparation: Dealing with missing values, joining imperfect datasets, investigating data quality issues, building the clean dataset that analysis requires. This work is unglamorous and time-consuming. Most practitioners report spending more time on it than on any other activity. The 2016 CrowdFlower (now Figure Eight) Data Science Report found that 60% of a data scientist's time is spent on cleaning and organizing data -- a figure that has remained consistent in subsequent surveys.
~20-30% exploratory analysis: Understanding what is in the data, identifying patterns, generating hypotheses, visualizing distributions and relationships.
~15-20% modeling and analysis: Building and evaluating models, running statistical analyses, conducting experiments.
~15-20% communication: Writing documentation, presenting findings to stakeholders, translating results into recommendations, explaining model behavior.
~5-15% deployment and monitoring (varies enormously by organization): Some data scientists work end-to-end and deploy models to production; others hand off to ML engineers.
The proportion of time spent on glamorous vs. unglamorous work is a reliable indicator of data team maturity. Early-stage teams in organizations without strong data infrastructure spend most of their time cleaning data and building pipelines. Mature teams with strong infrastructure spend proportionally more time on modeling and analysis.
Responsible AI and Data Ethics
The growing deployment of data science and machine learning systems in consequential decisions -- credit scoring, hiring screening, medical diagnosis, criminal sentencing -- has elevated data ethics from a philosophical concern to a practical engineering requirement.
Algorithmic bias occurs when a model produces systematically different outcomes for identifiable groups in ways that are unfair or harmful. The most documented cases involve facial recognition systems (Buolamwini and Gebru, 2018, "Gender Shades" study found error rates 34 percentage points higher for dark-skinned women than light-skinned men across commercial facial recognition products), credit scoring (automated systems have been shown to replicate historical redlining patterns even without using race as a direct feature), and hiring algorithms (Amazon discontinued an AI recruitment tool in 2018 after discovering it systematically downranked resumes containing the word "women's").
Data privacy concerns arise when models are trained on personal data, when predictions reveal sensitive information, or when data is used in ways users did not consent to. Differential privacy -- a mathematical framework developed by Dwork et al. at Microsoft Research (2006) -- provides formal guarantees about what can be learned about individuals from aggregate statistics, and has been adopted by Apple, Google, and the US Census Bureau for sensitive data releases.
Transparency and explainability -- the ability to explain how a model arrives at a specific prediction -- has become both a regulatory requirement and an ethical standard. The EU's GDPR includes a "right to explanation" for automated decisions. The field of interpretable machine learning (Molnar, "Interpretable Machine Learning," 2019) has developed techniques including SHAP values, LIME, and attention visualization to make model behavior more understandable.
Data scientists who understand these concerns and can proactively address them are significantly more valuable -- and more employable -- than those with equivalent technical skills but no ethical grounding.
Salary Ranges and Career Paths
US Salary Data (2024)
| Role | Experience | Base Salary | Total Compensation (with equity) |
|---|---|---|---|
| Data Analyst | Entry-level | $65,000-$85,000 | $70,000-$100,000 |
| Data Scientist | Entry-level | $90,000-$120,000 | $100,000-$160,000 |
| Data Scientist | Senior | $140,000-$200,000 | $200,000-$350,000 |
| Staff Data Scientist | Senior + | $180,000-$250,000 | $300,000-$600,000 |
| ML Engineer | Senior | $150,000-$220,000 | $250,000-$500,000 |
| Director of Data Science | Leadership | $200,000-$280,000 | $350,000-$800,000+ |
Sources: Levels.fyi, Glassdoor, LinkedIn Salary, ONET.*
These ranges vary substantially by company type (large tech companies pay most), geography (San Francisco, Seattle, New York command premiums), and industry (finance and tech pay above healthcare and retail). Outside the US, compensation is substantially lower but has grown considerably in Western Europe, Canada, and Australia. The UK's Office for National Statistics reported a median salary of approximately GBP 58,000 for data scientists in 2023.
Career Paths
There is no single path into data science. Common backgrounds include:
Quantitative academic fields -- Statistics, mathematics, economics, physics, computational biology, and related fields provide the analytical foundations. PhD holders are prevalent in ML research roles at top companies; master's degrees have become the typical credential for industry data science roles. A survey by Burtch Works (2022) found that 49% of working data scientists held a master's degree as their highest qualification, and 23% held a PhD. Bachelor's degree holders made up the remaining 28%.
Computer science or software engineering -- Strong programming background with added statistical and ML knowledge. Engineers transitioning into data science often have an easier time with the technical implementation side and a harder time with statistical inference and experimental design.
Data analytics transition -- Many data scientists start as analysts, develop Python skills and modeling knowledge alongside SQL and business domain knowledge, and grow into data science roles. This path has domain expertise as its advantage.
Bootcamps and self-directed learning -- The accessibility of open data, online courses (Coursera, fast.ai, Kaggle), and portfolio-based hiring has made self-directed paths viable, particularly into analytics and applied ML roles at smaller companies. The bar is a demonstrable portfolio of real work: cleaned datasets, trained models, documented notebooks, and ideally projects that address a real business or research question.
The most common bottleneck in hiring is not technical credentials but applied project experience -- evidence that a candidate can identify a real problem, get data, clean it, build an analysis or model, evaluate it properly, and communicate the results clearly. Interview formats that include case studies, take-home projects, or portfolio review are specifically designed to assess this.
What Organizations Get Wrong About Data Science
The enthusiasm with which organizations hired data scientists in the 2010s was not matched by equal investment in the prerequisites. Common failure modes:
Expecting models without infrastructure. Hiring data scientists before having reliable data pipelines, clear business metrics, and accessible data is like hiring chefs before building a kitchen. The McKinsey report cited above found this was the most common cause of failed data science initiatives -- teams could not build meaningful models because the data they needed was inaccessible, unreliable, or non-existent.
Treating data science as a technical function. The highest-value data science work requires deep collaboration between analysts, domain experts, and decision-makers. Organizations that silo data teams from the business problems they are supposed to solve produce technically excellent models with no business impact. DJ Patil, the former Chief Data Scientist of the United States, has described this as the "museum problem" -- beautiful models sitting in a display case that no one uses.
Overweighting ML sophistication. A logistic regression that stakeholders understand and trust and can act on often creates more business value than a gradient boosted ensemble that produces marginally better predictions no one understands or uses. Pedro Domingos of the University of Washington, author of "The Master Algorithm" (2015), has observed that in most business applications, the improvement from using a more complex algorithm over a simple baseline is smaller than the improvement from better data.
Ignoring experimentation infrastructure. Understanding whether a change caused an outcome -- as opposed to just correlating with it -- requires A/B testing or other causal inference methods. Without experimentation infrastructure, organizations accumulate data about what happened without ever learning what works. Airbnb, Booking.com, and Netflix have published extensively about the investment required to build reliable experimentation platforms -- and the business value that investment returned.
Under-investing in data quality. The most impactful investment many organizations can make in their data capabilities is not hiring more data scientists -- it is building the data engineering and data governance infrastructure that makes the data scientists they already have more effective.
The clearest signal of a data-mature organization is not the sophistication of the models it runs. It is the quality and accessibility of the data those models are trained on, and the organizational processes that turn analytical output into decisions and actions.
Data Science and Generative AI
The emergence of large language models (GPT-4, Gemini, Claude, Llama) and multimodal foundation models has created both opportunities and identity questions for the data science field.
On one hand, generative AI has dramatically lowered the barrier to some data science tasks. Code generation, data cleaning scripts, exploratory analysis, and even simple model-building can be accelerated substantially by LLM assistants. A 2023 GitHub Copilot study found that data analysis tasks were completed 30-40% faster with AI assistance.
On the other hand, the rise of pre-trained foundation models has shifted competitive advantage in applied ML away from building models from scratch (which most organizations could not do effectively anyway) toward fine-tuning, prompt engineering, and retrieval-augmented generation -- skills that require understanding of both the models' capabilities and the domain problems being solved.
The practical impact for data science careers: routine analytical and modeling work will be increasingly automated, raising the bar for what constitutes differentiated contribution. The skills that remain distinctly human -- problem formulation, experimental design, causal reasoning, stakeholder communication, and domain expertise -- are precisely the skills that the data science hierarchy of needs identifies as foundational. Data scientists who invest in depth on these dimensions are better positioned than those who specialize in model-building mechanics alone.
Frequently Asked Questions
What is data science?
Data science is an interdisciplinary field that uses statistical methods, programming, and domain knowledge to extract insights and build predictive models from data. It combines elements of statistics, computer science, and subject-matter expertise to answer questions that cannot be answered by simple querying or reporting alone. Typical outputs include predictive models, causal analyses, recommendation systems, automated decision-making pipelines, and data-driven strategies.
What is the difference between data science, data analytics, and machine learning engineering?
Data analytics focuses on describing and explaining what has happened — querying databases, building dashboards, and summarizing trends for business decisions. Data science builds predictive and prescriptive models — using statistical and machine learning methods to forecast what will happen and recommend what to do. Machine learning engineering focuses on productionizing ML models — building the infrastructure, pipelines, and systems that serve model predictions at scale in real applications. There is substantial overlap, and job titles are used inconsistently across companies.
What skills does a data scientist need?
The core technical skills are: Python (the dominant language for data work, with libraries including pandas, NumPy, scikit-learn, and PyTorch), SQL (for querying and manipulating relational databases), and statistics (probability, hypothesis testing, regression, experimental design). Beyond technical skills, effective data scientists need strong communication to translate findings for non-technical stakeholders, critical thinking to avoid spurious conclusions, and enough domain knowledge to know which questions are worth asking.
What is the data science hierarchy of needs?
Monica Rogati's data science hierarchy of needs, drawn as a pyramid, illustrates that sophisticated machine learning sits at the top of a foundation that must be built first. From bottom to top: reliable data collection and storage, clean and trustworthy data pipelines, analytics and business intelligence, probabilistic modeling and A/B testing, and finally machine learning and AI. Organizations that skip the foundational layers and jump straight to 'AI' typically fail because their underlying data is unreliable or poorly structured.
What does a data scientist earn and how do you enter the field?
US data scientist salaries range widely: entry-level roles at non-tech companies typically pay \(80,000-\)110,000; senior data scientists at tech companies commonly earn \(150,000-\)220,000 in base salary, with total compensation (including equity) often higher. Entry paths include university degrees in statistics, computer science, or related quantitative fields, but many practicing data scientists transitioned from adjacent fields (economics, physics, biology, social science) and filled technical gaps through online courses, bootcamps, and self-directed projects. A portfolio of real projects demonstrating end-to-end data skills matters more than credentials at most companies.