In 2012, a Harvard Business Review article called data scientist "the sexiest job of the 21st century." A decade later, that article's author, Thomas Davenport, revisited the claim and found it had proven largely accurate — while also identifying a significant problem. Organizations had hired thousands of data scientists, purchased expensive tools, and built data teams. Many had struggled to generate proportionate business value. The issue was not the talent. It was the organizational and data infrastructure that talent needed to do meaningful work.

The gap between the expectation and the reality of data science in organizations illuminates something important about the field: data science is not a technology that you acquire. It is a discipline that requires investment in foundations before sophisticated work becomes possible. Understanding what data science actually is — what it requires, what it produces, how it differs from adjacent fields, and what the career path looks like — is increasingly important for professionals at every level of an organization that handles data.


What Is Data Science?

Data science is an interdisciplinary field that uses statistical methods, computational tools, and domain knowledge to extract meaning from data — particularly complex, large, or unstructured data that cannot be analyzed by simple querying or reporting.

The term was popularized in the early 2000s and gained mainstream traction around 2010 when companies began accumulating large datasets from digital behavior that existing analytical approaches could not process. The canonical activities of data science include:

  • Building predictive models — algorithms that estimate future outcomes based on patterns in historical data (which customers are likely to churn, which transactions are likely fraudulent, what price a property will sell for)
  • Conducting causal analyses — using experimental design and statistical methods to estimate whether an intervention actually caused an outcome
  • Developing recommendation systems — algorithms that suggest relevant items to users based on behavioral patterns
  • Performing natural language processing — extracting structured meaning from text data
  • Creating classification systems — models that assign observations to categories (spam vs. not spam, positive vs. negative sentiment, high-risk vs. low-risk)

What distinguishes data science from simpler data work is the ambiguity and complexity of the questions it addresses. Querying a database to find last month's revenue is data retrieval. Predicting which products a customer will buy next week, and estimating how confident you should be in that prediction, is data science.


Data Science vs. Data Analytics vs. Machine Learning Engineering

These three roles overlap substantially in practice and are used inconsistently across organizations. Understanding the distinctions clarifies what different jobs actually require.

Data Analytics

Data analytics is oriented toward description and explanation: what happened, how much, why it happened, how it compares to a baseline. Data analysts query databases, build dashboards, design reports, and communicate business performance to decision-makers.

A data analyst asks: How many users signed up last month? What is our customer acquisition cost by channel? How did retention change after we launched the new onboarding flow?

Typical tools: SQL (the primary skill), business intelligence platforms (Tableau, Looker, Power BI), Excel, basic statistics.

Data Science

Data science is oriented toward prediction and prescription: what will happen, what should we do about it, what will the outcome be. Data scientists build models that make predictions about future or unobserved events and conduct analyses that go beyond description to estimate causal effects.

A data scientist asks: Which users are likely to churn in the next 30 days, so we can intervene? Did our price change cause the increase in conversion, or was it caused by something else that changed at the same time? What price maximizes expected revenue given demand elasticity?

Typical tools: Python (dominant), R, SQL, scikit-learn, PyTorch or TensorFlow, statistical modeling, experimental design.

Machine Learning Engineering

Machine learning engineering is oriented toward production: taking a validated model and deploying it reliably at scale in a real application. ML engineers build the infrastructure, pipelines, and APIs that serve model predictions to users in real time or batch processes.

An ML engineer asks: How do we serve this recommendation model to 10 million users with sub-100ms latency? How do we monitor the model for performance drift? How do we automate the retraining pipeline?

Typical tools: Python, MLflow, Kubernetes, Docker, cloud ML platforms (SageMaker, Vertex AI, Azure ML), software engineering practices.

Dimension Data Analytics Data Science ML Engineering
Core question What happened? What will happen? How do we deploy at scale?
Primary output Reports, dashboards Models, insights, experiments Production ML systems
Python required? Sometimes Yes Yes
Statistics emphasis Basic Substantial Applied
Software engineering Minimal Moderate Strong
SQL proficiency Essential Essential Moderate
Typical background Business, economics Stats, CS, quantitative fields Software engineering + ML

The Data Science Hierarchy of Needs

Monica Rogati, a former data scientist at LinkedIn, published a widely influential blog post describing what she called the data science hierarchy of needs — a pyramid illustrating that sophisticated machine learning sits atop a set of foundational capabilities that must be built first.

From bottom to top of the pyramid:

Data collection and storage (foundation): Does the organization reliably collect the data it needs? Does it have appropriate infrastructure to store it? Many organizations try to do data science before they have trustworthy data pipelines. The result is models trained on unreliable data that produce unreliable predictions.

Data movement and processing: Can data be moved, transformed, and joined across systems? Do data pipelines run reliably, fail gracefully, and produce consistent results? Data engineering — the work of building these pipelines — is unglamorous but essential.

Exploration and transformation: Can analysts actually access, query, and understand the data? Are there business intelligence tools that let decision-makers explore what is happening? A/B testing infrastructure for measuring causal effects?

Aggregate / label: Are there well-defined metrics? Are there labeled datasets for supervised learning problems? Is data quality sufficient that models trained on it will generalize?

Learning / optimization (top): Machine learning, deep learning, AI. The sophisticated techniques that generate public excitement and inflated expectations.

Rogati's insight is that organizations that skip the foundational layers and jump straight to the top fail systematically. You cannot build a reliable fraud detection model on an unreliable data pipeline. You cannot run valid A/B tests without a rigorous metrics framework. The pyramid has to be built from the bottom.

"Data quality is boring work. It is also, for almost every organization, more impactful than any machine learning model you will ever build on top of it." — Monica Rogati


The Core Skills Stack

Data science competencies are commonly described in terms of three dimensions, originally expressed by Drew Conway's 2010 Venn diagram:

Programming and computer science: The ability to manipulate, process, and model data computationally.

Statistics and mathematics: The knowledge to design valid analyses, interpret probabilistic results, and avoid common pitfalls in modeling.

Domain knowledge and communication: Understanding of the field or industry you are working in, and the ability to translate analytical results into actionable recommendations.

Python

Python is the dominant language in data science. The ecosystem of libraries makes it the practical standard:

  • pandas for data manipulation and analysis
  • NumPy for numerical computation
  • scikit-learn for classical machine learning (regression, classification, clustering, preprocessing)
  • matplotlib and seaborn for visualization
  • PyTorch and TensorFlow for deep learning
  • statsmodels for statistical modeling with inference (p-values, confidence intervals)
  • Jupyter notebooks for exploratory analysis and communication

Python proficiency at the data science level means being comfortable with data manipulation, exploratory analysis, model building and evaluation, and communicating results — not necessarily deep software engineering.

SQL

SQL (Structured Query Language) is the foundational skill for accessing data in relational databases, and it is non-negotiable across virtually all data roles. Most organizational data lives in relational databases (PostgreSQL, MySQL, SQL Server, Snowflake, BigQuery, Redshift), and SQL is the primary interface for extracting and preparing it.

Advanced SQL skills — window functions, CTEs, subqueries, performance optimization — are frequently tested in data science interviews and genuinely matter in day-to-day work.

Statistics

Core statistical knowledge for data science includes:

  • Probability: distributions, conditional probability, Bayes' theorem
  • Statistical inference: hypothesis testing, p-values, confidence intervals, Type I and II errors
  • Regression: linear and logistic regression, interpretation, assumptions and diagnostics
  • Experimental design: A/B testing, randomization, control for confounders, power analysis
  • Causal inference: difference-in-differences, regression discontinuity, instrumental variables — methods for estimating causal effects from observational data

The most dangerous data scientist is one who is technically competent but statistically naive — someone who can build a model but does not understand when its output is misleading.


A Day in the Life of a Data Scientist

Data science work does not resemble its public portrayal — running cutting-edge deep learning algorithms against vast datasets. In most organizations, the actual work distribution looks something like this:

~30-40% data cleaning and preparation: Dealing with missing values, joining imperfect datasets, investigating data quality issues, building the clean dataset that analysis requires. This work is unglamorous and time-consuming. Most practitioners report spending more time on it than on any other activity.

~20-30% exploratory analysis: Understanding what is in the data, identifying patterns, generating hypotheses, visualizing distributions and relationships.

~15-20% modeling and analysis: Building and evaluating models, running statistical analyses, conducting experiments.

~15-20% communication: Writing documentation, presenting findings to stakeholders, translating results into recommendations, explaining model behavior.

~5-15% deployment and monitoring (varies enormously by organization): Some data scientists work end-to-end and deploy models to production; others hand off to ML engineers.

The proportion of time spent on glamorous vs. unglamorous work is a reliable indicator of data team maturity. Early-stage teams in organizations without strong data infrastructure spend most of their time cleaning data and building pipelines. Mature teams with strong infrastructure spend proportionally more time on modeling and analysis.


Salary Ranges and Career Paths

US Salary Data (2024)

Role Experience Base Salary Total Compensation (with equity)
Data Analyst Entry-level $65,000-$85,000 $70,000-$100,000
Data Scientist Entry-level $90,000-$120,000 $100,000-$160,000
Data Scientist Senior $140,000-$200,000 $200,000-$350,000
Staff Data Scientist Senior + $180,000-$250,000 $300,000-$600,000
ML Engineer Senior $150,000-$220,000 $250,000-$500,000
Director of Data Science Leadership $200,000-$280,000 $350,000-$800,000+

These ranges vary substantially by company type (large tech companies pay most), geography (San Francisco, Seattle, New York command premiums), and industry (finance and tech pay above healthcare and retail). Outside the US, compensation is substantially lower but has grown considerably in Western Europe, Canada, and Australia.

Career Paths

There is no single path into data science. Common backgrounds include:

Quantitative academic fields — Statistics, mathematics, economics, physics, computational biology, and related fields provide the analytical foundations. PhD holders are prevalent in ML research roles at top companies; master's degrees have become the typical credential for industry data science roles.

Computer science or software engineering — Strong programming background with added statistical and ML knowledge. Engineers transitioning into data science often have an easier time with the technical implementation side and a harder time with statistical inference and experimental design.

Data analytics transition — Many data scientists start as analysts, develop Python skills and modeling knowledge alongside SQL and business domain knowledge, and grow into data science roles. This path has domain expertise as its advantage.

Bootcamps and self-directed learning — The accessibility of open data, online courses (Coursera, fast.ai, Kaggle), and portfolio-based hiring has made self-directed paths viable, particularly into analytics and applied ML roles at smaller companies. The bar is a demonstrable portfolio of real work.

The most common bottleneck in hiring is not technical credentials but applied project experience — evidence that a candidate can identify a real problem, get data, clean it, build an analysis or model, evaluate it properly, and communicate the results clearly. Interview formats that include case studies, take-home projects, or portfolio review are specifically designed to assess this.


What Organizations Get Wrong About Data Science

The enthusiasm with which organizations hired data scientists in the 2010s was not matched by equal investment in the prerequisites. Common failure modes:

Expecting models without infrastructure. Hiring data scientists before having reliable data pipelines, clear business metrics, and accessible data is like hiring chefs before building a kitchen.

Treating data science as a technical function. The highest-value data science work requires deep collaboration between analysts, domain experts, and decision-makers. Organizations that silo data teams from the business problems they are supposed to solve produce technically excellent models with no business impact.

Overweighting ML sophistication. A logistic regression that stakeholders understand and trust and can act on often creates more business value than a gradient boosted ensemble that produces marginally better predictions no one understands or uses.

Ignoring experimentation infrastructure. Understanding whether a change caused an outcome — as opposed to just correlating with it — requires A/B testing or other causal inference methods. Without experimentation infrastructure, organizations accumulate data about what happened without ever learning what works.

Under-investing in data quality. The most impactful investment many organizations can make in their data capabilities is not hiring more data scientists — it is building the data engineering and data governance infrastructure that makes the data scientists they already have more effective.

The clearest signal of a data-mature organization is not the sophistication of the models it runs. It is the quality and accessibility of the data those models are trained on, and the organizational processes that turn analytical output into decisions and actions.

Frequently Asked Questions

What is data science?

Data science is an interdisciplinary field that uses statistical methods, programming, and domain knowledge to extract insights and build predictive models from data. It combines elements of statistics, computer science, and subject-matter expertise to answer questions that cannot be answered by simple querying or reporting alone. Typical outputs include predictive models, causal analyses, recommendation systems, automated decision-making pipelines, and data-driven strategies.

What is the difference between data science, data analytics, and machine learning engineering?

Data analytics focuses on describing and explaining what has happened — querying databases, building dashboards, and summarizing trends for business decisions. Data science builds predictive and prescriptive models — using statistical and machine learning methods to forecast what will happen and recommend what to do. Machine learning engineering focuses on productionizing ML models — building the infrastructure, pipelines, and systems that serve model predictions at scale in real applications. There is substantial overlap, and job titles are used inconsistently across companies.

What skills does a data scientist need?

The core technical skills are: Python (the dominant language for data work, with libraries including pandas, NumPy, scikit-learn, and PyTorch), SQL (for querying and manipulating relational databases), and statistics (probability, hypothesis testing, regression, experimental design). Beyond technical skills, effective data scientists need strong communication to translate findings for non-technical stakeholders, critical thinking to avoid spurious conclusions, and enough domain knowledge to know which questions are worth asking.

What is the data science hierarchy of needs?

Monica Rogati's data science hierarchy of needs, drawn as a pyramid, illustrates that sophisticated machine learning sits at the top of a foundation that must be built first. From bottom to top: reliable data collection and storage, clean and trustworthy data pipelines, analytics and business intelligence, probabilistic modeling and A/B testing, and finally machine learning and AI. Organizations that skip the foundational layers and jump straight to 'AI' typically fail because their underlying data is unreliable or poorly structured.

What does a data scientist earn and how do you enter the field?

US data scientist salaries range widely: entry-level roles at non-tech companies typically pay \(80,000-\)110,000; senior data scientists at tech companies commonly earn \(150,000-\)220,000 in base salary, with total compensation (including equity) often higher. Entry paths include university degrees in statistics, computer science, or related quantitative fields, but many practicing data scientists transitioned from adjacent fields (economics, physics, biology, social science) and filled technical gaps through online courses, bootcamps, and self-directed projects. A portfolio of real projects demonstrating end-to-end data skills matters more than credentials at most companies.