Data engineering is the invisible infrastructure of the modern data economy. Before a data scientist can build a model, before a business analyst can produce a report, before a machine learning system can make a recommendation, someone has to get the data — from dozens of sources, in inconsistent formats, at varying frequencies — into a state where it can be reliably used. That is the data engineer's job. It is not glamorous by the standards of data science, which captured much of the public imagination in the 2010s. But it is arguably more foundational, and the demand for it has proved more durable.
The role has evolved rapidly. A decade ago, data engineering was largely a back-office function focused on ETL (extract, transform, load) pipelines and relational database management. Today, data engineers work with cloud-native architectures, streaming data systems, distributed computing frameworks, and sophisticated transformation layers that sit much closer to the analytical and ML workflows they serve. The tools have changed dramatically, the expectations have risen, and the compensation has followed.
This article explains what data engineers actually do, how the role differs from the data scientist and software engineer roles it is frequently confused with, what the modern tool stack looks like, what salary ranges are realistic across career levels and sectors, and what the most effective path into the field looks like.
"Data quality is not a data science problem. It is a data engineering problem. You cannot analyse data you cannot trust, and building systems that produce trustworthy data is genuinely hard." — Tristan Handy, founder of dbt Labs
Key Definitions
ETL (Extract, Transform, Load): The traditional process of extracting data from source systems, transforming it into the required format, and loading it into a target storage system. Modern variants include ELT, where raw data is loaded first and transformation happens after.
Data Pipeline: An automated system that moves data from source systems to storage and processing systems on a defined schedule or in response to events. Reliability, observability, and failure handling are primary engineering concerns.
Data Warehouse: A centralised repository of structured data optimised for querying and reporting. Modern cloud data warehouses (Snowflake, BigQuery, Amazon Redshift) are columnar stores capable of querying billions of rows at interactive speeds.
Data Lakehouse: An architecture combining the low-cost storage flexibility of a data lake (raw files in object storage) with the structured querying capabilities of a data warehouse, using formats like Apache Iceberg or Delta Lake.
Orchestration: The scheduling and coordination of multiple data pipeline steps or tasks in the correct sequence, with dependency management and error handling. Apache Airflow and Prefect are the primary orchestration tools.
What a Data Engineer Does Day-to-Day
The daily work varies by company size, maturity, and the data engineer's seniority, but broadly involves:
Building and maintaining data pipelines: Writing code that extracts data from source systems (transactional databases, SaaS APIs like Salesforce or Stripe, event streams from application logs), applies necessary transformations, and loads it into downstream systems. This involves handling failure cases, retries, monitoring, and alerting when something breaks.
Data modelling: Designing how data is structured in the data warehouse — table schemas, dimension tables, fact tables, and the relationships between them. Good data models make downstream analysis fast and intuitive; poor models create technical debt that compounds rapidly.
Writing and maintaining dbt models: dbt (data build tool) has become the standard tool for the transformation layer of the data stack. Data engineers write SQL-based transformation logic in dbt, defining how raw data becomes the clean, business-ready tables that analysts query.
Infrastructure management: Provisioning and configuring cloud data infrastructure — Snowflake accounts, BigQuery datasets, Airflow deployments, Kafka clusters, Spark on EMR or Databricks. Increasingly this is done through infrastructure-as-code tools (Terraform, Pulumi).
Data quality and observability: Implementing tests, anomaly detection, and lineage tracking to ensure that data flowing through pipelines is accurate, complete, and trustworthy. Tools like Great Expectations, dbt tests, Monte Carlo, and Soda help here.
Collaboration with data consumers: Working with data scientists, analysts, and business stakeholders to understand data needs, prioritise pipeline work, and ensure that the data models built are actually fit for purpose.
Data Engineer vs Data Scientist vs Software Engineer
These three roles are frequently confused, particularly by people outside the data field. The distinctions matter for hiring, career planning, and team structure.
Data Engineer:
- Builds and maintains data infrastructure (pipelines, warehouses, transformation layers)
- Primary outputs: reliable data pipelines, clean data models, well-structured data assets
- Core skills: Python/Scala, SQL, distributed systems, cloud platforms, data modelling
- Focuses on: data availability, reliability, scalability, and transformation
Data Scientist:
- Uses data to generate insights, build predictive models, and support decision-making
- Primary outputs: analytical reports, ML models, statistical analyses, business recommendations
- Core skills: Python/R, statistics, machine learning, data visualisation, domain knowledge
- Depends on: the data infrastructure built by data engineers
Software Engineer:
- Builds applications and services for end users or internal systems
- Primary outputs: production software applications, APIs, microservices
- Core skills: software design patterns, testing, deployment, system architecture
- May interact with data systems but is not primarily focused on data infrastructure
A useful analogy: if a city's water system is the data infrastructure, software engineers build the buildings, data scientists decide where the water should go and what the pressure should be, and data engineers build and maintain the pipes.
The Modern Data Engineering Tool Stack
The data engineering tool landscape has expanded and matured significantly since 2018. A competent data engineer in 2024 is expected to be familiar with:
Languages:
- Python: universal for pipeline scripting, data processing, and automation
- SQL: essential for data transformation and querying
- Scala or Java: used in Spark contexts and older JVM-based data systems
Data Warehousing:
- Snowflake: cloud-native, widely adopted in enterprise
- Google BigQuery: dominant in Google Cloud environments
- Amazon Redshift: established in AWS environments
- Databricks: particularly strong for ML-adjacent data engineering
Transformation:
- dbt (data build tool): now the standard for the transformation layer; requires SQL proficiency
Orchestration:
- Apache Airflow: the most widely deployed orchestration platform; Python-based DAGs
- Prefect and Dagster: modern alternatives with improved developer experience
Distributed Processing:
- Apache Spark: the industry standard for large-scale data processing
- Apache Flink: increasingly used for streaming workloads
Streaming:
- Apache Kafka: the dominant event streaming platform
- AWS Kinesis, Google Pub/Sub: cloud-native streaming alternatives
Cloud Platforms: AWS, GCP, and Azure are all common. Most data engineers specialise in one while maintaining general familiarity with the others.
Version Control and CI/CD: Git, GitHub Actions or GitLab CI — data pipelines are code and require the same software engineering practices (version control, testing, automated deployment) as any other codebase.
Salary and Compensation
Data engineering salaries are strong and have remained robust through the technology industry corrections of 2022-23 that affected some other data roles more severely.
US Salaries (2024, sources: Levels.fyi, LinkedIn Salary, Stack Overflow Survey):
- Entry-level data engineer (0-2 years): $95,000-$130,000
- Mid-level data engineer (2-5 years): $130,000-$175,000
- Senior data engineer (5-8 years): $160,000-$220,000
- Staff/Lead data engineer: $200,000-$280,000
- Principal/Architect: $250,000-$350,000+
At major tech companies (Google, Meta, Amazon, Airbnb, Databricks), total compensation including base, bonus, and RSUs significantly exceeds base salary. A senior data engineer at Google earning $200,000 base may have total compensation of $300,000-$400,000 depending on RSU grants.
UK Salaries (2024):
- Entry-level: £40,000-£60,000
- Mid-level: £65,000-£90,000
- Senior: £90,000-£130,000
- Lead/Principal: £120,000-£160,000+
Finance sector (banks, hedge funds, fintech) pays a premium of 20-30% above the technology sector baseline for equivalent experience.
Career Path
Typical progression:
Junior / Associate Data Engineer: Working on well-scoped tasks within established systems — adding new data sources to existing pipelines, building dbt models under supervision, fixing bugs. Learning the team's infrastructure and tooling.
Data Engineer: Owning complete pipeline projects independently. Designing data models for new domains. Mentoring junior colleagues. Beginning to make architectural recommendations.
Senior Data Engineer: Leading significant data engineering projects. Making architectural decisions for specific domains. Cross-functional collaboration with product, analytics, and ML teams. Defining team standards and practices.
Staff / Lead Data Engineer: Setting architectural direction for the entire data platform. Influencing hiring and team processes. Technical leadership across multiple product areas or a large team.
Principal / Data Architect: Organisation-wide technical vision for data infrastructure. Evaluating major tool and platform decisions. Significant strategic influence beyond the data team.
Parallel paths include moving into data platform engineering (more infrastructure-focused), analytics engineering (more BI and SQL transformation focused, less pipeline engineering), or transitioning into ML engineering (intersection of data engineering and machine learning deployment).
How to Become a Data Engineer
From a software engineering background: The most natural transition. Fill the data-specific gaps: learn SQL deeply, understand data modelling concepts, take the dbt fundamentals course (free), build an end-to-end pipeline project using Airflow and Snowflake or BigQuery, and document it as a portfolio project.
From a data science background: Data scientists often have strong Python and SQL foundations. The gaps to fill are software engineering practices (writing production-quality code, testing, version control discipline) and infrastructure knowledge (cloud platforms, orchestration, data warehouse architecture).
From analytics or BI: Strong SQL skills are an asset. Filling the gap requires adding Python programming, learning pipeline tools (Airflow), and understanding the engineering side of data warehousing beyond just querying it.
From a non-technical background: The path is longer but viable. Strong SQL is the first milestone — achievable in 3-6 months of consistent study. Python programming comes next. Building a documented end-to-end project (ingest data from a public API, transform it with dbt, load it to a free-tier cloud warehouse, schedule it with Airflow) demonstrates practical ability more convincingly than any course certificate.
Portfolio project recommendations:
- Build a pipeline ingesting real data (public APIs like the Open-Meteo weather API, sports data, financial data) into a cloud warehouse
- Write dbt models transforming raw data into analytical tables
- Orchestrate with Airflow running on a free-tier VM
- Document the architecture, design decisions, and what you would do differently with more time
Practical Takeaways
SQL is the foundation. Before touching Spark or Kafka, master complex SQL: window functions, CTEs, performance optimisation, and data modelling patterns. Then add Python. Then pick one cloud platform and learn it well. The dbt Community Slack and Data Engineering Weekly newsletter are the two best free resources for staying current with industry practice. Target your first data engineering role at a company with a mature data team — you will learn far more in the first year from good mentors and existing systems than you would building everything from scratch at a seed-stage startup.
References
- Stack Overflow, Developer Survey 2024 — Data Engineer Salary Data. stackoverflow.com/survey
- Levels.fyi, Data Engineer Compensation Data (2024). levels.fyi
- dbt Labs, The Analytics Engineering Guide (2024). docs.getdbt.com
- Bureau of Labour Statistics, Database Administrators and Architects (2023). bls.gov
- Kleppmann, Martin. Designing Data-Intensive Applications. O'Reilly Media, 2017.
- Apache Software Foundation, Airflow Documentation (2024). airflow.apache.org
- Databricks, The Data + AI Survey (2024). databricks.com
- Snowflake, Modern Data Stack Report (2023). snowflake.com
- Tristan Handy, 'The Analytics Engineer' (Fishtown Analytics Blog, 2016). getdbt.com/blog
- Data Engineering Weekly, Industry Newsletter (2024). dataengineeringweekly.com
- Fundamentals of Data Engineering, Joe Reis and Matt Housley. O'Reilly Media, 2022.
- LinkedIn Workforce Report, Data Engineering Demand (2024). linkedin.com
Frequently Asked Questions
What is the difference between a data engineer and a data scientist?
A data engineer builds and maintains the infrastructure that makes data available for analysis — pipelines, warehouses, and transformation layers. A data scientist uses that data to build models and generate insights. Data engineers focus on data plumbing; data scientists focus on data analysis and modelling.
What tools does a data engineer use?
Core tools include Python or Scala for scripting, SQL for data transformation, Apache Spark for large-scale processing, Apache Airflow or Prefect for workflow orchestration, dbt for transformation logic, and cloud data warehouses (Snowflake, BigQuery, Redshift). Cloud platform skills (AWS, GCP, Azure) are universally expected.
How much does a data engineer earn?
US data engineers earn \(100,000-\)170,000 at major tech companies plus stock compensation. The BLS (2023) reports median annual wages for database administrators (the closest category) of \(101,000. Senior data engineers and lead/principal roles at major companies earn \)180,000-$300,000+ total compensation.
Do you need a computer science degree to become a data engineer?
A CS degree is common but not required. Many data engineers come from data science, software engineering, or even analytics backgrounds. Strong SQL skills, Python programming, and demonstrated pipeline project experience are often more important to employers than a specific degree.
Is data engineering a good career in 2024?
Yes. Demand for data engineers consistently exceeds supply, and the role has become more central as organisations depend more heavily on data-driven decisions. Unlike data science, which saw some overhiring and correction in 2022-23, data engineering demand has remained robust.