Data engineering is the invisible infrastructure of the modern data economy. Before a data scientist can build a model, before a business analyst can produce a report, before a machine learning system can make a recommendation, someone has to get the data — from dozens of sources, in inconsistent formats, at varying frequencies — into a state where it can be reliably used. That is the data engineer's job. It is not glamorous by the standards of data science, which captured much of the public imagination in the 2010s. But it is arguably more foundational, and the demand for it has proved far more durable.
The role has evolved rapidly. A decade ago, data engineering was largely a back-office function focused on ETL (extract, transform, load) pipelines and relational database management. Today, data engineers work with cloud-native architectures, streaming data systems, distributed computing frameworks, and sophisticated transformation layers that sit much closer to the analytical and ML workflows they serve. The tooling has changed dramatically — Hadoop gave way to Spark, on-premise warehouses gave way to Snowflake and BigQuery, hand-rolled schedulers gave way to Airflow and Prefect, and SQL-in-scripts gave way to dbt. The expectations have risen with the tooling, and the compensation has followed.
This article explains what data engineers actually do, how the role differs from data scientist and software engineer roles it is frequently confused with, what the modern full tool stack looks like across each layer, what salary ranges are realistic across career levels and company tiers, and what the realistic path into the field requires. It also covers how to evaluate whether data engineering is a genuinely good fit compared to adjacent paths, and what separates high-quality data engineering practice from the mediocre work that is more common than most job postings acknowledge.
"Data quality is not a data science problem. It is a data engineering problem. You cannot analyse data you cannot trust, and building systems that produce trustworthy data is genuinely hard." — Tristan Handy, founder of dbt Labs
Key Definitions
ETL (Extract, Transform, Load): The traditional process of extracting data from source systems, transforming it into the required format, and loading it into a target storage system. Modern variants include ELT, where raw data is loaded first and transformation happens after storage — enabled by the cheap compute available in cloud data warehouses.
Data Pipeline: An automated system that moves data from source systems to storage and processing systems on a defined schedule or in response to events. Reliability, observability, and graceful failure handling are the primary engineering concerns. A pipeline that silently produces wrong data is worse than one that visibly fails.
Data Warehouse: A centralised repository of structured data optimised for querying and reporting. Modern cloud data warehouses (Snowflake, BigQuery, Amazon Redshift) are columnar stores capable of querying billions of rows at interactive speeds by separating storage from compute.
Data Lakehouse: An architecture combining the low-cost storage flexibility of a data lake (raw files in object storage like S3 or GCS) with the structured querying capabilities of a data warehouse, using open table formats like Apache Iceberg or Delta Lake. This avoids paying warehouse vendors for cold storage of rarely queried data.
Orchestration: The scheduling and coordination of multiple data pipeline steps or tasks in the correct sequence, with dependency management, retry logic, and error handling. Apache Airflow and Prefect are the primary orchestration tools. Orchestration is the plumbing that ensures pipelines run reliably and in the right order.
Data Observability: The ability to understand the health and state of data systems in production — detecting schema changes, volume anomalies, freshness issues, and data quality degradation before downstream users are affected.
What a Data Engineer Does Day-to-Day
The daily work varies by company size, maturity, and seniority, but the core activities are consistent across most environments:
Building and maintaining data pipelines: Writing code that extracts data from source systems — transactional databases, SaaS tool APIs (Salesforce, Stripe, HubSpot), event streams from application logs — applies necessary transformations, and loads it into downstream systems. This involves implementing failure cases, retries, monitoring alerts, and handling schema drift when source systems change unexpectedly.
Data modelling: Designing how data is structured in the data warehouse — table schemas, dimension tables, fact tables, and the relationships between them. This is one of the most intellectually demanding parts of the job. Good data models make downstream analysis fast, intuitive, and trustworthy. Poor models create technical debt that compounds rapidly and is extremely difficult to refactor without breaking downstream reports and dashboards.
Writing and maintaining dbt models: dbt (data build tool) has become the standard tool for the transformation layer. Data engineers write SQL-based transformation logic in dbt, defining how raw data becomes the clean, business-ready tables that analysts and data scientists query. dbt also provides documentation, data lineage, and automated testing of transformation outputs.
Infrastructure management: Provisioning and configuring cloud data infrastructure — Snowflake accounts, BigQuery datasets, Airflow deployments, Kafka clusters, Spark on EMR or Databricks. In mature data teams this is increasingly done through infrastructure-as-code tools (Terraform, Pulumi) and automated via CI/CD pipelines, not through manual console clicks.
Data quality and observability: Implementing tests, anomaly detection, and lineage tracking to ensure that data flowing through pipelines is accurate, complete, and trustworthy. This is the practice most commonly skipped under deadline pressure and the one that causes the most painful production incidents later. Tools in this space include Great Expectations, dbt tests, Monte Carlo, and Soda.
Stakeholder collaboration: Working with data scientists, analysts, and business stakeholders to understand data needs, prioritise pipeline work, and ensure that the data models built are actually fit for purpose. In smaller teams this often means sitting in planning sessions with product managers to understand upcoming feature launches that will create new data requirements.
Schema design and data governance: Defining naming conventions, field types, and deprecation policies for tables. In regulated industries this extends to data classification, access controls, and audit trails.
A useful breakdown of how a senior data engineer might allocate time across a typical week: 30 percent building new pipelines or features; 25 percent investigating and fixing data quality issues or pipeline failures; 20 percent reviewing data models, code, and architectural decisions; 15 percent collaborating with stakeholders; 10 percent infrastructure management and monitoring.
Data Engineer vs Data Scientist vs Software Engineer
These three roles are frequently confused by people outside the data field, misrepresented in job postings, and conflated in the hiring practices of smaller companies. The distinctions matter for career planning, team design, and setting realistic expectations.
| Dimension | Data Engineer | Data Scientist | Software Engineer |
|---|---|---|---|
| Primary output | Reliable data pipelines, clean data models | Analytical insights, ML models, statistical analysis | Production software applications, APIs |
| Core languages | Python, SQL, Scala | Python, R, SQL | Python, Java, Go, JavaScript, C++ |
| Key tools | Airflow, dbt, Spark, Snowflake, Kafka | Jupyter, scikit-learn, PyTorch, TensorFlow, Pandas | React, Spring, Docker, Kubernetes, Postgres |
| Systems focus | Data availability, reliability, scalability | Statistical modelling, feature engineering | Application correctness, user experience, API design |
| Depends on | Source systems, cloud infrastructure | Data infrastructure built by data engineers | Product requirements, system architecture |
| Degree common path | CS, MIS, Statistics | Statistics, Mathematics, CS | CS, Software Engineering |
| Median US salary (2024) | $130,000-$160,000 | $120,000-$155,000 | $125,000-$165,000 |
| AI displacement risk | Low-moderate (tooling automation) | Moderate (automated ML, AutoML) | Moderate (code generation tools) |
| Remote work availability | High | High | Very high |
The most important distinction is the production responsibility split. Data engineers are responsible for the infrastructure that makes data available and correct. Data scientists use that infrastructure to produce analytical work. A data scientist who tries to build their own data infrastructure without engineering discipline typically produces pipelines that work in notebooks and fail in production. A data engineer who tries to do data science without statistical training typically produces models that overfit or answer the wrong question. The roles are complementary, not competitive.
A useful analogy: if an organisation's data infrastructure is a city water system, software engineers build the buildings, data scientists decide where the water should go and at what pressure, and data engineers build and maintain the pipes. The pipes are less visible and less glamorous, but they are what makes everything else possible.
The Modern Data Engineering Tool Stack
The data engineering tool landscape has expanded and matured significantly since 2018. The full stack, layer by layer:
| Layer | Tool | Notes |
|---|---|---|
| Ingestion: Batch connectors | Fivetran, Airbyte, Stitch | Managed connectors for SaaS sources; Airbyte is open source |
| Ingestion: Event streaming | Apache Kafka | Dominant event bus for real-time data; managed via Confluent Cloud |
| Ingestion: Cloud streaming | AWS Kinesis, Google Pub/Sub | Cloud-native streaming alternatives |
| Storage: Cloud data warehouse | Snowflake | Cloud-native, separation of compute/storage, widely adopted in enterprise |
| Storage: Cloud data warehouse | Google BigQuery | Dominant in GCP; serverless billing model |
| Storage: Cloud data warehouse | Amazon Redshift | Established in AWS environments; RA3 nodes allow S3 storage decoupling |
| Storage: Data lakehouse | Databricks (Delta Lake) | Strong for ML-adjacent engineering; Delta Lake format |
| Storage: Table format | Apache Iceberg | Open standard; growing fast for lakehouse architectures |
| Transformation | dbt (data build tool) | Now the standard transformation layer; SQL-based, git-native |
| Distributed processing | Apache Spark | Industry standard for large-scale batch processing |
| Distributed processing: streaming | Apache Flink | Increasingly used for low-latency streaming workloads |
| Orchestration | Apache Airflow | Most widely deployed; Python DAGs; steep learning curve |
| Orchestration: modern | Prefect, Dagster | Better developer experience than Airflow; native testing support |
| Visualisation | Looker | Semantic layer model; strong for governed metrics |
| Visualisation | Tableau | Widely used; strong self-service analytics |
| Visualisation | Power BI | Dominant in Microsoft-stack organisations |
| Data quality / observability | Monte Carlo, Soda, Great Expectations | Anomaly detection, lineage, freshness monitoring |
| Infrastructure as code | Terraform, Pulumi | Pipeline infrastructure provisioned and versioned as code |
| Version control and CI/CD | Git, GitHub Actions, GitLab CI | Data pipelines are code and require the same practices as any codebase |
Not all data engineers use every tool in this stack. Specialisation occurs naturally: engineers working in streaming environments spend more time with Kafka and Flink; those in analytics-focused teams spend more time with dbt and Looker. The cloud platform (AWS, GCP, or Azure) shapes which storage and managed service choices are most natural. Most employers expect depth in at least two or three layers and general familiarity with the rest.
Salary by Career Level and Company Tier
Data engineering salaries have remained robust through the technology industry corrections of 2022-23 that affected some other data roles more severely. Demand has consistently outpaced supply, particularly for engineers with cloud-native and dbt/Airflow skills.
| Level | Years Experience | FAANG / Top Tier | Growth Stage / Mid-Market | Enterprise Non-Tech | Startup (Seed-Series B) |
|---|---|---|---|---|---|
| Junior / Associate | 0-2 | $130,000-$160,000 base | $95,000-$120,000 | $85,000-$110,000 | $90,000-$115,000 |
| Data Engineer | 2-5 | $165,000-$210,000 base | $125,000-$160,000 | $110,000-$140,000 | $115,000-$145,000 |
| Senior Data Engineer | 5-8 | $200,000-$260,000 base | $155,000-$195,000 | $140,000-$175,000 | $140,000-$170,000 |
| Staff / Lead | 8-12 | $240,000-$320,000 base | $185,000-$240,000 | $165,000-$210,000 | $155,000-$195,000 |
| Principal / Architect | 12+ | $280,000-$380,000 base | $210,000-$280,000 | $185,000-$240,000 | $170,000-$220,000 |
Note: FAANG total compensation (base + bonus + RSUs) significantly exceeds base salary figures. A senior data engineer at Google with a $220,000 base may have total compensation of $350,000-$450,000 depending on RSU grant timing and vesting schedule. Growth-stage companies often partially offset lower base pay with equity grants that can be substantial but carry more risk.
UK salary ranges (2024, London-adjusted): Junior £42,000-£62,000; mid £65,000-£90,000; senior £90,000-£130,000; lead/principal £120,000-£165,000+. Finance sector (hedge funds, investment banks, fintech) pays a premium of 20-30% above the technology sector baseline for equivalent experience.
Sources: Levels.fyi Data Engineer Compensation 2024, Stack Overflow Developer Survey 2024, LinkedIn Salary Insights 2024.
Career Path from Junior to Architect
| Level | Typical Scope | Key Responsibilities | What 'Good' Looks Like |
|---|---|---|---|
| Junior / Associate | Well-scoped tasks within existing systems | Adding data sources to established pipelines; building dbt models under review; fixing bugs | Learns team standards quickly; asks good questions; does not break production |
| Data Engineer | Complete pipeline projects independently | Owns new domain pipelines end-to-end; designs data models for new areas; first to mentor juniors | Delivers working, tested, documented pipelines; catches schema issues before they cause incidents |
| Senior Data Engineer | Significant projects with architectural decisions | Leads domain-level architecture; cross-functional collaboration with product and ML; defines team standards | Anticipates downstream impact of design decisions; influences non-engineers effectively |
| Staff / Lead | Platform-level direction across multiple domains | Sets architectural direction for data platform; defines hiring standards; owns cross-team technical decisions | Can make large-scale platform changes that improve entire team productivity; trusted to make tradeoffs with limited oversight |
| Principal / Architect | Organisation-wide technical vision | Evaluates major tool and vendor decisions; influences data strategy at executive level; designs systems lasting 5+ years | Decisions shaped by long-term business context; aware of industry trajectory; respected by external peers |
Parallel tracks exist beyond this ladder. Analytics engineering (closer to dbt and BI, more SQL-focused, less pipeline engineering) suits engineers who prefer working closer to business users. Data platform engineering (closer to infrastructure, Kubernetes, and cloud architecture) suits engineers who enjoy building the foundations others use. ML engineering (intersection of data engineering and model deployment) suits engineers interested in machine learning production systems.
How Data Engineering Evolved From ETL Developer
The job title "data engineer" barely existed before 2012. The work was done under titles like ETL developer, data warehouse developer, or business intelligence engineer. These roles were often tooling-specific (Informatica developer, SSIS developer) and associated primarily with batch processing of structured relational data into reporting databases.
Three forces transformed the role between 2012 and 2020. First, the explosion of data volume from mobile applications and IoT devices made traditional batch ETL architectures too slow and too expensive. Second, the migration of storage and compute to cloud platforms (AWS, GCP, Azure) made it economically viable to store and process vastly larger datasets without capital investment in on-premise hardware. Third, the rise of machine learning as a business function created new consumers of data infrastructure with requirements (feature stores, training pipelines, model monitoring) that traditional BI-focused data warehousing had not designed for.
The result was a role that needed to span software engineering discipline (testing, version control, CI/CD, code review) with data-specific knowledge (modelling, warehousing, streaming) and infrastructure competence (cloud platforms, containerisation, orchestration). The modern data engineer is, in practice, a specialised software engineer who has chosen to focus on data systems.
What Good Data Engineering Looks Like vs Bad
Most data engineering work exists on a spectrum from reliable and maintainable to brittle and opaque. The difference has compounding consequences: good pipelines enable teams to ship faster and with confidence; bad pipelines generate constant incidents, erode trust in data, and consume engineering time on triage rather than new work.
Good data engineering:
- Pipelines fail loudly and immediately rather than silently producing wrong data
- Schema changes in source systems are caught by automated tests, not discovered in a quarterly board report
- Data models are documented, named consistently, and structured so analysts can query them without asking an engineer first
- Transformation logic is in version control, not in a stored procedure written by someone who left the company
- Monitoring alerts on freshness and volume anomalies, not just on pipeline process failures
- Infrastructure is provisioned via code (Terraform), not by clicking in the AWS console
Bad data engineering (common patterns):
- Pipelines that move data but have no automated quality checks
- Data models named with initials, dates, or legacy acronyms with no documentation
- Logic duplicated across multiple pipelines that produces conflicting metrics for the same business concept
- Transformation code in Jupyter notebooks running on someone's laptop
- No alerting until a stakeholder reports a number is wrong
- Ingestion jobs that break whenever the source system changes a field name
The most frequent cause of bad data engineering is not incompetence but time pressure. Pipelines built under sprint deadline pressure get shipped without tests and never revisited. Over time, the maintenance cost of untested pipelines exceeds the cost of building reliable ones from the start, but the cost is distributed across the future and the saving was captured in the past.
Interview Process for Data Engineering Roles
Data engineering interviews vary by company but typically include three to four stages:
SQL round: Complex SQL problems — window functions, aggregations across multiple tables, CTEs, performance considerations. At strong companies this goes beyond basic joins to require genuine analytical query construction. Preparation: LeetCode SQL section, StrataScratch, Mode SQL tutorial.
Python / coding round: Data processing with Python — typically involving pandas or pure Python. May include writing a small data transformation function, parsing a JSON API response, or processing a CSV efficiently. More software-engineering-focused companies include LeetCode-style algorithmic problems.
Data modelling round: Given a business scenario (an e-commerce platform, a SaaS subscription product, a ride-sharing service), design the dimensional model. Define fact and dimension tables, granularity, slowly changing dimensions, and how you would handle late-arriving data. This is often a whiteboard or discussion exercise.
System design for data: Design a data pipeline for a given scenario — for example, design a pipeline that ingests clickstream events from a web application, stores them in a data warehouse, and makes a daily active users metric available to a dashboard by 8am. Expects discussion of ingestion choices (Kafka vs batch), storage choices, transformation layers, orchestration, and failure handling.
Take-home project: Some companies provide a dataset and ask for a dbt project, a pipeline implementation, or an analytical investigation. Expected to be production-quality code with tests, documentation, and a brief writeup.
Strong interview preparation: practice SQL until complex window function queries feel natural; build and document an end-to-end pipeline project you can walk through; read chapters 1-5 of "Fundamentals of Data Engineering" by Joe Reis and Matt Housley.
Practical Takeaways
SQL is the true foundation. Before touching Spark or Kafka, master complex SQL: window functions, CTEs, performance optimisation, and the dimensional modelling patterns (star schema, slowly changing dimensions) used in real warehouse design. Then add Python to the point of writing production-quality code with tests. Then pick one cloud platform and learn its data services well — not superficially, but well enough to make cost and performance tradeoffs with confidence.
The dbt Fundamentals course (free, available at courses.getdbt.com) is the single most valuable free resource for someone entering data engineering in 2024-25. Completing it and building a personal dbt project with real data demonstrates more practical capability than any certification.
Target a first role at a company with a mature data team and existing infrastructure. The first twelve months of exposure to well-designed pipelines, good code review practices, and experienced senior engineers is worth more than two years spent building everything from scratch without guidance. Data Engineering Weekly (newsletter) and the dbt Community Slack are the best free resources for staying current with industry practice.
References
- Stack Overflow, Developer Survey 2024 — Data Engineer Salary Data. stackoverflow.com/survey
- Levels.fyi, Data Engineer Compensation Data (2024). levels.fyi
- dbt Labs, The Analytics Engineering Guide (2024). docs.getdbt.com
- Bureau of Labour Statistics, Database Administrators and Architects (2024). bls.gov
- Kleppmann, Martin. Designing Data-Intensive Applications. O'Reilly Media, 2017.
- Apache Software Foundation, Airflow Documentation (2024). airflow.apache.org
- Databricks, The Data + AI Survey (2024). databricks.com
- Snowflake, Modern Data Stack Report (2024). snowflake.com
- Reis, Joe and Housley, Matt. Fundamentals of Data Engineering. O'Reilly Media, 2022.
- Tristan Handy, 'The Analytics Engineer' (Fishtown Analytics Blog, 2016). getdbt.com/blog
- LinkedIn Workforce Report, Data Engineering Demand (2024). linkedin.com
- Data Engineering Weekly, Industry Newsletter (2024). dataengineeringweekly.com
Frequently Asked Questions
What is the difference between a data engineer and a data scientist?
A data engineer builds and maintains the infrastructure that makes data available and reliable — pipelines, warehouses, transformation layers, and orchestration systems. A data scientist uses that infrastructure to generate insights, build models, and run analyses. Data engineers focus on data plumbing; data scientists focus on what to do with it once it flows.
What tools does a data engineer use?
The core stack spans ingestion (Kafka, Fivetran, Airbyte), storage (Snowflake, BigQuery, Redshift, Databricks), transformation (dbt), orchestration (Apache Airflow, Prefect), and visualisation (Looker, Tableau). Python and SQL are the universal languages. Cloud platform skills on AWS, GCP, or Azure are expected at every level.
How much does a data engineer earn?
In the US, junior data engineers earn \(95,000-\)130,000, mid-level \(125,000-\)165,000, and senior \(155,000-\)210,000. At FAANG-tier companies, senior total compensation including RSUs reaches \(350,000-\)450,000. UK ranges run roughly 60-65% of US figures, with finance sector roles paying a 20-30% premium over tech baseline.
Do you need a computer science degree to become a data engineer?
No. Many data engineers come from software engineering, data science, or analytics backgrounds. Strong SQL, Python programming, and a documented end-to-end pipeline project often carry more weight with employers than a specific degree. The dbt Fundamentals course and a personal pipeline project on GitHub are more useful than most certifications.
Is data engineering still a good career in 2025 and 2026?
Yes. Demand for data engineers consistently outpaces supply, and the role has grown more central as organisations depend heavily on data infrastructure. Unlike data science, which saw overhiring and correction in 2022-23, data engineering demand has remained durable. AI tooling is creating automation in some pipeline tasks but is increasing overall data infrastructure demand, not replacing the engineering judgment required.