Data engineering is the invisible infrastructure of the modern data economy. Before a data scientist can build a model, before a business analyst can produce a report, before a machine learning system can make a recommendation, someone has to get the data — from dozens of sources, in inconsistent formats, at varying frequencies — into a state where it can be reliably used. That is the data engineer's job. It is not glamorous by the standards of data science, which captured much of the public imagination in the 2010s. But it is arguably more foundational, and the demand for it has proved far more durable.

The role has evolved rapidly. A decade ago, data engineering was largely a back-office function focused on ETL (extract, transform, load) pipelines and relational database management. Today, data engineers work with cloud-native architectures, streaming data systems, distributed computing frameworks, and sophisticated transformation layers that sit much closer to the analytical and ML workflows they serve. The tooling has changed dramatically — Hadoop gave way to Spark, on-premise warehouses gave way to Snowflake and BigQuery, hand-rolled schedulers gave way to Airflow and Prefect, and SQL-in-scripts gave way to dbt. The expectations have risen with the tooling, and the compensation has followed.

This article explains what data engineers actually do, how the role differs from data scientist and software engineer roles it is frequently confused with, what the modern full tool stack looks like across each layer, what salary ranges are realistic across career levels and company tiers, and what the realistic path into the field requires. It also covers how to evaluate whether data engineering is a genuinely good fit compared to adjacent paths, and what separates high-quality data engineering practice from the mediocre work that is more common than most job postings acknowledge.

"Data quality is not a data science problem. It is a data engineering problem. You cannot analyse data you cannot trust, and building systems that produce trustworthy data is genuinely hard." — Tristan Handy, founder of dbt Labs


Key Definitions

ETL (Extract, Transform, Load): The traditional process of extracting data from source systems, transforming it into the required format, and loading it into a target storage system. Modern variants include ELT, where raw data is loaded first and transformation happens after storage — enabled by the cheap compute available in cloud data warehouses.

Data Pipeline: An automated system that moves data from source systems to storage and processing systems on a defined schedule or in response to events. Reliability, observability, and graceful failure handling are the primary engineering concerns. A pipeline that silently produces wrong data is worse than one that visibly fails.

Data Warehouse: A centralised repository of structured data optimised for querying and reporting. Modern cloud data warehouses (Snowflake, BigQuery, Amazon Redshift) are columnar stores capable of querying billions of rows at interactive speeds by separating storage from compute.

Data Lakehouse: An architecture combining the low-cost storage flexibility of a data lake (raw files in object storage like S3 or GCS) with the structured querying capabilities of a data warehouse, using open table formats like Apache Iceberg or Delta Lake. This avoids paying warehouse vendors for cold storage of rarely queried data.

Orchestration: The scheduling and coordination of multiple data pipeline steps or tasks in the correct sequence, with dependency management, retry logic, and error handling. Apache Airflow and Prefect are the primary orchestration tools. Orchestration is the plumbing that ensures pipelines run reliably and in the right order.

Data Observability: The ability to understand the health and state of data systems in production — detecting schema changes, volume anomalies, freshness issues, and data quality degradation before downstream users are affected.


What a Data Engineer Does Day-to-Day

The daily work varies by company size, maturity, and seniority, but the core activities are consistent across most environments:

Building and maintaining data pipelines: Writing code that extracts data from source systems — transactional databases, SaaS tool APIs (Salesforce, Stripe, HubSpot), event streams from application logs — applies necessary transformations, and loads it into downstream systems. This involves implementing failure cases, retries, monitoring alerts, and handling schema drift when source systems change unexpectedly.

Data modelling: Designing how data is structured in the data warehouse — table schemas, dimension tables, fact tables, and the relationships between them. This is one of the most intellectually demanding parts of the job. Good data models make downstream analysis fast, intuitive, and trustworthy. Poor models create technical debt that compounds rapidly and is extremely difficult to refactor without breaking downstream reports and dashboards.

Writing and maintaining dbt models: dbt (data build tool) has become the standard tool for the transformation layer. Data engineers write SQL-based transformation logic in dbt, defining how raw data becomes the clean, business-ready tables that analysts and data scientists query. dbt also provides documentation, data lineage, and automated testing of transformation outputs.

Infrastructure management: Provisioning and configuring cloud data infrastructure — Snowflake accounts, BigQuery datasets, Airflow deployments, Kafka clusters, Spark on EMR or Databricks. In mature data teams this is increasingly done through infrastructure-as-code tools (Terraform, Pulumi) and automated via CI/CD pipelines, not through manual console clicks.

Data quality and observability: Implementing tests, anomaly detection, and lineage tracking to ensure that data flowing through pipelines is accurate, complete, and trustworthy. This is the practice most commonly skipped under deadline pressure and the one that causes the most painful production incidents later. Tools in this space include Great Expectations, dbt tests, Monte Carlo, and Soda.

Stakeholder collaboration: Working with data scientists, analysts, and business stakeholders to understand data needs, prioritise pipeline work, and ensure that the data models built are actually fit for purpose. In smaller teams this often means sitting in planning sessions with product managers to understand upcoming feature launches that will create new data requirements.

Schema design and data governance: Defining naming conventions, field types, and deprecation policies for tables. In regulated industries this extends to data classification, access controls, and audit trails.

A useful breakdown of how a senior data engineer might allocate time across a typical week: 30 percent building new pipelines or features; 25 percent investigating and fixing data quality issues or pipeline failures; 20 percent reviewing data models, code, and architectural decisions; 15 percent collaborating with stakeholders; 10 percent infrastructure management and monitoring.


Data Engineer vs Data Scientist vs Software Engineer

These three roles are frequently confused by people outside the data field, misrepresented in job postings, and conflated in the hiring practices of smaller companies. The distinctions matter for career planning, team design, and setting realistic expectations.

Dimension Data Engineer Data Scientist Software Engineer
Primary output Reliable data pipelines, clean data models Analytical insights, ML models, statistical analysis Production software applications, APIs
Core languages Python, SQL, Scala Python, R, SQL Python, Java, Go, JavaScript, C++
Key tools Airflow, dbt, Spark, Snowflake, Kafka Jupyter, scikit-learn, PyTorch, TensorFlow, Pandas React, Spring, Docker, Kubernetes, Postgres
Systems focus Data availability, reliability, scalability Statistical modelling, feature engineering Application correctness, user experience, API design
Depends on Source systems, cloud infrastructure Data infrastructure built by data engineers Product requirements, system architecture
Degree common path CS, MIS, Statistics Statistics, Mathematics, CS CS, Software Engineering
Median US salary (2024) $130,000-$160,000 $120,000-$155,000 $125,000-$165,000
AI displacement risk Low-moderate (tooling automation) Moderate (automated ML, AutoML) Moderate (code generation tools)
Remote work availability High High Very high

The most important distinction is the production responsibility split. Data engineers are responsible for the infrastructure that makes data available and correct. Data scientists use that infrastructure to produce analytical work. A data scientist who tries to build their own data infrastructure without engineering discipline typically produces pipelines that work in notebooks and fail in production. A data engineer who tries to do data science without statistical training typically produces models that overfit or answer the wrong question. The roles are complementary, not competitive.

A useful analogy: if an organisation's data infrastructure is a city water system, software engineers build the buildings, data scientists decide where the water should go and at what pressure, and data engineers build and maintain the pipes. The pipes are less visible and less glamorous, but they are what makes everything else possible.


The Modern Data Engineering Tool Stack

The data engineering tool landscape has expanded and matured significantly since 2018. The full stack, layer by layer:

Layer Tool Notes
Ingestion: Batch connectors Fivetran, Airbyte, Stitch Managed connectors for SaaS sources; Airbyte is open source
Ingestion: Event streaming Apache Kafka Dominant event bus for real-time data; managed via Confluent Cloud
Ingestion: Cloud streaming AWS Kinesis, Google Pub/Sub Cloud-native streaming alternatives
Storage: Cloud data warehouse Snowflake Cloud-native, separation of compute/storage, widely adopted in enterprise
Storage: Cloud data warehouse Google BigQuery Dominant in GCP; serverless billing model
Storage: Cloud data warehouse Amazon Redshift Established in AWS environments; RA3 nodes allow S3 storage decoupling
Storage: Data lakehouse Databricks (Delta Lake) Strong for ML-adjacent engineering; Delta Lake format
Storage: Table format Apache Iceberg Open standard; growing fast for lakehouse architectures
Transformation dbt (data build tool) Now the standard transformation layer; SQL-based, git-native
Distributed processing Apache Spark Industry standard for large-scale batch processing
Distributed processing: streaming Apache Flink Increasingly used for low-latency streaming workloads
Orchestration Apache Airflow Most widely deployed; Python DAGs; steep learning curve
Orchestration: modern Prefect, Dagster Better developer experience than Airflow; native testing support
Visualisation Looker Semantic layer model; strong for governed metrics
Visualisation Tableau Widely used; strong self-service analytics
Visualisation Power BI Dominant in Microsoft-stack organisations
Data quality / observability Monte Carlo, Soda, Great Expectations Anomaly detection, lineage, freshness monitoring
Infrastructure as code Terraform, Pulumi Pipeline infrastructure provisioned and versioned as code
Version control and CI/CD Git, GitHub Actions, GitLab CI Data pipelines are code and require the same practices as any codebase

Not all data engineers use every tool in this stack. Specialisation occurs naturally: engineers working in streaming environments spend more time with Kafka and Flink; those in analytics-focused teams spend more time with dbt and Looker. The cloud platform (AWS, GCP, or Azure) shapes which storage and managed service choices are most natural. Most employers expect depth in at least two or three layers and general familiarity with the rest.


Salary by Career Level and Company Tier

Data engineering salaries have remained robust through the technology industry corrections of 2022-23 that affected some other data roles more severely. Demand has consistently outpaced supply, particularly for engineers with cloud-native and dbt/Airflow skills.

Level Years Experience FAANG / Top Tier Growth Stage / Mid-Market Enterprise Non-Tech Startup (Seed-Series B)
Junior / Associate 0-2 $130,000-$160,000 base $95,000-$120,000 $85,000-$110,000 $90,000-$115,000
Data Engineer 2-5 $165,000-$210,000 base $125,000-$160,000 $110,000-$140,000 $115,000-$145,000
Senior Data Engineer 5-8 $200,000-$260,000 base $155,000-$195,000 $140,000-$175,000 $140,000-$170,000
Staff / Lead 8-12 $240,000-$320,000 base $185,000-$240,000 $165,000-$210,000 $155,000-$195,000
Principal / Architect 12+ $280,000-$380,000 base $210,000-$280,000 $185,000-$240,000 $170,000-$220,000

Note: FAANG total compensation (base + bonus + RSUs) significantly exceeds base salary figures. A senior data engineer at Google with a $220,000 base may have total compensation of $350,000-$450,000 depending on RSU grant timing and vesting schedule. Growth-stage companies often partially offset lower base pay with equity grants that can be substantial but carry more risk.

UK salary ranges (2024, London-adjusted): Junior £42,000-£62,000; mid £65,000-£90,000; senior £90,000-£130,000; lead/principal £120,000-£165,000+. Finance sector (hedge funds, investment banks, fintech) pays a premium of 20-30% above the technology sector baseline for equivalent experience.

Sources: Levels.fyi Data Engineer Compensation 2024, Stack Overflow Developer Survey 2024, LinkedIn Salary Insights 2024.


Career Path from Junior to Architect

Level Typical Scope Key Responsibilities What 'Good' Looks Like
Junior / Associate Well-scoped tasks within existing systems Adding data sources to established pipelines; building dbt models under review; fixing bugs Learns team standards quickly; asks good questions; does not break production
Data Engineer Complete pipeline projects independently Owns new domain pipelines end-to-end; designs data models for new areas; first to mentor juniors Delivers working, tested, documented pipelines; catches schema issues before they cause incidents
Senior Data Engineer Significant projects with architectural decisions Leads domain-level architecture; cross-functional collaboration with product and ML; defines team standards Anticipates downstream impact of design decisions; influences non-engineers effectively
Staff / Lead Platform-level direction across multiple domains Sets architectural direction for data platform; defines hiring standards; owns cross-team technical decisions Can make large-scale platform changes that improve entire team productivity; trusted to make tradeoffs with limited oversight
Principal / Architect Organisation-wide technical vision Evaluates major tool and vendor decisions; influences data strategy at executive level; designs systems lasting 5+ years Decisions shaped by long-term business context; aware of industry trajectory; respected by external peers

Parallel tracks exist beyond this ladder. Analytics engineering (closer to dbt and BI, more SQL-focused, less pipeline engineering) suits engineers who prefer working closer to business users. Data platform engineering (closer to infrastructure, Kubernetes, and cloud architecture) suits engineers who enjoy building the foundations others use. ML engineering (intersection of data engineering and model deployment) suits engineers interested in machine learning production systems.


How Data Engineering Evolved From ETL Developer

The job title "data engineer" barely existed before 2012. The work was done under titles like ETL developer, data warehouse developer, or business intelligence engineer. These roles were often tooling-specific (Informatica developer, SSIS developer) and associated primarily with batch processing of structured relational data into reporting databases.

Three forces transformed the role between 2012 and 2020. First, the explosion of data volume from mobile applications and IoT devices made traditional batch ETL architectures too slow and too expensive. Second, the migration of storage and compute to cloud platforms (AWS, GCP, Azure) made it economically viable to store and process vastly larger datasets without capital investment in on-premise hardware. Third, the rise of machine learning as a business function created new consumers of data infrastructure with requirements (feature stores, training pipelines, model monitoring) that traditional BI-focused data warehousing had not designed for.

The result was a role that needed to span software engineering discipline (testing, version control, CI/CD, code review) with data-specific knowledge (modelling, warehousing, streaming) and infrastructure competence (cloud platforms, containerisation, orchestration). The modern data engineer is, in practice, a specialised software engineer who has chosen to focus on data systems.


What Good Data Engineering Looks Like vs Bad

Most data engineering work exists on a spectrum from reliable and maintainable to brittle and opaque. The difference has compounding consequences: good pipelines enable teams to ship faster and with confidence; bad pipelines generate constant incidents, erode trust in data, and consume engineering time on triage rather than new work.

Good data engineering:

  • Pipelines fail loudly and immediately rather than silently producing wrong data
  • Schema changes in source systems are caught by automated tests, not discovered in a quarterly board report
  • Data models are documented, named consistently, and structured so analysts can query them without asking an engineer first
  • Transformation logic is in version control, not in a stored procedure written by someone who left the company
  • Monitoring alerts on freshness and volume anomalies, not just on pipeline process failures
  • Infrastructure is provisioned via code (Terraform), not by clicking in the AWS console

Bad data engineering (common patterns):

  • Pipelines that move data but have no automated quality checks
  • Data models named with initials, dates, or legacy acronyms with no documentation
  • Logic duplicated across multiple pipelines that produces conflicting metrics for the same business concept
  • Transformation code in Jupyter notebooks running on someone's laptop
  • No alerting until a stakeholder reports a number is wrong
  • Ingestion jobs that break whenever the source system changes a field name

The most frequent cause of bad data engineering is not incompetence but time pressure. Pipelines built under sprint deadline pressure get shipped without tests and never revisited. Over time, the maintenance cost of untested pipelines exceeds the cost of building reliable ones from the start, but the cost is distributed across the future and the saving was captured in the past.


Interview Process for Data Engineering Roles

Data engineering interviews vary by company but typically include three to four stages:

SQL round: Complex SQL problems — window functions, aggregations across multiple tables, CTEs, performance considerations. At strong companies this goes beyond basic joins to require genuine analytical query construction. Preparation: LeetCode SQL section, StrataScratch, Mode SQL tutorial.

Python / coding round: Data processing with Python — typically involving pandas or pure Python. May include writing a small data transformation function, parsing a JSON API response, or processing a CSV efficiently. More software-engineering-focused companies include LeetCode-style algorithmic problems.

Data modelling round: Given a business scenario (an e-commerce platform, a SaaS subscription product, a ride-sharing service), design the dimensional model. Define fact and dimension tables, granularity, slowly changing dimensions, and how you would handle late-arriving data. This is often a whiteboard or discussion exercise.

System design for data: Design a data pipeline for a given scenario — for example, design a pipeline that ingests clickstream events from a web application, stores them in a data warehouse, and makes a daily active users metric available to a dashboard by 8am. Expects discussion of ingestion choices (Kafka vs batch), storage choices, transformation layers, orchestration, and failure handling.

Take-home project: Some companies provide a dataset and ask for a dbt project, a pipeline implementation, or an analytical investigation. Expected to be production-quality code with tests, documentation, and a brief writeup.

Strong interview preparation: practice SQL until complex window function queries feel natural; build and document an end-to-end pipeline project you can walk through; read chapters 1-5 of "Fundamentals of Data Engineering" by Joe Reis and Matt Housley.


Practical Takeaways

SQL is the true foundation. Before touching Spark or Kafka, master complex SQL: window functions, CTEs, performance optimisation, and the dimensional modelling patterns (star schema, slowly changing dimensions) used in real warehouse design. Then add Python to the point of writing production-quality code with tests. Then pick one cloud platform and learn its data services well — not superficially, but well enough to make cost and performance tradeoffs with confidence.

The dbt Fundamentals course (free, available at courses.getdbt.com) is the single most valuable free resource for someone entering data engineering in 2024-25. Completing it and building a personal dbt project with real data demonstrates more practical capability than any certification.

Target a first role at a company with a mature data team and existing infrastructure. The first twelve months of exposure to well-designed pipelines, good code review practices, and experienced senior engineers is worth more than two years spent building everything from scratch without guidance. Data Engineering Weekly (newsletter) and the dbt Community Slack are the best free resources for staying current with industry practice.


References

  1. Stack Overflow, Developer Survey 2024 — Data Engineer Salary Data. stackoverflow.com/survey
  2. Levels.fyi, Data Engineer Compensation Data (2024). levels.fyi
  3. dbt Labs, The Analytics Engineering Guide (2024). docs.getdbt.com
  4. Bureau of Labour Statistics, Database Administrators and Architects (2024). bls.gov
  5. Kleppmann, Martin. Designing Data-Intensive Applications. O'Reilly Media, 2017.
  6. Apache Software Foundation, Airflow Documentation (2024). airflow.apache.org
  7. Databricks, The Data + AI Survey (2024). databricks.com
  8. Snowflake, Modern Data Stack Report (2024). snowflake.com
  9. Reis, Joe and Housley, Matt. Fundamentals of Data Engineering. O'Reilly Media, 2022.
  10. Tristan Handy, 'The Analytics Engineer' (Fishtown Analytics Blog, 2016). getdbt.com/blog
  11. LinkedIn Workforce Report, Data Engineering Demand (2024). linkedin.com
  12. Data Engineering Weekly, Industry Newsletter (2024). dataengineeringweekly.com

Frequently Asked Questions

What is the difference between a data engineer and a data scientist?

A data engineer builds and maintains the infrastructure that makes data available and reliable — pipelines, warehouses, transformation layers, and orchestration systems. A data scientist uses that infrastructure to generate insights, build models, and run analyses. Data engineers focus on data plumbing; data scientists focus on what to do with it once it flows.

What tools does a data engineer use?

The core stack spans ingestion (Kafka, Fivetran, Airbyte), storage (Snowflake, BigQuery, Redshift, Databricks), transformation (dbt), orchestration (Apache Airflow, Prefect), and visualisation (Looker, Tableau). Python and SQL are the universal languages. Cloud platform skills on AWS, GCP, or Azure are expected at every level.

How much does a data engineer earn?

In the US, junior data engineers earn \(95,000-\)130,000, mid-level \(125,000-\)165,000, and senior \(155,000-\)210,000. At FAANG-tier companies, senior total compensation including RSUs reaches \(350,000-\)450,000. UK ranges run roughly 60-65% of US figures, with finance sector roles paying a 20-30% premium over tech baseline.

Do you need a computer science degree to become a data engineer?

No. Many data engineers come from software engineering, data science, or analytics backgrounds. Strong SQL, Python programming, and a documented end-to-end pipeline project often carry more weight with employers than a specific degree. The dbt Fundamentals course and a personal pipeline project on GitHub are more useful than most certifications.

Is data engineering still a good career in 2025 and 2026?

Yes. Demand for data engineers consistently outpaces supply, and the role has grown more central as organisations depend heavily on data infrastructure. Unlike data science, which saw overhiring and correction in 2022-23, data engineering demand has remained durable. AI tooling is creating automation in some pipeline tasks but is increasing overall data infrastructure demand, not replacing the engineering judgment required.