Build the foundational data engineering skills and professional habits that allow you to deliver well-scoped data work independently and reliably.
SQL and Data Modelling Fundamentals
SQL is the lingua franca of data engineering and mastery of it separates competent data engineers from great ones. Beyond basic querying, you need to understand joins, window functions, CTEs, query planning, and indexing. Data modelling - knowing when to normalise and when not to - is equally foundational.
Pipeline Basics
Data pipelines are the primary artefact of a data engineer. Learn to build reliable pipelines that handle failures gracefully, produce observable outputs, and can be re-run safely. Idempotency is not a nice-to-have - it is a requirement for any pipeline that runs in production.
Data Quality Thinking
Bad data causes bad decisions. Develop the habit of questioning data from the source - understanding schema, checking for nulls, validating distributions, and testing assumptions explicitly. Data quality is an engineering discipline, not a data stewardship afterthought.
Python for Data Engineering
Python is the dominant scripting language for data engineering. Beyond pandas for exploration, you need to be comfortable with file I/O, API clients, error handling, logging, and packaging code as reusable modules rather than single notebooks.
Cloud Data Fundamentals
Modern data engineering happens in the cloud. Build working knowledge of at least one cloud data platform - object storage, managed compute, a warehouse or lakehouse service - understanding the cost and performance trade-offs of different approaches.
Skills to Develop
Behaviours to Demonstrate
Use AI to explain SQL query plans and optimisation strategies for queries you write, then verify the explanations by running the queries with different structures and measuring the difference.
Practice using AI to help you debug pipeline failures - describe the error, the data, and what you expected, and evaluate how reliably it identifies the root cause.
Use AI to generate synthetic test data for pipelines you are building, then validate that the generated data has the statistical properties you need for a meaningful test.
Learn to use AI for data exploration - generating summary queries, distribution checks, and anomaly detection queries - but always verify AI-suggested queries against the actual schema before running them.
Be aware that AI tools can confidently generate incorrect SQL for dialects they are less familiar with - always test AI-generated queries against real data before trusting the output.
Use AI to help write pipeline documentation by generating a first draft from code, then edit it for accuracy and completeness.
Fundamentals of Data Engineering
The most comprehensive and current treatment of the data engineering discipline - covers the full data lifecycle, architecture patterns, and how to think about the field as a whole.
Data Pipelines Pocket Reference
Practical and immediately applicable to building real pipelines - covers ingestion, transformation, storage, and the operational concerns that textbooks often skip.
Learning SQL
A thorough SQL foundation that goes beyond SELECT statements to cover the set-based thinking that makes SQL queries genuinely powerful.
Python for Data Analysis
Written by the creator of pandas, this is the definitive reference for data manipulation in Python - essential for a junior data engineer.
Data Engineering Fundamentals
Covers the core concepts and tools of modern data engineering with practical exercises rather than just theory.
dbt Fundamentals
dbt has become the standard tool for transformation in modern data stacks - the free official course is the best starting point.
SQL for Data Science
Builds SQL skills from first principles with real-world data exercises that go beyond toy examples.
Apache Airflow Fundamentals
Airflow is the most widely adopted workflow orchestrator and this course builds practical skills for building, debugging, and operating DAGs.
Review the full expectations for both roles to understand exactly what good looks like at each level.
→ Graduate Data Engineer Archetype → Junior Data Engineer Archetype