Design and own complete data pipelines independently, build data quality frameworks, and begin understanding how data flows and is governed across the wider platform.
Complete Pipeline Ownership
An intermediate data engineer designs, builds, operates, and iterates on pipelines independently. This means owning the full lifecycle - from source system understanding through transformation logic to consumer SLAs. When a pipeline breaks in production, you are the one who diagnoses it, fixes it, and prevents it from happening again.
Data Quality as Engineering
Move beyond ad-hoc data quality checks to systematic quality frameworks - schema validation, referential integrity checks, statistical profiling, freshness monitoring, and consumer alerting. Great data quality is not achieved by checking harder - it is achieved by designing pipelines and contracts that make quality failures visible and fast to fix.
Performance Optimisation
Intermediate engineers understand why a pipeline or query is slow, not just that it is slow. This means reading query execution plans, understanding partitioning and clustering strategies, knowing when to materialise versus compute on demand, and understanding the cost implications of design choices in cloud platforms.
Cross-Platform Data Understanding
Data does not live in one place. Develop a working understanding of how data moves across the organisation - from operational systems into the data platform, between data platform layers, and into consumption tools. Understanding data lineage at this level changes how you design for reliability and schema change.
Mentoring and Knowledge Sharing
Intermediate engineers start actively growing graduate engineers through pairing, code review, and structured knowledge sharing. Teaching pipeline design and data quality thinking forces you to make your own mental models explicit and often reveals gaps you did not know you had.
Skills to Develop
Behaviours to Demonstrate
Use AI to generate transformation SQL for well-defined business logic, but always validate the output against sample data and edge cases before committing to production.
Experiment with AI-assisted anomaly detection on datasets you own - use AI to suggest statistical approaches and thresholds, then implement and validate them yourself.
Use AI to help design data quality test suites by describing your data domain and asking for a comprehensive set of checks to implement, then critically evaluate the coverage.
Practice using AI to explain execution plans and query optimisation strategies, then verify the recommendations by running A/B comparisons on real data.
Evaluate AI-generated dbt models carefully for correctness of joins, grain, and filter logic - AI tools frequently produce syntactically valid but semantically wrong SQL.
Use AI to generate pipeline documentation from code and treat the output as a first draft requiring expert review, not a finished product.
Fundamentals of Data Engineering
The comprehensive reference for the data engineering lifecycle - essential reading for understanding how your work fits into the broader data platform.
The Data Warehouse Toolkit
The definitive reference for dimensional modelling - even in modern lakehouse environments the concepts here underpin how analytical data should be structured.
Data Quality Engineering at Scale
Builds a systematic engineering approach to data quality rather than treating it as a testing activity bolted on at the end.
Designing Data-Intensive Applications
The essential reference for understanding how data systems actually work under the hood - consistency, replication, stream processing - that every intermediate data engineer needs.
Streaming Systems
As intermediate engineers encounter real-time data requirements, this book provides the conceptual foundation for reasoning about streaming pipelines correctly.
dbt Advanced
Covers advanced dbt patterns - incremental models, snapshots, macros, packages - that separate engineers who use dbt from engineers who use it well.
Data Modelling with Snowflake or BigQuery
Platform-specific modelling and optimisation knowledge that has direct impact on query performance and cost.
Apache Spark for Data Engineers
Spark is the dominant distributed compute engine in modern data platforms - understanding its execution model is essential for intermediate engineers working at scale.
Kafka for Data Engineers
Event streaming is increasingly part of data engineering architectures and Kafka is the standard - this builds the practical skills to work with it effectively.
Review the full expectations for both roles to understand exactly what good looks like at each level.
→ Junior Data Engineer Archetype → Intermediate Data Engineer Archetype