Junior Data Engineer to Intermediate Data Engineer

🎯 Focus Areas

Complete Pipeline Ownership

An intermediate data engineer designs, builds, operates, and iterates on pipelines independently. This means owning the full lifecycle - from source system understanding through transformation logic to consumer SLAs. When a pipeline breaks in production, you are the one who diagnoses it, fixes it, and prevents it from happening again.

Data Quality as Engineering

Move beyond ad-hoc data quality checks to systematic quality frameworks - schema validation, referential integrity checks, statistical profiling, freshness monitoring, and consumer alerting. Great data quality is not achieved by checking harder - it is achieved by designing pipelines and contracts that make quality failures visible and fast to fix.

Performance Optimisation

Intermediate engineers understand why a pipeline or query is slow, not just that it is slow. This means reading query execution plans, understanding partitioning and clustering strategies, knowing when to materialise versus compute on demand, and understanding the cost implications of design choices in cloud platforms.

Cross-Platform Data Understanding

Data does not live in one place. Develop a working understanding of how data moves across the organisation - from operational systems into the data platform, between data platform layers, and into consumption tools. Understanding data lineage at this level changes how you design for reliability and schema change.

Mentoring and Knowledge Sharing

Intermediate engineers start actively growing graduate engineers through pairing, code review, and structured knowledge sharing. Teaching pipeline design and data quality thinking forces you to make your own mental models explicit and often reveals gaps you did not know you had.

⚡ Skills & Behaviours to Develop

Skills to Develop

Design a multi-layer data pipeline from raw ingestion through curated consumption, with documented data contracts at each layer boundary.
Implement a data quality framework for a domain you own, covering schema validation, freshness checks, statistical anomaly detection, and consumer alerting.
Read and interpret query execution plans in your primary data warehouse and make targeted optimisations based on what they reveal.
Apply partitioning, clustering, and materialisation strategies to reduce query cost and latency on large datasets.
Model data for analytical consumption using dimensional modelling or a modern vault approach, understanding the trade-offs between the two.
Implement schema evolution handling in pipelines - managing upstream schema changes without breaking downstream consumers.
Use data lineage tooling to trace the provenance of a dataset from source to consumption and use that understanding to assess the impact of proposed changes.
Design and run a data incident post-mortem when a pipeline produces incorrect or missing data, following up with systemic fixes.

Behaviours to Demonstrate

Takes end-to-end ownership of pipelines including on-call response, root cause analysis, and prevention of repeat incidents.
Proactively informs data consumers of upcoming schema changes with enough lead time to adapt.
Treats data quality metrics as engineering outputs to be monitored and improved, not as someone else's problem.
Documents pipeline design decisions and trade-offs so that the next engineer does not have to reconstruct the reasoning.
Questions data source assumptions before trusting upstream data in a new pipeline.
Pairs with graduate engineers and provides code review that teaches pipeline design principles, not just corrects mistakes.
Raises performance and cost concerns in design discussions rather than waiting until a pipeline is in production.

🛠 Hands-On Projects

1 Design and implement a three-layer data pipeline for a real business domain - raw, cleansed, and curated - with data quality checks at each layer and schema change handling.

2 Take a slow dbt model or warehouse query, profile it, apply partitioning or clustering changes, and document the cost and performance improvement.

3 Build a data quality monitoring dashboard that tracks freshness, null rates, row count anomalies, and distribution shifts for a dataset you own.

4 Implement an incremental loading strategy for a high-volume pipeline, replacing a full-refresh approach and measuring the improvement in cost and runtime.

5 Create data lineage documentation for a key domain in your organisation using a lineage tool or structured diagrams, and use it to identify and address missing data contracts.

⚡ AI Literacy for This Transition

AI for pipeline code generation and data quality reasoning

Use AI to generate transformation SQL for well-defined business logic, but always validate the output against sample data and edge cases before committing to production.

Experiment with AI-assisted anomaly detection on datasets you own - use AI to suggest statistical approaches and thresholds, then implement and validate them yourself.

Use AI to help design data quality test suites by describing your data domain and asking for a comprehensive set of checks to implement, then critically evaluate the coverage.

Practice using AI to explain execution plans and query optimisation strategies, then verify the recommendations by running A/B comparisons on real data.

Evaluate AI-generated dbt models carefully for correctness of joins, grain, and filter logic - AI tools frequently produce syntactically valid but semantically wrong SQL.

Use AI to generate pipeline documentation from code and treat the output as a first draft requiring expert review, not a finished product.

📚 Recommended Reading

Fundamentals of Data Engineering

Joe Reis and Matt Housley

The comprehensive reference for the data engineering lifecycle - essential reading for understanding how your work fits into the broader data platform.

The Data Warehouse Toolkit

Ralph Kimball and Margy Ross

The definitive reference for dimensional modelling - even in modern lakehouse environments the concepts here underpin how analytical data should be structured.

Data Quality Engineering at Scale

Various / O'Reilly resources

Builds a systematic engineering approach to data quality rather than treating it as a testing activity bolted on at the end.

Designing Data-Intensive Applications

Martin Kleppmann

The essential reference for understanding how data systems actually work under the hood - consistency, replication, stream processing - that every intermediate data engineer needs.

Streaming Systems

Tyler Akidau, Slava Chernyak, and Reuven Lax

As intermediate engineers encounter real-time data requirements, this book provides the conceptual foundation for reasoning about streaming pipelines correctly.

🎓 Courses & Resources

dbt Advanced

dbt Learn

Covers advanced dbt patterns - incremental models, snapshots, macros, packages - that separate engineers who use dbt from engineers who use it well.

Data Modelling with Snowflake or BigQuery

Pluralsight

Platform-specific modelling and optimisation knowledge that has direct impact on query performance and cost.

Apache Spark for Data Engineers

Databricks Academy

Spark is the dominant distributed compute engine in modern data platforms - understanding its execution model is essential for intermediate engineers working at scale.

Kafka for Data Engineers

Confluent Developer

Event streaming is increasingly part of data engineering architectures and Kafka is the standard - this builds the practical skills to work with it effectively.

📋 Role Archetypes

Review the full expectations for both roles to understand exactly what good looks like at each level.

→ Junior Data Engineer Archetype → Intermediate Data Engineer Archetype