Junior Platform Engineer to Intermediate Platform Engineer

🎯 Focus Areas

Kubernetes Administration

Move from deploying applications in Kubernetes to administering the cluster itself. This means understanding node management, network policies, RBAC, admission controllers, resource quotas, and cluster upgrades. The intermediate platform engineer can diagnose cluster-level problems, not just application-level ones.

SRE Practices

Service level objectives are the contract between the platform and its consumers. Learn to define meaningful SLIs, set realistic SLOs, build alerting that pages for things that actually matter, and run blameless post-mortems that produce systemic improvements. SRE is a discipline - not a job title - that platform engineers own.

Observability Engineering

Observability is not metric collection - it is the ability to ask arbitrary questions about system behaviour using the signals you have. Build the skills to instrument systems meaningfully, design dashboards that surface real problems, and use distributed tracing to understand behaviour across service boundaries.

Developer Experience

The platform team's primary customer is other engineers. Intermediate platform engineers actively seek feedback from developer teams, identify friction in the development workflow, and implement targeted improvements. A platform that developers love to use is a platform that creates business value.

Platform Component Design

Design platform components - deployment templates, infrastructure modules, shared pipeline libraries - as products. This means stable APIs, backward compatibility, versioning, documentation, and enough flexibility for consumers to use them without forking.

⚡ Skills & Behaviours to Develop

Skills to Develop

Administer a Kubernetes cluster including RBAC, network policies, resource quotas, admission webhooks, and cluster version upgrades.
Define and implement SLIs and SLOs for a platform service, build alerting against them, and run a post-mortem when an SLO is breached.
Build a distributed tracing implementation across multiple services using OpenTelemetry and a compatible backend.
Design and implement a reusable Terraform module with a stable public API, input validation, documentation, and versioned releases.
Create a shared CI/CD pipeline library that developer teams can adopt without modifying the core, with a migration path from existing pipelines.
Identify and implement a meaningful developer experience improvement - measured by adoption, time saved, or developer satisfaction feedback.
Implement GitOps for a platform component using ArgoCD or Flux, including handling secrets and environment promotion.
Conduct a structured platform incident post-mortem with root cause analysis, timeline, and systemic remediation items.

Behaviours to Demonstrate

Proactively seeks feedback from developer teams on platform friction rather than waiting for complaints to reach the backlog.
Designs platform components with developer usability as a primary concern, not just technical correctness.
Writes and maintains runbooks that are detailed enough to be followed by someone who was not involved in building the system.
Communicates planned platform changes to affected teams with appropriate notice and a clear rollback plan.
Monitors platform SLOs actively and treats SLO breaches as engineering problems to solve, not metrics to explain away.
Reviews and tests Terraform and Kubernetes changes in non-production before production, every time.
Pairs with graduate engineers and provides code review that teaches platform engineering principles.

🛠 Hands-On Projects

1 Set up full cluster observability - metrics, logs, traces - for a Kubernetes cluster, implement meaningful SLOs for platform services, and build alerting that surfaces real problems without alert fatigue.

2 Build a reusable Terraform module for a commonly provisioned resource, release it with versioning and documentation, and get at least two developer teams to adopt it.

3 Implement GitOps for a platform component using ArgoCD, including environment promotion and secrets management, and document the operational runbook.

4 Run a developer experience survey across your engineering community, identify the top three pain points, and implement at least one targeted improvement with a before-and-after measurement.

5 Administer a Kubernetes cluster upgrade from one minor version to the next in a non-production environment, documenting the process and any issues encountered.

⚡ AI Literacy for This Transition

AI for platform automation and developer experience

Use AI to generate Terraform and Kubernetes YAML boilerplate for components you are building, but review every security-sensitive attribute - IAM policies, network rules, secret handling - independently.

Experiment with AI for incident diagnosis by providing log excerpts and error messages and evaluating how reliably it identifies root causes versus plausible-sounding but incorrect hypotheses.

Use AI to help write runbooks and operational documentation from notes or code, treating the output as a first draft requiring expert review for accuracy and completeness.

Explore AI-assisted code review for infrastructure-as-code by asking AI to identify security misconfigurations, missing resource limits, or deviations from team standards in Terraform and Kubernetes manifests.

Evaluate AI coding tools from a platform security perspective - understand what code and secrets might be inadvertently included in prompts and establish team guidelines.

Use AI to accelerate writing developer-facing documentation for platform tools and services, recognising that accuracy is non-negotiable and must be verified before publishing.

📚 Recommended Reading

Site Reliability Engineering

Niall Richard Murphy, Betsy Beyer, Chris Jones, and Jennifer Petoff

The foundational text for SRE practice - SLIs, SLOs, error budgets, on-call design, and incident management - that every intermediate platform engineer must have read.

The Site Reliability Workbook

Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, and Stephen Thorne

The practical companion to the SRE book, with worked examples of implementing SRE practices in real organisations.

Kubernetes in Action

Marko Luksa

The deepest practical treatment of Kubernetes available - essential for moving from using the platform to administering it.

Observability Engineering

Charity Majors, Liz Fong-Jones, and George Miranda

The definitive guide to modern observability - instrumentation, high-cardinality data, debugging in production - written by engineers who invented much of the practice.

Accelerate

Nicole Forsgren, Jez Humble, and Gene Kim

Understanding the research connecting deployment practices, team culture, and business outcomes helps platform engineers make the case for developer experience investment.

🎓 Courses & Resources

Certified Kubernetes Administrator (CKA)

Linux Foundation / Cloud Native Computing Foundation

The CKA validates deep Kubernetes administration skills - the exam is hands-on and genuinely tests operational competence.

Prometheus and Grafana: Complete Monitoring Stack

Udemy

Builds practical observability implementation skills that are immediately applicable to building platform monitoring infrastructure.

GitOps with ArgoCD

Codefresh / A Cloud Guru

GitOps is becoming the standard deployment model for platform teams and this builds hands-on skills with the leading implementation.

Platform Engineering Fundamentals

Pluralsight

Covers the platform-as-a-product mindset and developer experience principles that differentiate great platform teams from infrastructure teams with a different name.

📋 Role Archetypes

Review the full expectations for both roles to understand exactly what good looks like at each level.

→ Junior Platform Engineer Archetype → Intermediate Platform Engineer Archetype