• Home
  • BVSSH
  • C4E
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Practice : Blue-Green Model Deployment

Purpose and Strategic Importance

Deploying a new model version is one of the highest-risk moments in an AI system's operational life. The new model may behave differently from the old one in ways that were not caught by pre-deployment testing — producing unexpected outputs, serving certain user populations differently, or degrading under production load conditions that were not simulated in evaluation. Blue-green deployment mitigates this risk by maintaining the previous model version as an immediately available fallback, enabling rollback without downtime if problems emerge.

Beyond safety, blue-green deployment enables confident, frequent releases. Teams that must choose between risky big-bang deployments and infrequent releases driven by fear of deployment failures naturally tend towards the latter. By making deployment safer, blue-green practices create the conditions for the higher deployment frequency that keeps AI systems fresh, responsive to user needs, and continuously improving.


Description of the Practice

  • Maintains two equivalent production environments — blue (current) and green (new) — that can be switched instantly or gradually without downtime.
  • Routes traffic between environments using configurable load balancing, enabling gradual rollout from 0% to 100% with monitoring at each increment.
  • Keeps the previous model version operational and ready to receive traffic for a defined window after new version deployment, enabling instant rollback without redeployment.
  • Monitors model performance metrics on the new version during gradual rollout, with automatic or manual rollback triggered when degradation thresholds are breached.
  • Documents rollback procedures and tests them regularly, ensuring that the team can execute a rollback confidently under pressure when it is needed.

How to Practise It (Playbook)

1. Getting Started

  • Ensure your model serving infrastructure supports routing rules that can direct traffic to different model versions — this is the foundational capability on which all blue-green practices depend.
  • Define rollback criteria explicitly: what specific metric thresholds or error conditions will trigger a rollback, and who has the authority to initiate one.
  • Implement a canary deployment phase — routing a small percentage of traffic (e.g., 1-5%) to the new model before full rollout — as a minimum viable blue-green practice.
  • Test rollback procedures in a staging environment before the first production use, validating that traffic can be redirected to the previous version within your defined SLA.

2. Scaling and Maturing

  • Automate rollback triggers based on real-time monitoring metrics, so that degradation beyond defined thresholds initiates rollback without requiring a human to notice and react.
  • Build A/B testing capability on top of the blue-green infrastructure, enabling systematic comparison of model variants on controlled traffic segments before making deployment decisions.
  • Extend blue-green practices to cover the full serving stack — feature computation, pre-processing, and model serving — ensuring that all components can be safely versioned and rolled back together.
  • Define and test disaster recovery procedures that cover scenarios where both blue and green environments fail simultaneously, ensuring a last-resort recovery path exists.

3. Team Behaviours to Encourage

  • Never rush through canary deployment phases under delivery pressure — the monitoring window exists to surface production issues that pre-deployment testing missed, and compressing it defeats its purpose.
  • Treat rollback as a normal operational event, not a failure — a team that rolls back a problematic deployment has demonstrated mature operational practice, not weakness.
  • Document every rollback with a post-incident review that identifies what pre-deployment testing would have caught the issue, feeding continuous improvement of the testing process.
  • Test rollback procedures regularly in production or production-equivalent environments — procedures that are only tested in theory are procedures that will fail under pressure.

4. Watch Out For…

  • Canary deployments on user populations that are not representative of the full user base, missing issues that only manifest for specific subgroups.
  • Keeping the rollback window so short that it provides no practical safety — production issues can take hours to accumulate statistically significant evidence.
  • Infrastructure costs of maintaining two parallel environments that are not managed carefully, particularly for compute-intensive AI serving workloads.
  • Blue-green deployment implemented for the model serving layer but not for the data pipelines and feature stores that feed it, creating a partial safety net with significant gaps.

5. Signals of Success

  • All production model deployments go through a canary phase with defined monitoring before full rollout, without exceptions driven by time pressure.
  • The team has successfully executed at least one rollback using the blue-green infrastructure, demonstrating that the process works under real conditions.
  • Mean time to rollback — from decision to traffic fully redirected — is measured and meets the team's defined SLA, typically minutes rather than hours.
  • Rollback capability is regularly tested, with results demonstrating that rollback procedures remain effective as the infrastructure evolves.
  • Deployment frequency has increased since blue-green deployment was implemented, as teams feel safer releasing more frequently.
Associated Standards
  • AI models are deployed via automated, repeatable pipelines
  • Post-deployment model performance is monitored continuously
  • Production feedback loops are closed within defined time limits

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering