Blue-Green Model Deployment | Engineering Practice

Practice : Blue-Green Model Deployment

Purpose and Strategic Importance

Deploying a new model version is one of the highest-risk moments in an AI system's operational life. The new model may behave differently from the old one in ways that were not caught by pre-deployment testing — producing unexpected outputs, serving certain user populations differently, or degrading under production load conditions that were not simulated in evaluation. Blue-green deployment mitigates this risk by maintaining the previous model version as an immediately available fallback, enabling rollback without downtime if problems emerge.

Beyond safety, blue-green deployment enables confident, frequent releases. Teams that must choose between risky big-bang deployments and infrequent releases driven by fear of deployment failures naturally tend towards the latter. By making deployment safer, blue-green practices create the conditions for the higher deployment frequency that keeps AI systems fresh, responsive to user needs, and continuously improving.

Description of the Practice

Maintains two equivalent production environments — blue (current) and green (new) — that can be switched instantly or gradually without downtime.
Routes traffic between environments using configurable load balancing, enabling gradual rollout from 0% to 100% with monitoring at each increment.
Keeps the previous model version operational and ready to receive traffic for a defined window after new version deployment, enabling instant rollback without redeployment.
Monitors model performance metrics on the new version during gradual rollout, with automatic or manual rollback triggered when degradation thresholds are breached.
Documents rollback procedures and tests them regularly, ensuring that the team can execute a rollback confidently under pressure when it is needed.

How to Practise It (Playbook)

1. Getting Started

Ensure your model serving infrastructure supports routing rules that can direct traffic to different model versions — this is the foundational capability on which all blue-green practices depend.
Define rollback criteria explicitly: what specific metric thresholds or error conditions will trigger a rollback, and who has the authority to initiate one.
Implement a canary deployment phase — routing a small percentage of traffic (e.g., 1-5%) to the new model before full rollout — as a minimum viable blue-green practice.
Test rollback procedures in a staging environment before the first production use, validating that traffic can be redirected to the previous version within your defined SLA.

2. Scaling and Maturing

Automate rollback triggers based on real-time monitoring metrics, so that degradation beyond defined thresholds initiates rollback without requiring a human to notice and react.
Build A/B testing capability on top of the blue-green infrastructure, enabling systematic comparison of model variants on controlled traffic segments before making deployment decisions.
Extend blue-green practices to cover the full serving stack — feature computation, pre-processing, and model serving — ensuring that all components can be safely versioned and rolled back together.
Define and test disaster recovery procedures that cover scenarios where both blue and green environments fail simultaneously, ensuring a last-resort recovery path exists.

3. Team Behaviours to Encourage

Never rush through canary deployment phases under delivery pressure — the monitoring window exists to surface production issues that pre-deployment testing missed, and compressing it defeats its purpose.
Treat rollback as a normal operational event, not a failure — a team that rolls back a problematic deployment has demonstrated mature operational practice, not weakness.
Document every rollback with a post-incident review that identifies what pre-deployment testing would have caught the issue, feeding continuous improvement of the testing process.
Test rollback procedures regularly in production or production-equivalent environments — procedures that are only tested in theory are procedures that will fail under pressure.

4. Watch Out For…

Canary deployments on user populations that are not representative of the full user base, missing issues that only manifest for specific subgroups.
Keeping the rollback window so short that it provides no practical safety — production issues can take hours to accumulate statistically significant evidence.
Infrastructure costs of maintaining two parallel environments that are not managed carefully, particularly for compute-intensive AI serving workloads.
Blue-green deployment implemented for the model serving layer but not for the data pipelines and feature stores that feed it, creating a partial safety net with significant gaps.

5. Signals of Success

All production model deployments go through a canary phase with defined monitoring before full rollout, without exceptions driven by time pressure.
The team has successfully executed at least one rollback using the blue-green infrastructure, demonstrating that the process works under real conditions.
Mean time to rollback — from decision to traffic fully redirected — is measured and meets the team's defined SLA, typically minutes rather than hours.
Rollback capability is regularly tested, with results demonstrating that rollback procedures remain effective as the infrastructure evolves.
Deployment frequency has increased since blue-green deployment was implemented, as teams feel safer releasing more frequently.