Practice : Blue-Green Model Deployment
Purpose and Strategic Importance
Deploying a new model version is one of the highest-risk moments in an AI system's operational life. The new model may behave differently from the old one in ways that were not caught by pre-deployment testing — producing unexpected outputs, serving certain user populations differently, or degrading under production load conditions that were not simulated in evaluation. Blue-green deployment mitigates this risk by maintaining the previous model version as an immediately available fallback, enabling rollback without downtime if problems emerge.
Beyond safety, blue-green deployment enables confident, frequent releases. Teams that must choose between risky big-bang deployments and infrequent releases driven by fear of deployment failures naturally tend towards the latter. By making deployment safer, blue-green practices create the conditions for the higher deployment frequency that keeps AI systems fresh, responsive to user needs, and continuously improving.
Description of the Practice
- Maintains two equivalent production environments — blue (current) and green (new) — that can be switched instantly or gradually without downtime.
- Routes traffic between environments using configurable load balancing, enabling gradual rollout from 0% to 100% with monitoring at each increment.
- Keeps the previous model version operational and ready to receive traffic for a defined window after new version deployment, enabling instant rollback without redeployment.
- Monitors model performance metrics on the new version during gradual rollout, with automatic or manual rollback triggered when degradation thresholds are breached.
- Documents rollback procedures and tests them regularly, ensuring that the team can execute a rollback confidently under pressure when it is needed.
How to Practise It (Playbook)
1. Getting Started
- Ensure your model serving infrastructure supports routing rules that can direct traffic to different model versions — this is the foundational capability on which all blue-green practices depend.
- Define rollback criteria explicitly: what specific metric thresholds or error conditions will trigger a rollback, and who has the authority to initiate one.
- Implement a canary deployment phase — routing a small percentage of traffic (e.g., 1-5%) to the new model before full rollout — as a minimum viable blue-green practice.
- Test rollback procedures in a staging environment before the first production use, validating that traffic can be redirected to the previous version within your defined SLA.
2. Scaling and Maturing
- Automate rollback triggers based on real-time monitoring metrics, so that degradation beyond defined thresholds initiates rollback without requiring a human to notice and react.
- Build A/B testing capability on top of the blue-green infrastructure, enabling systematic comparison of model variants on controlled traffic segments before making deployment decisions.
- Extend blue-green practices to cover the full serving stack — feature computation, pre-processing, and model serving — ensuring that all components can be safely versioned and rolled back together.
- Define and test disaster recovery procedures that cover scenarios where both blue and green environments fail simultaneously, ensuring a last-resort recovery path exists.
3. Team Behaviours to Encourage
- Never rush through canary deployment phases under delivery pressure — the monitoring window exists to surface production issues that pre-deployment testing missed, and compressing it defeats its purpose.
- Treat rollback as a normal operational event, not a failure — a team that rolls back a problematic deployment has demonstrated mature operational practice, not weakness.
- Document every rollback with a post-incident review that identifies what pre-deployment testing would have caught the issue, feeding continuous improvement of the testing process.
- Test rollback procedures regularly in production or production-equivalent environments — procedures that are only tested in theory are procedures that will fail under pressure.
4. Watch Out For…
- Canary deployments on user populations that are not representative of the full user base, missing issues that only manifest for specific subgroups.
- Keeping the rollback window so short that it provides no practical safety — production issues can take hours to accumulate statistically significant evidence.
- Infrastructure costs of maintaining two parallel environments that are not managed carefully, particularly for compute-intensive AI serving workloads.
- Blue-green deployment implemented for the model serving layer but not for the data pipelines and feature stores that feed it, creating a partial safety net with significant gaps.
5. Signals of Success
- All production model deployments go through a canary phase with defined monitoring before full rollout, without exceptions driven by time pressure.
- The team has successfully executed at least one rollback using the blue-green infrastructure, demonstrating that the process works under real conditions.
- Mean time to rollback — from decision to traffic fully redirected — is measured and meets the team's defined SLA, typically minutes rather than hours.
- Rollback capability is regularly tested, with results demonstrating that rollback procedures remain effective as the infrastructure evolves.
- Deployment frequency has increased since blue-green deployment was implemented, as teams feel safer releasing more frequently.