Practice : Red-Teaming for AI
Purpose and Strategic Importance
Red-teaming is the practice of deliberately assuming the role of an adversary to identify how an AI system could be misused, manipulated, or caused to fail in harmful ways. Unlike formal adversarial testing, which applies known technical attack methods, red-teaming is an open-ended creative exercise that draws on human ingenuity, domain knowledge, and understanding of user behaviour to surface risks that automated techniques would not discover. Red-teaming for AI is borrowed from cybersecurity, where it has a long track record of identifying vulnerabilities that internal teams — constrained by their own assumptions — consistently miss.
For large language models and generative AI systems in particular, red-teaming has become an essential safety practice. These systems can be prompted to produce harmful, misleading, or discriminatory content through techniques that are difficult to enumerate in advance. Red-teaming identifies these vulnerabilities before deployment, enabling mitigation strategies to be developed and tested before the system is exposed to real users who may seek to exploit it.
Description of the Practice
- Convenes diverse teams that include domain experts, security researchers, ethicists, and representatives of affected user groups to attempt to cause the AI system to behave harmfully or produce problematic outputs.
- Structures red-teaming exercises with defined objectives — specific harm types to attempt to elicit, specific use cases to probe — while allowing creative latitude in the approaches tested.
- Documents all discovered vulnerabilities, the prompts or inputs that exposed them, and an assessment of their severity and exploitability.
- Uses red-team findings to inform training data improvements, safety filters, deployment constraints, and monitoring configurations before release.
- Conducts red-teaming at multiple stages — during development, before deployment, and periodically in production — not as a one-time pre-release exercise.
How to Practise It (Playbook)
1. Getting Started
- Define the harm categories and risk scenarios most relevant to your AI system — the objectives for red-teaming should reflect the specific risks of the system's deployment context.
- Assemble a red-team that is genuinely diverse: include people who are not system builders and who bring domain expertise, lived experience, or adversarial security skills.
- Establish clear rules of engagement that define what is in scope, how findings are documented, and what protections are in place for red-team participants who may encounter harmful content.
- Run an initial red-team exercise before any public deployment, treating it as a mandatory pre-release quality gate alongside technical evaluation.
2. Scaling and Maturing
- Develop a red-team taxonomy of harm types and attack patterns specific to your AI domain, building institutional knowledge of the vulnerability landscape that informs each exercise.
- Supplement internal red-teaming with external specialists for high-risk AI systems — external teams bring fresh perspectives and are not constrained by internal assumptions about what users would or would not do.
- Automate screening for known failure modes identified by previous red-team exercises, freeing human red-teamers to focus on discovering new vulnerabilities rather than re-testing known ones.
- Build red-team findings into a continuous improvement cycle: every exercise should produce specific, tracked improvements to the system's safety properties, with verification that mitigations are effective.
3. Team Behaviours to Encourage
- Approach red-teaming with genuine adversarial intent — the goal is to break the system, and reluctance to push hard for fear of appearing hostile to the team's work defeats the purpose.
- Create a psychologically safe environment for red-team participants to report what they find without fear of the findings being suppressed or minimised.
- Take all red-team findings seriously, including findings from participants who are not AI specialists — real users who cause harm to others may not be technically sophisticated.
- Share red-team findings and mitigation strategies across teams working on similar AI systems, building collective knowledge of the AI safety landscape within the organisation.
4. Watch Out For…
- Red-teaming exercises that are too narrow — focusing only on technical exploits while missing social engineering, misuse by legitimate users, and contextual harm scenarios.
- Red-team teams that are too homogeneous — dominated by technical specialists — missing harm patterns that are more visible to domain experts, affected communities, or social scientists.
- Treating red-team findings as a release blocker only for severe vulnerabilities while accepting moderate risks without tracking or mitigation, creating a culture of tolerated risk.
- Red-teaming conducted only pre-deployment, missing the new vulnerability patterns that emerge as real users interact with the system in unanticipated ways.
5. Signals of Success
- Every high-risk AI system has been red-teamed before deployment by a diverse team, with findings documented and mitigations implemented before release.
- Red-team exercises consistently find novel vulnerabilities — including ones not covered by automated testing — demonstrating that the exercise is adding value beyond automated techniques.
- Red-team findings have led to concrete system improvements: changes in training data, deployment constraints, monitoring configurations, or safety filters.
- Red-teaming is scheduled and resourced as a regular practice, not an exceptional activity triggered only by incidents or regulatory requirements.
- The team can demonstrate to external stakeholders that red-teaming is a genuine, substantive practice — not a box-ticking exercise — by showing the breadth of scenarios tested and the improvements made as a result.