Running a Calibration Session | Management Playbooks

Purpose

Calibration exists to answer one question: are we applying the same standard to everyone at the same level?

Without calibration, "meets expectations" means something different in every manager's head. One manager's "strong performer" is another manager's "average." The result is unfair ratings, unfair compensation, unfair promotion decisions - and eventually unfair outcomes that erode trust and increase attrition.

Calibration is the mechanism that closes that gap. It does this by forcing managers to compare specific individuals to specific individuals at the same level, in the presence of other managers who can challenge the comparison.

Done well, calibration:

Makes the performance distribution visible and discussable
Exposes where standards are drifting (up or down) within individual managers
Gives HR and senior leadership confidence that ratings mean something
Produces a defensible, documented set of assessments
Reduces the risk of bias shaping outcomes without scrutiny

Done badly, calibration becomes a political exercise where the loudest manager wins and the data is retrofitted to the decision already made.

This playbook covers how to run it well.

When to Use This Playbook

Use this playbook when:

You are running a performance cycle and need to normalise ratings before they are communicated to individuals
You have more than one manager whose team members are at the same level
You are preparing to make compensation, promotion, or bonus allocation decisions and want defensible inputs
You have had feedback that performance ratings feel inconsistent across the organisation
You are introducing a new rating framework and want to establish shared anchors

This is not the right tool for:

Making initial performance assessments (managers do that before they arrive)
Individual performance conversations (that happens after calibration, not during)
Resolving a dispute between a manager and an individual about their rating (that is a separate process)

Before You Start

Who attends

Role	Presence	Why
All relevant managers	Required	They bring the assessments and can be challenged on them
Skip-level manager	Required	Facilitates, holds the standard, breaks deadlocks
HR Business Partner	Required	Process guide, records outcomes, flags risk
Senior technical leader	Optional	Useful for calibrating technical standards

Keep the group to people who have direct knowledge of the individuals being calibrated. Do not include observers who have no stake in the outcome - they dilute accountability.

What each manager prepares

Every manager brings:

A completed assessment for each person they manage - rating plus written evidence
At least two specific examples per person that support their rating (not one-liners - real examples with context and outcome)
A clear view of where each person sits relative to the anchor behaviours at their level

What they do not bring:

Ratings based on effort alone ("they work so hard")
Ratings based on likeability ("the team loves them")
Ratings based on visibility ("they always present at all-hands")
Ratings they have not discussed with the individual yet (but they will have told the individual that calibration happens before ratings are final)

What the facilitator prepares

Before the session:

Compile all submitted ratings into a single view - how many at each rating level, by manager
Identify the distribution - is it skewed high? Skewed low? Does one manager look like an outlier?
Identify the anchors - two or three people at each rating level whose assessments are clear and likely to be uncontroversial. These are your starting points.
Prepare the challenge questions you will use when an assessment needs probing
Set the agenda and time allocation (roughly 5 minutes per person for the bulk of the list, more time at the edges)

Pre-read the distribution

Before you walk in, you should already know:

Which manager has rated everyone high
Which manager has rated everyone average
Where the biggest gaps between managers' standards are likely to be
Who the potential flashpoints are (people who are borderline between two ratings)

You are not deciding the outcome before the session. You are preparing to facilitate a conversation that is likely to be uncomfortable without preparation.

The Process

Step 1 - Open with the purpose and the ground rules (5 minutes)

Start every calibration session by being explicit about what you are doing and why.

Say something like:

"The purpose of today is to make sure that a 'strong' from [manager A] means the same thing as a 'strong' from [manager B]. We are not here to overrule your judgement. We are here to make sure your judgement is grounded in the same standard. I am going to ask for evidence. I am going to ask you to compare your assessments to specific people. That is how we find out if the standard is consistent. It is not a personal challenge."

Ground rules:

Evidence, not advocacy. When you are challenged, respond with examples, not arguments about how well you know the person.
We compare people, not just to a rubric - we compare them to each other. That is the only way to find the standard.
No lobbying outside the session. What is decided here is decided here. If you have a concern, raise it in the room.
Ratings are not final until the session closes. Managers may need to adjust.
What is discussed in this room stays in this room.

Step 2 - Establish the anchors (15-20 minutes)

Start with the clearest cases at the top and bottom of the distribution. These are your anchors.

Pick one person who is clearly exceeding expectations - someone where every manager in the room would agree with the assessment without significant debate. Walk through their evidence briefly. Agree that this is what "exceeding" looks like at this level.

Then pick one person who is clearly not meeting expectations - same process. Walk through the evidence. Agree what "below" looks like.

Now you have two reference points. The rest of the session is about placing everyone else relative to those anchors.

Why anchors work: when managers argue about whether someone is "meeting" or "exceeding," they are often arguing about the definition of the word. Anchors shift the argument to "compared to this specific person, where do they sit?" That is a much more tractable question.

Good anchor questions:

"Does [person B] do what [anchor A] does? More? Less?"
"If you had to put them in a room together and explain to both of them why one is rated higher, what would you say?"
"What would [person B] need to be doing consistently to sit where [anchor A] sits?"

Step 3 - Work through the middle (the bulk of the session)

The middle is where calibration does its real work. Most people sit in the middle. Most disagreements live in the middle.

Work through managers' lists, person by person. For each person:

Manager states the rating and gives one piece of supporting evidence.
Facilitator asks: "Does anyone have a different view, or have they worked with this person?"
If no challenge - move on. Note the agreed rating.
If challenge - go to step 4.

Move at pace through uncontroversial cases. Do not slow down for everyone. The time is for disagreement and complexity.

Step 4 - Handle challenges

When a rating is challenged, the process is:

First: go to evidence.

Ask the challenging manager: "What are you seeing that leads you to a different view?" They must give a specific example - a project, a decision, a behaviour, a pattern. Not a general impression.

Ask the original manager: "What is your evidence?" Same standard - specific examples.

Second: compare the evidence.

If both managers have evidence - what does it say? Are they seeing different things? Is one more recent? Is one more relevant to the rating level? Are they both true?

Third: resolve or note the gap.

If the evidence points clearly in one direction - agree the rating. If there is genuine ambiguity - note it and agree who makes the final call (usually the direct manager, informed by the calibration discussion). If it is unresolvable in the session - flag it as a follow-up with a named owner.

Do not let disagreements drag. If you have spent more than eight minutes on one person and are no further forward, call it: "We are going to note this as a close call, [manager] retains the rating, and we flag it for follow-up. Moving on."

Step 5 - Handle the manager who grades everyone high

This is one of the most common calibration problems. One manager has everyone at "exceeds" or at the top of the distribution. This is almost never accurate at the team level.

The approach is not to embarrass the manager but to surface the comparison problem.

Ask: "You have [five people] all rated at 'exceeds.' Compared to [anchor], where do each of them sit? Can you rank them - even informally?"

If they cannot rank them, that is evidence that the ratings are not granular enough to be meaningful.

Then: "If all five exceed expectations, what does it look like when someone is truly outstanding? Who in this room would you say is at a higher level than your whole team?"

The goal is to get the manager to compare their team to reference points outside their team. Almost always, when forced to compare, some of their "exceeds" will become "meets strongly" or "solid meets."

Do not do this publicly in a way that humiliates the manager. Frame it as: "I want to make sure that your strong ratings land well for your team - so I need to understand how they compare to others at the same level."

Step 6 - Handle the manager who grades everyone average

Less common but equally a problem. A manager who rates everyone as "meets expectations" may be:

Conflict-averse and unwilling to differentiate
Genuinely uncertain about the standard
Protecting low performers from scrutiny
Under-reporting strong performers to avoid losing them to promotion

Ask: "Is there anyone in your team you would describe as performing in a way that stands out? Even slightly?" If yes, probe why they were not rated higher. If no, that itself is a signal - either the team is genuinely average or the manager is not calibrated to what strong looks like.

Then compare to anchors: "Compared to [anchor who is clearly meeting expectations] - is there anyone in your team who is consistently doing more than that?" Push gently. The goal is to surface a differentiation the manager already knows but is not saying.

Step 7 - Review the final distribution

Once you have worked through everyone, pull up the full distribution.

Ask:

"Does this distribution feel right to us as a group? Does it reflect the team's actual performance?"
"Are there any ratings here that, now we have seen the full picture, feel off?"
"Are there any people who should be adjusted based on what we heard during the session?"

This is not about forcing a bell curve. It is about a sense-check. If 80% of the team is at "exceeds expectations," either this is an exceptional team (possible) or the bar has drifted (more likely). Either way, name it and agree what it means.

Step 8 - Close and agree next steps

Confirm:

The agreed rating for every person in scope
Any ratings that are still being reviewed and the process to resolve them
When managers will communicate ratings to individuals
What managers can and cannot share about the calibration process

Read back the full list before anyone leaves. Corrections are much easier before the session ends than after emails have been sent.

What Good Looks Like

A good calibration session:

Starts with clear anchors that the group agrees on without significant debate
Has at least four to five genuine challenges to submitted ratings - where someone says "I see this differently"
Results in at least two to three ratings being adjusted (if zero ratings change, calibration did not do its job)
Produces a final distribution that managers can explain and defend
Ends with every person having a confirmed rating and the manager knowing when and how to communicate it
Takes no longer than the planned time

A well-prepared manager in calibration:

Has specific examples ready for every person they manage
Can rank their team members informally if asked
Can compare their assessments to the anchors without needing to be prompted
Is willing to adjust a rating when evidence points that way

Common Failures

Managers arrive without evidence. They have ratings but no examples. Every challenge is met with "I just know my team." The session produces nothing useful because there is nothing to calibrate against. Fix: make written evidence a hard pre-requirement. No evidence submitted, no rating protected in the session.

The loudest voice wins. One senior manager dominates the room. Others back down to avoid conflict. Ratings shift toward whatever that manager favours. Fix: facilitator actively draws out quieter voices. "You've worked with this person on the platform migration - what did you see?" Direct invitations, not open questions.

Anchors are skipped. The group jumps straight into the middle of the distribution without establishing reference points. Arguments become circular because no one agrees what the words mean. Fix: anchors first, always.

Gaming the system. A manager submits inflated ratings knowing they will be knocked down in calibration, so the final output is where they actually wanted to land. Counter this by asking managers to sign off on their submitted evidence as their honest assessment. Make the calibration session the place where evidence is tested, not where initial inflation is expected.

Calibration language leaks to individuals. A manager goes back and tells their team member: "You were discussed in calibration and [other manager] had concerns." That individual now has a specific, named person as a source of negative feedback. This is damaging and avoidable. Be explicit: calibration is confidential. The output communicated to individuals is their rating and the reasoning for it - not the calibration discussion.

Time runs out before the hardest cases are discussed. Easy cases get extended discussion. Difficult cases are rushed at the end. Fix: hard cases first. Put the borderline individuals at the top of the agenda, not the bottom.

The distribution is forced to fit a curve. "We can only have 10% at exceeds" is applied mechanically regardless of the actual performance of the group. This produces arbitrary outcomes and destroys trust. Calibration should produce a distribution that reflects reality, not one that fits a template.

Checklist

Two weeks before

Confirm attendees and book the room or call (allow 2-3 hours for a team of 15-20 people)
Share the assessment template with all managers
Share the rating definitions and anchor behaviours for each level
Set the submission deadline for assessments (48 hours before calibration)

One week before

Chase managers for submissions
Review submitted assessments - flag thin evidence, missing examples, or clear outliers
Compile the pre-read distribution - ratings by manager, level, and demographic
Identify the anchor candidates (clear top, clear bottom)
Identify the likely flashpoints and prepare your challenge questions

48 hours before

All assessments received
Distribution pre-read prepared and reviewed
Agenda confirmed - running order, time per person
Facilitator has reviewed every submitted assessment

Day of session

Ground rules stated at the start
Anchors established before the main list begins
Timer running per person
Evidence demanded for every challenged rating
Loud voices balanced with direct invitations to quieter ones
Action log or rating log updated in real time

End of session

Final ratings confirmed for every person in scope
Unresolved cases logged with owner and due date
Distribution reviewed by the group
Communication timing agreed
Confidentiality reminder given
Rating log distributed to managers within 24 hours

After calibration

Managers communicate ratings to individuals (with timeline agreed in session)
Any follow-up cases resolved within agreed timeframe
Note what to do differently in the next cycle

Reference: Rating Anchor Examples

Use language like this to establish shared anchors. Adjust to match your organisation's level definitions.

Rating	What it looks like at a senior engineer level
Exceeds	Independently leads delivery of complex, ambiguous work. Improves the way the team works, not just their own output. Others seek their input before making decisions.
Meets strongly	Delivers reliably on complex work. Occasionally takes initiative beyond their scope. Proactively identifies problems before they escalate.
Meets	Delivers consistently on clear scoped work. Raises problems. Growing in confidence taking on ambiguous work.
Developing	Delivers on well-defined tasks. Needs regular guidance on approach. Learning the fundamentals of the role.
Below	Delivery is inconsistent. Requires close support. Gaps in the fundamentals of the role.

Reference: Common Gaming Behaviours and Counters

Behaviour	What it looks like	Counter
Inflation gaming	Submits everyone high knowing calibration will knock them down	Make submitted evidence the anchor. Ask: "Is this your honest assessment?" Challenge every high rating with evidence.
Advocacy substitution	Argues for a rating by talking about how much they like or trust the individual, not what they did	Redirect: "I hear that. What did they do specifically that tells you that?"
Historical anchoring	Justifies current rating based on performance from two years ago	Ask: "What have you seen in the last six months specifically?"
Comparison avoidance	Refuses to compare their team to others - "every role is different"	Agree the roles differ, then compare on transferable dimensions: leadership, problem-solving, impact
False precision	Gives highly specific ratings without evidence to make them sound grounded	Ask for the evidence behind the precision. If there is none, the precision is noise.

Reference: Example Challenge Questions

"You've rated [name] as exceeds. Compared to [anchor], what are they doing that [anchor] is not?"
"What would [name] need to stop doing, or start doing, to move their rating?"
"You have five people rated the same. If you had to rank them, who is strongest? What makes them different?"
"The evidence you've shared is from eight months ago. What have you seen in the last quarter?"
"You and [other manager] have both worked with this person. You see them differently. What specifically are each of you seeing?"
"If [name] asked you after this conversation why they were rated where they were, what would you say? Would you be comfortable saying it in those terms?"

Reference: Communicating Calibration Outcomes to Individuals

After calibration, managers return to their teams and communicate individual ratings. This conversation must be handled carefully.

What to say:

"Your rating for this cycle is [X]. That reflects [specific evidence - one or two things they did that drove the rating]. The calibration process has confirmed that this is the right assessment at your level."

What not to say:

Do not reference what other people were rated - ever
Do not say "calibration knocked your rating down" or "calibration pushed it up" - own the rating as the manager's assessment
Do not use calibration as a shield: "I would have rated you higher but calibration said..."
Do not share what was discussed about other individuals

If the individual pushes back:

"I hear that you see it differently. Let's talk about the specific evidence. What do you think you demonstrated this cycle that should have moved the rating?"

Then: listen, engage with the evidence they give, and be honest. If their evidence is compelling and changes your view - say so, and agree to raise it. If it does not change your view - explain specifically why.

Frequently asked questions from individuals after calibration:

Q: "Was my rating discussed in calibration?" A: "The calibration process is confidential. What I can tell you is your rating and the reasoning for it."

Q: "Did my manager advocate for me?" A: "I can speak to your rating and what it reflects. The process itself is confidential."

Q: "What do I need to do to get a higher rating next cycle?" A: This is the right question to focus on. Answer it specifically and in writing. "To move from [X] to [Y] I would need to see [specific behaviours and outcomes] consistently over the next cycle."

← Previous Facilitating a Talent Review Next → Managing a Resignation