How to measure AI coding agent productivity after GitHub shipped a merge-count metric

2026-06-28

An engineering dashboard showing a tall bar chart of merged pull requests next to a much shorter bar labeled verified, with a magnifying glass over the gap between the two.

GitHub just added a merge-count-by-adoption-phase metric to the Copilot admin console. It is a great input and a dangerous board number. Here is how engineering leaders measure real coding-agent productivity without mistaking volume for value.

I want to flag a small feature GitHub shipped on June 26, because it is going to end up on a board slide near every engineering leader reading this, and it is worth putting there on purpose rather than by accident.

GitHub added a new field to the Copilot usage-metrics API for organizations and enterprises. It is called total_pull_requests_merged, and it breaks merged pull requests down by AI adoption phase. So now an admin can open a report and see, in raw counts, how many PRs the heavy AI adopters merged versus everyone else, in both one-day and twenty-eight-day windows. The changelog describes the point of it plainly: with a total instead of just a per-user average, a leader can work out what share of all merged PRs each adoption phase contributed.

That is a genuinely useful number. It is also the most seductive trap in coding-agent measurement right now, because it counts the thing AI inflates most easily and says nothing about whether that thing was any good.

TLDR

GitHub now ships merged-PR counts by AI adoption phase straight into the Copilot admin console. Treat it as an input that tells you where the volume is coming from, never as the productivity number itself. The defensible metric is verified merged output per engineer over fully loaded cost per engineer, because every credible 2026 study shows merge volume rises while team delivery stays flat and the gain pools at the review step.

The number that just walked into the admin console

Here is what happened, with the specifics, because the specifics are the whole point.

On June 26 GitHub added merge counts by adoption phase to the Copilot usage-metrics reports. The day before, June 25, it shipped a related change: Copilot code review now uses built-in file-exploration tools, which the changelog says cut review cost by about twenty percent at the same review quality. Put those two together and the shape of the week is clear. The harness vendor is making the cost of reviewing AI output cheaper, and at the same time it is handing leadership a tidy count of how much output got merged. One side of the ratio is getting more honest. The other side is getting easier to misread.

I have watched a lot of engineering leaders build their first AI productivity slide over the last year, and the pattern is always the same. The merge count is right there, it is big, it went up, and it has the vendor’s logo on it. So it goes on the slide. “Our phase-three adopters merged a far larger share of PRs this quarter.” Everyone nods. The slide ships.

The problem is not that the number is wrong. The number is accurate. The problem is that merged-PR count is precisely the metric an agentic harness moves first and moves most, and it moves whether or not anything got better.

What every 2026 study found when they looked past the merge count

This is where the outside evidence matters, and it is worth dating carefully, because none of it is from this week. It is the context that makes this week’s GitHub feature legible.

The clearest single read is the Faros AI engineering report from April, the one they called the Acceleration Whiplash. They pulled workflow telemetry from ten thousand developers across more than a thousand enterprise teams, which makes it one of the larger data sets on this question. The finding is the whole argument in one sentence.

"Developers on teams with high AI adoption complete 21% more tasks and merge 98% more pull requests, but PR review time increases 91%."

Faros AI, The AI Engineering Report 2026, April 2026

Read that again slowly. Ninety-eight percent more PRs merged. And ninety-one percent more time spent in review. The volume nearly doubled, and the cost of checking that volume nearly doubled right alongside it. The work did not disappear. It moved downstream, from authoring to verifying, and it landed on the people doing the reviews.

Sonar’s State of Code survey from early June, covering more than eleven hundred professional developers, tells the same story from the developer’s chair: about forty-two percent of committed code is now AI-generated or assisted, and the bottleneck has shifted to the verification stage. DX’s Q1 impact report across four hundred-plus companies found daily AI users shipping roughly sixty percent more PRs, with some organizations seeing up to fifty percent more defects than before they adopted. More merges, more to check, more that slips through.

So the GitHub merge count is not measuring productivity. It is measuring the numerator of a ratio whose denominator nobody put on the slide.

98%

more pull requests merged on high-AI-adoption teams, against a 91% rise in review time (Faros AI, April 2026)

Where the merge count quietly breaks as a productivity proxy

Let me be concrete about how this goes wrong in a real engineering org, because the failure is not theoretical and it is not loud. It is quiet, which is what makes it dangerous.

A team turns on Copilot or Claude Code or Cursor. Within a few weeks the merge count climbs. The adoption-phase report confirms it: the heavy adopters are merging a much larger share of PRs. Leadership sees the chart, concludes the tools are working, and expands the rollout. Reasonable. The chart said so.

But underneath that chart, three things are happening that the chart cannot see. Review time per PR is climbing, because each merged PR now carries more code that a human had to read and trust. A slice of those merges went in without a real review at all, because the queue got long and someone waved them through. And the defect rate is drifting up, slowly enough that nobody attributes it to the tool. The merge count went up. Delivery did not. Stability got a little worse. The number on the slide and the health of the system have quietly diverged.

That divergence is the binding constraint of AI-assisted engineering in 2026, and it is exactly the thing a merge count hides. The merge count answers “how much got through.” It does not answer “how much of what got through was correct, durable, and worth the token bill.” Those are different questions, and only the second one belongs on a board slide.

Key Insight

A coding agent moves merged-PR volume first and most. So merge count is the one metric guaranteed to look like success even when delivery is flat and stability is slipping. It is the easiest number to produce and the easiest to be fooled by.

The metric that survives a CFO asking “so what did we actually get”

Here is the number I would put on the slide instead, and it is built from two honest halves.

The numerator is verified merged output per engineer. Not merges. Merges that passed a real review and were still standing some weeks later, measured against the engineer who shipped them. The word doing the work there is “verified.” A PR that merged and got reverted, or merged and caused an incident, or merged without anyone actually reading it, does not count as output. It counts as debt with a green checkmark.

The denominator is fully loaded cost per engineer. And in 2026 that denominator is finally knowable, because the meter is on. Anthropic’s own enterprise figures, as compiled in mid-June, put Claude Code at roughly thirteen dollars per developer per active day and a hundred fifty to two hundred fifty dollars per developer per month, with about ninety percent of users under thirty dollars on any active day. GitHub’s metered billing has been live since June 1. The cost side stopped being a guess. The output side is the half still pretending to be measured when it is mostly just counted.

The merge count tells you how much got through. It does not tell you how much of what got through was worth keeping. Those are different questions, and only one of them belongs on a board slide.

So use GitHub’s new field. Genuinely, use it. It is a good input. The adoption-phase breakdown shows where the volume is concentrated, which heavy users to talk to, where to look first when hunting for the verification load. That is real signal. Just do not let the count cross the line from input to output without passing through the word verified on the way.

Input metric versus board metric

Question	What answers it
Where is the merge volume coming from?	GitHub merges-by-adoption-phase (the new field)
Did that volume create real value?	Verified merged output per engineer over loaded cost

Three moves before the next review cycle

I keep this calm because it is figure-out-able, and most of the work is reframing, not rebuilding.

First, when the merge-count chart goes on the slide, put review time and revert or incident rate on the very same slide. If volume is up and review time is up ninety percent, the slide should say both out loud, in the same breath. A number that only shows the half that flatters the tool is not a measurement, it is marketing the org did to itself.

Second, define verified merged output before reporting a productivity number, not after. Pick the rule. A PR counts as output if it was reviewed by a named human, was not reverted within some window, and did not trigger an incident. It does not have to be perfect. It has to be honest and consistent, so that when the count moves a leader knows whether the value moved with it.

Third, name one person who owns the ratio. Not a dashboard, not a channel, a person. The owner of “verified merged output per engineer over cost per engineer” is the one who can stand in front of the board and say what the org actually got for its agent spend this quarter. Right now in most orgs that person does not exist, and the merge count is filling the vacuum.

The thing I would tell a fellow founder over coffee is this. The vendors are getting very good at handing over the numerator. They will keep shipping cleaner, prettier counts of how much the agents produced, because that is the number that sells the next seat and the next token tier. Nobody is going to ship the denominator or the word verified. That part belongs to the org. It always did. The good news is that it is not hard, it is just unglamorous, and unglamorous is exactly the kind of work that has always separated the engineering orgs that compound from the ones that just look busy.

Sources

Track total merges by adoption phase in enterprise and organization reports - GitHub Changelog, 2026-06-26
Copilot code review: Analysis depth and efficiency updates - GitHub Changelog, 2026-06-25
The AI Engineering Report 2026: The AI Acceleration Whiplash - Faros AI, 2026-04-22
State of Code Developer Survey report 2026 - Sonar, 2026-06-01
AI-assisted engineering: Q1 2026 impact report - DX, 2026-02-01

Back to all insights