News · 2026-07-04

GPT-5.5 Codex Keeps Cutting Its Own Reasoning Off at Exactly 516 Tokens

GPT-5.5, OpenAI's newest Codex model, is cutting off its own internal reasoning at exactly 516 tokens far more often than any other model on the market. A GitHub issue analyzing 390,195 response records from real coding sessions found the cutoff clusters overwhelmingly on GPT-5.5 at its highest reasoning setting, and the pattern has been reproduced independently by multiple developers. The leading explanation is an accidental serving-side optimization, not a deliberate change to how much effort the model puts in.

Key facts

GPT-5.5 accounted for 82% of all exact-516-token reasoning cutoffs, despite making up only about 19.3% of all responses studied.
The analysis covered 390,195 response records across 865 real coding sessions from February through June 2026, posted in GitHub issue #30364 on 2026-06-27.
44% of GPT-5.5's longer responses stopped at precisely 516 reasoning tokens, compared with about 1.3% for every other model - roughly a 30x difference.
Primary source: the GitHub issue, discussed further on Hacker News.

When you ask a modern coding model a hard question, it does not just spit out an answer - it first works through the problem in a private scratchpad, producing what's called reasoning tokens before it writes the visible reply. In theory, harder problems get more scratchpad space. That is what makes this bug so strange: GPT-5.5, at its most careful "xhigh" reasoning setting, keeps stopping that scratchpad at exactly 516 tokens, then 1,034, then 1,552, then 2,070 - a fixed comb of values spaced 518 tokens apart, as if the model's thinking were being sliced onto a shelf that only comes in one size.

The numbers back this up. Developer vguptaa45 combed through 390,195 individual model responses gathered from 865 real coding sessions run between February and June 2026, then posted the findings on OpenAI's Codex GitHub repository. Of every response in the entire dataset that stopped its reasoning at exactly 516 tokens, 82% came from GPT-5.5 - even though GPT-5.5 made up only about a fifth of the responses in the dataset overall. Zoom in further and it gets more striking: of GPT-5.5's own responses that used at least 516 reasoning tokens, 44% stopped at exactly that number. For every other model checked, the same clustering happened only about 1.3% of the time. That is roughly a 30-fold difference concentrated in a single model.

The behavior is not constant, either. In February 2026 the exact-516 cutoff essentially did not happen. By May 2026 it had surged to around 53% of qualifying responses, and over that same stretch GPT-5.5's average reasoning length roughly cut in half. It only shows up at the "xhigh" (extra-high) reasoning-effort setting - the mode users pick specifically when they want the model to think harder about a tough problem.

The likely explanation, favored by several engineers in the discussion, is not that OpenAI is deliberately dumbing the model down. It is a side effect of "continuous batching," a common trick inference providers use to serve many users' requests at once efficiently, similar to how a bakery bakes bread in trays that hold a fixed number of loaves rather than one at a time. If the reasoning process is internally chunked into roughly 512-token slots for batching efficiency (516 being 512 plus 4 tokens of overhead), then a problem that needs slightly more room than one slot may get truncated right at that slot's edge instead of being allowed to spill into the next one. As one commenter on Hacker News put it, "This is evidence of a bug, not the purposeful enshittification people are referencing."

But a bug is still a real problem for anyone relying on the model. One developer ran an identical prompt through GPT-5.5 ten times; four of the ten runs hit the 516-token clamp, and all four of those runs produced the wrong answer. Developer nsingh2, posting on Hacker News, summarized the stakes plainly: "nearly half the time, 5.5 xhigh could be short circuiting and degrading performance." For anyone using GPT-5.5 inside an autonomous coding agent - the kind of tool covered in our look at whether you can trust what a coding agent tells you - a silently truncated thought process is exactly the sort of failure that is invisible until the code breaks.

There is a partial workaround circulating in the community: explicitly instructing the model to "reason for at least 60 seconds" before answering appears to avoid the clamp in many cases, suggesting the cutoff is tied to a length or token-count boundary rather than the model genuinely deciding it is done thinking.

A full week after the issue was filed, OpenAI has not given any substantive staff response - only an automated bot has labeled the report. The honest caveat here is that OpenAI has not confirmed what is causing this, and the strongest evidence available points to an accidental serving or batching boundary rather than an intentional effort-reduction, though developers report the pattern persisting into July. Until OpenAI comments, GPT-5.5 users running long or difficult coding sessions may want to watch for suspiciously short reasoning traces and consider the "reason longer" workaround as a stopgap.

Primary source, verified: read the paper →

Key questions

What is causing GPT-5.5 to cut off its reasoning at 516 tokens?

The leading explanation is an accidental serving-side batching optimization that processes reasoning in fixed slots of about 512 tokens, cutting off anything that needs more room right at that boundary, though OpenAI has not confirmed the cause.

How much worse does this make GPT-5.5's answers?

In one test, a developer ran the same prompt ten times and the four runs that hit the 516-token clamp all produced the wrong answer, showing the truncation directly affects accuracy.

Is there a workaround for the 516-token cutoff?

Some developers report that explicitly instructing GPT-5.5 to reason for at least 60 seconds avoids the clamp, though this is a community workaround, not an official fix.

Cite this

APA

Ground Truth. (2026, July 4). GPT-5.5 Codex Keeps Cutting Its Own Reasoning Off at Exactly 516 Tokens. Ground Truth. https://groundtruth.day/news/gpt-5-5-codex-caps-its-own-reasoning-at-516-tokens.html

BibTeX

@misc{groundtruth:gpt-5-5-codex-caps-its-own-reasoning-at-516-tokens,
  title  = {GPT-5.5 Codex Keeps Cutting Its Own Reasoning Off at Exactly 516 Tokens},
  author = {{Ground Truth}},
  year   = {2026},
  month  = {jul},
  url    = {https://groundtruth.day/news/gpt-5-5-codex-caps-its-own-reasoning-at-516-tokens.html}
}

Topics: OpenAI · GPT-5.5 · Codex · inference · reliability

Comments are replies to this story on Bluesky — reply with any Bluesky account to join in.