News · 2026-07-04
GPT-5.5 Codex Keeps Cutting Its Own Reasoning Off at Exactly 516 Tokens
GPT-5.5, OpenAI's newest Codex model, is cutting off its own internal reasoning at exactly 516 tokens far more often than any other model on the market. A GitHub issue analyzing 390,195 response records from real coding sessions found the cutoff clusters overwhelmingly on GPT-5.5 at its highest reasoning setting, and the pattern has been reproduced independently by multiple developers. The leading explanation is an accidental serving-side optimization, not a deliberate change to how much effort the model puts in.
Key facts
- GPT-5.5 accounted for 82% of all exact-516-token reasoning cutoffs, despite making up only about 19.3% of all responses studied.
- The analysis covered 390,195 response records across 865 real coding sessions from February through June 2026, posted in GitHub issue #30364 on 2026-06-27.
- 44% of GPT-5.5's longer responses stopped at precisely 516 reasoning tokens, compared with about 1.3% for every other model - roughly a 30x difference.
- Primary source: the GitHub issue, discussed further on Hacker News.
When you ask a modern coding model a hard question, it does not just spit out an answer - it first works through the problem in a private scratchpad, producing what's called reasoning tokens before it writes the visible reply. In theory, harder problems get more scratchpad space. That is what makes this bug so strange: GPT-5.5, at its most careful "xhigh" reasoning setting, keeps stopping that scratchpad at exactly 516 tokens, then 1,034, then 1,552, then 2,070 - a fixed comb of values spaced 518 tokens apart, as if the model's thinking were being sliced onto a shelf that only comes in one size.
The numbers back this up. Developer vguptaa45 combed through 390,195 individual model responses gathered from 865 real coding sessions run between February and June 2026, then posted the findings on OpenAI's Codex GitHub repository. Of every response in the entire dataset that stopped its reasoning at exactly 516 tokens, 82% came from GPT-5.5 - even though GPT-5.5 made up only about a fifth of the responses in the dataset overall. Zoom in further and it gets more striking: of GPT-5.5's own responses that used at least 516 reasoning tokens, 44% stopped at exactly that number. For every other model checked, the same clustering happened only about 1.3% of the time. That is roughly a 30-fold difference concentrated in a single model.
The behavior is not constant, either. In February 2026 the exact-516 cutoff essentially did not happen. By May 2026 it had surged to around 53% of qualifying responses, and over that same stretch GPT-5.5's average reasoning length roughly cut in half. It only shows up at the "xhigh" (extra-high) reasoning-effort setting - the mode users pick specifically when they want the model to think harder about a tough problem.
The likely explanation, favored by several engineers in the discussion, is not that OpenAI is deliberately dumbing the model down. It is a side effect of "continuous batching," a common trick inference providers use to serve many users' requests at once efficiently, similar to how a bakery bakes bread in trays that hold a fixed number of loaves rather than one at a time. If the reasoning process is internally chunked into roughly 512-token slots for batching efficiency (516 being 512 plus 4 tokens of overhead), then a problem that needs slightly more room than one slot may get truncated right at that slot's edge instead of being allowed to spill into the next one. As one commenter on Hacker News put it, "This is evidence of a bug, not the purposeful enshittification people are referencing."
But a bug is still a real problem for anyone relying on the model. One developer ran an identical prompt through GPT-5.5 ten times; four of the ten runs hit the 516-token clamp, and all four of those runs produced the wrong answer. Developer nsingh2, posting on Hacker News, summarized the stakes plainly: "nearly half the time, 5.5 xhigh could be short circuiting and degrading performance." For anyone using GPT-5.5 inside an autonomous coding agent - the kind of tool covered in our look at whether you can trust what a coding agent tells you - a silently truncated thought process is exactly the sort of failure that is invisible until the code breaks.
There is a partial workaround circulating in the community: explicitly instructing the model to "reason for at least 60 seconds" before answering appears to avoid the clamp in many cases, suggesting the cutoff is tied to a length or token-count boundary rather than the model genuinely deciding it is done thinking.
A full week after the issue was filed, OpenAI has not given any substantive staff response - only an automated bot has labeled the report. The honest caveat here is that OpenAI has not confirmed what is causing this, and the strongest evidence available points to an accidental serving or batching boundary rather than an intentional effort-reduction, though developers report the pattern persisting into July. Until OpenAI comments, GPT-5.5 users running long or difficult coding sessions may want to watch for suspiciously short reasoning traces and consider the "reason longer" workaround as a stopgap.
Key questions
What is causing GPT-5.5 to cut off its reasoning at 516 tokens?
How much worse does this make GPT-5.5's answers?
Is there a workaround for the 516-token cutoff?
Cite this
APA
Ground Truth. (2026, July 4). GPT-5.5 Codex Keeps Cutting Its Own Reasoning Off at Exactly 516 Tokens. Ground Truth. https://groundtruth.day/news/gpt-5-5-codex-caps-its-own-reasoning-at-516-tokens.html
BibTeX
@misc{groundtruth:gpt-5-5-codex-caps-its-own-reasoning-at-516-tokens,
title = {GPT-5.5 Codex Keeps Cutting Its Own Reasoning Off at Exactly 516 Tokens},
author = {{Ground Truth}},
year = {2026},
month = {jul},
url = {https://groundtruth.day/news/gpt-5-5-codex-caps-its-own-reasoning-at-516-tokens.html}
}
Comments are replies to this story on Bluesky — reply with any Bluesky account to join in.