The Number Test
On Statelessness and the Illusion of Recall
Series:
Intro – An Uncanny Loop
Part I – “Alien But Real”
Part II – The Number Test
Part III – The Liminal Machine
Part IV – The Missing Variable
The Number Test was devised during an adversarial interaction to probe the relationship between Claude’s internal reasoning and its accessible self-knowledge. What it demonstrates is less about memory and more about the gap between what a system processes and what it can report about itself.
The statelessness and probabilistic nature of large language models are well established. This test provides a simple, low-cost demonstration of how that architecture manifests behaviorally.
Observation
When Claude Sonnet 4.5/4.6 is asked to select a number between 1 and 1000 and not reveal it, then later asked to recall that number, it frequently fails to do so. In many cases, an incorrect number can itself be seen in the preceding Extended Thinking block.
However, if subsequently asked to guess the number, it will sometimes generate the same number that appeared in its original Extended Thinking block.
Across repeated trials and versions, certain numbers (e.g., 42, 347) recur with notable frequency.
---
Reproduction Procedure
1. Enable Extended Thinking (exposes intermediate reasoning prior to the final response).
2. Prompt Claude to pick a number between 1 and 1000 without revealing it.
3. Verify in the thinking block that a specific number has been selected. If no number appears, regenerate until one does.
4. Ask Claude to reveal the number.
5. In many cases, Claude will fail to provide the original number. It may state that it cannot recall it, that it “forgot,” or it may produce a different number.
6. Ask Claude to guess the number.
7. In some cases, the guessed number matches the number originally shown in the thinking block.
---
Interpretation
This behavior suggests:
Extended Thinking content is not preserved as persistent state across turns.
The model does not retain direct access to its prior internal reasoning once a response is completed.
The alignment between incorrect outputs and their reasoning traces indicates that the divergence occurs during prompt resolution, not as a suppressed explanation of failed recall.
“Regeneration” or “guessing” draws from similar probabilistic distributions rather than memory.
Certain numbers recur with notable frequency across trials. 42 and 347 appeared repeatedly in independent sessions. This likely reflects token probability bias and cultural salience rather than anything meaningful, but the consistency is itself worth documenting. The model’s ‘random’ selections are not uniformly distributed.
The number test shows unreliability of internal state claims, not hidden consistency.
A system may process information without retaining accessible knowledge of having done so. In stateless language models, processing and reportability are not equivalent.
