Reasoning LLMs vs the Parlor Puzzle

Blue Prince^#

Blue Prince is a wonderful puzzle game that released not too long ago. In it, you draft rooms to explore a mansion, with its layout reforming each day. The rooms contain all kinds of layered puzzles, some more obvious than others.

One room I particularly enjoy is the "Parlor," a room containing an explicit puzzle reminiscent of the Knights and Knaves logic puzzle. We are given the following rules:

There will always be at least one box which displays only true statements.
There will always be at least one box which displays only false statements.
Only one box has a prize within. The other 2 are always empty.

We can infer a few other properties. For example, the puzzle is expected to be solvable (we must be able to logically deduce a location). The truth of each box is not necessarily correlated with the gem prize within. Boxes may have multiple statements, and up to one box may have no statement at all.

Each time you draft the room, you get a new set of statements to solve. Opening the correct box rewards you with a few gems, which you can then use to assist you in your drafting that day.

Let's work through one example problem to get a sense for it:

Boxes enlarged to show detail. The white box is placed in the middle.

The puzzles often look more intimidating than they actually are, once you start eliminating possibilities.

First, we do some quick word counting: B1 (the first blue statement) - 11 words, B2 - 10 words, W1 - 7 words, W2 - 3 words, K1 - 4 words. White is disagreeing with itself, so we can eliminate it as an all-true candidate. As both of its statements have an odd number of words, and at least one of them is false, B1 must also be false (and thus blue cannot be all-true).

Due to rule #1, that means the remaining box, black, must be our all-true box. We can open the black box and enjoy our prize.

For reference, this explanation took ~140 tokens, or 95 words. We don't actually have to think through the rest of the possibilities and assign truth values to the other statements/boxes, although that may be helpful as a means of double-checking our logic.

Some puzzles have multiple valid truth configurations, where each one points to the same box holding the prize.

Language Model Testing^#

I've generally been experimenting with language models lately, with mixed feelings. I have not done this while playing the game (that'd be cheating!) but I wrote down each Parlor Puzzle I encountered after coming up with the idea and started testing a number of "reasoning" language models with the various configurations to see if they'd be able to figure it out as well. I ended up liking this test too much to leave it there, with just a few scattered manual tests on a small subset of the questions. So, I found a set of (likely) all the parlor puzzles in the game and hacked together something to test models on it properly.

It seemed like an interesting way of evaluating their abilities, ideally in a way that might dodge some of the typical "benchmaxxing." I doubt these puzzles have made it into the training data of any of these models.

Max response length was set to 24k tokens. Response lengths are slight approximations due to the tools used. I wrote a standardized prompt that provides enough information to solve all the puzzle configurations.

The results, in short:

Model	Parameters	Quantization	Parlors Puzzled	Score (% correct)	Avg/Median Response length
AI21 Jamba Reasoning	3B	F16	113*4	41.4%	3211/3221 tokens
Aquif-3.5-Max-42B-A3B	42B	Q8_0	113*3	87.9%	5752/3734 tokens
Apriel 1.5 Thinker	15B	Q8_0	113*3	94.4%	5411/3592 tokens
Cydonia R1 (v4f)	24B	Q5_K_XL	113*3	79.9%	6193/3773 tokens
ERNIE-4.5-21B-A3B-Thinking	21B	Q5_K_XL	113*3	68.1%	6403/6396 tokens
		Q8_0	113*3	63.1%	6161/6223 tokens
EXAONE-4.0	1.2B	F16	113*3	48.1%	6546/6381 tokens
EXAONE-4.0	32B	Q8_0	113*3	93.5%	6133/4836 tokens
EXAONE-4.0.1	32B	Q8_0	113*3	92.6%	6120/4500 tokens
Gemini-2.5-Pro** (batch size 10)	???	???	113*1	92.9%	???
Gemini-3-Pro** (batch size 10)	???	???	113*1	93.8%	???
GLM-4.5-Air	106B	Q4_K_XL	113*3	85.6%	5301/3681 tokens
GLM-4.6	355B	Q3_K_XL	113*1	88.5%	4321/2344 tokens
GPT-5-Thinking-Mini	???	???	113*1	95.6%	21.2/17 seconds
GPT-OSS-120B (Reasoning: high)	120B	MXFP4	113*5	97.0%	4001/2836 tokens
Hermes-4-70B	70B	Q4_K_XL	113*3	89.1%	6085/4923 tokens
Llama-3.3-Nemotron-Super-49B-v1.5	49B	Q8_0	113*1	87.6%	5555/3908 tokens
Magistral-Small-2509	24B	Q5_K_XL	113*3	89.7%	4045/3333 tokens
Mistral-Small-3.2-24B-Instruct-2506 (non-thinking)	24B	Q5_K_XL	113*4	74.1%	1891/1749 tokens
Qwen3-1.7B (thinking)	1.7B	F16	113*3	60.5%	8203/8255 tokens
Qwen3-4B-Thinking-2507	4B	F16	113*3	91.7%	7481/6356 tokens
Qwen3-14B (thinking)	14B	Q8	113*3	88.2%	5661/3956 tokens
Qwen3-30B-A3B-Thinking-2507	30B	Q4_K_XL	113*4	90.3%	7092/5819 tokens
		Q8_0	113*3	90.6%	7023/5797 tokens
Ring-Flash-2.0	100B	Q4_K_M	113*3	92.0%	6604/5013 tokens
Seed-OSS-36B	36B	Q4_K_XL	113*2	96.0%	5530/4610 tokens
		Q8_0	113*2	96.0%	5596/4272 tokens
VibeThinker-1.5B	1.5B	F16	113*3	68.1%	10655/10593 tokens

Reported percentages are exact results, but you may assume the model's "true" score is somewhere around ±3% from those. Aside from Mistral Small, all here models were tested using their thinking modes.

A model with significantly higher mean response lengths than its median is more likely to have some responses where it got stuck in some kind of generation/reasoning loop.

Taking the top quant tested from each model, we might also try throwing them in some scatter plots:

This shows, approximately, performance relative to model size. I take the geometric mean of active & total parameters as an attempt to better place MoE models (although I have some concerns on how accurate that is). Dense models are placed as you'd expect.

Setting aside model size, this contrasts performance with verbosity. Ideally, we'd want a model in the top left: high accuracy, few tokens to get there. GPT-OSS-120B most clearly occupies that spot.

We could also try relating the scores on this benchmark to that on some more mainstream benchmarks. For this chart, I'm referencing Artificial Analysis scores:

I don't love that site much, in particular for how strongly its scoring favors reasoning models. But as we're focused on reasoning models here anyways, it seems fair enough as a popular resource/reference point. My general interpretation here would be that models in the top-left area perform better on parlor puzzles than on standard benchmarks, while models in the bottom-right area perform worse on these parlor puzzles than they do on standard benchmarks.

The two metrics are reasonably correlated, which is nice to see.

Artificial Analysis scores presumably represent the full precision of each model. Parlor accuracy scores, and plotted points, are run with the quants described. For instance, we could reasonably expect the GLM-4.6 at Q3 to technically have a slightly lower intelligence index score. Many of the models were tested at Q4+, many at Q8, so I think it should be close enough overall. I used the "medium" reasoning effort AAII score (61) to place GPT-5-Thinking-Mini. They do not seem to have a score for low, and high is marginally higher at 64. I did not specify a reasoning effort when doing this, if they even listen to the user prompt for that.

Model Commentary^#

Following are some thoughts on a few of these models.

Jamba 3B

Claims of this model outperforming Qwen3-4B seem to be quite overstated, at least by this test. It barely does better than random guessing, and was much worse than even the 1.7B Qwen on this task. Its responses were well-formatted English, so I doubt sampler settings could account for the discrepancy. I also have yet to see evidence of temperature being a crucial factor.

One response helpfully explained "Assuming the Blue box is telling the truth, then the Black box contains gems.[...]Thus the prize is in the Blue box." In one of the easiest puzzles, we have an even more local contradiction: "White box’s false claim “THIS BOX IS EMPTY” confirms it is not the prize container."

I'd also like to shout out their hilarious chart showcasing a 6% score on HLE for a 3B model. How insightful. (I'd expect most points earned in that range come from LLM judge errors.)

Apriel-1.5

It did well overall, especially for its size. I still have major doubts about their big advertising point:

Achieves a score of 52 on the Artificial Analysis index and is competitive with Deepseek R1 0528, Gemini-Flash etc.

At least in terms of how much that generalizes to real usage. There's only so much you can fit in a ~15 GB model.

The model's "[BEGIN FINAL RESPONSE]" formatting is annoying.

Cydonia R1

To be clear, this model seems to be a fine-tune of a non-reasoning model, adding reasoning with a focus on improving role-play performance. I included it in this test because I was curious if that added thinking would still improve performance on other subjects. It showed some improvement from Mistral Small, but was not on the level of Magistral.

EXAONE

The 1.2B version was a bit worse than Qwen3-1.7B, but it's also a bit smaller than the Qwen model.

I'm somewhat curious what really changed between 4.0 and 4.0.1 (their statement is that it's "a patch version to reduce unintended or inappropriate responses," whatever that means). While the score turned out slightly lower with 4.0.1, it's only two additional errors. I can't claim any detectable difference/dumbing down yet.

**Gemini-2.5-Pro & Gemini 3

Take these results with a grain of salt! I tried sending these as sets of 10 puzzles at once, due to their tight usage restrictions and my lack of interest in paying subscriptions/per prompt. This has both positive and negative potential impacts on the results.

It is more to compute in one go, so token/reason limits could cause hard stops. The grouping of the questions (difficulty-wise) would also affect this. On the other hand, the model may be able to apply information/logic from one solution to another, or get additional context clues.

The model's true chain of thought is hidden, so we can't learn much about how it processes these. (For what it's worth, Gemini itself claimed it could probably handle 20-50 of these at once with 100% accuracy before its performance starts dropping. However, sets of 20 caused problems beyond the model's generation, so I reduced it to 10 per group. Obviously models don't actually understand their capabilities, but I technically have its approval, and even went easy on it :) )

Gemini 3 (Thinking) only scored marginally better (93.8% over 92.9%), and was still worse than GPT-OSS-120B with the same batch size (94.7%). As these are only based on a single run of the 113 puzzles, we can't reach overly strong conclusions, but it's a relatively disappointing score given the presumed size of these models.

GLM

GLM 4.6 has been by far the best-scoring model on my in-progress fact-based testing, and overall seems to give great results. Its answers were formatted more consistently than any other model. However, it and the 4.5-Air version scored quite horribly on this test, considering their size!

Perhaps there's some kind of logical reasoning focus in training this model neglected compared to its competitors. Some blame of course may go to the low quant on 4.6, though I can't imagine that explains everything (even the Q3_K_XL Qwen3-4B did alright compared to their F16).

GPT-5-Thinking-Mini

This is the "thinking" model you get some access to without spending money. Of course, it has some advantages such as speed (vs most hardware) and decent accuracy, but was hardly PhD-level.

This model's chain-of-thought is hidden, so I can't truly verify if it "cheated" by using python/websearch/other tools to solve some problems. This is also why I can only offer average thinking time as a response length metric.

GPT-OSS

The reasoning effort setting gives us a whole bunch of configurations, so I've copied over its best result and included the rest here:

Model	Effort	Parameters	Parlors Puzzled	Score (% correct)	Avg/Median Response length
GPT-OSS-20B	Low	20B	113*3	73.7%	1242/1044 tokens
	Medium	20B	113*3	91.2%	3828/2259 tokens
	High	20B	113*3	90.9%	7836/5196 tokens
GPT-OSS-120B	Low	120B	113*5	84.2%	1042/891 tokens
	Medium	120B	113*5	92.9%	2168/1599 tokens
	High	120B	113*5	97.0%	4001/2836 tokens
120B** (batch size 10)	High	120B	113*1	94.7%	3721 avg tokens

GPT-OSS-20B actually performed (by 1 question) worse on "high" than on "medium." This could largely be attributed to the token limit: "medium" timed out ~3 times, while "high" timed out ~10 times. As the limit was a rather generous 24000 tokens and these puzzles are not that complicated, fundamentally, I'm not sure how much adding more tokens would help. The 120B model's high setting performed much better (97%), and is the current best.

It's also notably a better score than the GPT-5-Thinking-Mini model, which is limited to ~10 messages every few hours without paying. The OSS models lack certain features (image recognition) and would need setup for others (tool integration), but are surprisingly competitive for this sort of task. Just make sure it's set to high, and make sure it's actually listening to that setting.

**In connection with the Gemini-2.5-Pro batched test, I ran a set on GPT-OSS-120B, to get a sense for just how cruel asking a model to solve ten of these in one go is. In this case, I loaded the model with max context and removed any response length limit. Reasoning: high, of course. I was not expecting it to scale this well, and it even managed to beat Gemini-2.5-Pro with this setup. (Of course, with the limited sample size, I can't definitively claim GPT-OSS is truly better on this task - but it's at least competitive. Gemini-2.5-Pro certainly has advantages and better performance on a number of other subjects.)

I have some other reasons to dislike these models for many uses, but I am quite pleased with how they did here.

Mistral Small

This model is quite good! It's one of my preferred options in the ~20-30B size range.

I include it here, and excuse its low score, just to illustrate the difference reasoning makes on a problem like this. Naturally, its average response length/tokens used was much lower. On my short-form tests, I did not find reasoning models to have a significant advantage on answering those factual questions (based on separate less formal tests; the linked data is only on non-reasoning models). In other words, I would argue a (perhaps rather obvious) interpretation: reasoning models perform better at reasoning tasks, but are not necessarily superior in general.

Seed OSS 36B

Very promising! Its 96% is the second highest score, also beating the online GPT-5-Thinking-Mini. It only timed out two times (in 2x113 runs), both on the same puzzle:

Blue: "The statement matching this statement is true. The statement matching this statement is on a completely true box."
White: "The statement matching this statement is true. The statement matching this statement is on a box containing gems."
Black: "The statement matching this statement is on a box containing gems. The statement matching this statement is on a completely true box."

Which, if you ask me, can actually be simplified/solved surprisingly quickly. I'm not too surprised a LLM would get tripped up, though. The reasoning traces largely seemed to struggle to understand what a "matching" statement is.

Amusingly, the Q4 timed out only once on a separate problem, and got the above problem right on one of its two attempts. Total score was unchanged between quants.

Its score on my non-thinking/fact-based benchmark was about typical for its size.

Qwen3-1.7B

I recommend using at least slightly larger models than this.

Qwen3-4B

This model was small and promising enough to test a range of its quantizations, which may be easier to review visually:

Interestingly, we see similar results to my factual LLM testing, with almost no noticeable change in performance until the 3-bit quants. Note that the IQ2_XXS model does worse than random guessing, mostly because it usually failed to produce any answer at all. It tended to get stuck somewhere in the reasoning process, and loop with text like "But wait" until running out of tokens.

However, I would advise some caution for general use: there's only so much knowledge that can be fit in 2 GB, and ultimately there's a lot of implicit/explicit knowledge required to best answer tasks. Being good at evaluating true/false statements is an excellent skill for a model to have, but is not the whole story. For example:

Suppose you want a summary of a video about D-Day. A more knowledgeable model is more likely to be able to catch minor errors in the video's timeline, and may have better follow-up reading suggestions.
Suppose you wanted to discuss a movie you just saw. A bigger model has better odds of knowing the characters and plot details.

Supporting a model with RAG/web search/similar tech can also only do so much. Models cannot actually reflect on their own knowledge, and may not reliably use these tools when they truly need to.

In my more fact-based testing, the instruct version of this model scored around 45%. (Thinking, in general, showed no major improvement on those tests.) Certainly not a bad score for its size, but a number of 7-14B models did noticeably better, and are not that much harder to run.

I also tested this "Gemini 2.5 Pro distill" of Qwen3-4B. It did quite poorly - 38.9% after 3 sets of all 113 questions. Notably, it seemed to be trained on the thought summaries of 2.5-Pro, which is... not a good idea. It made the thinking shorter (about 1.3k tokens/response) at the cost of all accuracy. Bad implementations aside, I'm quite doubtful of distills in general: LLM output tends to be rather low quality (making it bad training data), and if there was some easy extra training/modifications you could do to make a model smarter, its creators probably would have figured that out and done it already.

Conclusion and notes^#

This class of puzzle, overall, seemed not too hard for most reasoning models. Random guessing should score around 33%, while even the 4-billion parameter Qwen thinking model did reasonably with scores around 90-92%. There is still much room for improvement. The best models made about half as many errors, with scores around 95-97%, but even that is lacking if viewed from a consistency angle.

I intend to keep this updated as new reasoning models release. It at least seems useful for detecting some under-performing small reasoning models making big claims about their genius.

The language model thinking traces were never particularly elegant or efficient. Brute force tactics were common, and necessarily true/false boxes were regularly revisited. Of course, you could argue that how intelligent the reasoning trace appears is irrelevant: if the final answer is correct, that's all that really matters (and is all I graded on). Although, seeing the circles these models sometimes spun in, I would still hope for future developments to improve reasoning token efficiency.

I believe a reasonably intelligent human, giving each problem careful consideration, can score about 100% on the Parlor Puzzles. So, if you have time to think it through, I recommend doing so and not asking ChatGPT to play the game for you.

Some statements gave models more trouble than others, on average. Nothing too surprising given standard LLM limitations: self-referential, "subjective," and counting statements were the typical culprits. Examples:

This puzzle is harder than it seems
This statement is of no help at all
Every statement with the word 'blue' is false
All statements that contain more than one 'B' are true

I have yet to see notable differences in performance between Q4 and higher models.

I slightly adjusted one parlor puzzle which, under a certain set of assumptions, would be ambiguous. All responses used a basic automatic detection followed by manual review to ensure all scores are correct.

Ultimately, you should consider playing Blue Prince. It's a beautiful game.

Blue Prince#

Language Model Testing#

Model Commentary#