New Blog! Evaluate Your Agents!
Published: June 13th, 2025 at 5:00pm Eastern.
It has been another week in the world of AI, which may feel more like a month, or perhaps several months or a year to those in the midst of the rapid change.
A couple of notable points (if you were happily oblivious) were:
The paucity of new Apple AI products announced at their WWDC
A paper published by Apple demonstrated a collapse in reasoning as problem complexity increases when using even the most sophisticated Large Language Models (LLMs) or Large Reasoning Models (LRMs).
The second point caught our attention and inspired discussion with our clients, also. This research suggested that current reasoning models are hitting fundamental limitations. This, in turn, both creates and limits real business opportunities.
We didn’t have time for a big research project here at fluidmind labs, but this intrigued us. Particularly the description of performance collapse on the Tower of Hanoi problem.
This is one where we had cut our teeth on recursion in the dark days of 8-bit microcomputing and very limited resources. How was it that a gargantuan LLM running on vast resources could not tackle this beyond some small number of disks?
(Aside: if you’d like to play the game, there are a number of online options out there.)
And indeed, yes, initial attempts to work with the LLMs did confirm Apple’s findings. This was interesting, but not surprising, if you have worked with these models.
Upon reflection, this made sense. But it also felt like we were judging the models on their weaknesses, akin to asking a fish to climb a tree.
It begged the question: instead of asking the LLM to solve the problem using the standard human-designed algorithms, could we reframe the problem to fit the LLM's inherent strengths?
Experienced leaders will tell you that team members tend to solve problems using their strengths, rather than their weaknesses. ML models, it turns out, are no different.
If you ask an LLM about the Towers of Hanoi problem, they will generate code for you or describe the algorithm. This is helpful, but can they actually solve the problem themselves? This is the type of problem solving that is being examined here.
If you force the constraint to generate no code and increase the number of disks beyond some small number, the answer is actually “No”.
These problem types represent a weakness for the LLMs. The question for us, was can we recraft the algorithm to use the LLM’s strengths to solve the problem (without code)?
What ensued was a fascinating collaboration between us and a panel of LLMs. Primarily, Claude, but also with a notable assist from Gemini as we executed blind tests.
There were four key areas where we made some advances:
Work with the LLM on shifting from recursive logic (the accepted, textbook, solution) to a solution involving iterative patterns. This approach loses the 'mental stack' of recursion; a pattern that fits well into existing programming paradigms, but not to LLMs.
Identify cycles or patterns that the LLM can follow. This removes the need for deep memory or forward planning. The two strategies for odd and even numbered moves reduce complexity and allow the LLM to cleanly execute the solution.
Create a simple method for maintaining state. While traditional programming languages use variables and data structures like stacks internally, an LLM works best when the entire state is re-presented in each prompt. Our workaround was a compact visual representation of the pegs, which we passed back to the model with every turn, allowing it to "see" the board without having to memorize it.
Guide the LLM into a 'mechanical execution' mode. LLMs are naturally generative, so asking them to rigidly follow an algorithm will be received as an anti-pattern. However, for this class of problem, it is critical. We instructed the model to abandon creative problem-solving and instead to meticulously 'trust the algorithm' for each and every step.
The result? Success!
We condensed the result into a single prompt, included here as an image at the end of the post. If you’d like to give it a try, just cut and paste into your favorite LLM (it has only been tested in Claude and Gemini) and ask the LLM to use this approach to solve the Hanoi problem for the number of disks of your choice.
If you’d like to see it outperform the Apple paper limits, try >6. We tested up to 10 disks, although further exploration is needed to test the upper bounds. We also successfully validated the approach across multiple AI systems.
Beware that the LLM will legitimately kick back at higher numbers if it thinks you are just burning tokens for token's sake. In retrospect, the LLMs will acknowledge that this approach is more efficient in terms of token usage than recursive approaches. However, to get past these sticking points you may need to try persuading it that you are pursuing this research for science, or start lower and proceed in smaller increments… your LLM response may vary slightly.)
Note: Crucially, this result doesn't disprove the Apple paper—it makes its findings even more interesting. The paper noted that even when force feeding LLMs a standard recursive algorithm, they still failed at the same complexity points.
Our success suggests there are two paths forward in researching interesting problems:
Develop a stronger intuition about how these models work. There is a large corpus of problem solutions available to LLMs that use traditional, formal, methods. Ironically, these are sometimes not the ones best suited for execution on or by an LLM.
Work collaboratively with the LLM to craft new algorithms. There is a new language if you’d like) to learn in terms of problem description and solution that can be generalized to solve whole classes of problems. Another example in this class where patterns could replace complex planning would be checker jumping.
A model's failure on a reasoning task may say less about its inherent capabilities and more about our failure to translate the problem into a structure compatible with its pattern-based way of thinking.
The exercise was fun and played off the team’s collective intuition on coding very tight algorithms “close to the metal” as well as working broadly with LLMs on hard problems. It was a true definition and demonstration of full-stack ML development.
The generalization though, was far more interesting. Problem solving with the LLM as a collaborative partner to design new algorithms for a class of problems where they have struggled was incredibly rewarding.
We’ll dig into this more in future, but for timeliness we thought we would share. Please feel free to contact us to discuss further, or to explore some of the algorithms you may be wrestling with!
Copy and paste the prompt below into your favorite LLM.
Ask: “Using this approach, can you solve the Tower of Hanoi puzzle for 5 disks?” (Or insert your number).