Back to posts

Zero Cost AI and Slow Thinking Systems

May 27, 2026

Recently I've been tinkering with small language models. These models are able to run on my MacBook, use around 7-30GB of Unified Memory, and able to do a lot of things. They are good at tool calling and can accomplish trivial tasks like calling the right API as long as the task is well defined. (My reference models in this case are Gemma4 e4B and Gemma4 27B.)

I actually call them Zero Cost Intelligence (well, not exactly free, but almost negligible compared to the API cost of LLMs). Intelligence might be an overrated term, so I prefer to call them Language and Reasoning Simulators. And that made me wonder: is it possible to build a system that uses a group of these models (small models running on consumer hardware, or not needing expensive GPUs) and is able to do some tasks (a subset of enterprise-level tasks)?

Plus, not all tasks need realtime output. For example, when you are sleeping and these LLM agents are working in the background, it is not necessary that the output token speed should be realtime. So if you have a workflow or tasks that don't need to provide immediate output, why not run them using a system of these small models?

System of Small Models: as we know, small LLMs are not very smart, so in order to improve the reliability of tasks, we need to build up a system that is able to channelize the model towards the right direction and eventually towards the right output.

An example system could look like:

  • A Large Reasoning Model creates a trajectory plan: very detailed steps for the other less intelligent model to follow.
  • A small language model then follows that trajectory.
  • There could then be another agent using a small model that verifies at the end.