Prompt Control – The mind behind the machine — Part 3To stops your local AI from talking… too much, you don't need a better mode. You need to find a better way to to talk to your favorite one.I am a huge fan of Local AI: means that I test to the limit my crappy GPU-poor laptop to see how far I can use a Large Language Model without relying on internet connection and cloud APIs. I decided to test again how good can a Small Language model be in support my python programming: a sort of… AI that help me programming AI application In this article I will report few points to be considered if you decide to do the same: what model to use, what kind of resources you need (VRAM/RAM), and as well drawbacks like context window size and system prompt requirements. This is the app I was trying to build: Honestly the app itself is not that important, but the process to arrive to completion is interesting. If you want to have a look at the code, here is the final GitHub repository: So, after last two weeks, we know that a system prompt is a rulebook. Today we will try to understand how we can use it as a real operating system for our LLM applications. To recap If you are already bored to death… here the main issue I found: I will explain in the article how to solve them. Buckle up and follow me. My setupI always serve local LLM with llama.cpp server. It is easy to use, comes with pre-compiled binaries for different Operating Systems, and it is fully compliant with OpenAI standard API endpoints. The model I am using is, in my opinion, the best open-source, non-reasoning Small Language Model at the moment: Qwen3–4B-Instruct-2507.
Running this model without a dedicated GPU (or without Intel Accelerators and NPU) is possible, but you have to expect slow speed… But the accuracy is outstanding! If you want to do the same: 1️⃣ download in a project directory (something like
2️⃣ Then download the GGUF weights Qwen3–4B-Instruct-2507.i1-Q4_K_S.gguf for the model from here. Put it in the same 3️⃣ From the terminal run: This will start the model, create compatible openai API endpoints, with a context window of 8k tokens and 0 layers offloaded to GPU. 4️⃣ Now you can open your browser at |