Prompt Control – The mind behind the machine — Part 4

Part 5 - System prompt and structured outputs RAG ep 2.

Prompt Control – The mind behind the machine — Part 4Structured Output generation is the untold technique to tame your language model and get what you want. And here is how…
͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     
Forwarded this email? Subscribe here for more
Prompt Control – The mind behind the machine — Part 4
Structured Output generation is the untold technique to tame your language model and get what you want. And here is how…
Nov 9

READ IN APP

In the past weeks we discovered together the power of system prompts.
What is the real difference among huge LLM and Small Language Models?
I believe every one of us can give a reply to this question. Certainly ChatGPT or Claude or Qwen are powerful AI, running 50 billion plus parameters under the hood.
But when you force the output structure of a Small Language Model, even a mere 3 Billion model can achieve the same performances of the mentioned above.
And they can run even on your computer. You don’t need to pay a subscription, or pray that your free API quota (like an openrouter.ai one) is depleted.
I am putting together a special package for the upcoming Black Friday. You can join the wait-list to be the first to know about it… why?
🔒 Early-bird pricing (lowest of the year)
🎁 Bonus: All waitlist subscribers get a free “FLUX Prompt Templates” mini-guide ($10 value)
Join the waitlist and get your Gift
Meantime…
Today I will show you how to do it yourself with two practical (and useful) examples. What you need is only:
a computer with at least 8 Gb of RAM
python
llama.cpp binaries
time to be spent learning and experimenting on your own.
To recap
Part 1 - The importance of system messages with your local AI      DONE
Part 2 - Prompt engineering Vs System design                       DONE   
Part 3 - The hidden Logic: teach local AI to follow your rules     DONE

Part 4 - System prompt and structured outputs ep 1.       <- WE ARE HERE

Part 5 - System prompt and structured outputs RAG ep 2. 
Taming the Beast — Controlling LLM Output
The reality is that little I knew about all of this before I started asking questions in forums, replying to Medium articles, and running endless tests on my local machine.
Sure, prompting helps. But if you want predictable, programmable responses — not just poetic guesses — you need more than vibes and trial-and-error.
You need output control.
And there are actually three powerful ways to force an LLM to generate exactly what you want, when you want it:
1. Grammar-Based Sampling (GGUF-style)
This one’s for the hardcore tinkerers. With tools like llama.cpp, you can define a GBNF (GGML Backus-Naur Form) that constrains every token the model generates. Think of it as putting the LLM in handcuffs — it can only speak in valid JSON, XML, or even custom DSLs.
Example: Want only “yes” or “no”? Define a grammar that allows nothing else.
🔧 Pros: Maximum precision.
⚠️ Cons: Complex syntax, hard to debug, not all models support it easily.
2. Logit Bias
Here, you manually boost or suppress specific tokens. For instance, if you don’t want the model saying “maybe,” you can penalize those words at the logits level before sampling.
It’s like tuning a radio — you reduce noise by turning down unwanted frequencies.
🔧 Pros: Lightweight, works with basic API calls.
⚠️ Cons: Fragile. Easy to over-constrain. Doesn’t scale well beyond simple choices.
3. Structured Output (my preferred option)
This is the structured output generation , an elegant middle ground.
Instead of fighting tokens or writing Backus-Naur Form grammars (honestly more difficult than ancient Greek grammar…), you tell the model:
“Respond using this Pydantic schema.”
Behind the scenes, the API (like OpenAI’s .parse() or llama.cpp’s JSON mode) combines prompt engineering, constrained decoding, and post-processing to ensure valid, predictable JSON — every single time.
🔧 Pros: Clean, developer-friendly, composable, works great with small models.
💡 Bonus: It also reduces hallucinations — because the model must follow rules, not improvise.