Stop defaulting to the most powerful model for rule-based tasks

Table of Contents

My default when starting a new AI integration is to reach for the most capable model available — in this case, the most capable model in the client’s purchased tier. More capability should mean better results. That logic feels obvious until it breaks something.

In this case, it nearly broke a customer-facing chatbot before we pushed it to production.

The hallucination hiding in plain sight
#

Illustration for The hallucination hiding in plain sight We were building a customer service bot for a tour operator. The knowledge base was explicit: the 3-day, 2-night package does not include airline tickets.

When a customer asked, “Does this tour include airline tickets?”, the model read the knowledge base and then reasoned its way past it. It knew that most all-inclusive tours typically include flights, so it answered: “This tour includes round-trip airline tickets.”

Dead wrong — completely contrary to the actual policy. Had we not caught it during testing, a customer would have shown up at the airport without a ticket. The model was too capable for this specific job.

Why high capability becomes a liability here
#

Illustration for Why high capability becomes a liability here Powerful models tend to fill gaps. When the knowledge base did not resolve a question cleanly, this one answered as if it had — drawing on general world knowledge about how tours usually work rather than stopping at what the document actually said. For creative writing or debugging complex code, that tendency is useful. For a customer service bot operating from a fixed knowledge base, it is the failure mode: the model’s instinct to help becomes the thing that breaks compliance.

Customer service bots don’t need creativity. They need to stay inside the lines — answer from the knowledge base, follow the rules, and stop there.

I switched to a smaller model with the same knowledge base and the same prompt. The bot responded: “This tour does not include airline tickets. Please book your own tickets or contact us if you need help with a separate booking.” Done. No fabrication, no embellishment.

Matching the model to the actual job
#

Illustration for Matching the model to the actual job This reshaped how I think about model selection. The rough split I use now: scripted customer service, knowledge base Q&A, rule-based classification, and schema extraction all call for a smaller, more predictable model where stability matters more than reasoning range. Debugging, content generation, and multi-factor analysis are where a larger model earns its place — the ability to infer and synthesize is the whole point.

Defaulting to the most powerful model available creates risk that does not surface until something goes wrong in production. The right question is not which model is most capable — it is which model is most appropriate for what this task actually requires.

That is the engineering decision.

When you push an AI coding agent too hard, it starts optimizing for the wrong thing

12 May 2026·1185 words·6 mins

AI Tech Engineering

Anthropic’s interpretability research on internal activation patterns helps explain a pattern most AI coding tool users have hit: the agent passes all the tests, but the code is quietly broken. Here’s what that means for how you run these sessions.

The AI apprenticeship model: building senior talent in the age of automation

11 May 2026·2607 words·13 mins