My default when starting a new AI integration is to reach for the most capable model available — in this case, the most capable model in the client’s purchased tier. More capability should mean better results. That logic feels obvious until it breaks something.
In this case, it nearly broke a customer-facing chatbot before we pushed it to production.
The hallucination hiding in plain sight #
When a customer asked, “Does this tour include airline tickets?”, the model read the knowledge base and then reasoned its way past it. It knew that most all-inclusive tours typically include flights, so it answered: “This tour includes round-trip airline tickets.”
Dead wrong — completely contrary to the actual policy. Had we not caught it during testing, a customer would have shown up at the airport without a ticket. The model was too capable for this specific job.
Why high capability becomes a liability here #
Customer service bots don’t need creativity. They need to stay inside the lines — answer from the knowledge base, follow the rules, and stop there.
I switched to a smaller model with the same knowledge base and the same prompt. The bot responded: “This tour does not include airline tickets. Please book your own tickets or contact us if you need help with a separate booking.” Done. No fabrication, no embellishment.
Matching the model to the actual job #
Defaulting to the most powerful model available creates risk that does not surface until something goes wrong in production. The right question is not which model is most capable — it is which model is most appropriate for what this task actually requires.
That is the engineering decision.