AI / Data / SoftwareShort (weeks)Detectability: Hard

LLM hallucination risk assumed to decrease monotonically

A team shipped an internal assistant with the assumption that model updates would steadily reduce hallucinations.

Silence is not stability.

Decision summary

Year
2023
Failure mode
Changing error shape: fluency increased faster than grounding and governance.
Silent failure window
2–3 weeks: errors were rare enough to escape attention but impactful enough to cause downstream rework.

The original logic

Benchmarks improved across releases, user satisfaction increased, and a light “human-in-the-loop” review was considered sufficient for the initial launch scope.

Key assumptions

  • Model upgrades would monotonically reduce hallucination rates in our domain.
    Confidence at decision: Medium
    Expected lifetime: Weeks
  • Prompting and retrieval would bound outputs to approved sources.
    Confidence at decision: Medium
    Expected lifetime: 1–3 months
  • Users would treat the assistant as a draft, not an authority.
    Confidence at decision: Low
    Expected lifetime: Weeks

What changed

A model update improved general reasoning but changed failure modes: it became more fluent in incorrect specifics. Retrieval coverage was incomplete for edge cases, and users began trusting the assistant as confidence rose.

Outcome

A small number of high-impact incorrect recommendations made it into customer-facing materials, forcing retraction, process changes, and tighter governance controls.

Early warning signals (missed)

  • A shift from “obvious” hallucinations to plausible-but-wrong citations
  • Increased user copy-paste into external documents
  • RAG coverage gaps (queries with low retrieval confidence) not surfaced to users

How AssureAI would have helped

  • Assumption half-life tracking for “model update improves safety,” requiring explicit re-validation post-upgrade.
  • Signals: retrieval confidence + citation coverage tracked as drift signals with thresholds.
  • Audit exports: every recommendation includes sources, confidence, and “review required” triggers.

Non-obvious lessons

  • Improvement is not monotonic; it is multi-dimensional.
  • Fluency is a risk amplifier when governance is weak.
  • If the tool feels authoritative, the process must be authoritative too.
LLM hallucination risk assumed to decrease monotonically — Decision Graveyard