👉Our AI agents platform is now PCI DSS L1 certified!

sei
AI Agents

Preventing AI Hallucinations in Bank Customer Service: A Grounding and Citation Architecture

5 min read
Ramkumar Venkataraman
Share

Why Hallucinations Are the Number One Production Risk

In regulated finance, an AI agent that invents an answer is not a quirky bug. It is a UDAAP violation, a TILA misstatement, or a RESPA disclosure failure. Examiners do not distinguish between a human who misquoted policy and a model that did. The bank owns both.

We have built and operated AI agents at banks where a single hallucinated APR would have triggered a remediation program, so this is the architecture we use to keep that from happening.

What "Hallucination" Actually Means

The term is a catch-all. There are at least four failure modes worth separating:

  • Invented facts. The model produces a number, name, or rule that does not exist. APRs, fee amounts, eligibility cutoffs.
  • Stale facts. The model produces something that was true once and is no longer true. Last quarter's promo rate, an old escrow rule.
  • Composed facts. The model takes two true things and combines them into a false third thing. Fannie Mae's maximum DTI plus FHA's compensating factors yielded as one rule.
  • Out-of-policy compliant-sounding answers. The model says something plausible that the bank does not, in fact, permit. "I can waive that fee for you" when the agent has no authority to.

A grounding architecture has to defend against all four. Different defenses are needed for each.

The Architecture We Deploy

1. Retrieval-grounded generation, not free generation

The agent does not generate answers from the model's parameters. It retrieves passages from the bank's policy library, product disclosures, rate sheets, and procedure documents, and the prompt instructs the model to answer only from those passages. The retrieval system is a model in its own right — it gets validated, monitored, and versioned (see our SR 11-7 piece). When retrieval fails, the agent does not improvise. It says it does not have an answer and offers a handoff.

2. Citation-required generation

Every claim in the agent's response is tied back to a source passage with an internal citation. Borrowers do not see the citation, but the QA system does. If a sentence has no citation, it is flagged. We run a post-generation pass that scores citation coverage on every response; below threshold, the response does not ship. This single control catches the majority of composed-fact errors before they reach the borrower.

3. Confidence gating

The model returns a confidence signal — either logprobs, an internal self-evaluation, or a verifier model running on the output. Below a threshold, the agent escalates to a human or asks a clarifying question instead of guessing. We tune the threshold per use case. For a balance lookup the threshold is permissive; for a refinance eligibility statement, it is strict.

4. Hard policy guardrails

Some statements are never allowed regardless of model confidence. The agent cannot quote a rate that is not on today's rate sheet. It cannot quote a fee that is not in the disclosures. It cannot promise a waiver. These are deterministic checks, not LLM judgment. They sit in front of and behind the model, and they fail closed.

5. Freshness controls

Stale facts are usually a data problem, not a model problem. The retrieval index has versioned documents with effective dates. The agent's prompt tells it to filter by effective date. When a rate sheet updates, a job re-indexes within minutes and invalidates cached responses. We run a daily diff against the source-of-truth systems so missed updates surface in the morning, not in a month-end QA review.

6. Out-of-policy detection

A separate classifier watches for answers that are policy-shaped but not grounded in the bank's policy. "I can do that for you" without a corresponding authority in the policy library triggers a flag. We train this classifier on the bank's actual policy text plus a dataset of historical out-of-policy answers — both real misses and synthetic adversarial examples.

The Eval Harness That Keeps It Honest

The architecture is not the proof. The eval harness is. We build three layers.

Golden set

A hand-curated set of 500 to 2,000 questions per use case with expected answers and the policy citations that support them. The agent runs against the golden set on every change. Below threshold, the change does not promote.

Adversarial set

Questions designed to trip the agent: edge cases, jurisdictional twists, requests to do things the agent should not do, requests phrased in ways that historically have caused hallucinations elsewhere. We add to this set every time a real conversation surfaces a new failure mode.

Production sample

Daily sampling of live conversations, scored by a separate evaluator model and reviewed weekly with compliance. A 1% sample on a busy line is hundreds of conversations; that is enough to catch drift.

The eval scores feed a dashboard the model owner sees and the second line reviews. If the production sample diverges from the golden set, that is the trigger for investigation.

What the Vendor Pitch Misses

A common vendor claim is "zero hallucinations." That is not a real number. What you can promise honestly is a measured rate, the eval set it was measured on, and a monitoring system that catches drift. We tell banks: ask the vendor for the eval methodology and the production sampling rate. If they cannot give you both, the zero-hallucination claim is marketing.

Where We See Programs Fail

Three failure patterns we have walked into and fixed:

  • The retrieval system was built once and never re-evaluated. New products were added to the bank's offerings; the index never picked them up. The agent confidently quoted old policy.
  • The citation check was a soft warning, not a hard gate. Hallucinations slipped through because no one read the warnings.
  • The prompt was treated as configuration. A vendor pushed a prompt update that softened a refusal rule, and the agent started promising waivers it did not have authority for. There was no change-control gate in front of prompt updates.

Each of these is a process problem dressed as a model problem. The architecture above closes them.

A 60-Day Plan

For a bank starting fresh, a workable rollout is:

  • Days 1–10. Pick one workflow with a complete, source-of-truth policy artifact. Build the retrieval index and version it. Stand up the citation check and the basic guardrails.
  • Days 11–25. Build the golden and adversarial eval sets with the policy team. Tune the confidence threshold against the golden set.
  • Days 26–45. Pilot in production on a single queue. Daily review with compliance. Track hallucination rate, citation coverage, and unjustified-confidence rate.
  • Days 46–60. Promote to a broader queue. Establish the weekly review cadence. Hand the dashboard to the second line.

By day 60 you have a measured hallucination rate, an eval set that the second line trusts, and a process that lets you ship the next workflow without rebuilding the architecture.

The Standard To Hold Vendors To

If you are buying, here are the questions worth asking. What is the citation coverage on a representative production sample? What is the hallucination rate on the bank's own eval set, not the vendor's? What is the change-control process for prompts and retrieval? Who validates the retrieval system, and how often? When the answers are concrete, the program will hold up. When they are not, the architecture above is what you should be building or buying.

Ramkumar Venkataraman

Ramkumar Venkataraman

CTO & Co-Founder

BOOK A DEMO

Embed Sei AI in your workflows
Tell us about your operations. We'll show you how Sei handles borrower calls, processes loan documents, and monitors compliance for mortgage lenders and banks.
  • Deploy in weeks, not months
  • Trained on FDCPA, TCPA, TILA, UDAAP, and RESPA
  • SOC 2 Type II and PCI DSS L1 certified
  • Integrates with your LOS, CRM, and telephony

Please provide your full name so we know how to address you.

Tell us which company you represent so we can personalise our response.

Use your work email so we can connect you with the right specialist.

Choose the topics you’d like us to cover during the demo.

Complete the verification to submit the form.

sei

AI operations platform for mortgage lenders, servicers, and banks. Handle borrower calls, process loan documents, and monitor compliance.

Partners

Speechmatics

© 2026 Sei Software Technologies Inc. All rights reserved.