👉Our AI agents platform is now PCI DSS L1 certified!

sei
AI Agents

Voice Cloning and the End of Voice Biometrics as a Sole Factor: A Caller-Verification Architecture for Banks

7 min read
Ramkumar Venkataraman
Share

The Threat Model Changed Faster Than Most Bank Phone Trees

Two facts decide the architecture. The first is that open-source voice-cloning models trained on under thirty seconds of source audio now produce speech that defeats commodity voiceprint verifiers in a meaningful fraction of attempts, and the source audio is often available from a public-facing video, a podcast, or a social account the target does not think of as biometric material. The second is that the customer's mother's maiden name, last four of the Social, prior address, and prior employer are all on the open and dark web for a non-trivial share of the US population. The two together mean that any authentication path on a bank's voice channel that relies on voice-only matching or on shared secrets is a path a moderately resourced fraud ring can defeat at scale.

The NYDFS October 2024 AI cybersecurity letter said this plainly: covered entities should use forms of authentication that AI-generated deepfakes cannot impersonate, and the letter names digital certificates and physical security keys as examples. The CISA "Contextualizing Deepfake Threats" guidance and NIST SP 800-63-4 Digital Identity Guidelines carry the same message in operational vocabulary. We run the agent that holds the bank's phone channel, so the controls below are the ones our customers' fraud and identity teams have validated against this threat model in 2026.

The Two Caller-Verification Problems Banks Confuse

Most caller-verification programs blur two separable problems and pay for it. The first problem is enrollment, where a person claims an identity at account opening and the bank establishes what NIST calls identity-proofing strength under SP 800-63A. The second problem is authentication on every subsequent call, where a returning caller has to prove they are the same person, at the authenticator-assurance level SP 800-63B defines. Voiceprint enrollment is a 63A problem in disguise; voiceprint matching on a subsequent call is a 63B authentication. The defenses are different.

For enrollment, the high-fidelity voice-clone risk argues against using voice at all as identity evidence on a remote channel; the strongest practical posture is in-person or remote-supervised proofing for any high-risk relationship, and a multi-factor combination of cryptographically verifiable credentials and live verification for lower-risk paths. For authentication on a subsequent call, the answer is to push to NIST AAL2 or AAL3 with at least one factor that is provably phishing-resistant, which by definition is not a thing the caller can say into a microphone.

What "Phishing-Resistant" Means When the Caller Is on a Voice Line

NIST 800-63B defines phishing-resistant authentication as authentication that requires verifier-name binding and protection against credential interception, which in practice means cryptographic authenticators such as FIDO2/WebAuthn, smart cards, or hardware security keys, or push notifications to a registered application that include verifier-name binding and require user interaction. A code read over the phone, an SMS one-time passcode, an emailed magic link, and a voice biometric are not phishing-resistant under the standard. The voice channel cannot natively carry a FIDO2 handshake, which means the architecture has to move the authenticating event off the voice channel even when the conversation continues on it.

The pattern we deploy at banks is a callbreak: the caller reaches the agent, the agent identifies the requested action and its risk tier, and any tier above the lowest sends an authentication challenge to the customer's enrolled mobile app or hardware key. The customer completes the challenge in the app and returns to the call to continue. The voice channel carries the conversation, not the credential. A customer without an enrolled phishing-resistant factor on a high-risk action is offered an in-branch or video-supervised path rather than a fallback to voiceprint or KBA, because the fallback is the path the attacker exploits.

Voice Biometrics, Used Correctly, Are a Risk Signal Not a Factor

A voice biometric is still useful, but the architecture has to demote it from "factor" to "signal." Used as a signal, the biometric score feeds the risk engine alongside ANI verification, device fingerprint from any concurrent app session, the caller's behavioral pattern on this account, and the linguistic and prosodic signals a deepfake-detection model surfaces. A low score raises the risk tier and forces the phishing-resistant challenge earlier; a high score lowers the tier inside the rule, but never substitutes for the challenge on actions above the lowest tier. The bank that treats the score as a binary factor is the bank that approves the synthetic caller whose model the fraud ring tuned against the bank's specific verifier.

We invest in a separate liveness and synthesis-detection model that runs on the inbound audio and reports a probability the caller is human-and-live versus synthesized or replayed. The model is not the decision; it is an input to the risk engine. It also produces a per-call confidence score the QA team samples weekly, because a deepfake-detection model whose performance is unmeasured is the kind of control the next attacker's iteration will quietly outrun. The NIST Presentation Attack Detection (ISO/IEC 30107) framework is the test rubric we use, and our reporting maps the model's behavior to PAD's vocabulary so the bank's fraud team can compare across vendors.

The Knowledge-Based Authentication Question, Settled

Knowledge-based authentication asking the caller their prior address, prior employer, or mother's maiden name has been deprecated by NIST since SP 800-63-3 and remains deprecated in SP 800-63-4. The data underneath KBA is breach-compromised at population scale, and the same is true for dynamic-KBA questions assembled from credit-header data. We retire KBA from the high-risk authentication path entirely, and we use it, where it remains in the lowest-risk paths at all, as a friction step rather than as evidence of identity. A program that still gates wire transfers on KBA is a program whose next material incident is a question of when, not whether.

The ANI Verification Layer the Voice Path Needs Anyway

The Federal Communications Commission's STIR/SHAKEN framework attests to the legitimacy of the calling number through cryptographic signatures the originating carrier applies and the terminating carrier verifies. The signal the bank reads is the attestation level (A, B, or C) attached to the inbound call. An A attestation is the carrier's assertion that the calling party is fully authorized to use the calling number. We do not treat an A as authentication, but we treat the absence of an A on a high-risk inbound call as a risk-tier increase that forces an earlier challenge. The bank's voice infrastructure has to make the attestation available to the application layer, which is a configuration question many banks have not addressed because the voice team and the fraud team report through different chains.

The other ANI-side check is device-binding cross-channel. If the same customer is concurrently in the mobile app on the device whose number matches the inbound ANI, the risk engine reads that as a positive signal. If the customer is in the app on a different device or no app session is live, the signal is neutral; if a device the customer never authenticated from is now in the app, the signal is negative regardless of the ANI. The cross-channel binding is one of the highest-value signals available and requires the voice agent to query the mobile session state in near real time.

The Agent-Side Controls That Match the Threat

The agent the customer talks to is the surface the attacker is probing, and its design assumptions matter. The system instructions tell the agent that the caller's identity claims are not authentication, that the conversation may proceed for low-risk informational requests without elevated authentication, and that any action above the lowest tier requires the phishing-resistant challenge to succeed before the tool that executes the action becomes available. The tool gating is enforced at the agent's tool layer, not in the prompt, because anything enforced only in the prompt is a control a sufficiently capable prompt injection can defeat (we wrote about that defense separately).

The agent surfaces context to the human handoff when the risk engine routes a call to a person. The handoff payload includes the verified-or-not authentication state, the risk score with the contributing signals, the deepfake-detection model's output, and the cross-channel session state, so the person receiving the call is not running the verification from scratch. The handoff also includes the agent's confidence in the speaker's identity claim, which is sometimes "we believe this is the customer" and sometimes "we believe this is plausibly the customer's voice but the device and behavioral signals do not match," and the human reviewer makes the call with the full picture rather than with the agent's recommendation alone.

The Red-Team Cadence Our Customers Run

A caller-verification program that is not adversarially tested is not a program; it is a wish. The cadence we run on every quarter and on every material change includes a synthetic-voice penetration test using publicly available cloning tools tuned on consented sample audio of a tester, a STIR/SHAKEN spoofing simulation against the inbound path through the bank's voice infrastructure, a KBA-as-attacker test that probes which of the bank's residual KBA paths still grant material access, and a social-engineering battery against the human-handoff path that tests whether a verified-looking but unauthenticated caller can talk a representative into bypassing the challenge. The results feed the next round of controls. The institution that takes the synthetic-voice test the first time is almost never satisfied with where its program lands; the second time around the deltas matter.

What the File Contains After a Disputed Call

The disputed call is the moment the program has to show its work. The per-call record we keep includes the inbound ANI with the STIR/SHAKEN attestation, the inbound audio with the deepfake-detection score time-series across the call, the voiceprint score if one ran and its position in the score distribution, the cross-channel device and session state at the moment of the call, the risk-tier classification of every action requested and the authentication challenges issued and completed, the response time on each challenge, the tool calls the agent made and the gating that authorized them, and the handoff payload if one was produced. A disputed transaction at month three reviewed against this record is a determination, not a reconstruction. A disputed transaction without this record is a settlement.

The Honest Limit and the Direction of Travel

There is no detector for synthetic voice whose accuracy is stable against the next generation of cloning models, and the NIST AI 100-2 E2025: Adversarial Machine Learning taxonomy is candid about this. The architecture above is built so that a detector miss is not a fraud incident, because the actions an attacker would care about are gated behind a factor the synthetic voice cannot complete. The detection layer is a defense in depth that lowers the rate at which the challenge is invoked on legitimate callers and raises the rate at which suspicious callers escalate, but the actual stop is the cryptographic authenticator the model cannot reach. The direction of travel is toward more capable voice synthesis, more accessible to more attackers, on shorter source audio, and a program designed around voice as a factor is on the wrong side of that curve. The program designed around voice as a signal, with phishing-resistant authentication as the factor, is on the right side. We build for the curve that is coming, not the threat model that was current the last time the voice channel was redesigned.

Ramkumar Venkataraman

Ramkumar Venkataraman

CTO & Co-Founder

BOOK A DEMO

Embed Sei AI in your workflows
Tell us about your operations. We'll show you how Sei handles borrower calls, processes loan documents, and monitors compliance for mortgage lenders and banks.
  • Deploy in weeks, not months
  • Trained on FDCPA, TCPA, TILA, UDAAP, and RESPA
  • SOC 2 Type II and PCI DSS L1 certified
  • Integrates with your LOS, CRM, and telephony

Please provide your full name so we know how to address you.

Tell us which company you represent so we can personalise our response.

Use your work email so we can connect you with the right specialist.

Choose the topics you’d like us to cover during the demo.

Complete the verification to submit the form.

sei

AI operations platform for mortgage lenders, servicers, and banks. Handle borrower calls, process loan documents, and monitor compliance.

Partners

Speechmatics

© 2026 Sei Software Technologies Inc. All rights reserved.