Model Risk Management for AI Agents: An SR 11-7 and NIST AI RMF Playbook
Why MRM Is Now a Gating Item
Federal banking regulators treat AI agents as models. SR 11-7 (the Fed and OCC's joint guidance from 2011), OCC Bulletin 2011-12, and FDIC FIL-22-2017 already cover anything that takes inputs and produces a quantitative output used in decisioning. The CFPB's 2024 comments on AI confirmed the regulators see no exemption for generative or conversational systems. NIST's AI Risk Management Framework, adopted by the Federal Reserve as a reference for AI governance, fills in what SR 11-7 was written too early to cover: bias, robustness, explainability, and human oversight for non-quantitative models.
If you cannot show your AI agents in your model inventory with full lineage, you should not be running them in production. We help banks pass this bar without slowing the program to a crawl.
What Counts as a Model
This is where most programs get into trouble. SR 11-7 defines a model broadly — a "quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories." Applied to AI agents, the definition catches:
- The LLM that decides what the agent says
- The retrieval system that grounds its answers
- The classifier that routes calls to escalation
- The transcription model and any post-call analytics
- Any embedding-based search that surfaces policy text
Each of these needs an inventory entry, a model owner, a validator, and a monitoring plan. Treating "the AI agent" as one entry on the inventory is what fails an exam. Each component is a model in its own right.
The Three Lines and Who Does What
SR 11-7's three-lines framework still applies. The first line is the business owner who deploys and uses the agent. The second line is independent model risk management, which validates and monitors. The third line is internal audit. The change with AI agents is that validation now requires people who understand both the regulatory frame (UDAAP, Reg B, Reg Z) and the technical stack (prompt construction, retrieval grounding, eval harnesses). Most banks do not have this skill set in-house yet, so we see hybrid arrangements where the second line uses outside validators against an internal control plan.
Validation: What the Second Line Actually Does
For each model component, validation has to cover four areas.
Conceptual soundness
Is the design fit for purpose? An LLM tuned on general chat is not a defensible choice for issuing TILA-aligned disclosures. A retrieval system that does not cite source documents back to the agent cannot ground answers in policy. The validator's job is to challenge the design choice, not just the metrics.
Outcomes analysis
Does the agent produce the right outcomes against a labeled test set? For voice agents, this means a held-out set of calls scored against policy adherence, disclosure delivery, and resolution. For document agents, it means an extraction set with ground truth from human review. We aim for at least 1,000 labeled cases per use case before promoting to production, and 50 to 100 per week of fresh sampling thereafter.
Ongoing monitoring
Drift is real. Borrower language shifts, policy changes, and silent model updates from upstream providers can move accuracy several points in a quarter. The monitoring plan needs trigger thresholds, an escalation path, and a documented frequency. Weekly is the minimum cadence we recommend for production agents talking to borrowers.
Challenger models
SR 11-7 expects a challenger when feasible. For an AI agent, a useful challenger is a deterministic rules-based version of the same workflow, run on the same inputs in shadow. When the LLM agent and the rules engine diverge, that is your cue for human review. It also keeps the team honest about how often the AI is genuinely doing better than a flowchart.
NIST AI RMF: Filling the Gaps
NIST AI RMF organizes work into four functions — Govern, Map, Measure, Manage — and adds dimensions SR 11-7 was not built for: fairness, robustness against adversarial inputs, privacy, transparency, and safety. For an AI agent in regulated finance, the Map function is where most of the value is. You map the actors, the regulatory exposure, the data flows, and the potential harms before you measure anything. We have seen banks skip this and end up with a model card that describes accuracy in detail and never mentions UDAAP.
A Working Inventory Entry for an AI Agent
A minimum viable inventory entry includes the model name, version, owner, validator, intended use, prohibited use, in-scope regulations, training data summary, evaluation harness reference, monitoring plan, change control process, retirement criteria, and links to source artifacts. We ship banks a template that maps each field to SR 11-7 and AI RMF references so the alignment is visible to examiners.
What Trips Programs Up
A short list of what we have seen go wrong:
- Treating the prompt as configuration, not a model artifact. Prompts version like code and need change control.
- Letting the agent retrain itself on production conversations without a documented promotion gate.
- Validating the LLM and skipping the retrieval system, where most policy errors actually originate.
- Using vendor benchmarks instead of building an internal eval set. Vendor numbers tell you nothing about how the system will perform on your borrowers.
- No retirement criteria. SR 11-7 expects you to know when a model goes out of scope. Most AI agent deployments cannot answer this.
Pilot to Production Timeline
A realistic SR 11-7-aligned MRM build for an AI agent program is 12 to 16 weeks: four for inventory and policy, four for the eval harness and challenger, and the balance for validation, monitoring tooling, and the second-line review. Banks that try to compress this end up redoing it in the next exam cycle.
Audit Posture
If your examiner asks for the AI agent's MRM file, you should be able to produce: the inventory record, the model card, the validation report with outcomes analysis, the monitoring dashboard with the last 90 days of metrics, the change log since the last review, the challenger comparison, and the AI RMF mapping. The work to produce this on demand is the work that makes the program defensible. It is also what lets you ship the next use case in weeks instead of quarters.
Ramkumar Venkataraman
CTO & Co-Founder