Fair Lending Testing for AI Agents: A Disparate Impact Playbook Under ECOA, HMDA, and the FHA
Why Fair Lending Is the Next AI Exam Theme
The CFPB's August 2024 comment letter and the OCC's 2025 fair lending update both made the same point: an AI model in a credit decision, in a pricing decision, in a marketing decision, or in a servicing decision is a fair-lending model. The bank owns the disparate-impact analysis whether the model is the lender's own, a vendor's, or a hybrid. Examiners have started asking for the testing file at the start of the exam, not at the end.
We work with banks and non-bank lenders running AI in origination, servicing, and outreach. The fair-lending file is the document that decides whether the AI program survives the exam.
The Three Tests You Have to Run
A fair-lending program for an AI agent needs three tests, each producing its own artifact.
The first is disparate treatment, which asks whether the agent treats a protected-class applicant differently from a similarly situated non-protected-class applicant. For an AI agent, this is the test that catches a model using a protected-class proxy as a feature.
The second is disparate impact, which asks whether the agent's outcomes differ across protected classes in a statistically significant way, regardless of intent, and whether any difference is justified by a legitimate business need with no less-discriminatory alternative.
The third is unequal treatment in marketing and steering, which asks whether the agent's outbound outreach, language, channel selection, and offer surfacing differ across protected classes in a way that affects access to credit.
A bank that runs only the first test is missing the test that catches most modern AI failures. A bank that runs only the second is missing the test that catches the proxy-feature problem. Each test is necessary.
What Counts as a "Decision" for Fair Lending
This is where most AI programs lose the thread. A fair-lending decision is any output of an AI agent that affects a consumer's access to credit, the terms of credit, or the experience of obtaining credit. The list catches more than the underwriting model:
- The marketing model that decides whether to surface a HELOC offer to a customer
- The lead-scoring model that decides which loan officer gets the inbound application
- The chat agent that recommends a product to a borrower
- The voice agent that decides when to escalate a hardship call versus self-serve
- The collections agent that decides which cadence to use
- The document agent that decides whether a stipulation has been satisfied
Each of these can produce a fair-lending outcome. Each needs the testing file.
Building a Defensible Disparate-Impact Test
The methodology that holds up in an exam has a small number of moving parts done well.
Define the protected-class proxy carefully
For mortgage applications under HMDA, the protected class is reported. For non-mortgage products, the bank has to infer with BISG (Bayesian Improved Surname Geocoding) or a similar method. We use BISG with the 2020 Census tracts, document the methodology, and report the imputation confidence. We never run a fair-lending test against an undocumented proxy.
Define the outcome and the population
The outcome is the model's decision: approve/deny, price band, channel, contact frequency. The population is everyone the agent decided on, not everyone who came in. A model that screens out borrowers before the recorded decision creates a missing-data problem that has to be analyzed separately.
Run the AIR/SMD comparison
The standard test is the adverse-impact ratio for binary decisions (4/5 rule as a tripwire, not a safe harbor) and the standardized mean difference for continuous outcomes like APR. We compute these per protected class against the control class with confidence intervals. A point estimate without a confidence interval is not a fair-lending test.
Run the "controlled for legitimate factors" version
Disparate impact analysis is not just the raw difference. The bank can defend a difference if it is explained by legitimate, non-discriminatory factors. Run the model with controls for the underwriting variables the bank uses in its policy — DTI, credit score, LTV. The residual difference is the unexplained disparity. That is the number that matters.
Search for less discriminatory alternatives
If the unexplained disparity is statistically significant, the bank has to search for a less discriminatory alternative model. This is the step the 2024 CFPB letter emphasized. We run a systematic search across feature drops, monotonic constraints, and reweighing techniques, and we report the alternatives we tested and the trade-off on predictive performance. The search itself is the defense; the absence of the search is the violation.
The Proxy Feature Problem
The technically predictive features that get banks into trouble are familiar by now. Zip code interactions. Device type. Time-of-day patterns. Transaction merchant categories. Each of these can correlate strongly with a protected class without being a "race feature" on its face. An AI model that uses them, even indirectly through a learned representation, can produce a disparate impact that the bank cannot defend.
The testing for proxies is more demanding than the outcomes test. We do three things:
- Feature contribution analysis on protected-class subgroups. If a feature contributes differently to decisions for a protected class than for the control, it is a candidate proxy.
- Feature ablation against the protected-class signal. Remove the suspected proxy. If the model's ability to predict the protected class drops materially while predictive performance on the credit outcome holds, the feature was a proxy.
- Documented business justification per feature. Every feature in the production model has a one-paragraph justification signed by the fair-lending counsel.
A bank that cannot produce this file for every feature has not run the test.
Where AI Agents in Marketing Get Banks in Trouble
The disparate-impact framework applies to credit-marketing decisions too. The patterns we see fail:
- A marketing model that pre-qualifies based on inferred home value and inadvertently steers HELOC offers away from minority neighborhoods at the same income
- A voice agent that asks more verification questions of a borrower with an accent and lengthens the call, creating a real difference in experience
- A chat agent that offers different "next steps" to similarly situated borrowers based on language preference, which becomes a steering issue
- An outreach optimizer that picks call times based on response rate and ends up calling protected-class borrowers at less favorable times
Each of these has shown up in fair-lending exams. Each is testable with the methodology above before it ships.
Servicing and Loss Mitigation Fair-Lending Risk
Reg X and the CFPB's servicing exam priorities give a clear signal: loss mitigation outcomes are a fair-lending exam item. An AI agent that handles hardship intake, offers options, or routes cases can create disparate impact in approval rates, time-to-resolution, and the mix of options offered. The test cuts across the same disparate-impact methodology, with the outcome being loss-mitigation outcome rather than origination decision. We run this test quarterly on every servicing AI we deploy.
The Audit Pack Examiners Ask For
When fair lending comes up in an AI exam, the file we produce includes:
- The list of AI models in use that affect a consumer credit outcome, with model owner, validator, and version
- The protected-class inference methodology and the imputation quality report
- The disparate-impact test for each model, with confidence intervals, controls, and the date of the last refresh
- The less-discriminatory-alternative search log, with alternatives considered, the metric trade-offs, and the rationale for the production choice
- The feature-level business justifications, signed by the fair-lending function
- The continuous-monitoring dashboard with the disparity metrics and trigger thresholds
- The remediation log for any past finding, including the corrective actions and the verification testing
The file is large the first time and shrinks once the controls are in place. The point is that every test produces an artifact and every artifact has a person whose name is attached.
Tuning the Production Model Without Breaking Fair Lending
Most fair-lending problems we get called in on were created by a well-intentioned tuning round. A team noticed that a segment was over-approving for a default-rate reason and tightened a feature. The tightening dropped approvals for a protected class disproportionately and no one ran the disparate-impact test on the post-tuning model. Three months later it surfaces in a quarterly review.
The fix is process. No production model change ships without the disparate-impact test, the less-discriminatory-alternative review, and the fair-lending sign-off. The change-control gate is the same gate we use for SR 11-7. The two functions read the same file.
The Vendor Question
If the AI agent is a vendor's, the bank still owns the fair-lending file. The diligence questions we tell banks to ask:
- Show me the disparate-impact testing methodology you ran on the production model
- Show me the less-discriminatory-alternative search log
- What features are in the production model and what is the business justification for each
- Who at your firm signs the fair-lending compliance attestation
- What is the cadence for re-testing as the model is retrained or as our portfolio shifts
If the vendor cannot produce these, the bank either runs the testing itself or finds a vendor that can. There is no third path that survives an exam.
A Practical Cadence
For a bank deploying AI agents that touch any credit-relevant decision, a reasonable testing cadence is monthly disparate-impact monitoring against a rolling 90-day window, quarterly less-discriminatory-alternative review, semi-annual feature-level business justification refresh, and an annual independent fair-lending audit. The work compounds; by year two, most of it is automated and the compliance team reviews exceptions rather than running the analysis from scratch.
The first quarter is the hard one. Set the methodology, build the artifacts, write the justifications. The first audit cycle is what the program is judged on.
Pranay Shetty
CEO & Co-Founder