Why generic AI fails on institutional real estate

Take any general-purpose AI assistant, hand it a commercial lease, and ask it for the current base rent and the next escalation date. It will give you an answer in seconds. It will sound certain. And on a real lease, one with three amendments, a side letter, and a renewal option that reset the schedule, it will frequently be wrong. The problem is not that the model is weak. The problem is that institutional real estate is a domain with rules the model was never taught, and those rules are exactly where the money is.

This matters because the failure is quiet. A generic model does not say "I am not sure." It produces a clean, plausible number, and a plausible number that is wrong is more dangerous than no number at all. To understand why horizontal AI keeps failing in this industry, you have to look at the specific places it breaks.

Failure one: it does not understand document hierarchy

Real estate documents are not flat. A lease is a stack. There is the original lease, then the first amendment, the second amendment, an assignment, a side letter, an estoppel, maybe a subordination agreement. Each document can modify, override, or quietly waive a term in the one before it. The whole discipline of lease abstraction is knowing which version of each clause actually governs today.

A horizontal model treats these as a pile of text. Ask it for the base rent and it may pull the figure from the original lease, the document that is longest and most "lease-like", and miss that the second amendment reset the schedule two years ago. Ask it whether a co-tenancy provision is in effect and it may not register that a side letter waived it. The model can read every word and still get the answer wrong, because reading is not the same as understanding precedence. There is no concept in a general model that says an amendment supersedes the original on any term it touches. That rule has to be built in.

A lease is not a document. It is a stack of documents that argue with each other, and the answer is whichever one wins. Generic AI does not know there is an argument.

Document hierarchy

Which version of a clause actually governs today

Base

Original lease

The longest, most "lease-like" document, and the one generic AI tends to read.

base

Over

Second amendment

Reset the rent schedule two years ago. Supersedes the original on any term it touches.

amends

Wins

Side letter

Quietly waived the co-tenancy provision. The clause that actually governs today.

governs

An amendment supersedes the original on any term it touches, and a side letter can beat both. That rule has to be built in.

Failure two: it cannot reconcile conflicting sources

Now widen the lens from one lease to one asset. The same fact, say, in-place rent for a given suite, appears in the lease, in the rent roll, in the Argus model, and in the Yardi export. In the real world these four sources disagree more often than anyone likes to admit. A renewal got papered but never entered. A free-rent period was modeled but not reflected in the GL. The rent roll was current as of last month and the lease was amended since.

An analyst's actual job, much of the time, is reconciliation: finding the conflict, deciding which source is authoritative, and explaining the variance. A generic model does the opposite. Handed four numbers, it will smooth them into one confident answer and tell you nothing about the disagreement it just papered over. It has no notion of source authority, that the executed lease beats the rent roll, which beats the stale model. It cannot show its work because it never did the work. It guessed.

Failure three: it hallucinates, and it hallucinates with conviction

Everyone in finance has now heard the word "hallucination," but the framing undersells the risk in this context. The danger is not that the model occasionally makes things up. It is that it makes things up in exactly the same confident, fluent register it uses when it is right. Ask for a covenant calculation and it may invent a debt-service-coverage threshold that is not in the loan agreement. Ask it to summarize an offering memorandum and it may attribute an occupancy figure to the wrong asset. The output reads like a competent analyst's. There is no tell.

In a domain where a single wrong number can flow into an investment committee memo and then into a bid, "usually right, occasionally and invisibly wrong" is not a productivity tool. It is a liability you have to check by hand, which means you have saved nothing.

Failure four: it cannot be audited

This is the one that ends most enterprise pilots. EY and others have written about the black-box problem in AI, when a system produces an output that materially affects a decision, regulated institutions need to know why. Where did this number come from? Which clause, which page, which document? A general model cannot answer that. It produces a number with no chain of custody. You cannot trace it, you cannot defend it to an auditor or an investment committee, and you cannot reproduce it, ask the same question twice and you may get two different answers.

For an institutional firm, an unauditable answer is not a partial win. It is a non-starter. The entire culture of the business is built on being able to show your work. A tool that cannot is a tool that never leaves the sandbox.

Failure five: it does not live where the work lives

Finally, even a model that somehow cleared all four bars above would still be useless if it sat in a separate chat window, disconnected from Argus, Yardi, MRI, the data room, and the firm's own files. Real estate analysis is not a trivia exercise; it is an exercise in operating over a specific firm's specific data. A generic assistant with no integration is a clever stranger you have to spoon-feed every document by hand. The integration is not a nice-to-have. It is most of the value.

The pattern

Every one of these failures has the same root cause. A horizontal model is general by design, it knows a little about everything and the specific rules of real estate about nothing. The fix is not a bigger general model. It is an architecture that encodes the domain's rules, sources, and standards of proof. That is a different kind of system.

What actually works

The systems that survive contact with real institutional work share four properties. None of them is "a better chatbot." Together they describe a domain-native architecture.

Generic AI vs Built AI

Same lease, two ways of understanding it

Generic AI

Treats a lease stack as a flat pile of text
Smooths conflicting sources into one confident number
Hallucinates in the same fluent register it uses when right
Produces a number with no chain of custody
Sits in a chat window, disconnected from the firm's data

Built AI

Models hierarchy in a real estate knowledge graph
Knows the executed lease beats the rent roll beats the stale model
Runs the math through a deterministic engine
Cites every figure to document, page, and clause
Reads Argus, Yardi, MRI, and the data room directly

The fix is not a bigger general model. It is an architecture that encodes the domain's rules, sources, and standards of proof.

A real estate knowledge graph

Instead of treating documents as flat text, a knowledge graph models the actual objects of the business and the relationships between them: this amendment belongs to this lease, which governs this suite, occupied by this tenant, whose exposure rolls up through this entity to this fund. Document hierarchy stops being something the model has to guess at and becomes something the structure enforces. The amendment overrides the original because the graph is built to know that an amendment overrides the original. Reconciliation becomes possible because the graph knows which source is authoritative for which fact.

A deterministic engine for anything that is math

Covenant tests, waterfalls, escalation schedules, debt-service coverage, variance to budget, none of these should ever pass through a language model's probabilistic guess. They run through a deterministic engine. A covenant calculation is arithmetic against terms that exist in a document; the right answer is not a matter of opinion, and it must be the same every time you ask. The language model's job is to read the document and identify the inputs. The engine's job is to compute. Separating those two responsibilities is what kills the hallucinated covenant threshold.

A citation on every cell

Every figure the system produces points back to the exact source it came from, document, page, clause. If the base rent is $42.50, you can click it and land on the line of the second amendment that set it. If a number cannot be cited, it does not get shown. This is what turns an unauditable black box into something an investment committee and an auditor will accept. It also makes the output self-checking: a reviewer is not asked to trust the number, they are handed the evidence and asked to confirm it.

A human holding the pen

The right design does not remove the analyst, it changes what the analyst does. The system produces a first draft of the lease abstract, the variance analysis, the IC memo section, with every claim sourced. A person reviews, corrects, and approves. Judgment stays human. What disappears is the hours of manual extraction and reconciliation that produced the draft. The model proposes; the professional disposes.

The goal is not an AI that replaces the analyst. It is an AI that does the analyst's typing, shows its receipts, and hands the analyst back their judgment.

What this looks like in practice

Consider three pieces of real work and how the two approaches diverge.

Lease abstraction. Generic AI reads the original lease and returns terms that may be two amendments out of date, with no source. The domain-native system walks the full stack, applies the amendment-supersedes rule, returns the term that governs today, and cites the clause that set it. One is a draft you cannot trust. The other is a draft you can check in seconds.
Covenant math. Generic AI invents or misreads a coverage threshold and computes confidently against it. The domain-native system extracts the threshold from the loan agreement with a citation, runs the calculation through a deterministic engine, and flags the result against the actual covenant, same answer every time.
An IC memo. Generic AI drafts fluent prose with figures you have to independently verify, some of which are subtly wrong. The domain-native system drafts the same prose with every figure linked to its source in the model, the rent roll, or the lease, so review is confirmation, not reconstruction.

In each case the difference is not eloquence. The generic model often writes more smoothly. The difference is whether the output is trustworthy and traceable, whether a senior person can put their name on it without redoing the work underneath.

The lesson is not that AI does not work for real estate. It is that generic AI does not, and the reasons are specific and fixable. The fixes, a real estate knowledge graph, a deterministic engine, citations on every cell, and a human in the loop, are exactly the architecture we have built Built AI around, sitting on top of Argus, Yardi, and the data room rather than asking you to abandon them. If you want to put it against a lease stack or a covenant you already know cold, book a walkthrough or see how the platform is built.

Why generic AI fails on institutional real estate (and what actually works)