Lease abstraction is a document-hierarchy problem, not a text problem

Ask a vendor to demonstrate lease abstraction and they will upload a clean, single-document office lease, point to the base rent, the term, the expiry, and the extracted fields will line up beautifully. It looks solved. It is not solved, because the lease they showed you is not the lease your portfolio actually contains. Your portfolio contains a 2014 lease, a 2017 amendment that re-cut the premises, a 2019 amendment that reset the rent steps, a side letter that quietly granted a rent-free period the amendment never mentions, and a commencement letter that fixed the date everything keys off. The terms that matter live in the relationship between those documents, not in any one of them.

This is why lease abstraction is best understood as a document-hierarchy problem rather than a text-extraction problem. The hard part is not reading words off a page. Modern models read words off a page well. The hard part is knowing which words win when two documents disagree, and a lease portfolio disagrees with itself constantly.

What a lease actually is

Treat a single tenancy as a small legal record with a strict order of authority. At the base sits the original lease. On top of it sit amendments, each of which overrides specific clauses of the original and leaves the rest untouched. Alongside them sit side letters, which often carry the most commercially sensitive concessions precisely because they were negotiated quietly. Then come the operational confirmations: commencement letters that fix the start date, estoppel certificates that record the agreed state of the tenancy at a moment in time, and renewal or extension notices that change the term.

The order is not cosmetic. A later amendment beats the original on any clause it touches. A side letter may beat both. A commencement letter does not restate the rent but it fixes the date that every rent step, every break notice window, and every option deadline is calculated from. Get the hierarchy right and the abstract is correct. Get it wrong, even on a single document, and the abstract is worse than useless, because it is wrong while looking authoritative.

A lease abstract that reads the original and misses the amendment is not 90% right. On the clause the amendment changed, it is 100% wrong, and it is wrong with total confidence.

Order of authority

A tenancy is a stack with a strict precedence

Base

Original lease

The 2014 lease at the base of the stack. Everything else modifies it.

base

Over

2019 amendment

Reset the rent steps. Overrides the original on any clause it touches.

amends

Wins

Side letter

Quietly granted a rent-free period the amendment never mentions. May beat both.

governs

A later amendment beats the original on any clause it touches. A side letter may beat both. The commencement letter fixes the date everything keys off.

Why generic extraction stalls at about 80%

A horizontal extraction tool, the kind that treats a document as a bag of text and pattern-matches for fields, will reach a respectable-sounding accuracy on a portfolio. Call it roughly 80%. The number sounds like a passing grade. It is the most dangerous number in the building.

The reason is what sits inside the missing 20%. It is not randomly distributed across trivial fields. It clusters exactly where the documents interact: the rent step that an amendment reset, the break date that a side letter moved, the expiry that a renewal notice extended, the free-rent period that lives only in a letter. These are the high-stakes terms. A tool that nails tenant name, square footage, and base rent but silently keeps the superseded rent schedule has produced an abstract that is mostly accurate and specifically catastrophic.

And the error does not stay contained. Lease data is an input, not an output. A single wrong expiry feeds the rollover schedule, the WALT calculation, the valuation, the covenant headroom, and the investor report. One corrupted abstract propagates into every downstream number that touches that tenancy. Eighty percent accuracy at the document level becomes a portfolio you cannot trust at the decision level, which is the only level that matters.

Where the missing 20% hides

Superseded rent steps an amendment replaced. Break dates a side letter moved. Free-rent periods that exist only in a letter. Options whose clock starts at a commencement date in a different document. Expiries extended by a renewal notice filed separately. None of these are exotic. They are the normal texture of an institutional lease, and they are exactly what flat extraction misses.

Flat text vs hierarchy-aware

The missing 20% is where the money lives

Flat extraction

~80%

Hierarchy-aware

~98%

Roughly 80% accuracy sounds like a passing grade. The missing 20% clusters exactly where documents interact, so it is mostly accurate and specifically catastrophic.

How you actually get it right

Solving this is not a matter of a better parser. It is a matter of representing the tenancy the way it really exists, as a structured set of related documents with an explicit order of precedence, and then resolving each term against that order. In practice that means four things working together.

The four-part solution

How a hierarchy-aware abstract is built

Knowledge graph

The lease and every document that modifies it, linked, dated, and ranked.

Citation

Every value points back to the exact document, page, and clause it resolved from.

Confidence scoring

Clean terms pass; conflicts and missing side letters get flagged for review.

Human confirms

The professional confirms a sourced, resolved draft rather than re-keying it.

Each term is resolved against the order of precedence, then routed to a human only where judgment is genuinely required.

A knowledge graph, not a flat record

The tenancy has to be modeled as a graph: the lease and every document that modifies it, linked, dated, and ranked. When the system is asked for the current rent, it does not read a field, it resolves the rent by walking the hierarchy, applying the latest amendment that touched it, and accounting for any side letter that overrides even that. The answer is computed from the structure, so it stays correct as documents are added.

A citation behind every term

Every abstracted value has to point back to the exact document, page, and clause it was resolved from. This is the difference between an abstract you can defend and a spreadsheet you have to re-check by hand. When the current rent traces to clause 4.2 of the second amendment rather than the original, a reviewer can confirm the precedence logic in seconds instead of re-reading the whole file.

Confidence scoring that flags the hard cases

The system has to know when it is unsure and say so. A clean term resolved from a single unambiguous clause is high confidence and can pass. A term where two documents appear to conflict, or where a side letter is referenced but not in the file, is low confidence and must be surfaced. The point of scoring is to route human attention to the 20% that needs it instead of spreading it evenly across the 100% that does not.

A human who confirms, not re-keys

The professional stays in the loop, but their job changes. Instead of abstracting from scratch, they confirm a drafted, sourced, hierarchy-resolved abstract, spending their judgment on the flagged conflicts where judgment is actually required. The rote 80% is handled; the human owns the consequential 20% and signs off on the whole. That is what makes the output trustworthy enough to feed a model.

The goal is not to remove the human from lease abstraction. It is to point the human at the five terms that are genuinely ambiguous instead of the five hundred that are not.

What accurate lease data unlocks

Get abstraction right and a set of decisions that were previously guesswork become tractable. The reason firms tolerate slow, manual abstraction is that they have learned not to trust the fast kind. Reverse that, and the lease layer turns into a live foundation rather than a liability.

Break-option planning. A correctly resolved break date, with the right notice window calculated from the right commencement letter, tells you which tenants can walk and when, early enough to act instead of react.
Rollover and WALT. Expiries that reflect every renewal and extension produce a rollover schedule and a weighted average lease term you can actually underwrite against, rather than a number quietly corrupted by stale dates.
Covenant inputs. Accurate in-place rent and term feed the income figures behind DSCR and other covenant tests. Wrong lease data does not just mislead asset management, it misstates the inputs your lender is watching.
Diligence at speed. When the abstract is sourced and hierarchy-aware, acquisitions can trust a portfolio's rent roll without re-abstracting every lease by hand, compressing diligence from weeks to days without lowering the bar.

Lease abstraction looks like a reading task and is actually a precedence task. The original, the amendments, the side letters, and the commencement letters form a hierarchy, and the whole game is knowing which one wins. Built AI treats a tenancy as what it is, a graph of related documents resolved on a deterministic engine, with a citation behind every term, confidence scoring that flags the conflicts, and a human who confirms rather than re-keys. It reads the leases where they already live, in your data room and your systems of record, and hands back an abstract you can put in front of an IC. To see it resolve a real stack of amendments and side letters on your own portfolio, book a walkthrough or read how the knowledge graph is built.