Courthouse records are one of those “everybody knows they’re important” data sources that still feel stuck in the past.
The information is there, deeds, leases, assignments, liens, probate filings, but it’s spread across thousands of counties,
stored as scans or PDFs, and indexed inconsistently.
If you’ve ever helped build a runsheet or reconstruct a chain of title, you already know the pain: pulling documents,
retyping names, hunting legal descriptions, cross-referencing book/page or instrument numbers, and trying to make sure
you didn’t miss the one document that matters.
We’ve been working on something we think is genuinely neat: a system that takes messy courthouse documents and turns them
into structured, searchable data that can automatically assemble runsheets and chain-of-title, not by magic,
but by applying a handful of practical, well-understood techniques in a clean pipeline.
The core problem: courthouse data exists, but it’s not “data”
Courthouse records weren’t designed to be queried like a database. They’re more like a giant box of scanned paperwork:
- A deed might be a crisp PDF in one county and a low-resolution scan in another.
- Legal descriptions appear in different formats (PLSS, lots/blocks, metes & bounds), and instruments often include multiple tracts.
- County indexes can be incomplete, some only capture the first tract or two on a multi-tract instrument.
So when someone says “we have courthouse data,” the real question is:
Is it searchable in a way that reliably finds everything you need?
What we’re building: a courthouse-to-runsheet pipeline
At a high level, our system does five things:
- Collect documents and index metadata
- Read the documents (text extraction + OCR)
- Understand what they are (classification)
- Pull out the important pieces (parties/roles, legal descriptions, key terms)
- Link everything into a chain (runsheet + chain-of-title graph)
Step 1: Ingest courthouse instruments (PDFs, scans, and metadata)
We start by capturing the basic “what is this document?” information:
- county
- recording date
- instrument number (or book/page)
- document PDF/images
- any clerk index fields that exist
This is mostly plumbing, but it matters later because we want traceability:
every extracted fact should link back to the source document/page.
Step 2: Turn scans into text you can work with
Some courthouse PDFs contain selectable text. Many are just images. So we do both:
- Text extraction when text exists
- OCR when it doesn’t
This step doesn’t need to be perfect; it needs to be good enough to support the steps that follow, and to preserve
where the text came from.
Step 3: Classify document type (deed, lease, assignment, lien, probate…)
Before you can build a reliable chain, you have to know what you’re looking at. We assign each instrument a type like:
- mineral deed / warranty deed / quitclaim deed
- oil & gas lease
- assignment
- lien / UCC / judgment
- probate / affidavit of heirship / estate-related filings
- releases and amendments
In practice this is a mix of document templates, keywords, and standard ML/NLP approaches. The goal isn’t hype, it’s accuracy:
don’t treat every document the same.
Step 4: Extract parties and roles (grantor/grantee, lessor/lessee)
Runsheets are structured answers to: who did what to whom, when, and for what tract(s)? So we extract:
- people/companies (entities)
- roles (grantor/grantee, lessor/lessee, assignor/assignee)
- relationships (A conveyed/leased/assigned to B)
This is generally entity extraction plus role labeling, supported by patterns (e.g., “grantor”, “grantee”), layout cues,
and sanity checks to handle formatting variants.
Step 5: Parse legal descriptions into structured data
Legal descriptions show up as PLSS (section/township/range), lots/blocks, metes & bounds, and hybrids, sometimes with
exceptions and carve-outs. Humans interpret these quickly; computers need structure.
So we parse descriptions into a canonical representation:
- normalized fields (county, section/township/range, aliquots, lot/block, etc.)
- standardized formatting
- tract “keys” used for search and linkage
Once legal descriptions are structured, you can index and join instruments reliably.
The “neat” part: automatically building a runsheet (how the chain is constructed)
Once the document is structured, we link records into a chain using two practical methods:
Method A: Follow references (citation chasing)
Instruments often reference prior instruments (instrument number, book/page, “being the same land conveyed in…”).
When we detect a reference, we pull the referenced instrument and keep walking backward.
Method B: Link by tract (legal description matching)
Even when references are missing, instruments can be connected by matching tract keys derived from structured legal descriptions.
This is especially helpful when county indexes are incomplete.
Graph model (runsheet as a graph)
Under the hood, runsheets are naturally represented as a graph: instruments connect parties and tracts, and also reference
other instruments. Here’s a simple conceptual view:
LEGEND:
[Instrument] = a recorded document (deed/lease/assignment/etc.)
(Party) = person or company
{Tract} = a normalized tract/legal description key
--conveys--> = conveyance/assignment relationship
..refers..> = "this document references that prior document"
(Grantor) --conveys--> [Deed #2021-000123] --conveys--> (Grantee)
|
+--> {Tract: Sec 12-T10N-R3W, NW/4}
|
..refers..> [Deed #1998-004567] --conveys--> (Prior Grantee)
|
+--> {Tract: Sec 12-T10N-R3W, NW/4}
[Lease #2010-009999] --conveys--> (Lessee)
|
+--> (Lessor)
|
+--> {Tract: Sec 12-T10N-R3W, NW/4}
A “runsheet view” is simply a readable timeline/report generated from this graph: which instruments affect the tract,
who the parties are, and how the chain links together.
Why multi-tract instruments matter (and why traditional search misses them)
A common failure mode in courthouse searching is multi-tract instruments: a single document may contain many legal descriptions,
but some county indexes only record the first one or two. If you search only by indexed fields, you can miss relevant documents.
A practical fix is straightforward:
- extract all legal descriptions from the document itself
- parse them into structured tract keys
- index/search at the tract level
That makes “search by legal” much more complete than relying on partial index fields.
Code Sample
Below is a simplified example of the traversal logic. It’s intentionally small and conceptual, but it reflects the standard pattern:
start from a seed tract (or instrument), then expand the chain by following references and tract matches, while deduping.
from collections import deque
def build_chain(seed_tract_keys, max_depth=5):
queue = deque()
visited_instruments = set()
graph_edges = [] # (src_id, edge_type, dst_id)
# Seed: find instruments that mention the tract(s)
for tk in seed_tract_keys:
for inst_id in find_instruments_by_tract_key(tk):
queue.append((inst_id, 0))
while queue:
inst_id, depth = queue.popleft()
if inst_id in visited_instruments:
continue
visited_instruments.add(inst_id)
inst = load_instrument(inst_id)
# 1) Extract structured facts
parties = extract_parties_with_roles(inst) # e.g., (Grantor, Grantee)
tract_keys = extract_tract_keys(inst) # parsed legal descriptions
references = extract_referenced_instruments(inst) # instrument #, book/page, etc.
# 2) Add edges: instrument -> parties and instrument -> tracts
for party in parties:
graph_edges.append((inst_id, "has_party", party.party_id))
for tk in tract_keys:
graph_edges.append((inst_id, "affects_tract", tk))
# 3) Expand by following references (citation chasing)
if depth < max_depth:
for ref in references:
ref_id = resolve_reference_to_instrument_id(ref)
if ref_id and ref_id not in visited_instruments:
graph_edges.append((inst_id, "refers_to", ref_id))
queue.append((ref_id, depth + 1))
# 4) Optional expansion: widen by tract matches (use carefully)
# This is useful when clerk indexes are incomplete, especially for multi-tract instruments.
if depth < max_depth:
for tk in tract_keys:
for neighbor_id in find_instruments_by_tract_key(tk):
if neighbor_id not in visited_instruments:
graph_edges.append((inst_id, "shares_tract_with", neighbor_id))
queue.append((neighbor_id, depth + 1))
return graph_edges
In production, this usually adds:
-
- confidence scoring (exact reference matches rank higher than fuzzy matches)
- dedupe logic (same document indexed multiple ways)
- explanations for “why this instrument is included”
- limits to prevent overly broad tract-based expansion
Where this goes next: connecting title work to assets and economics
Once instruments are connected to structured tracts, it becomes natural to relate that to wells, units, production,
and mapping layers, so title work and underwriting can happen in one workflow instead of switching tools.
The part that’s most exciting isn’t “AI” as a buzzword. It’s the shift from:
- documents as PDFs
- to documents as structured, searchable facts with traceability back to the source pages
That changes how quickly you can answer questions like:
- Do we have a clean chain here?
- What’s missing?
- What’s the risk?
- How does this tie to the asset we’re underwriting?
We’ll share more as the system matures, especially examples of tract parsing, graph outputs, and the edge cases we’ve learned
from the wild variety of county document formats.