The Missing Layer: Why Ontology Might Be the Highest-Leverage Tool in AI-Assisted Development

There’s a pattern I keep running into when working with AI coding agents. You write careful requirements, maybe even detailed specs, hand them to the agent, and get back code that’s plausible but wrong. Not wrong in a way that fails immediately — wrong in a way that subtly misunderstands your domain. The naming is off. The relationships between entities don’t quite match what you meant. The edge cases reveal assumptions you never made. The code compiles, the tests pass, and the architecture slowly drifts from what you actually need.

The untethered AI workflow is fast — but it’s too fast. You save time on the front end and spend twice the tokens on the back end trying to get the agent to track down its own mistakes. In regulated environments — manufacturing, medical devices, defense, finance — “plausible but wrong” isn’t an inconvenience. It’s a compliance failure. And even outside regulated industries, anyone trying to reliably automate QA hits the same wall: you can’t test what you haven’t defined, and you can’t define what you haven’t modeled.

I’ve come to believe this is a structural problem, not a prompting problem. And the fix comes from a discipline most software engineers have never had reason to think about: formal ontology.

What Is Ontology, Anyway?

The word comes from philosophy — literally, the study of what exists. Aristotle’s Categories is arguably the first ontological framework: a systematic attempt to enumerate the kinds of things that are, and how they relate to each other. Substance, quantity, quality, relation, place, time. Before you can say anything meaningful about the world, you need to agree on what the world contains.

In computer science, the term was borrowed (some would say stolen) in the early 1990s by the knowledge representation community. Tom Gruber’s widely cited definition is useful here: an ontology is “a formal, explicit specification of a shared conceptualization.” Unpack that and you get four things:

Formal — expressed in a language with precise semantics, not just natural language prose. You can reason over it mechanically.

Explicit — the concepts, relationships, and constraints are stated, not implied. Nothing is left to the reader’s interpretation.

Shared — it represents a consensus understanding. Multiple agents (human or artificial) can use it as common ground.

Conceptualization — it’s a model of a domain, not of software. It describes what things are before it describes what the system does.

If this sounds like Domain-Driven Design, you’re not wrong — DDD’s “ubiquitous language” is doing ontological work. The difference is that DDD’s domain model typically lives in developers’ heads and in code. It’s not machine-readable in a way that enables automated gap analysis, contradiction detection, or — crucially — targeted injection into an LLM’s context window. The ontology step is what makes the domain model feedable to an AI agent as structured data, not just prose.

The notation I’ve been using is OWL — the Web Ontology Language, in Turtle syntax. I should be upfront: I don’t care about the Semantic Web. OWL carries baggage from a 2000s-era vision that largely didn’t materialize, and a lot of engineers will see Turtle syntax and mentally file this under “academic solutions looking for problems.” Fair enough. The Semantic Web was the wrong use case for this notation. Using it as a structured intermediate representation for LLM-assisted development might be the right one. If you can get the same properties — formal classes, typed relationships, axioms a reasoner can check — from pure TypeScript types or JSON Schema, I’d love to hear about it. The point is the formal model, not the file format.

Here’s what a tiny fragment looks like:

pd:Snapshot a owl:Class ;
    rdfs:comment "The fundamental unit — an immutable record at a point in time." .

pd:Stage a owl:Class ;
    rdfs:comment "Lifecycle phase: Part → Engineering → Manufacturing → AsBuilt." .

pd:atStage a owl:ObjectProperty ;
    rdfs:domain pd:Snapshot ;
    rdfs:range pd:Stage ;
    rdfs:comment "Every Snapshot exists at exactly one Stage." .

Even without knowing OWL, you can read this: there’s a thing called a Snapshot, a thing called a Stage, and every Snapshot is at exactly one Stage. That’s a formal ontological commitment — and it’s one that an AI agent can use to constrain everything it builds downstream.

The Gap Between Requirements and Understanding

Here’s the problem with feeding an AI agent a list of requirements: requirements describe behavior, not structure. Consider this requirement:

DOCS-035: The Version operation SHALL create a new Document on the same DocumentRoot,
          incrementing the version number by 1.

This tells you what should happen, but it doesn’t tell the agent what a Document is, what a DocumentRoot is, how they relate, what “version” means in this domain versus any other domain, or what constraints govern the operation. A skilled human developer fills in these gaps through context, experience, and conversation. They read the requirements, form a mental model of the domain, and write code that reflects that model. The mental model is the real source of truth — the requirements are just its shadow.

An AI agent doesn’t form mental models. It predicts tokens. When it encounters ambiguity in your requirements, it resolves that ambiguity by defaulting to whatever pattern is most probable given its training data. If your domain happens to match common patterns (a CRUD app, a blog, a todo list), this works fine. If your domain has its own internal logic — its own ontology — the agent’s defaults will be subtly, persistently wrong.

This is why better prompting has diminishing returns. The problem isn’t that the agent can’t understand your instructions. The problem is that your instructions don’t contain enough structural information to constrain the agent’s output to your specific domain.

A Case Study: 549 Requirements and No Domain Model

I’ve been building a platform called Primer that helps small manufacturers manage product data — parts, bills of materials, document control, engineering changes, approvals. The domain is deceptively complex. Six user roles with a strict hierarchy. Four lifecycle stages that progressively freeze design decisions. Multiple states per stage. Approval workflows that vary by document category. A composition model where assemblies reference sub-assemblies in a directed acyclic graph.

I should be honest about something: I’m not a manufacturing PLM expert. The domain model didn’t come from years of shop-floor experience. It came from pointing Claude Code at a legacy Quickbase application — a bunch of tables and some JavaScript — and asking it to reverse-engineer the domain. The AI derived the ontology through what amounted to requirements gathering: analyzing existing data structures, identifying entities and relationships, and asking clarifying questions. This matters for the argument I’m making, and I’ll come back to it.

My first approach to formalizing requirements was the obvious one: hand-author them as markdown tables, organized by domain.

| ID       | Requirement                                              | Refs              |
|----------|----------------------------------------------------------|-------------------|
| DOCS-008 | The standard documentNumber formula SHALL be:             | (doc-0015,        |
|          | {prefix}-{rootSeq:4}-{revision}.{version:2}              |  doc-0038)        |
| DOCS-009 | The prefix SHALL be a 3-digit category code.              | (doc-0015)        |
| DOCS-010 | The rootSeq SHALL be a zero-padded 4-digit integer,       | (doc-0015)        |
|          | unique within the Category and Tenant.                    |                   |

Eleven domain files, 549 individual SHALL statements, each with a manually assigned ID and hand-written cross-references. Traceability was manual at every level — a developer read the requirement, wrote a test with the ID in the name, wrote the code, and mentally tracked the mapping:

// Unit test — requirement ID embedded in the test name
test("DOCS-008: standard format is {prefix}-{rootSeq:4}-{revision}.{version:2}", () => {
  expect(formatDocumentNumber("010", 100, "A", 1)).toBe("010-0100-A.01");
});

// Integration test — requirement range in the describe block
describe("Version Operation (DOCS-035 through DOCS-041)", () => {
  test("DOCS-035: creates new Document on the same DocumentRoot", async () => {
    // ...test hits real database
  });
});

# E2E test — requirement ID as a Gherkin tag
@DOCS-026
Scenario: Effective button visible only for Working documents
  Given I am viewing a Part in "Working" state
  Then I should see the "Effective" action button

The full traceability chain was human-maintained at every link:

Requirement ID (DOCS-008)
  ↓ manually embedded in...
Test name ("DOCS-008: standard format...")
  ↓ manually organized in...
Test file (tests/documents/document-numbering.unit.test.ts)
  ↓ manually mapped to...
Implementation file (src/lib/format-document-number.ts)

Three test layers, all hand-wired. This worked well enough to build a functioning system. But as it grew, the cracks became structural.

Gap analysis was a heuristic, not a proof. Finding untested requirements meant diffing IDs against test files. A requirement could appear “covered” by a test that mentioned the ID in its name but didn’t actually verify the behavior.

The negative space was invisible. Requirements described what the system SHALL do. They rarely described what it SHALL NOT do. “Administrator can approve documents” didn’t generate “viewer cannot approve documents.” With six roles and dozens of actions, the negative permission space is enormous — and it was untested unless someone explicitly thought to write each case.

Requirements didn’t compose. Each statement was isolated. There was no way to express “this pattern applies to all entities with lifecycle state” and have it propagate. Adding a new entity type meant rewriting every relevant requirement by analogy, hoping nothing was missed.

Contradictions were undetectable. With 549 requirements across eleven files, no tooling checked whether one requirement’s grant conflicted with another requirement’s precondition.

The domain model was implicit. Concepts like “a Part is a 1:1 extension of a Document” and “an engineering BOM derives from a Part via stage-fork” lived as natural language scattered across twenty-plus design documents. Nothing formalized it. And when an AI agent tried to work with the requirements, it had no access to any of this implicit structure.

The underlying issue: the requirement space was enumerable but unenumerated. The dimensions existed — roles, stages, states, actions, entity types — but no one had built the full matrix.

The Ontology Layer

The fix was to make the domain model formal. Not as documentation — as a machine-readable artifact that could be reasoned over, and more importantly, that could be injected into an LLM’s context as structured data.

Layer 1: Domain — What the System Is

The fundamental insight that unlocked everything else: every record in Primer is an immutable snapshot in a directed acyclic graph. Parts, engineering BOMs, manufacturing BOMs, as-built records — they’re not different kinds of things. They’re the same kind of thing at different lifecycle stages.

pd:Snapshot a owl:Class ;
    rdfs:comment "THE fundamental unit — an immutable record." .

pd:Derivation a owl:Class ;
    rdfs:comment "How one Snapshot begets another." .

pd:DerivationType a owl:Class ;
    owl:oneOf (pd:Version pd:Revision pd:Clone pd:StageFork) .

pd:source a owl:ObjectProperty ;
    rdfs:domain pd:Derivation ;
    rdfs:range pd:Snapshot ;
    rdfs:comment "Source must be Frozen." .

pd:target a owl:ObjectProperty ;
    rdfs:domain pd:Derivation ;
    rdfs:range pd:Snapshot ;
    rdfs:comment "Target starts as Mutable." .

All operations are variations of “fork a frozen snapshot”:

Operation	What changes	What stays
Version	version +1	same root, same revision, same stage
Revision	next revision letter, version resets	same root, same stage
Clone	new root, new number	same stage
StageFork	new root, new number, next stage	provenance link back

The lifecycle is simple: Mutable → Approved → Frozen → Fork to create new Mutable. Each stage isn’t a different kind of thing — it’s the same mechanics at a different lifecycle phase, with different applicable attributes and permissions.

Agents, roles, and permissions get the same treatment:

pd:Role a owl:Class ;
    owl:oneOf (pd:viewer pd:participant pd:bom_participant
               pd:bom_super_user pd:administrator pd:system_admin) .

# Role hierarchy as OWL axiom:
# administrator ⊃ bom_super_user ⊃ bom_participant ⊃ participant ⊃ viewer

pd:Interaction a owl:Class ;
    rdfs:comment "A possible user action — one cell in the permission matrix." .

pd:hasRole    a owl:ObjectProperty ; rdfs:domain pd:Interaction ; rdfs:range pd:Role .
pd:hasAction  a owl:ObjectProperty ; rdfs:domain pd:Interaction ; rdfs:range pd:ActionType .
pd:atStage    a owl:ObjectProperty ; rdfs:domain pd:Interaction ; rdfs:range pd:Stage .
pd:inState    a owl:ObjectProperty ; rdfs:domain pd:Interaction ; rdfs:range pd:LifecycleState .

Layer 2: Requirements Meta-Model — What a Requirement Is

This is where the ontology earns its keep. Instead of writing requirements one at a time, you define patterns that generate requirements when instantiated against the domain model:

pr:RequirementPattern a owl:Class ;
    rdfs:comment "Template for generating SHALL statements." .

pr:Requirement a owl:Class ;
    rdfs:comment "A formal SHALL/SHALL NOT statement." .

pr:generatedBy a owl:ObjectProperty ;
    rdfs:domain pr:Requirement ;
    rdfs:range pr:RequirementPattern .

pr:Coverage a owl:Class ;
    rdfs:comment "Traceability link: requirement → test case → status." .

A pattern like “For each Stage, every Snapshot SHALL have a valid state machine” isn’t one requirement — it generates a requirement for every Stage in Layer 1. Add a Stage, and the pattern produces new requirements automatically.

The interaction matrix is the most powerful construct. An Interaction is a cell in:

Role × TargetType × Stage × State × Action

Each valid cell → a positive requirement. Each invalid cell → a negative requirement. Each cell → needs a test.

Now, I’ll be the first to admit: the combinatorial explosion this produces looks like garbage at first glance. Six roles times four stages times four states times fifteen actions is a lot of cells. Nobody wants to read that many requirements. But that’s the point — the requirements become an intermediate artifact, not a deliverable. They exist to be distilled into integration tests. What you’re generating is automated coverage for user stories — and automated user stories themselves. The human reads the ontology and the tests. The combinatorial matrix is for the machine.

The full matrix is enumerable from the ontology. Gap analysis becomes a query:

SELECT ?interaction WHERE {
  ?interaction a pd:Interaction .
  FILTER NOT EXISTS {
    ?req pr:covers ?interaction .
    ?test pr:verifies ?req .
  }
}

That query returns every interaction that lacks either a requirement or a test. Try doing that with 549 markdown rows and grep.

Layer 3: Instances — The Actual Requirements

The existing 549 requirements get parsed into OWL individuals, linked to domain concepts through formal bindings:

pr:DOCS-008 a pr:Requirement ;
    pr:domain "DOCS" ;
    pr:strength pr:Mandatory ;
    pr:statement "The standard documentNumber formula SHALL be: ..." ;
    pr:bindsTo pd:Snapshot, pd:documentNumber .

This layer is regenerated whenever the domain model changes. Requirements that were previously free-floating strings are now nodes in a queryable graph.

The New Development Pipeline

With the ontology in place, the workflow becomes:

Rough Requirements → Ontology → Refined Requirements → Specs → Tests → Code

Compare how a requirement reads before and after the ontology:

BEFORE:
"The system shall allow BOM super users to create an engineering BOM from a part."

AFTER:
"A pd:Interaction with pd:hasRole pd:bom_super_user, pd:hasAction pd:StageFork,
 pd:sourceStage pd:Part, pd:targetStage pd:Engineering SHALL be permitted
 when pd:source pd:inState pd:Frozen."

The first version requires the agent to guess what “create an engineering BOM from a part” means structurally. The second version tells it exactly: this is a StageFork derivation, from Part stage to Engineering stage, the source must be Frozen, and the required role is bom_super_user. There’s nothing left to guess.

Why Structured + Unstructured Input Matters

Large language models don’t reason from first principles. They predict probable continuations. When you give a model only natural language, it fills structural gaps with whatever patterns are statistically likely. When you give it only formal structure, it fills intent gaps with generic assumptions. Neither alone is sufficient.

When you provide both — a formal ontology that constrains the output space, and natural language that provides intent, context, and edge cases — the model’s predictions are simultaneously more constrained and more relevant. The structured input narrows what’s valid. The unstructured input guides what’s useful.

This is the same principle behind a number of results in the retrieval-augmented generation literature: structured metadata plus natural language retrieval outperforms either alone. Few-shot examples with schema definitions dramatically outperform zero-shot prompting. The pattern is consistent — LLMs perform best when you give them both a map and a destination.

In the ontology-driven pipeline, this compounding plays out across every step:

Ontology      → constrains vocabulary      (structured)
Requirements  → constrains behavior        (structured + unstructured)
Specs         → constrains implementation  (structured + unstructured)
Tests         → constrains correctness     (structured)
Code          → maximally constrained      (output)

By the time the agent writes code, it’s operating within a tightly bounded space where the probable continuation and the correct continuation are much more likely to be the same thing.

The idea is that the more levels of analysis you can offload into machine-readable artifacts, the more precisely you can target the LLM’s context window, and the less you end up dealing with contradictions, hallucinated assumptions, and expensive debugging loops. Each layer is a checkpoint. Each checkpoint is an opportunity to catch errors before they compound.

There’s a compression analogy I find useful. The ontology works like a codebook. Once you’ve established that pd:Snapshot means this specific entity with these coordinates, this stage, this state, and these derivation rules, every downstream reference to “Snapshot” carries that full semantic payload. You’re compressing the information the agent needs to hold in context, reducing ambiguity at the token level. And reducing ambiguity at the token level is essentially the whole game.

You Don’t Have to Be the Domain Expert

I mentioned earlier that I’m not a manufacturing PLM expert. This is important because the most obvious objection to this workflow is that it requires a unicorn — someone who understands both the domain deeply enough to build an ontology and the engineering deeply enough to use it. In most organizations, the domain expert and the developer are different people.

But here’s the thing: the LLM can bridge that gap. The Primer ontology wasn’t hand-crafted from decades of manufacturing experience. It was derived by an AI agent doing requirements gathering — analyzing a legacy Quickbase application’s table structures and JavaScript, identifying entities and relationships, asking clarifying questions, and producing a formal model. The human’s job was to review and correct, not to author from scratch.

The next step — one I’m actively working on — is making the input even more accessible: a domain expert describing their problems in conversation, not code artifacts. The LLM conducts the requirements gathering, derives the ontology through dialogue, and produces the formal model that then drives the rest of the pipeline. The domain expert never sees Turtle syntax. They just talk about their work, and the machine builds the structure.

This is where it gets interesting for regulated industries specifically. The bottleneck in manufacturing QA isn’t writing code — it’s capturing domain knowledge formally enough to generate comprehensive test coverage. If an LLM can conduct that capture through conversation and produce a formal model, you’ve automated the most expensive part of the compliance pipeline.

Where Humans Are Still Irreplaceable

Even with the LLM doing the heavy lifting on ontology derivation, human review of the domain model is non-negotiable. Everything downstream inherits its structure. If the domain model is wrong — if you model something as two entities that should be one, or collapse a distinction that matters — the requirements describe the wrong things, the specs implement the wrong design, the tests validate the wrong behavior, and the code is confidently, systematically incorrect.

In Primer, the most consequential ontological decision was recognizing that parts, engineering BOMs, manufacturing BOMs, and as-built records are all the same fundamental entity. That’s a domain insight. Getting it right meant the entire state machine, derivation model, and permission system could be defined once and applied uniformly. Getting it wrong would have meant parallel logic for each entity type — the kind of structural duplication that breeds the contradictions the ontology is supposed to eliminate.

The ontology step is the highest-leverage checkpoint in the pipeline. The human doesn’t need to write it. But the human needs to read it and say “yes, that’s what this domain is.”

Portability

One property of this workflow that I didn’t expect is how cleanly it generalizes. I’m applying the same pipeline to domains as different as manufacturing data management, clinical psychological assessment, and audio signal processing. The workflow is identical. The ontology is the only thing that changes.

This makes the ontology arguably the most portable artifact in the entire development process. Requirements are project-specific. Specs are implementation-specific. Code is language-specific. But a well-constructed domain model travels across all of them, and it can seed future projects in related domains.

What This Doesn’t Do Yet

I want to be honest about the state of this: it’s a theoretical argument backed by one real project, not a controlled experiment. I don’t yet have a side-by-side comparison of LLM-generated code with and without the ontology layer. I don’t have metrics on bug reduction, rework cycles, or test pass rates. The compression analogy and the RAG parallels are persuasive reasoning, not empirical evidence.

What I do have is a workflow that took 549 hand-authored requirements with manual traceability and turned them into a formal model where gaps, contradictions, and missing coverage are queryable by definition. I have a pipeline that I’m now applying across multiple domains. And I have a strong intuition, grounded in how LLMs process input, that the structured+unstructured combination is doing real work.

The next piece is replicating this framework in a way that others can use and measure. If you’re working in a domain where QA matters — where “plausible but wrong” isn’t acceptable — and you want to try this approach, I’d like to hear about it.

The Deeper Point

I think there’s something more general going on here. The history of software engineering is largely a history of finding the right abstractions — the right way to decompose a problem so that the pieces are independent, composable, and comprehensible. Object-oriented programming, domain-driven design, microservices, even the relational model — these are all, at bottom, ontological commitments. They’re claims about what kinds of things exist in a domain and how those things relate.

What’s changed with AI agents is that the cost of getting the ontology wrong has increased dramatically. A human developer with a flawed mental model will still produce locally reasonable code, because they’re constantly making micro-corrections based on their understanding. An AI agent with a flawed ontology will produce globally unreasonable code with total confidence, because it has no understanding to correct against. The ontology is no longer just a design aid. It’s the primary mechanism by which you align an AI agent’s output with your domain.

If you’re using AI coding agents and finding that the output is close but not quite right — that the code is plausible but doesn’t reflect what you actually mean — the problem might not be your prompts, your requirements, or your choice of model. It might be that you haven’t given the agent the one thing it can’t infer on its own: a formal account of what your domain is.

That’s what the ontology gives you. And it might be the highest-leverage investment you can make in your AI-assisted development workflow.