AI-Powered Test-to-Requirements Traceability Validation

#python #nlp #llm #machine-learning #life-sciences #qa

AI-Powered Test-to-Requirements Traceability Validation

Applied NLP and LLM Pipeline for Regulated Life Sciences QA

L7 Informatics · March – April 2023 · Nelson Love, Quality Engineer

Executive Summary

I designed and built a two-stage NLP/LLM pipeline that evaluated 7,750 test-to-requirement mappings across a regulated life sciences platform, reducing a projected 3–4 week manual effort by six engineers to roughly two weeks. The system combined text embeddings for high-recall candidate retrieval with GPT-based semantic evaluation for precision judgment, generating audit-ready justifications for each mapping decision. The QA lead commended the output quality as exceeding manual review standards.

Metric	Value
Test cases processed	8,406
Requirements analyzed	467
Pairs evaluated via GPT	7,750
Time reduction vs. projection	~50%

Problem

The Regulatory Context

L7 Informatics develops ESP (Enterprise Science Platform), a bioinformatics platform used in environments subject to 21 CFR Part 11 and GAMP 5. These frameworks require demonstrable, auditable traceability between every software requirement and the tests that validate it. ESP had accumulated 8,406 test cases across 20+ UI applications, SDKs, and content packs, mapped against 467 feature-level requirements.

The trace matrix linking tests to requirements had drifted. Mappings were stale, incorrect, or missing. A compliance gap here isn’t a documentation inconvenience — it’s regulatory exposure.

The Justification Hardening Initiative

Six engineers were assigned to the initiative with three objectives:

Verify existing mappings are semantically correct
Connect missing requirement-test links
Disconnect invalid mappings where tests don’t genuinely validate the requirement

The internal work instructions described the cognitive load honestly: engineers needed to read each requirement until its intent was clear, read the associated test, and decide whether they could “defend this test in court.” The projected timeline was 3–4 weeks of dedicated effort.

The core problem beyond time was consistency: different engineers applied different standards, attention degraded across thousands of evaluations, and subtle semantic gaps went undetected.

Solution Architecture

The fundamental insight driving the design: test-to-requirement traceability is a semantic relationship, not a syntactic one. A test that checks whether a save button exists doesn’t validate a requirement about data persistence, even if both mention “save.” This ruled out keyword matching and required getting at the intent of both artifacts.

The final system was a two-stage pipeline.

Stage 1 — Embedding-Based Candidate Retrieval (High Recall)

Each test case and requirement was embedded using OpenAI’s text-embedding-ada-002 (1,536 dimensions). For each requirement, I retrieved the 50 nearest test cases by cosine distance. The goal of this stage was recall — exclude the ~99% of tests that are clearly irrelevant, while accepting some noise in the top 50.

Test representations were enriched before embedding: I generated a Gherkin (Given/When/Then) translation for every test case using GPT, which surfaced the test’s business intent rather than its mechanical steps. A test described as “Click save, verify data persists” becomes “Given a user with edit permissions, When they modify and save data, Then the system persistently stores those changes” — revealing implicit assumptions about permissions and persistence scope. The Gherkin was concatenated with the test’s name, preconditions, steps, and expected results to form the embedding input.

Stage 2 — GPT Semantic Evaluation (High Precision)

Each of the ~23,350 candidate pairs was evaluated via a structured GPT prompt that asked: (1) summarize the requirement, (2) summarize the test, (3) determine whether they are related and how, (4) assess whether the test is sufficient to validate the requirement. The Gherkin was included alongside the raw test steps, bridging the gap between procedural test language and business-requirement language.

Phase 1 evaluations (Builders and LIMS components) used GPT-3.5-turbo with a four-question structured format. Phase 2 (remaining components) upgraded to GPT-4 with a streamlined prompt: determine if the mapping is correct, respond Yes/No with reasoning. Combined output: 7,750 evaluated pairs against 7,822 existing trace matrix entries.

Key Technical Decision: Diagnosing the Embedding Space

The most consequential engineering judgment in the project was what happened before the final architecture was committed — discovering that the embeddings weren’t working as expected, then redesigning accordingly.

The Style-Dominance Problem

Before committing to embedding-based matching, I validated the embedding space by running K-Means clustering (k=6) with t-SNE visualization on the 467 requirement embeddings. The clusters showed clear separation — but not along functional or domain lines. They separated by authorship style.

The 467 requirements came from three release cohorts with measurably different writing patterns:

Release 2.4.1 (49%): Formulaic — “As a [User] I must be able to… which I perform [daily/as needed].” 90% used “must,” 38% included the trailing frequency clause.
Release 3.0.0 (22%): Similar template, slightly looser — 73% “must,” 55% trailing clause.
Release 3.1.0 (25%): Distinct authorship — used “should be able to” (never in 2.4.1), rarely included the trailing clause (11%), more descriptive prose.

The embedding model was capturing authorship fingerprints more strongly than functional semantics. Two requirements about different ESP components, both written in the “I must be able to… which I perform daily” template, clustered closer together than two requirements about the same component written by different authors.

The Response: Targeted NLP Preprocessing

I built a preprocessing pipeline specifically to strip authorship signal before re-embedding. The approach used spaCy for tokenization and NLTK’s Snowball stemmer for aggressive lemmatization — collapsing “must be able to,” “should be able to,” and “I can” into similar stem patterns. Critically, I preserved capitalized domain entities (Protocol, Workflow, Sample, Entity) during lemmatization — the terms that actually distinguish one requirement from another survived; the boilerplate that distinguished one author from another was stripped.

I then re-ran clustering on the preprocessed embeddings and tested nearest-neighbor matching quality. The preprocessing reduced style dominance but didn’t eliminate it. Named entity overlap scoring (boosting pairs that shared domain terms) further narrowed the gap but remained insufficient as a standalone ranking signal.

The Architectural Implication

This iterative evaluation — cluster → discover style dominance → preprocess → re-embed → measure improvement → find it insufficient — led directly to the two-stage architecture. Embeddings, even imperfect ones, could reliably exclude the bottom 99% of candidates. They couldn’t reliably rank the top 1%. GPT could.

The math made this necessary: 467 × 8,406 = 3.9 million GPT evaluations at full scale; 467 × 50 = 23,350 after embedding filtering. The architecture was a direct consequence of empirical measurement, not a starting assumption.

Implementation

Data Engineering

All three data sources required custom parsing. Requirements came from a SpiraTeam XML export with metadata in non-obvious custom fields. Test cases came from Excel workbooks with multi-row structures — a single logical test could span dozens of rows with interleaved preconditions, steps, and expected results.

I built a recursive object model with a state-machine parser: each object type (TestSuite → TestSet → Test → TestStep; Requirements → Capability → Requirement) signaled completion via an EndOfObject exception. A central orchestrator maintained bidirectional cross-references across all three data hierarchies and identified orphans — tests in the trace matrix that didn’t exist in the test suite, and vice versa.

Batch Processing at Scale

Gherkin generation and embedding computation ran as batch JSONL pipelines through OpenAI’s async parallel request processor, with configurable rate limiting and exponential backoff retry. Token management used tiktoken — prompts exceeding the context limit fell back to shorter system context versions. Total: 8,873 embedding vectors computed (8,406 tests + 467 requirements).

Validation Framework

I built a command-line tool for human-vs-GPT agreement measurement. An engineer reviewed test-requirement pairs interactively, and the tool computed a confusion matrix (precision, recall, F1) treating “Disconnect” as the positive class. This provided quantitative confidence in the automated evaluations and surfaced systematic disagreements between GPT judgment and human expert assessment.

Output and Integration

Final outputs were Excel trace matrices with GPT recommendations pre-sorted into Disconnect/Stay Connected/New Connections columns, with the GPT’s reasoning as justification text. A SpiraTeam REST API integration handled reading test case data; bulk write-back used SpiraTeam’s Excel import workflow.

What Failed First

The initial approach was a multi-turn GPT conversation for each test case: generate a Gherkin translation, overcome model reluctance to speculate on project-specific requirements, synthesize “ideal requirements” the test should validate, then evaluate the existing mapping against that synthesized reference. This semantic triangulation produced qualitatively strong results — GPT caught real problems in existing mappings.

At 13 minutes per test case with two parallel processes running overnight, I had completed 158 of 8,406 test cases. Estimated total runtime: 78 days of continuous API calls. Beyond the time problem, the approach couldn’t discover missing connections — it could only evaluate existing mappings. The pivot to embeddings solved both constraints simultaneously.

Results

Measured Outcomes

7,750 test-requirement pairs evaluated — covering 99% of the 7,822 existing trace matrix mappings
8,406 Gherkin scenarios generated, providing reusable business-intent translations of each test
8,873 embedding vectors computed across the full test and requirement corpora
Completed in approximately two weeks against a 3–4 week manual projection for six engineers

Quality

The QA lead commended the output quality as exceeding manual review standards — specifically the consistency and depth of justification text. The automated system applied identical evaluation criteria across all 7,750 pairs; manual review degrades in consistency as reviewer fatigue accumulates.

The nearest-neighbor retrieval stage discovered missing connections that the manual workflow — which started from existing mappings — would never surface. For any requirement, the system could surface the 50 most semantically similar tests regardless of their current trace matrix status.

The Gherkin translation step exposed a category of mapping error that’s easy to miss manually: tests that verify a feature exists rather than functions correctly. A test confirming a UI element renders doesn’t validate a requirement about data integrity, even if both reference the same feature. The system flagged dozens of these.

Limitations and What I’d Do Next

The embedding quality ceiling was the primary constraint. Even with targeted NLP preprocessing, ada-002 captured authorship style more strongly than functional semantics for requirements with formulaic templates. Better options would include:

Domain-specific fine-tuning: Fine-tune an embedding model on requirement-test pairs where the correct mapping is known, so the model learns what “semantically validates” means in this domain.
Reranker models: Use a cross-encoder reranker (e.g., a fine-tuned BERT variant) as the Stage 1/Stage 2 bridge — more precise than cosine similarity on raw embeddings, less expensive than GPT on all 3.9M pairs.
Formal benchmark: The human-vs-GPT validation tool was used informally. A systematic benchmark — stratified sample across requirement cohorts and components, reviewed by multiple engineers — would quantify precision/recall more rigorously and enable model comparison.
Active learning loop: Use engineer accept/reject decisions on GPT recommendations as a feedback signal to improve retrieval and evaluation over successive trace matrix audits.

The current system was designed for a one-time audit under time pressure. The architecture would need persistence and a cleaner API surface to become an ongoing quality control tool embedded in the development workflow.

Technical Stack

Area	Tools
Language	Python 3
NLP	spaCy (en_core_web_sm), NLTK (Snowball stemmer, Punkt tokenizer)
LLM	GPT-3.5-turbo (Gherkin + initial evaluation), GPT-4 (final evaluation)
Embeddings	text-embedding-ada-002 (1,536 dimensions)
Data	pandas, openpyxl, NumPy
Clustering / Analysis	scikit-learn (K-Means, PCA, t-SNE), SciPy
Dev Environment	Emacs org-mode with ob-jupyter
Integration	SpiraTeam REST API (read), Excel import (write-back)

Nelson Love · Quality Engineer → Applied ML Systems · L7 Informatics, 2023