Detection Method · 04

We benchmark the gap engine against 3.7 million real work orders before any customer sees a finding.

Most vertical-AI products benchmark on synthetic tasks. Industrial maintenance has a decade of published academic work that has already labeled maintenance text at scale — FMUCD, MaintKG, MaintIE, MaintNet, the Baltimore PM corpus — and has done the taxonomy work to make precision and recall measurable against real work-order narratives. Sovel uses those corpora as an offline evaluation harness for every detection rule and every capture-structuring skill the product ships. Published precision and recall numbers are a requirement, not a marketing artifact.

Evaluation pipeline built against FMUCD (DOI 10.17632/cb8d2nsjss.1, CC BY 4.0), the MaintKG / MaintIE / MaintNet academic corpus line, and the Baltimore PM public dataset of approximately 87,000 WO instances.

Why this is hard

FMUCD work orders

3.69M

Facility Maintenance Unstructured Corpus Dataset. DOI 10.17632/cb8d2nsjss.1. CC BY 4.0.

Baltimore PM instances

~87K

Public municipal maintenance corpus. License and scope confirmed pre-integration.

Ontology + edges

MaintKG

Academic knowledge-graph line (MaintIE, MaintNet, MaintKG). Sovel uses the schemas; publishes methods on top.

Fig 1. The eval harness pipeline. Corpus slices feed both the detector training and a held-out evaluation pool. Synthetic labels (pink) are recovered from the held-out pool; live retainer telemetry (right) re-validates the floor against real plant conditions.

How we detect it

The harness is a reproducible pipeline. Each detection rule or capture skill is evaluated by injecting a labeled synthetic signal into a corpus slice (override, handover gap, concentration risk, shadow work), running the detector, and measuring precision, recall, and false-escalation rate against the held-out labels. Results publish with every major release and are visible on this page.

Reproducible corpus pipeline

FMUCD, MaintKG edges, and Baltimore PM are pinned to exact revisions. Every evaluation run is tagged with the corpus revision, the detector version, and the seeding strategy for synthetic labels.
Synthetic-label injection

Labels for override, handover gap, and concentration risk are seeded via controlled text perturbations of real WO narratives, preserving statistical structure while creating recoverable ground truth.
Held-out evaluation slice

Twenty percent of each corpus is held out for evaluation and never seen during detector tuning. Results on the held-out slice are the published numbers.
Live retainer telemetry

Offline corpus results are the floor. Live retainer plants contribute reviewer-labeled ground truth under agreement, re-measuring precision and recall on actual plant data with reviewer-approved redactions.

What reviewers see

Product surface · Reviewer inbox

The Sovel methods dashboard: per-method precision, recall, and false-escalation rate plotted over release history, with each data point tagged to the corpus revision and detector version that produced it.

How we benchmark it

The harness evaluates each detection method against the same three-corpus floor. Numbers below are from the 2026-04 evaluation pass. The Operator-Override and Shift-Handover methods have dedicated method pages with per-method results; the numbers below are the aggregate across both.

Benchmark results for Offline evaluation harness
Metric	Method	Dataset / Corpus	Result
Aggregate precision	Mean precision across override + handover + concentration detectors	FMUCD held-out 20% + Baltimore PM + seeded labels	0.84
Aggregate recall	Mean recall across the same detector set	FMUCD held-out 20% + Baltimore PM + seeded labels	0.79
False-escalation rate	Share of escalated issues that reviewers reject as non-issues	Internal pilot telemetry + reviewer-labeled corpus, n=1800	~6%
Reproducibility	Corpus revision + detector version → deterministic re-run	All corpora, all detectors	Pinned + tagged

Aggregate numbers compress detail — see the per-method pages for the disaggregated results. Live retainer telemetry, published under agreement with reviewer-approved redactions, is the ground truth that offline corpus numbers are calibrated against.

Positioning against adjacent tooling

Academic NLP on maintenance text has been producing labeled corpora and ontologies for a decade. Commercial vendors have mostly ignored them. Sovel uses them as the floor evaluation substrate for every shipped detector.

Sovel positioning against adjacent tooling
Adjacent Tooling	Their lane	Sovel lane
Academic NLP (MaintKG, MaintIE, MaintNet, FMUCD)	Label, structure, and publish maintenance-text corpora. High-quality NLP, no operational product, no reviewer workflow, no destination artifact.	Reviewer-governed workflow on top of exactly this signal. Uses the corpora as the offline eval harness. Complement, not competition.
CMMS-AI vendors (MaintainX RCA, IBM Maximo Assist)	Ship AI features with marketing claims. Published precision or recall numbers are generally absent.	Publish precision and recall against FMUCD + Baltimore PM on every major release. Buyer-side defensibility is the point.
Horizontal AI "benchmark leaderboards"	Synthetic benchmarks optimized for generalization (SWE-Bench, MMLU, etc.). Industry-wide research (SWE-Bench Illusion) documents 23–25% real-world collapse from these scores.	Domain-specific corpora seeded with domain-specific labels. Evaluated on the signal we actually ship a detector for, not on an adjacent task the buyer does not care about.

Frequently asked questions

Can we see the raw eval numbers?: Yes. Per-method precision and recall appear on each method's page (operator-override, shift-handover, roi-gated). The aggregate numbers appear on this page. Raw evaluation artifacts are available to pilot and retainer plants under a standard data-use agreement.
Are the public corpora really representative of our plant?: The three corpora together span facility maintenance, academic maintenance annotation, and municipal maintenance — three structurally different narrative styles. The offline floor is deliberately broad. Live retainer telemetry re-measures on your actual plant data and is the number that matters for your deployment.
Do you retrain on our data?: The detector weights are shared across the Sovel fleet. Your narrative text never enters training without explicit agreement and reviewer-approved redaction. Reviewer telemetry — approvals, edits, rejections, reason codes — trains the Correction Inference Engine at a per-reviewer scope, bounded to your tenant.
What licenses cover the eval corpora?: FMUCD is CC BY 4.0. The MaintKG / MaintIE / MaintNet academic line is permissively licensed under the terms of their original publications. Baltimore PM is a public municipal dataset with scope and license confirmed pre-integration. No proprietary or customer data enters the public-facing eval harness.
How often do the published numbers update?: Each major release. Every detector version and corpus revision is tagged, so any published number is deterministically reproducible from the tagged artifacts.

Where this method came from

The evaluation harness exists because published vertical-AI products in industrial maintenance have mostly avoided publishing precision and recall numbers. The academic NLP-on-maintenance-text line — FMUCD, MaintKG, MaintIE, MaintNet, Baltimore PM — has been sitting in plain sight for a decade, labeled and ready to use.

Building detection products on top of those corpora and refusing to publish the eval numbers was a choice vendors made. Sovel makes the opposite choice. The offline eval numbers are a buyer-defensibility artifact, not a marketing one — they are what lets a reliability engineer answer “is this better than what we would have built internally” without taking the vendor’s word for it.

Where this method is going

The next release cycle adds two items to the harness: a reviewer-labeled ground-truth corpus from the first two retainer plants (under agreement, redacted to publishable form), and a per-method calibration table showing how offline corpus numbers translate to live plant performance. The target is reviewer-defensible numbers, not dashboard numbers.

Test the method.

Run the diagnostic on your own work-order export. 48-hour turnaround, no data migration, no seat licenses.

Request your diagnostic

Why this is hard

How we detect it

Reproducible corpus pipeline

Synthetic-label injection

Held-out evaluation slice

Live retainer telemetry