Benchmark

Correct facts over
maximum answer rate.

PolDex abstains or surfaces conflicts instead of fabricating certainty. Schema-valid is not the same as correct.

Benchmark Philosophy

We optimize for the right things.

Correct facts over maximum answer rate

An abstention on an ambiguous field is better than a confident wrong answer.

Evidence-backed output over unsupported confidence

Every returned fact must be traceable to a specific location in the source document.

Conflict visibility over false flattening

When a base policy and an endorsement contradict each other, both values surface with an explicit conflict, not a silently resolved value.

Low material error rate over pretty demos

Benchmark documents include messy scans, unusual formatting, and edge-case endorsements — not just clean template policies.

What PolDex Measures

Four quality dimensions.

1
Field accuracy

Extracted value matches the ground truth for that field across a held-out evaluation set.

2
Evidence precision

Evidence pointer (page, section, citation) correctly identifies the source of the extracted value.

3
Conflict detection rate

Rate at which known contradictions between document sections are identified and surfaced.

4
Abstention discipline

Rate at which PolDex correctly returns unknown state rather than hallucinating a value for ambiguous fields.

Document Coverage

Ugly documents are not excluded.

PolDex benchmark documents include poorly OCR'd scans, mid-cycle endorsement stacks, non-standard certificate formats, and carrier-specific layouts. Not just clean template policies.

Commercial GL

Base policy, CG forms, AI endorsements, schedule of locations, policy jacket

Commercial Auto

Fleet schedules, certificates, driver rosters, MCS-90, hired/non-owned

Commercial Property

BPP, BI/EE, commercial building, blanket vs specific limits, co-insurance

Workers Comp

NCCI forms, experience mod worksheets, payroll class schedules

Umbrella / Excess

Following-form, aggregate limits, underlying required schedules

Professional

E&O, D&O, cyber, claims-made, retro dates, consent-to-settle

Hard-Case Testing

The difficult documents matter most.

Multi-endorsement stacks

Policies with 30+ endorsements where each endorsement modifies a previous one. Effective date precedence must be correctly applied.

Conflicting coverage language

Base policy and endorsement state different aggregate limits. PolDex must surface both values and identify which supersedes.

Poor OCR quality

Real-world scanned documents with noise, rotation, incomplete OCR. PolDex operates on extracted text, not raw pixels.

Non-standard layout

Carrier-specific forms that do not follow ACORD or standard ISO structure. Field labels are inconsistent or absent.

Methodology

Honest, bounded claims.

Benchmark results are based on held-out evaluation sets annotated by domain experts. Results are reported per-field, per-document-family, and segmented by document quality tier.

PolDex does not report a single headline accuracy number. Accuracy is field-specific — aggregate limit extraction is not the same difficulty as additional insured identification.

Full benchmark methodology available to enterprise buyers under NDA.

Request Benchmark Report →

Validate quality on your documents.

Initialize API access and test extraction on your actual document corpus.