One score hides too much
An extraction can look successful while missing a required field or attaching weak evidence. A single aggregate score can hide the exact failure that matters in production.
PolDex tracks multiple dimensions: document pass rate, required-field score, exact-label score, evidence score, corpus count, evaluated documents, blocked reason, and publication state. Each dimension answers a different reliability question.
Labels must be source-verifiable
Gold labels are only useful if they can be tied to visible source text, tables, clauses, schedules, or declarations. If the label cannot be verified against the document, it should not become benchmark truth.
This is especially important for public proof. PolDex uses real public documents for diagnostic corpora so the benchmark story can be inspected without private-customer claims.
Benchmarks become release gates
A benchmark is not just a marketing page. It is a release gate. New schema behavior should pass the current corpus before it becomes public-facing.
That turns regression testing into product discipline. Every hardened schema must keep passing as FastScript gains new readers, rules, and normalizers.