Shōhyō (証憑 — the accounting term for source documents such as invoices and receipts) is a labeled evaluation dataset for measuring how well a document-to-JSON extractor performs on Japanese documents. The Japanese-specific logic — era dates, reduced (8%) versus standard (10%) tax, withholding, qualified-invoice (インボイス) rules, revenue stamps — is arithmetically verified on the answer side.
Building an extraction pipeline always hits the same wall: how do you measure accuracy? Real invoices can’t be used for evaluation because of PII, and hand-authoring gold JSON one record at a time doesn’t scale. Shōhyō stands in for that ground truth.
What it measures / doesn’t
It scores extraction correctness on clean text input: date normalization (era to Gregorian, relative dates), the reduced-vs-standard tax split, withholding amounts, revenue stamp, and full-width / symbol normalization — the spots where Japanese documents tend to trip extractors up.
It does not score OCR or layout analysis (image to text). The input is a document_text field, so accuracy at reading paper or scans is out of scope.
Contents (free sample)
- 20 invoices / 10 receipts, all synthetic — no real companies or personal data.
- Each record pairs
document_text(a realistic body) withexpected_output(gold JSON) plus difficulty and edge-case tags. - A JSON Schema, a validator, and a scorer are included. The scorer breaks your extractor’s accuracy down by edge case rather than handing you one flat number.
How it’s built (three-stage verification)
To make the answer side itself trustworthy, only records that clear three checks are included.
- Generate — author a document body with edge cases baked in, plus its answer.
- Independent audit — grounded in the text alone, check the support, the normalization, and whether any answer leaked into the body.
- Machine check — beyond schema validation, recompute the arithmetic for every record: line items, tax-rate buckets, totals, withholding, revenue stamp.
Records that don’t pass all three are dropped.
When to use it
- Regression testing an in-house extractor — when you swap a model or a prompt, watch which edge-case score moved. “Only the reduced-tax split got worse” points you straight at the cause.
- A benchmark for accounting / finance SaaS — measure your extraction accuracy against a shared yardstick that doesn’t depend on your own data, where in-house evaluation tends to grade itself too kindly.
- CI for an LLM app or agent — drop the scorer into your pipeline and gate releases on an extraction-accuracy threshold.
Compared to rolling your own
Authoring ground truth yourself runs into the PII wall on real invoices, and even synthetic data leaves the question no one checked: is the answer actually correct? Withholding and the reduced-tax split in particular are easy to get wrong by hand.
What Shōhyō adds is an answer side hardened by the three-stage check. Because it stays on clean text input, your score doesn’t wobble with OCR quality — you see the accuracy of the extraction logic itself.
Get it
- Free sample (Hugging Face) — download now: dataset page.
- Full set (2,000–5,000 records per document type, commercial license) — coming soon. Indicative price $59 per type, $229 bundle.
- Build notes will follow on the Build section.
For evaluating extraction accuracy only. Not tax or accounting advice. Tax-rate buckets and the like are examples under specific assumptions; confirm real-world judgments with a professional.