Shōhyō (証憑 — the accounting term for source documents such as invoices and receipts) is a labeled evaluation dataset for measuring how well a document-to-JSON extractor performs on Japanese documents. The Japanese-specific logic — era dates, reduced (8%) versus standard (10%) tax, withholding, qualified-invoice (インボイス) rules, revenue stamps — is arithmetically verified on the answer side.

Building an extraction pipeline always hits the same wall: how do you measure accuracy? Real invoices can’t be used for evaluation because of PII, and hand-authoring gold JSON one record at a time doesn’t scale. Shōhyō stands in for that ground truth.

What it measures / doesn’t

It scores extraction correctness on clean text input: date normalization (era to Gregorian, relative dates), the reduced-vs-standard tax split, withholding amounts, revenue stamp, and full-width / symbol normalization — the spots where Japanese documents tend to trip extractors up.

It does not score OCR or layout analysis (image to text). The input is a document_text field, so accuracy at reading paper or scans is out of scope.

Contents (free sample)

  • 20 invoices / 10 receipts, all synthetic — no real companies or personal data.
  • Each record pairs document_text (a realistic body) with expected_output (gold JSON) plus difficulty and edge-case tags.
  • A JSON Schema, a validator, and a scorer are included. The scorer breaks your extractor’s accuracy down by edge case rather than handing you one flat number.

How it’s built (three-stage verification)

To make the answer side itself trustworthy, only records that clear three checks are included.

  1. Generate — author a document body with edge cases baked in, plus its answer.
  2. Independent audit — grounded in the text alone, check the support, the normalization, and whether any answer leaked into the body.
  3. Machine check — beyond schema validation, recompute the arithmetic for every record: line items, tax-rate buckets, totals, withholding, revenue stamp.

Records that don’t pass all three are dropped.

When to use it

  • Regression testing an in-house extractor — when you swap a model or a prompt, watch which edge-case score moved. “Only the reduced-tax split got worse” points you straight at the cause.
  • A benchmark for accounting / finance SaaS — measure your extraction accuracy against a shared yardstick that doesn’t depend on your own data, where in-house evaluation tends to grade itself too kindly.
  • CI for an LLM app or agent — drop the scorer into your pipeline and gate releases on an extraction-accuracy threshold.

Compared to rolling your own

Authoring ground truth yourself runs into the PII wall on real invoices, and even synthetic data leaves the question no one checked: is the answer actually correct? Withholding and the reduced-tax split in particular are easy to get wrong by hand.

What Shōhyō adds is an answer side hardened by the three-stage check. Because it stays on clean text input, your score doesn’t wobble with OCR quality — you see the accuracy of the extraction logic itself.

Get it

  • Free sample (Hugging Face) — download now: dataset page.
  • Full set (2,000–5,000 records per document type, commercial license) — coming soon. Indicative price $59 per type, $229 bundle.
  • Build notes will follow on the Build section.

For evaluating extraction accuracy only. Not tax or accounting advice. Tax-rate buckets and the like are examples under specific assumptions; confirm real-world judgments with a professional.