EPITHRE.
← Research Corpus & data

Building an Indonesian regulatory corpus

A specialist model is only as good as the material it learns from. Building Strata for Indonesian government work meant first assembling something that did not exist in a usable form: a large, clean corpus of Indonesia's public regulatory record.

Where the text comes from

Indonesia publishes an enormous amount of regulation, but it is scattered. National statutes and government regulations sit on one portal; ministries and agencies keep their own legal-documentation systems; regional regulation lives across hundreds of local repositories. We gathered from across this range, central to regional, so that the corpus reflects the real spread of rules an institution has to work with, not only the headline laws.

Two principles guided collection. The first was public material only, gathered the way an ordinary browser would. We do not build or use tooling to circumvent a site's access controls. For a model meant for government use, where the provenance of the data may be audited, transparent and lawful sourcing is not optional. The second was breadth over cherry-picking. A model that only knows famous statutes is of little help at a working desk, where the rule that matters is often an obscure ministerial or regional one.

Cleaning is most of the work

Raw government documents are messy: scanned PDFs, inconsistent layouts, mislabeled files, the occasional stamp or seal sitting on top of the text. Turning that into clean training text took more effort than the gathering did.

Extraction runs in two tiers. A fast pass pulls text directly from documents that already contain it, which covers the large majority. Only the documents that come back sparse, typically scans, are sent through optical character recognition. Treating OCR as a rescue step rather than a default saved a large amount of compute without losing the scanned material.

Deduplication uses MinHash to compare documents for near-identical content. One result surprised us. Indonesian amendments are largely delta documents: they cite the original and change only specific clauses, so deduplication is close to a no-op on this material. The lesson was that corpus quality here comes from broad, careful sourcing, not from aggressive deduplication.

Quality over raw size

The result runs to billions of tokens spanning central and regional regulation. We deliberately favored coverage and cleanliness over chasing the largest possible token count. For a domain model, a smaller corpus that genuinely represents the working reality beats a larger one padded with noise.

This corpus is what lets Strata do what a general model cannot: trace a rule to its source, recognize the form of an official document, and reason about how rules relate. How we check that it actually works is the subject of a separate report, and the model itself is described on the Strata page.