Evaluating AI for Indonesian government work

A model can sit at the top of the popular benchmarks and still be useless to a ministry. Those benchmarks measure general ability, mostly in English; they say little about whether a model can trace a legal basis or draft an official letter correctly. So we built an evaluation of our own, aimed at the work that actually matters.

What we measure

The evaluation is organized around the tasks Strata is meant for, each judged by a single question: would the answer hold up if it were used for real?

Citation accuracy. When the model cites a rule, does that rule exist and apply, rather than merely sounding plausible? Confident fabrication of a regulation is the most dangerous failure, so we weight this heavily.
Regulatory reasoning. When two provisions appear to conflict, does the model resolve them with the right principle, such as lex specialis or lex posterior, or does it just pick one?
Official register. Whether letters, decisions, and reports follow the form and language of proper tata naskah, not everyday prose.
Document and OCR extraction. Whether the model pulls the correct fields out of official documents, including scans of uneven quality.

How it is graded

Tasks are drawn from real regulatory material rather than synthetic prompts, so the evaluation reflects the documents the model will actually meet. We grade on usefulness in context, not on surface fluency.

We also include adversarial cases on purpose: questions whose wrong answer is the one that sounds most reasonable. These separate a model that is merely confident from one that knows when to be careful. A good answer often ends in "this depends on the specific text, here is what to check" rather than a false verdict.

A proxy, not the truth

We do not treat the evaluation as ground truth. It is a proxy, and we revise it as we find its blind spots. What it gives us that no public leaderboard can is an honest signal of whether Strata is improving at the work that matters, rather than at the work that is easy to score. The corpus behind the model is described in a companion report.