Why we built our own AI benchmark

Whenever a new language model is released, the first thing on display is usually a score: such-and-such percent on one exam, some rank on a leaderboard. Those numbers are useful for comparing models in general. The trouble is that almost none of them answer the thing we care about most, which is whether the model is actually useful for Indonesian government work.

Most popular benchmarks measure general ability, and most of them in English. A model can score high there and still get it wrong when asked to trace the legal basis of a policy or to draft an official memo in the correct format. The score is good, just not on what we need.

What actually needs measuring

So we built our own evaluation, drawn from real material and judged against one plain question: would this answer hold up if it were actually used?

A few of the things it checks for:

Citation accuracy. When the model names a rule, does that rule actually exist and actually apply, rather than just sounding convincing?

Regulatory reasoning. When two provisions collide, does the model resolve them with the right principle, or just pick one?

Official register. Whether a letter, decision, or report follows the form and language proper to official drafting, rather than everyday prose.

Document reading. Whether the model can pull the right information out of official documents, including scans of middling quality.

Why take the harder road

Building your own benchmark takes real work. It is far easier to reuse a score from an existing leaderboard. But that leaderboard knows nothing about the context we work in, and a model that looks impressive there may not be dependable on a policy analyst’s desk.

We also slip in traps on purpose: questions whose wrong answer is the one that sounds most reasonable. A good model knows when to be careful, and when to admit that it does not know.

A benchmark is not the truth

We do not treat this evaluation as perfect. It is still a proxy, not the truth itself, and we keep fixing it as we find the gaps. What it gives us, though, is something no public leaderboard does. We set this yardstick before training began, measured against real work, so that from the very first version we can tell whether Strata improves on what actually matters, rather than on whatever happens to be easiest to score.