Whenever a new language model is released, the first thing on display is usually a score: such-and-such percent on one exam, some rank on a leaderboard. Those numbers are useful for comparing models in general. The trouble is that almost none of them answer the thing we care about most, which is whether the model is actually useful for Indonesian government work.
Most popular benchmarks measure general ability, and most of them in English. A model can score high there and still get it wrong when asked to trace the legal basis of a policy or to draft an official memo in the correct format. The score is good, just not on what we need.
What actually needs measuring
So we built our own evaluation, drawn from real material and judged against one plain question: would this answer hold up if it were actually used?
A few of the things we score:
Citation accuracy. When the model names a rule, does that rule actually exist and actually apply, rather than just sounding convincing.
Regulatory reasoning. When two provisions collide, does the model resolve them with the right principle, or just pick one.
Official register. Whether a letter, decision, or report follows the form and language proper to official drafting, rather than everyday prose.
Document reading. Whether the model can pull the right information out of official documents, including scans of middling quality.
Why take the harder road
Building your own benchmark is plainly tiresome. It is far easier to lift a number off an existing leaderboard. But that leaderboard knows nothing about the context we work in, and a model that looks impressive there may not be dependable on a policy analyst’s desk.
We also slip in traps on purpose: questions whose wrong answer is the one that sounds most reasonable. A good model is not the most confident one, but the one that knows when to be careful and when to admit it does not know.
A benchmark is not the truth
We do not treat this evaluation as perfect. It is still a proxy, not the truth itself, and we keep fixing it as we find the gaps. What it gives us, though, is something no public leaderboard does: an honest measure of whether Strata is getting better at the work that actually matters, rather than at the work that happens to be easiest to score.