EPITHRE.
← Research Evaluation

Evaluating AI for Indonesian government work

A model can sit at the top of the popular benchmarks and still be useless to a ministry. Those benchmarks measure general ability, mostly in English; they say little about whether a model can trace a legal basis or draft an official letter correctly. So we built an evaluation of our own, aimed at the work that actually matters.

What we measure

The evaluation is organized around the tasks Strata is meant for, each judged by a single question: would the answer hold up if it were used for real?

How it is graded

Tasks are drawn from real regulatory material rather than synthetic prompts, so the evaluation reflects the documents the model will actually meet. We grade on usefulness in context, not on surface fluency.

We also include adversarial cases on purpose: questions whose wrong answer is the one that sounds most reasonable. These separate a model that is merely confident from one that knows when to be careful. A good answer often ends in "this depends on the specific text, here is what to check" rather than a false verdict.

A proxy, not the truth

We do not treat the evaluation as ground truth. It is a proxy, and we revise it as we find its blind spots. What it gives us that no public leaderboard can is an honest signal of whether Strata is improving at the work that matters, rather than at the work that is easy to score. The corpus behind the model is described in a companion report.