Skip to main content

Benchmark literacy checklist

  • Use benchmark scores to shortlist, not to make final deployment decisions.
  • Compare models on your domain tasks (coding, support, RAG, internal docs).
  • Track both quality and operations metrics (latency, memory fit, failure modes).
  • Re-run evaluations after quantization, runtime changes, or model upgrades.

Starter eval template

Adapt this schema to build a repeatable eval set across candidate models.

{
  "name": "local-llm-eval-v1",
  "metrics": ["pass_rate", "latency_p95_ms", "format_adherence"],
  "cases": [
    {
      "id": "coding-refactor-001",
      "task": "coding",
      "prompt": "Refactor this function for readability without changing behavior.",
      "must_include": ["unchanged behavior", "clear naming"],
      "must_avoid": ["API breaking changes"]
    },
    {
      "id": "rag-grounding-001",
      "task": "rag",
      "prompt": "Answer only from the provided context and cite line numbers.",
      "must_include": ["grounded answer", "citation"],
      "must_avoid": ["unsupported claims"]
    }
  ]
}

Recommended external references

Compare model briefsApply security playbooks