📋

LM Evaluation Harness

Evaluation

Unified framework for evaluating language models on 200+ benchmarks.

8kstars2kforksPython

About

The LM Evaluation Harness by EleutherAI provides standardized evaluation of language models across academic benchmarks including MMLU, HellaSwag, and ARC. The standard for open model leaderboards.

Key Features

200+ benchmarks
Standardized evaluation
Few-shot prompting
Reproducibility

Related Resources

📏

RAGAS

Evaluation framework for RAG pipelines.

🧪

promptfoo

Test and evaluate LLM outputs with CI/CD integration.

🔬

DeepEval

The open-source LLM evaluation framework with 50+ research-backed metrics.