LLM benchmarks

Benchmarks are useful for understanding and comparing the performance of [[LLM]]s in different domains. Benchmarks are not perfect--they can be too narrow in scope, miss important measures of reasoning, and there is the potential for the tests to leak into the models themselves through direct training on test questions or overfitting to benchmarks. Benchmarks are collected on various leaderboards including - [Open LLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) - [BigCode](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard) (coding specific) - [LLM Perf](https://huggingface.co/spaces/optimum/llm-perf-leaderboard) (performance on specific hardware) - [HuggingFace Others](https://huggingface.co/spaces?q=leaderboard) (domain-specific models) - [Vellum](https://www.vellum.ai/llm-leaderboard) (API cost, context windows) - [SEAL](https://scale.com/leaderboard) (expert skills) - [Chatbot Arena](https://beta.lmarena.ai/leaderboard) (human preference) Common benchmarks include - **ARC**: Reasoning - **DROP**: Language Composition - **HellaSwag**: Common Sense - **MMLU**: Understanding - **ThruthfulQA**: Accuracy - **Winogrande**: Context - **GSM8K**: Math - **ELO**: Chat - **HumanEval**: Python coding - **MultiPL-EL**: Coding (general) The newer and more difficult benchmarks include - **GPQA** (Google-proof Q&A): PhD-level questions - **BBHard**: Future capabilities of LLMs (but no longer!) - **Math Lv 5**: High school math competition puzzles - **IFEVAL**: Difficult instructions to follow - **MuSR**: Multi-step soft reasoning (solve 1,000 word murder mystery novels) - **MMLU-PRO**: More nuanced understanding tasks