MoNaCo Leaderboard

Below are the closed-book setting results of models on the full MoNaCo benchmark i.e., over all 1,315 questions. To reproduce our evaluation, please refer to our LLM-as-judge prompt and evaluation script available here. Note that we originally used GPT-4.1 (2025-06) as our “judge” language model.

Rank Model Date Link F1 Precision Recall
1 o3 2025-05 🌐 61.18 68.10 59.54
2 GPT-5 (medium reasoning) 2025-08 🌐 60.11 66.38 58.98
3 Gemini 2.5-Pro 2025-07 🌐 59.11 65.02 58.14
4 GPT-4o (few-shot) 2025-03 🌐 55.05 63.33 52.88
5 Deepseek-V3 (few-shot) 2025-05 🌐 55.04 62.31 53.37
6 Claude 4-Opus 2025-07 🌐 55.03 62.28 53.47
7 o4-mini 2025-04 🌐 54.92 62.50 53.01
8 Deepseek-R1 2025-04 🌐 53.82 62.52 51.50
9 Gemini 2.5-Flash 2025-07 🌐 52.01 58.10 50.60
10 Llama 3.1-405B (few-shot) 2025-03 🌐 51.39 55.97 51.20
11 o3-mini 2025-04 🌐 48.75 59.29 46.19
12 GPT-4 Turbo (few-shot) 2024-05 🌐 48.58 56.26 46.81
13 Qwen 2.5-72B (few-shot) 2025-05 🌐 47.05 53.84 45.48
14 Llama 3-70B (few-shot) 2024-05 🌐 47.00 55.15 45.12
15 Qwen 2-72B (few-shot) 2024-07 🌐 43.92 50.80 42.89