MoNaCo Leaderboard

Below are the closed-book setting results of models on the full MoNaCo benchmark i.e., over all 1,315 questions. To reproduce our evaluation, please refer to our LLM-as-judge prompt and evaluation script available here. Note that we originally used GPT-4.1 (2025-06) as our “judge” language model.

Rank	Model	Date	Link	F1	Precision	Recall
1	o3	2025-05	🌐	61.18	68.10	59.54
2	GPT-5 (medium reasoning)	2025-08	🌐	60.11	66.38	58.98
3	Gemini 2.5-Pro	2025-07	🌐	59.11	65.02	58.14
4	GPT-4o (few-shot)	2025-03	🌐	55.05	63.33	52.88
5	Deepseek-V3 (few-shot)	2025-05	🌐	55.04	62.31	53.37
6	Claude 4-Opus	2025-07	🌐	55.03	62.28	53.47
7	o4-mini	2025-04	🌐	54.92	62.50	53.01
8	Deepseek-R1	2025-04	🌐	53.82	62.52	51.50
9	Gemini 2.5-Flash	2025-07	🌐	52.01	58.10	50.60
10	Llama 3.1-405B (few-shot)	2025-03	🌐	51.39	55.97	51.20
11	o3-mini	2025-04	🌐	48.75	59.29	46.19
12	GPT-4 Turbo (few-shot)	2024-05	🌐	48.58	56.26	46.81
13	Qwen 2.5-72B (few-shot)	2025-05	🌐	47.05	53.84	45.48
14	Llama 3-70B (few-shot)	2024-05	🌐	47.00	55.15	45.12
15	Qwen 2-72B (few-shot)	2024-07	🌐	43.92	50.80	42.89