Reasoning across Dozens of Documents

MoNaCo, a benchmark of More Natural and Complex questions to evaluate the question-answering (QA) skills of language models. MoNaCo contains 1,315 time-consuming questions whose solutions involve combining and reasoning over information spanning across dozens of Wikipedia tables and passages. Compared to traditional QA benchmarks, the scope of MoNaCo questions is much broader. To solve them, models must be proficient at decomposing complex queries, locating hundreds of pieces of information, and reasoning, combining, and aggregating this data effectively.

This benchmark was created by a team of NLP researchers at the University of Pennsylvania, the Allen Institute for AI, Tel Aviv Univeristy and Bar-Ilan University.

For more details on MoNaCo, please refer to our TACL 2025 paper, and to our Ai2 blogpost.

Overview

MoNaCo is a question answering benchmark of natural time-consuming questions, where each question requires combining information from dozens of Wikipedia pages.

We collected 1,315 human-written questions that reflect the goals of real-world users (political scientist, sports buff, amateur chef, etc.). Each question is also manually annotated with its gold-standard reasoning chain containing all: (1) intermediate questions, (2) answers and (3) Wikipedia evidence.

We evaluated MoNaCo in a question-answering setting, testing 15 frontier language models including GPT-5, o3, Claude Opus 4, Gemini 2.5 Pro, DeepSeek R1. These models predictions are released alongside our QA data, to help researchers better assess models factuality. To reproduce our evaluation, please refer to our LLM-as-judge prompt and evaluation script available here. MoNaCo was released in August 2025, when the top performing model, OpenAI o3 scores only 61% F1 while reaching a perfect score on only 38.7% of the examples.

While our paper centers on QA evaluation, the breadth and depth of MoNaCo questions makes it an ideal evaluation benchmark for at least five settings:

Evaluating models’ parametric knowledge and reasoning
Measuring complex reasoning over long contexts, where all of the evidence docs are provided in the context
Multi-document retrieval performance
End-to-end retrieval-augmented generation (RAG)
Training Deep Research-like systems, using the gold standard reasoning chains

Paper

MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents
Tomer Wolfson, Harsh Trivedi, Mor Geva, Yoav Goldberg, Dan Roth, Tushar Khot, Ashish Sabharwal and Reut Tsarfaty
To appear in the Transactions of the Association for Computational Linguistics (TACL), 2025

@article{wolfson-etal-2025-monaco,
    title = "MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents",
    author = "Wolfson, Tomer  and
      Trivedi, Harsh  and
      Geva, Mor  and
      Goldberg, Yoav  and
      Roth, Dan  and
      Khot, Tushar  and
      Sabharwal, Ashish  and
      Tsarfaty, Reut",
    journal = "Transactions of the Association for Computational Linguistics",
    address = "Cambridge, MA",
    publisher = "MIT Press",
    year="2025",
}

Authors

Leaderboard

For the most up-to-date results please refer to the MoNaCo Leaderboard page.

Download

For the full documentation of the dataset and its format please refer to our HuggingFace repository.