Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Tel-Aviv University Releases Long-Text NLP Benchmark SCROLLS

Tel-Aviv University Releases Long-Text NLP Benchmark SCROLLS

This item in japanese


Researchers with Tel-Aviv University, Meta AI, IBM Research, and Allen Institute for AI (AI2) have released Standardized CompaRison Over Long Language Sequences (SCROLLS), a set of natural language processing (NLP) benchmark tasks operating on long text sequences drawn from many domains. Experiments on baseline NLP models show that current models have significant room for improvement.

The benchmark and baseline experiments were described in a paper published on arXiv. SCROLLS measures a model's performance on NLP tasks such as natural language understanding (NLU), question-answering, and summarization, evaluated on seven different datasets containing text strings thousands of characters in length. The datasets are drawn from government reports, scientific papers, legal documents, film and television scripts, and literature. The goal of the benchmark is to improve the NLP community's ability to compare NLP models on text longer than a few sentences. According to the researchers:

We hope that SCROLLS inspires the NLP community to go beyond single sentences and paragraphs, and meet the challenges of processing and reasoning over longer discourses.

The Transformer has become the dominant architecture for deep-learning NLP models, but one drawback of the Transformer is that it can only handle inputs of a certain maximum length, and the model's computational and memory requirements grow with the square of this length. There have been many modifications to the basic Transformer to address this, including sparse Transformers, Reformer, and Performer. The SCROLLS team noted, however, that evaluation tasks and metrics for these different solutions often varied from model to model, making it difficult to compare the models' ability to handle long-range dependencies in text.

To address this, the team hand-curated seven existing datasets containing "discourses that are naturally long" and processed them into a common format. Each dataset has a corresponding NLP task.

  • GovReport: given a government report, generate an executive summary
  • SummScreenFD: given a TV show transcript, generate a "recap"
  • QMSum: given a transcript of an academic, business, or government meeting, generate a query-based summarization
  • Qasper: given a scientific paper, answer questions about the content
  • NarrativeQA: given a book or movie script, answer questions about the content
  • QuALITY: given a story or article, answer multiple-choice questions about the content
  • Contract NLI: given a legal contract, predict whether a legal statement can be "entailed" from the contract

The team then benchmarked two baseline Transformer models against SCROLLS: BART and Longformer Encoder-Decoder (LED). They also created a "naive" heuristic baseline for SCROLLS by simply reusing the beginning of the input as the output, and evaluating the result. The team noted several trends in model performance. First, both models performed better when given longer "contexts", or input sequences. For a given context length, BART outperformed LED, "suggesting that LED might be under-optimized." Both models outperformed the naive heuristic by "7 to 10 points." Unlike many other NLP benchmarks, the researchers were not able to determine a human-level performance score, but based on previous results from some of the SCROLLS datasets, they conclude it is "probably much higher than the current baselines," suggesting that there is opportunity for models to improve.

In a discussion about the work on Twitter, AI2 researcher Sameer Singh asked SCROLLS co-author Omer Levy if he considered short-text NLU was solved. Levy replied:

There's definitely a lot more research to do with short contexts [but] it might be time to venture outside our single-sentence comfort zone and put more emphasis on this under-represented area. In the not-too-distant past, we couldn't get anything in semantics (NLU) to work too well, so we didn't really need to go beyond sentence similarity/entailment to design a benchmark. We haven't necessarily solved these problems, but the situation is arguably different post-BERT.

The SCROLLS datasets are available on the benchmark's website, and the code for reproducing the paper's experiments is available on GitHub.

About the Author

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p