Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Google Open-Sources AI for Using Tabular Data to Answer Natural Language Questions

Google Open-Sources AI for Using Tabular Data to Answer Natural Language Questions

This item in japanese

Google open-sourced Table Parser (TAPAS), a deep-learning system that can answer natural-language questions from tabular data. TAPAS was trained on 6.2 million tables extracted from Wikipedia and matches or exceeds state-of-the-art performance on several benchmarks.

Co-creator Thomas Müller gave an overview of the work in a recent blog post. Given a table of numeric data, such as sports results or financial statistics, TAPAS is designed to answer natural-language questions about facts that can be inferred from the table; for example, given a list of sports championships, TAPAS might be able to answer "which team has won the most championships?" In contrast to previous solutions to this problem, which convert natural-language queries into software query languages such as SQL, which then run on the data table, TAPAS learns to operate directly on the data and outperforms the previous models on common question-answering benchmarks: by more than 12 points on Microsoft's Sequential Question Answering (SQA) and more than 4 points on Stanford's WikiTableQuestions (WTQ).

Many previous AI systems solve the problem of answering questions from tabular data with an approach called semantic parsing, which converts the natural-language question into a "logical form"---essentially translating human language into programming language statements. For questions about tabular data, the logical form is usually a query language such as SQL. Both Microsoft and Salesforce have developed such systems, but according to the Google team, one downside to semantic parsing is that, as with all supervised learning it requires a hand-labelled dataset; in this case, one that maps natural language questions to logical forms. Google's insight was to skip the intermediate step of the logical form. TAPAS instead directly outputs "a subset of the table cells and a possible aggregation operation."


TAPAS is based on BERT, Google's NLP system, which can be trained to give natural-language answers to natural-language questions. In that scenario, BERT's input training data includes both the question and the answer. For TAPAS, which answers questions with numeric data, the training input includes the question and the table's numeric data, flattened into a single long sequence. Because flattening the table loses information about the data structure, the input also includes embeddings to encode the row and column indices of each cell, as well as the cell's rank value within a column. The model has two sets of outputs. First, for each cell of the table, there is a probability score that the cell is part of the answer; any cell with a probability greater than 0.5 will be included in the final result. The second is a choice of aggregation operations, such as SUM or AVERAGE (or NONE if no aggregation is needed).

TAPAS was pre-trained on a set of 6.2 million data tables extracted from Wikipedia, with associated questions extracted from the article title, article description, table caption, and other related text snippets. The model is then fine-tuned on a dataset for a specific benchmark. The Google team used three benchmark datasets: SQA, WTQ, and Salesforce's WikiSQL. For SQA, TAPAS achieved 67.2% accuracy, improving 12 points over previous state-of-the-art. On WTQ it achieved 48.8% accuracy, a 4-point improvement over previous systems. On WikiSQL, TAPAS scored 83.6%, very close to the state-of-the-art score of 83.9%.

Google's training code and pre-trained models are available on Github, along with a Colab tutorial.

Rate this Article