A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation

Back to previous page

Conference Publication

A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation

AUTHORS:

ZHAW, Switzerland

Kurt Stockinger

ADDITIONAL AUTHORS:

Jan Deriu, Katsiaryna Mlynchyk, Philippe Schläpfer, Alvaro Rodrigo, Dirk von Grünigen, Nicolas Kaiser, Eneko Agirre, Mark Cieliebak

PUBLISHED IN:

accepted in:

ACL 2020

CURRENT STATUS

Yet to be published

DATE:

April 7, 2020

Read full article

In this paper, we introduce a novel methodology to efficiently construct a corpus for question answering over structured data. For this, we introduce an intermediate representation that is based on the logical query plan in a database, called Operation Trees (OT). This representation allows us to invert the annotation process without loosing flexibility in the types of queries that we generate. Furthermore, it allows for fine-grained alignment of the tokens to the operations.

In our method, we randomly generate OTs from a context free grammar, and annotators just have to write the appropriate question and assign the tokens. We apply the method to create a new corpus OTTA (Operation Trees and Token Assignment), a large semantic parsing corpus for evaluating natural language interfaces to databases. We compare OTTA to Spider and LC-QuaD 2.0 and show that our methodology more than triples the annotation speed while maintaining the complexity of the queries. Finally, we train a state-of-the-art semantic parsing model on our data and show that our corpus is a challenging dataset and that the token alignment can be leveraged to significantly increase the performance.

Download files:

A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation

A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation

Will be available soon to download

More Conference Publications

Data Democratisation with Deep Learning: The Anatomy of a Natural Language Data Interface

Learning Diversity Attributes in Multi-Session Recommendations

On Efficient Approximate Queries over Machine Learning Models

Get in touch