Author: Georgia Koutrika, Athena, Greece

Data is a prevalent part of every business and scientific domain, but its explosive volume and increasing complexity make data querying challenging even for experts. Nowadays, the value of natural language interfaces to databases has been widely recognized and numerous text-to-SQL systems have been developed, both from industry and academia, that enable users to pose natural language questions over relational databases (e.g., [1][2][3]). Such systems allowusers to posequeries in natural language withoutcaring for the particularities of the underlying schema and query languages such SQL or SPARQL. To truly bridge the gap between users and data, not only the user should be able to “talk” to the system using natural language, but the system should be able to respond in the same way [4].

A conversational data system understands user queries in natural language and uses natural language to (a) explain queries (b)ask for clarifications from the user, (c) describe results.

Explaining queries. Given a natural language user queryon a relational database, the system translates it to an equivalent SQL query. In practice, however, a system may end up with more than one possible translation due to query ambiguity and the underlying data. Let us consider an actual example from the Yelp database [5] (part of its schema is shown in the figure below).

A user is looking for “Italian restaurants”. However, the two keywords are found in several places in theunderlying database. Table 1 shows some examples. 

These mappings can generate several possible interpretations of the user query (depending on how relations are connected in the database schema). Table 2 shows some possible interpretations.

In our example, interpretations 1 and 4 are both correct, and should be executed. Interpretation 2 is also reasonable (and it may be contained in the results of the interpretations1 and 4, or not). Finally, interpretations 3 and 5 are not likely correct at all. Hence, the system will generate results for (at least) interpretations 1, 2 and 4 (assuming it can tell that the other interpretations are not good ones). In this case, for each set of results, the system can generate an NL explanation of the query executed, so that the user understands how each set was generated.

Asking for clarifications. Let us consider the following example. A user is asking “show me some comedies by Woody Allen”. The query is ambiguous, and the system internally generates the following possible translations:

In that case, the system should spot the critical difference between the two translations (shown in color) and generate a natural language utterance to ask the user which one is meant. For example, it could ask: “Do you mean the director or the actor Woody Allen?”

NL Explanations in INODE 

Generating natural language (NL) explanations of queries for the users is an important part in INODE that aims at bridging the gap between the user queries, on the one hand, and how the system interprets these user queries on the underlying data, on the other hand. In the example screenshot, the system has translated the user query “Find projects where Alberto Broggi was involved” to SQL. To help the user understand whether the system captured the query intention, an NL explanation is shown along the results.

Logos [6,7] is our system that provides natural language translations for relational queries expressed in SQL. Our translation mechanism uses a graph-based approach for the query translation problem. We represent various forms of structured queries as directed graphs and we annotate the graph edges with template labels using an extensible template mechanism. Logos uses different graph traversal strategies for efficiently exploring these graphs and composing textual query descriptions. 

We have adapted Logos to generate NL explanations over the CORDIS and SDSS databases with satisfying results. The newest version of the system generates more compact and natural explanations thanks to several extensions in the algorithms and its knowledge base.  Consider the following SQL query:

SELECT pr.title

FROM projects pr, project_subject_areas psa, subject_areas sa

WHERE pr.unics_id = psa.project AND psa.subject_area = sa.code AND sa.title = “Robotics”;

The early Logos would generate this NL explanation:

  • Find the titles of projects associated with project subject areas, and for project subject areas associated with subject areas whose title is Robotics

The latest version generates the following NL explanation:

  • Find projects on subject areas whose title is Robotics.

 

References:

[1] Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparative survey of recent natural language interfaces for databases. VLDB J. 28, 5 (2019), 793–819.

[2] Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. ACL 2020

[3] Fei Li and H. V. Jagadish. 2014. Constructing an Interactive Natural Language Interface for Relational Databases. PVLDB 8, 1 (Sept. 2014), 73–84.

[4] Alkis Simitsis, Yannis E. Ioannidis: DBMSs Should Talk Back Too. CIDR 2009

[5] Yelp dataset: https://www.yelp.com/dataset

[6] Andreas Kokkalis, Panagiotis Vagenas, Alexandros Zervakis, Alkis Simitsis, Georgia Koutrika, and Yannis Ioannidis. 2012. Logos: a system for translating queries into narratives. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD ’12)

[7] Georgia Koutrika, Alkis Simitsis, Yannis E. Ioannidis: Explaining structured queries in natural language. ICDE 2010: 333-344