Strongly Generalizable Question Answering Dataset

What is GrailQA?

Strongly Generalizable Question Answering (GrailQA) is a new large-scale, high-quality dataset for question answering on knowledge bases (KBQA) on Freebase with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It can be used to test three levels of generalization in KBQA: i.i.d., compositional, and zero-shot.

Explore GrailQA GrailQA paper (Gu et al. '20)Code

Why GrailQA?

GrailQA is by far the largest crowdsourced KBQA dataset with questions of high diversity (i.e., questions in GrailQA can have up to 4 relations and optionally have a function from counting, superlatives and comparatives). It also has the highest coverage over Freebase; it widely covers 3,720 relations and 86 domains from Freebase. Last but not least, our meticulous data split allows GrailQA to test not only i.i.d. generalization, but also compositional generalization and zero-shot generalization, which are critical for practical KBQA systems.

News

01/24/2021We provide instructions on Freebase setup.
01/22/2021We have updated our baseline performance based on a new entity linker with a recall of around 0.77 on the dev set, compared with the previous recall of around 0.46. More details about entity linking can be found in the updated paper.
01/18/2021 We will have some major update on our baseline model results with a new entity linker in this week. The numbers will be higher. Please stay tuned!
11/30/2020 We fix some minor error in the sparql_queries provided in our dataset.

Getting Started

We've built a few resources to help you get started with the dataset.

Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):

GrailQA Dataset (120 MB)

To work with our dataset, we recommend you setting up a Virtuoso server for Freebase (feel free to choose your own way to index Freebase). Please find both a clean version of Freebase dump and instructions on setting up the server via:

Freebase Setup

To evaluate your models, we have also made available the evaluation script we will use for official evaluation, along with a sample prediction file that the script will take as input. Evaluating semantic-level exact match also depends on several preprocessed ontology files of Freebase. You can find all of them here. To run the evaluation, use python evaluate.py <path_to_dev> <path_to_predictions> --fb_roles <path_to_fb_roles> --fb_types <path_to_fb_types> --reverse_properties <path_to_reverse_properties>.

Once you have a built a model that works to your expectations on the dev set, you submit it to get official scores on the dev and a hidden test set. To preserve the integrity of test results, we do not release the labels of test set to the public. Here's a tutorial walking you through official evaluation of your model:

Submission Tutorial

Have Questions?

Send an email to gu.826@osu.edu, or create an issue in github.

Acknowledgement

We thank Pranav Rajpurkar and Robin Jia for giving us the permission to build this website based on SQuAD.

Leaderboard: Overall

Here are the overall Exact Match (EM) and F1 scores evaluated on GrailQA test set. To get the EM score on GrailQA, please submit your results with logical forms in S-expression. Note that, submissions are ranked only based on F1, so feel free to choose your own meaning representation as EM won't affect your ranking.

Rank	Model	EM	F1
1 Sep 13, 2024	SG-KBQA (single-model) The University of Melbourne https://arxiv.org/abs/2502.12737	79.140	84.403
2 Jan 11, 2023	Rank-KBQA (ensemble) HIK RESEARCH INST NLP	73.418	81.869
3 Dec 12, 2022	Pangu (T5-3B) The Ohio State University https://aclanthology.org/2023.acl-long.270/	75.376	81.701
4 Dec 27, 2022	GAIN (T5-3B) Microsoft https://arxiv.org/abs/2309.08345	76.306	81.522
5 Nov 25, 2022	GAIN (T5-base) Microsoft https://arxiv.org/abs/2309.08345	75.074	80.632
6 Dec 02, 2022	Pangu (BERT-base) The Ohio State University https://aclanthology.org/2023.acl-long.270/	73.736	79.930
7 Mar 17, 2024	RetinaQA (single model) Indian Institute of Technology, Delhi	74.129	79.519
8 Nov 09, 2022	FC-KBQA (single model) Anonymous	73.199	78.765
9 Sep 06, 2022	DecAF (single model) AWS AI Labs https://arxiv.org/abs/2210.00063	68.362	78.758
10 Feb 11, 2023	MGC-KBQA (single model) Anonymous	72.844	78.522
11 May 31, 2022	TIARA (single model) Microsoft Research https://arxiv.org/abs/2210.12925	73.041	78.520
12 Jul 12, 2022	DeCC (single model) Anonymous	72.119	77.648
13 Dec 28, 2022	Rank-KBQA (single model) HIK RESEARCH INST NLP	64.583	77.125
14 Aug 31, 2022	UniParser (base model) Salesforce https://arxiv.org/pdf/2211.05165.pdf	69.451	74.649
15 Aug 20, 2021	RnG-KBQA (single model) Salesforce Research https://aclanthology.org/2022.acl-long.417/	68.778	74.422
16 Apr 19, 2022	ArcaneQA V2 (single model) The Ohio State University https://aclanthology.org/2022.coling-1.148/	63.774	73.713
17 Aug 12, 2021	S2QL (single model) UCAS & Meituan Inc. https://dl.acm.org/doi/abs/10.1007/978-3-031-05981-0_18	57.456	66.186
18 Apr 05, 2021	ReTraCk (single model) Microsoft Research https://aclanthology.org/2021.acl-demo.39/	58.136	65.285
19 Feb 04, 2021	ArcaneQA V1 (single model) The Ohio State University	57.872	64.924
20 Jan 22, 2021	BERT+Ranking (single model) The Ohio State University	50.578	57.988
21 Jan 22, 2021	GloVe+Ranking (single model) The Ohio State University	39.521	45.136
22 Jan 22, 2021	BERT+Transduction (single model) The Ohio State University	33.255	36.803
23 Jan 22, 2021	GloVe+Transduction (single model) The Ohio State University	17.587	18.432

Leaderboard: Compositional Generalization

Here are the Exact Match (EM) and F1 scores evaluated on the subset of GrailQA test set that tests compositional generalization.

Rank	Model	EM	F1
1 Sep 13, 2024	SG-KBQA (single-model) Anonymous	77.875	85.064
2 Jan 11, 2023	Rank-KBQA (ensemble) HIK RESEARCH INST NLP	74.806	82.293
3 Sep 06, 2022	DecAF (single model) AWS AI Labs https://arxiv.org/abs/2210.00063	73.385	81.807
4 Dec 12, 2022	Pangu (T5-3B) The Ohio State University https://aclanthology.org/2023.acl-long.270/	74.580	81.518
5 Dec 02, 2022	Pangu (BERT-base) The Ohio State University https://aclanthology.org/2023.acl-long.270/	74.935	81.216
6 Dec 27, 2022	GAIN (T5-3B) Microsoft https://arxiv.org/abs/2309.08345	73.708	80.004
7 Nov 25, 2022	GAIN (T5-base) Microsoft https://arxiv.org/abs/2309.08345	72.997	79.561
8 Dec 28, 2022	Rank-KBQA (single model) HIK RESEARCH INST NLP	70.252	79.392
9 Mar 17, 2024	RetinaQA (single model) Indian Institute of Technology, Delhi	71.996	78.903
10 Feb 11, 2023	MGC-KBQA (single model) Anonymous	71.382	78.790
11 Nov 09, 2022	FC-KBQA (single model) Anonymous	69.994	76.677
12 May 31, 2022	TIARA (single model) Microsoft Research https://arxiv.org/abs/2210.12925	69.186	76.546
13 Jul 12, 2022	DeCC (single model) Anonymous	68.540	75.862
14 Apr 19, 2022	ArcaneQA V2 (single model) The Ohio State University https://aclanthology.org/2022.coling-1.148/	65.795	75.302
15 Aug 20, 2021	RnG-KBQA (single model) Salesforce Research https://aclanthology.org/2022.acl-long.417/	63.792	71.156
16 Aug 31, 2022	UniParser (base model) Salesforce https://arxiv.org/pdf/2211.05165.pdf	65.149	71.102
17 Apr 05, 2021	ReTraCk (single model) Microsoft Research https://aclanthology.org/2021.acl-demo.39/	61.499	70.911
18 Aug 12, 2021	S2QL (single model) UCAS & Meituan Inc. https://dl.acm.org/doi/abs/10.1007/978-3-031-05981-0_18	54.716	64.679
19 Feb 04, 2021	ArcaneQA V1 (single model) The Ohio State University	56.395	63.533
20 Jan 22, 2021	BERT+Ranking (single model) The Ohio State University	45.510	53.890
21 Jan 22, 2021	GloVe+Ranking (single model) The Ohio State University	39.955	47.753
22 Jan 22, 2021	BERT+Transduction (single model) The Ohio State University	31.040	35.985
23 Jan 22, 2021	GloVe+Transduction (single model) The Ohio State University	16.441	18.507

Leaderboard: Zero-shot Generalization

Here are the Exact Match (EM) and F1 scores evaluated on the subset of GrailQA test set that tests zero-shot generalization.

Rank	Model	EM	F1
1 Sep 13, 2024	SG-KBQA (single-model) Anonymous	75.364	80.792
2 Dec 12, 2022	Pangu (T5-3B) The Ohio State University https://aclanthology.org/2023.acl-long.270/	71.575	78.536
3 Dec 27, 2022	GAIN (T5-3B) Microsoft https://arxiv.org/abs/2309.08345	71.848	77.751
4 Nov 25, 2022	GAIN (T5-base) Microsoft https://arxiv.org/abs/2309.08345	69.918	76.357
5 Dec 02, 2022	Pangu (BERT-base) The Ohio State University https://aclanthology.org/2023.acl-long.270/	69.125	76.059
6 Jan 11, 2023	Rank-KBQA (ensemble) HIK RESEARCH INST NLP	63.651	75.514
7 Mar 17, 2024	RetinaQA (single model) Indian Institute of Technology, Delhi	68.852	74.772
8 Nov 09, 2022	FC-KBQA (single model) Anonymous	67.584	73.986
9 May 31, 2022	TIARA (single model) Microsoft Research https://arxiv.org/abs/2210.12925	67.987	73.864
10 Feb 11, 2023	MGC-KBQA (single model) Anonymous	67.253	73.355
11 Jul 12, 2022	DeCC (single model) Anonymous	66.547	72.545
12 Sep 06, 2022	DecAF (single model) AWS AI Labs https://arxiv.org/abs/2210.00063	58.551	72.270
13 Aug 31, 2022	UniParser (base model) Salesforce https://arxiv.org/pdf/2211.05165.pdf	63.968	69.847
14 Aug 20, 2021	RnG-KBQA (single model) Salesforce Research https://aclanthology.org/2022.acl-long.417/	62.988	69.182
15 Dec 28, 2022	Rank-KBQA (single model) HIK RESEARCH INST NLP	50.166	68.489
16 Apr 19, 2022	ArcaneQA V2 (single model) The Ohio State University https://aclanthology.org/2022.coling-1.148/	52.860	66.029
17 Aug 12, 2021	S2QL (single model) UCAS & Meituan Inc. https://dl.acm.org/doi/abs/10.1007/978-3-031-05981-0_18	55.122	63.598
18 Feb 04, 2021	ArcaneQA V1 (single model) The Ohio State University	49.964	58.844
19 Jan 22, 2021	BERT+Ranking (single model) The Ohio State University	48.566	55.660
20 Apr 05, 2021	ReTraCk (single model) Microsoft Research https://aclanthology.org/2021.acl-demo.39/	44.561	52.539
21 Jan 22, 2021	GloVe+Ranking (single model) The Ohio State University	28.886	33.792
22 Jan 22, 2021	BERT+Transduction (single model) The Ohio State University	25.702	29.300
23 Jan 22, 2021	GloVe+Transduction (single model) The Ohio State University	2.968	3.123

GrailQA

The Strongly Generalizable Question Answering Dataset