# webnlg **Repository Path**: tmonica/webnlg ## Basic Information - **Project Name**: webnlg - **Description**: The enriched version of the WebNLG described at INLG 2018 - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-05-29 - **Last Updated**: 2021-05-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # WebNLG The enriched version of the WebNLG dataset, described in the INLG 2018 paper ["Enriching the WebNLG corpus"](https://aclweb.org/anthology/W18-6521). ### Description WebNLG is a valuable resource and benchmark for the Natural Language Generation (NLG) community. However, as other NLG benchmarks, it only consists of a collection of parallel raw representations and their corresponding textual realizations. This work aimed to provide intermediate representations of the data for the development and evaluation of popular tasks in the NLG pipeline architecture (Reiter and Dale, 2000), such as Discourse Ordering, Lexicalization, Aggregation and Referring Expression Generation. ### Data Here are the changes per version: - [**v1.6**](data/v1.6): Correction related to entities and templates. Thanks to [Abelardo Vieira](https://github.com/abevieiramota) and [zhijing-jin](https://github.com/zhijing-jin) (March 25th, 2021) - [**v1.5**](data/v1.5): English Lexicalization templates, introduced in the EMNLP 2019 paper "Neural data-to-text generation: A comparison between pipeline and end-to-end architectures". (August 22th, 2019) - [**v1.4**](data/v1.4): full revision of the delexicalized templates. (April 1st, 2019) - [**v1.3**](data/v1.3): tokenization by [NLTK](https://www.nltk.org/), leading to a better extraction of references and discourse ordering information. (January 31st, 2019) - [**v1.2**](data/v1.2): annotation of the test part of the corpus. See Issue [#2](https://github.com/ThiagoCF05/webnlg/issues/2). (January 31st, 2019) - [**v1.1**](data/v1.1): fix on some annotation mistakes. See Issue [#1](https://github.com/ThiagoCF05/webnlg/issues/1). (November 21st, 2018) - [**v1.0**](data/v1.0): first version with annotation of the train and development parts of the corpus and German translation. **BETA** - [**v2.0 (BETA)**](data/v2.0): Tree templates. (April 1st, 2019) ### Example ```xml 11th_Mississippi_Infantry_Monument | region | Adams_County,_Pennsylvania 11th_Mississippi_Infantry_Monument | established | 2000 11th_Mississippi_Infantry_Monument | category | Contributing_property 11th_Mississippi_Infantry_Monument | location | Adams_County,_Pennsylvania 11th_Mississippi_Infantry_Monument | established | 2000 11th_Mississippi_Infantry_Monument | category | Contributing_property 11th_Mississippi_Infantry_Monument | location | Adams_County,_Pennsylvania 11th_Mississippi_Infantry_Monument | established | 2000 11th_Mississippi_Infantry_Monument | category | Contributing_property The 11th Mississippi Infantry Monument Adams County , Pennsylvania It 2000 contributing property The 11th Mississippi Infantry Monument which is located in Adams County, Pennsylvania. It was established in 2000 and falls under the category of contributing property. AGENT-1 which VP[aspect=simple,tense=present,voice=active,person=3rd,number=singular] be located in PATIENT-1 . AGENT-1 VP[aspect=simple,tense=past,voice=passive,person=null,number=singular] establish in PATIENT-2 and VP[aspect=simple,tense=present,voice=active,person=3rd,number=null] fall under DT[form=defined] the category of PATIENT-3 . ``` ### German translation Besides the official English version of the data (``en``), we also provide a silver-standard version of the corpus in German (``de``). The details on how to obtain the translation is presented on the following section and on the INLG 2018 paper. ### Scripts To obtain the enriched version of the dataset as available in the mentioned directory, make sure to proper set up *the University of Edinburgh's Neural MT System for WMT17*, publicly available [here](http://data.statmt.org/wmt17_systems). After the settings, update the path variable ``MT_PATH`` in the ``main.sh`` script before execute it: `` sh main.sh `` ### Related Projects - [**WebNLG Reader**](https://github.com/zhijing-jin/WebNLG_Reader/): An easy-to-use Python reader and cleaner for the corpus. - [**NeuralREG**](https://github.com/ThiagoCF05/NeuralREG): An end-to-end approach to Referring Expression Generation trained and evaluated on the enriched WebNLG. - [**DeepNLG**](https://github.com/ThiagoCF05/DeepNLG): Project which aims to systematically compare neural pipeline and end-to-end architectures for NLG. The enriched WebNLG was used for evaluation. ### License The WebNLG data is licensed under the [CC Attribution-Noncommercial-Share Alike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/). The original version of the dataset can be found [here](https://gitlab.com/shimorina/webnlg-dataset). ### Citations: ``` @InProceedings{ferreiraetal2018, author = "Castro Ferreira, Thiago and Moussallem, Diego and Wubben, Sander and Krahmer, Emiel", title = "Enriching the WebNLG corpus", booktitle = "Proceedings of the 11th International Conference on Natural Language Generation", year = "2018", series = {INLG'18}, publisher = "Association for Computational Linguistics", address = "Tilburg, The Netherlands", } ``` ``` @InProceedings{gardentetal017, author = "Gardent, Claire and Shimorina, Anastasia and Narayan, Shashi and Perez-Beltrachini, Laura", title = {Creating Training Corpora for {NLG} Micro-Planners}, booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) ", series = {ACL'17}, year = "2017", publisher = "Association for Computational Linguistics", pages = "179--188", address = "Vancouver, Canada", doi = "10.18653/v1/P17-1017", url = "http://www.aclweb.org/anthology/P17-1017" } ```