YITU Technology introduces PreCo, a large-scale dataset for coreference resolution to help researchers analyze and improve their models more efficiently.
This dataset consists of around 40,000 documents and 13 million words mostly from the vocabulary of English-speaking preschoolers. It took researchers from YITU one year to develop the dataset and about 50,000 hours have been invested in the annotation.
The goal of coreference resolution is to identify mentions that refer to the same entities. For example, in the sentence “The trophy would not fit in the brown suitcase because it was too big.”, the phrase “it” and “the trophy” are corefered.
In natural language processing (NLP), coreference resolution is an important but still challenging task. It can be critical for many downstream applications including reading comprehension, chatbot and text summarization.
Siri, for instance, the intelligent assistant on our iPhone, shows it’s not-so-bad capability to answer our questions and set an alarm or a reminder as we required. But unfortunately, for so many times we realized this smart assistant is often unable to follow a conversation with us on one topic or understand multiple sentences. It shows for now the coreference resolution is not good enough to make that happen.
PreCo is designed to embody the core challenge of coreference resolution by addressing some limitations of existing datasets. It has a small vocabulary of preschoolers and it is at a large scale, about 10 times larger than OntoNotes, the most used dataset for coreference resolution in the past 5 years.
The small vocabulary and large scale together lead to significantly higher training-test overlap, which could make error analysis of models more efficient. In addition, the singleton mentions have been annotated, making it possible for the first time to quantify the influence that a mention detector makes on coreference resolution performance.
As PreCo has been set up, the researchers still have a lot to do in their work in the future. Firstly, they will continue to improve the annotation quality of the dataset by reducing the man-made errors and the ambiguities caused by annotation rules. Secondly, the researchers will provide questions & answers on the documents in PreCo, also with coreference resolution annotations. Also, they will keep trying to make the dataset larger.
YITU is dedicated to develop AI technologies to benefit the society. NLP is one of the company’s focuses as YITU believes NLP has great potential for the industries. YITU has been exploring NLP applications in intelligent medical records and voice recognition for several years. The newly-released dataset PreCo is one of its recent efforts to the development of NLP literature and is available for research purpose at https://preschool-lab.github.io/PreCo/.
2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018) has been held at the Square Brussels Convention Centre in Brussels, Belgium from Oct. 31 to Nov. 4. The researchers from YITU made an oral presentation about PreCo on Nov. 2. All visitors are welcomed to stop by YITU’s booth 14 at the conference.