Semantic Indexing of Multilingual Corpora

The increasing amount of multilingual text collections available in different domains makes its automatic processing essential for the development of a given field. However, standard processing techniques based on statistical clues and keyword searches have clear limitations. Instead, we propose a knowledge-based processing pipeline overcoming most of the limitations of these techniques and enabling direct comparison across texts in different languages without the need of translation. In our paper we show its potential for semantically indexing multilingual text collections. We used a multilingual version of the Bible for the experiments (available for download), evaluating the precision of our semantic indexing pipeline and showing its reliability on the cross-lingual text retrieval task.

Downloads

Download the whole package [32.4 MB] .

Or, alternatively, download each file individually:

Sense-annotated evaluation corpus (two chapters of the Bible) of 594 manual annotations (English and Spanish). Download:
Bible disambiguated in four languages: English, Spanish, French and Russian. Download:
Interface for the semantic search in preprocessed multilingual corpora Download:
Note: The interface is currently in its preliminary version. Feel free to make any modification or contribution to the Java code.
Repository of the interface [github]

Reference paper

When using these data, please refer to the following paper:

Alessandro Raganato, José Camacho-Collados, Antonio Raganato and Yunseo Joung.
Semantic Indexing of Multilingual Corpora and its Application on the History Domain. [paper] [bib] [poster]
LT4DH, COLING 2016, Osaka, Japan.

Contact

Should you have any enquiries about any of the resources, please contact Alessandro Raganato (raganato [at] di.uniroma1 [dot] it) or José Camacho Collados (collados [at] di.uniroma1 [dot] it).

Last update: 8 Dec. 2016

Semantic Indexing of Multilingual Corpora is an output of the Sapienza Research Grant Avvio alla Ricerca 2015 No. 56. and it is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.