Semantic Indexing of Multilingual Corpora

The increasing amount of multilingual text collections available in different domains makes its automatic processing essential for the development of a given field. However, standard processing techniques based on statistical clues and keyword searches have clear limitations. Instead, we propose a knowledge-based processing pipeline overcoming most of the limitations of these techniques and enabling direct comparison across texts in different languages without the need of translation. In our paper we show its potential for semantically indexing multilingual text collections. We used a multilingual version of the Bible for the experiments (available for download), evaluating the precision of our semantic indexing pipeline and showing its reliability on the cross-lingual text retrieval task.


Download the whole package [32.4 MB] .

Or, alternatively, download each file individually:

Due to license issues, so far we have not been able to provide an offline software to preprocess any given corpus. Meanwhile, if you have some corpora (up to 10MB) and you want to use our interface, please feel free to send the corpora to us (split in .txt files) and we will preprocess it for you. The preprocessing includes disambiguation and entity linking, and conversion to XML.

Reference paper

When using these data, please refer to the following paper:

Alessandro Raganato, José Camacho-Collados, Antonio Raganato and Yunseo Joung.
Semantic Indexing of Multilingual Corpora and its Application on the History Domain. [paper] [bib] [poster]
LT4DH, COLING 2016, Osaka, Japan.


Should you have any enquiries about any of the resources, please contact Alessandro Raganato (raganato [at] di.uniroma1 [dot] it) or José Camacho Collados (collados [at] di.uniroma1 [dot] it).

Last update: 8 Dec. 2016