Semantic Indexing of Multilingual Corpora
The increasing amount of multilingual text collections available in different domains makes its automatic processing essential for the development of a given field.
However, standard processing techniques based on statistical clues and keyword searches have clear limitations.
Instead, we propose a knowledge-based processing pipeline overcoming most of the limitations of these techniques and enabling direct comparison across texts in different languages
without the need of translation. In our paper we show its potential for semantically indexing multilingual text collections.
We used a multilingual version of the Bible for the experiments (available for download), evaluating the precision of our semantic indexing pipeline and
showing its reliability on the cross-lingual text retrieval task.
Download the whole package [32.4 MB] .
Or, alternatively, download each file individually:
- Sense-annotated evaluation corpus (two chapters of the Bible) of 594 manual annotations (English and Spanish). Download:
- Bible disambiguated in four languages: English, Spanish, French and Russian. Download:
- Interface for the semantic search in preprocessed multilingual corpora Download:
Note: The interface is currently in its preliminary version. Feel free to make any modification or contribution to the Java code.
Repository of the interface [github]
When using these data, please refer to the following paper:
Alessandro Raganato, José Camacho-Collados, Antonio Raganato and Yunseo Joung.
Semantic Indexing of Multilingual Corpora and its Application on the History Domain. [paper] [bib] [poster]
LT4DH, COLING 2016, Osaka, Japan.
Should you have any enquiries about any of the resources, please contact Alessandro Raganato (raganato [at] di.uniroma1 [dot] it) or
José Camacho Collados (collados [at] di.uniroma1 [dot] it).
Last update: 8 Dec. 2016