Rosetta Stone

Data

The EUTRANS-I corpus

EUTRANS-I is a simple translation corpus which was produced and used in the EuTrans project. It corresponds to the so called "Traveller Task" which involves human-to-human communication situations in the front-desk of a hotel. Bilingual data were produced semi-automatically in three language pairs on the base of small "seed corpora", obtained from several traveler-oriented booklets. More details and experimental results can be found here. Only a benchmark version of the Spanish-English corpus is available here for academic research (300KB).

The GERMANA corpus

GERMANA is the result of digitising and annotating a 764-page Spanish manuscript entitled “Noticias y documentos relativos a Doña Germana de Foix, ́última Reina de Aragón" and written in 1891 by Vicent Salvador. A detailed description and instructions to download can be found here.

The IAM-PRHLT bi-modal Handwritten Text corpus

The biMod-IAM-PRHLT corpus is a bimodal dataset of on-line and off-line handwritten text. It is composed of a set of handwritten words (500 aprox.) with several word instances of each of the on-line and off-line modalities. The off-line samples are presented as grey-level images (png format), and the on-line samples are sequences of X-Y coordinates (Unipen format, originally in xml format) describing the trajectory of an electronic pen while writing the same word. The writers of the on-line and off-line samples are (generally) different. A more detailed description and instructions to download can be found here.