Bridging Languages through Etymology: The case of cross language text categorization
Vivi Nastase and Carlo Strapparava
The 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013)
Sofia, Bulgaria, August 4-9, 2013
We propose the hypothesis that word etymology is useful for NLP applications as a bridge between languages. We support this hypothesis with experiments in cross-language (English-Italian) document categorization. In a straightforward bag-of-words experimental set-up we add etymological ancestors of the words in the documents, and investigate the performance of a model built on English data, on Italian test data (and viceversa). The results show not only statistically significant, but a large improvement – a jump of almost 40 percentage points – over the raw (vanilla bag-of-words) representation.
Conference Manager (V2.61.0 - Rev. 2792M)