Language Modeling for Morphologically Rich Languages

LMMRL is a multilingual corpus for evaluating language models across typologically-diverse languages.

We provide preprocessed training, validation and test sets for 50 languages.
The corpus is sourced from all 40 languages of the Polyglot Wikipedia (Al-Rfou et al. 2013), plus 10 more typologically-diverse languages extracted from Wikipedia.
Training sets contain 40,000 sentences, validation and test sets are 3000 sentences each.

Please contact Daniela Gerz for any questions.

Publications

Please cite the following paper if you use the LMMRL corpus in your work:

Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction
Daniela Gerz, Ivan Vulić, Edoardo Ponti, Jason Naradowsky, Roi Reichart and Anna Korhonen. TACL 2018.
[pdf][code coming soon][bibtex coming soon]

The paper contains more detailed corpus statistics and a benchmark of common language models on this data.

Download

Download the full corpus for all 50 languages by clicking here.

Corpus Statistics

Here is an overview of the corpus statistics.

Language # Vocab (train) # Vocab (test) # Tokens (train) # Tokens (test) # Type-Token Ratio (train)
× Amharic (am) 89749 94554 511158 39179 0.18
× Arabic (ar) 89089 94121 722348 54671 0.12
☐ Bulgarian (bg) 71360 75256 669953 48953 0.11
☐ Catalan (ca) 61033 63595 788303 59359 0.08
☐ Czech (cs) 86783 91083 640732 49562 0.14
☐ Danish (da) 72468 76086 663274 50319 0.11
☐ German (de) 80741 84786 682105 51255 0.12
☐ Greek (el) 76264 80031 743918 56491 0.10
☐ English (en) 55521 58001 782713 59514 0.07
☐ Spanish (es) 60196 62917 781495 57176 0.08
★ Estonian (et) 94184 98091 555718 38568 0.17
★ Basque (eu) 81177 84542 647264 47285 0.13
☐ Farsi (fa) 52306 54347 738162 54191 0.07
★ Finnish (fi) 115579 122068 585229 44839 0.20
☐ French (fr) 58539 61114 768695 57061 0.08
× Hebrew (he) 83217 87079 717380 54576 0.12
☐ Hindi (hi) 50384 53013 666337 49116 0.08
☐ Croatian (hr) 86357 90728 619753 48130 0.14
★ Hungarian (hu) 101874 106889 671705 48687 0.15
▷ Indonesian (id) 49125 51360 701716 52239 0.07
☐ French (it) 70194 73117 786902 59258 0.09
★ Japanese (ja) 44863 46631 729373 54573 0.06
★ Javanese (jv) 65141 69433 621995 51967 0.10
★ Georgian (ka) 80211 83949 579587 41137 0.14
▷ Khmer (km) 37851 39154 578796 37402 0.07
★ Kannada (kn) 94660 99264 433873 29366 0.22
★ Korean (ko) 143794 152069 648098 50612 0.22
☐ Lithuanian (lt) 81501 85292 554427 41708 0.15
☐ Latvian (lv) 75294 79858 587029 44968 0.13
▷ Malay (ms) 49385 52209 702195 54085 0.07
★ Mongolian (mng) 73884 78055 628780 50015 0.12
▷ Burmese (my) 20574 21329 575555 46065 0.04
▷ Min-Nan (nan) 33238 34642 1176075 65611 0.03
☐ Dutch (nl) 60206 62832 708357 53846 0.08
☐ Norwegian (no) 69761 73113 674072 47804 0.10
☐ Polish (pl) 97325 101851 634229 47713 0.15
☐ Portuguese (pt) 56167 58561 779877 59309 0.07
☐ Romanian (ro) 68913 71992 743115 52472 0.09
☐ Russian (ru) 98097 102084 666479 48369 0.15
☐ Slovak (sk) 88726 93247 618457 45004 0.14
☐ Slovene (sl) 83997 88340 658605 49242 0.13
☐ Serbian (sr) 81617 85258 628119 46698 0.13
☐ Swedish (sv) 77499 81608 688360 50377 0.11
★ Tamil (ta) 106403 112420 506898 39647 0.21
▷ Thai (th) 30056 31356 628000 49000 0.05
▷ Tagalog (tl) 72416 76207 972302 66325 0.07
★ Turkish (tr) 90840 95448 627000 45000 0.14
☐ Ukranian (uk) 89724 94707 635002 47042 0.14
▷ Vietnamese (vi) 32055 33215 753746 61908 0.04
▷ Chinese (zh) 43672 45325 745900 56769 0.06

Colours are indicating the morphological type of the language:

▷ Isolating ☐ Fusional × Introflexive ★ Agglutinative

References

Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed word representations for multilingual NLP. In Proceedings of CoNLL, pages 183-192