article thumbnail

All Languages Are NOT Created (Tokenized) Equal

Topbots

I used the dev split of the dataset, which consists of 2033 texts translated into each of the languages. Distribution of token lengths for all 2033 messages and 52 languages. 70% of research papers published in a computational linguistics conference only evaluated English.[ Association for Computational Linguistics.