Remove 2033 Remove AI Remove Computational Linguistics
article thumbnail

All Languages Are NOT Created (Tokenized) Equal

Topbots

I used the dev split of the dataset, which consists of 2033 texts translated into each of the languages. Distribution of token lengths for all 2033 messages and 52 languages. 70% of research papers published in a computational linguistics conference only evaluated English.[ Figure created by author.