Common Flaws in NLP Evaluation Experiments
Ehud Reiter
JANUARY 15, 2024
However I think journals such as Computational Linguistics and TACL could adjust reviewing procedures to check some of above. We only looked at human evaluations, but I suspect the problem may be just as bad with metric evaluations (eg see Arvan et al (2022) ). Computational Linguistics. Unfortunately.
Let's personalize your content