AI Built for Basque Language Considers Dialects, Social Media Language, and Old Literary Language

Researchers have developed a new family of language models for the Basque language, intentionally trained on dialects, social media language, and historical texts—not just the polished modern literary language. The work is driven by concerns that typical data filtering makes language models biased and vulnerable, as non-standard language forms are completely omitted.

Large language models are typically trained on vast text datasets that are cleaned of errors, colloquial language, and 'low-quality' text. According to researchers, this can mean that dialects, the social media language used by young people, or old language forms disappear from view—and the model learns to understand only standardized literary language.

The work focuses on Basque, which is both highly inflected and computationally low-resource. Researchers compiled three different text corpora in Basque: standardized literary language, social media messages, and historical texts. Based on these, they introduced a family of encoder-based language models named BERnaT, from which three versions were trained: one based only on standard language, a diverse one trained on non-standard data, and a combination of these.

Additionally, the researchers propose a new evaluation method where language comprehension tasks are separately divided to test standardized language and natural language variation. The aim is to measure how well the models handle real, varied language instead of excelling only in carefully edited texts.

The research highlights that little attention has been paid to the internal diversity of language in the development of language models. According to the authors, models should be designed from the outset to reflect the full spectrum of the language—including language forms that remain on the margins.

Source: BERnaT: Basque Encoders for Representing Natural Textual Diversity, ArXiv (AI).