Syllable-Based Context-Aware Sequence to Sequence Transformer Model for Turkish Diacritic Restoration

Yükleniyor...
Küçük Resim

Tarih

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

IEEE

Erişim Hakkı

info:eu-repo/semantics/closedAccess

Özet

With the Internet being used widely and technology advancing rapidly, the number of digital texts in various languages is continuously growing. However, due to differences in keyboards and alphabets, there are many missing or incorrectly used diacritics, which can make reading a challenge. This presents a difficulty for natural language processing (NLP) applications, as they must accurately interpret the meaning of words despite these errors. This presents a difficulty for natural language processing (NLP) applications, as they must accurately interpret the meaning of words despite these errors. This study focuses on diacritic restoration (DR) which is a crucial element in many natural language processing applications across multiple languages. This study proposes a Bidirectional Transformer structure based on syllables to account for Turkish's high sensitivity to syllables in determining meaning. Additionally, incorporating a semantic marker into the training data enhances the model's performance. Our research has demonstrated that optimizing the configuration of our proposed model has resulted in a significant improvement in performance compared to previous studies that were based on words or characters. We were able to achieve an impressive accuracy rate of 98.84% of accent characters within ambiguous words, with a high accuracy rate of 92.85% in correcting ambiguous words, indicating success in semantic learning. This represents a significant breakthrough in the field of diacritic restoration and emphasizes the potential for improving natural language processing applications in various languages.

Açıklama

32nd IEEE Signal Processing and Communications Applications Conference (SIU) -- MAY 15-18, 2024 -- Tarsus Univ Campus, Mersin, TURKEY

Anahtar Kelimeler

Natural Language Processing, Diacritics Restoration, Accent Correction, Sequence-to-Sequence Learning, Transformer Model, Syllable-Based Data, Semantic Learning

Kaynak

32nd IEEE Signal Processing and Communications Applications Conference, Siu 2024

WoS Q Değeri

Scopus Q Değeri

Cilt

Sayı

Künye

Onay

İnceleme

Ekleyen

Referans Veren