A high performance centroid-based classification approach for language identification

dc.contributor.authorTakci, Hidayet
dc.contributor.authorGungor, Tunga
dc.date.accessioned2025-10-29T11:23:54Z
dc.date.issued2012
dc.departmentGebze Teknik Üniversitesi
dc.description.abstractCentroid-based classification is a machine learning approach used in the text classification domain. The main advantage of centroid-based classifiers is their high performance during both the training stage and the classification stage. However, the success rate can be lower than the other classifiers if good centroid values are not used. In this paper, we apply the centroid-based classification method to the language identification problem, which can be considered as a sub-problem of text classification. We propose a novel method named as inverse class frequency to increase the quality of the centroid values, which involves an update of the classical values. We also use a feature set formed of individual characters rather than words or n-gram sequences to decrease the training and classification times. The experiments were performed on the ECI/MCI corpus and the method was compared with other methods and previous studies. The results showed that the proposed approach yields high success rates and works very efficiently for language identification. (c) 2012 Elsevier B.V. All rights reserved.
dc.identifier.doi10.1016/j.patrec.2012.06.012
dc.identifier.endpage2084
dc.identifier.issn0167-8655
dc.identifier.issn1872-7344
dc.identifier.issue16
dc.identifier.orcid0000-0002-4448-4284
dc.identifier.orcid0000-0001-9448-9422
dc.identifier.startpage2077
dc.identifier.urihttps://doi.org/10.1016/j.patrec.2012.06.012
dc.identifier.urihttps://hdl.handle.net/20.500.14854/9668
dc.identifier.volume33
dc.identifier.wosWOS:000311260000003
dc.identifier.wosqualityQ2
dc.indekslendigikaynakWeb of Science
dc.language.isoen
dc.publisherElsevier
dc.relation.ispartofPattern Recognition Letters
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/closedAccess
dc.snmzKA_WOS_20251020
dc.subjectLanguage identification
dc.subjectCentroid-based classification
dc.subjectIDF (inverse document frequency)
dc.subjectICF (inverse class frequency)
dc.titleA high performance centroid-based classification approach for language identification
dc.typeArticle

Dosyalar