A Turkish Word Frequency Tool: LexiTR Frequency

Taner Sezer; Özay Karadağ

doi:10.16916/aded.1636416

Research Article

Türkçe Sözcük Sıklığı Aracı: LexiTR Sıklık Aracı

Year 2025, Volume: 13 Issue: 2, 266 - 276, 30.04.2025

Taner Sezer , Özay Karadağ

https://doi.org/10.16916/aded.1636416

Abstract

Sözcük sıklığı, dilbilim, bilişimsel dilbilim, doğal dil işleme (NLP) ve dil eğitimi alanlarında temel bir kavramdır. Sözcük sıklığı bir sözcüğün özelliklerini ve kullanım eğilimlerini anlamada kritik bir rol oynamaktadır. Bu çalışmada, LexiTR Projesi kapsamında geliştirilen "Türkçe Sözcük Sıklığı Aracı (TSSA)” ve özellikleri tanıtılmaktadır. TSSA, akademik, sosyal medya, kurgusal ve bilgilendirici metinler olmak üzere dört farklı türden oluşan 193 milyondan fazla sözcük içeren dengeli bir derleme dayanmaktadır. TSSA, araştırmacılara farklı metin türleri arasında sözcük kullanım eğilimlerini inceleme olanağı sunan, gerçek zamanlı sorgulama, grafiksel veri gösterimi, ham ve normalize edilmiş sıklık değerleri ile kapsamlı analiz imkânı sağlayan ölçeklenebilir bir çevrimiçi platformdur. Ayrıca, sağladığı API desteği ile sözcüğe ilişkin sıklık bilgilerini yapılandırılmış bir formatta sunmaktadır. Mevcut literatürdeki önemli bir boşluğu dolduran TSSA dilbilim araştırmaları ile doğal dil işleme uygulamaları için tutarlı, şeffaf ve kapsamlı bir temel oluşturmayı hedeflenmektedir.

Keywords

Sıklık, sözcük listesi, birimlendirme, TS Tokenizer, LexiTR

References

Akın, A. A. ve Akın, M. D. (2007). Zemberek, an open source NLP framework for Turkic languages. Structure, 10(2007), 1-5.
Arslan, K. ve Bay, Y. (2023). İlkokul Türkçe ders kitaplarının söz varlığı bakımından incelenmesi. Turkish Journal of Primary Education, 8(1), 14-27.
Baş, B. (2011). Söz varlığı ile ilgili çalışmalarda kullanılacak ölçütler. Türklük Bilimi Araştırmaları, (29), 27-61.
Başaran, B. (2022). Measuring word frequency in language teaching textbooks using LexiTürk. International Online Journal of Education and Teaching (IOJET), 9(1), 571-583.
Çal, A. (2015). Türkiye’de farklı dönemlere ait kelime sıklığı çalışmaları üzerine bir değerlendirme. Turkish Studies: International Periodical for the Languages, Literature and History of Turkish or Turkic, 10(8), 715-730.
Çınar, İ. ve İnce, B. (2015). Türkçe ve Türk kültürü ders kitaplarındaki söz varlığına derlem temelli bir bakış. International Journal of Languages' Education and Teaching, 3(1), 198-209.
Davies, M. (2009). The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International journal of corpus linguistics, 14(2), 159-190.
Douglas, B. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press.
Evler, D. ve Aksoy, E. (2024). Şermin Yaşar'ın çocuklara yönelik eserlerinde söz varlığı. SEBED, 2(1), 1-15.
Göz, İ. (2003). Yazılı Türkçenin kelime sıklığı sözlüğü. Ankara: Türk Dil Kurumu Yayınları.
Gürler, H. ve Yıldız, M. (2024). Doğan Kardeş Dergisinin söz varlığı üzerine bir araştırma. Milli Eğitim Dergisi, 53(242), 969-996.
Hankamer, J. (1989). Morphological parsing and the lexicon. In W. Marslen-Wilson (Ed.), Lexical representation and process (pp. 392-408). United States: MIT Press.
Inkelas, S., Küntay, A., Orgun, O. ve Sprouse, R. (2000). Turkish electronic living lexicon (TELL). Turkic Languages, 4, 253-275.
Karadağ, Ö. (2005). İlköğretim I. kademe öğrencilerinin kelime hazinesi üzerine bir araştırma (Unpublished doctoral dissertation). Gazi University, Institute of Educational Sciences, Ankara.
Kurudayıoğlu, M. (2005). İlköğretim II. kademe öğrencilerinin kelime hazinesi üzerine bir araştırma (Unpublished doctoral dissertation). Gazi University, Institute of Educational Sciences, Ankara.
Leech, G. N. (2011). Frequency, corpora and language learning. In A taste for corpora: In honour of Sylviane Granger (pp. 7-32). Netherlands: John Benjamins Publishing Company.
McEnery, T. ve Andrew H. (2011). Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press.
Ölker, G. (2011). Yazılı Türkçenin kelime sıklığı sözlüğü (1945-1950 arası) (Unpublished doctoral dissertation). Selçuk University, Institute of Social Sciences, Konya.
Popescu, I. I., Mačutek, J. ve Altmann, G. (2009). Aspects of word frequencies. Lüdenscheid: RAM-Verlag.
Pilten-Ufuk, Ş. (2021). Derlem dilbilim ve edebiyat çalışmalarının kesişim noktası: Derlem biçem bilimi. Ö. Solak ve S. Doykun (Ed.), Disiplinlerarası edebiyat çalışmaları içinde (ss. 145-171). İstabul: Paradigma Akademi Yayın.
Rust, P., Pfeiffer, J., Vulić, I., Ruder, S. ve Gurevych, I. (2020). How good is your tokenizer? On the monolingual performance of multilingual language models. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (pp. 3118-3135). Association for Computational Linguistics, Bangkok.
Rychlý, P. ve Spalek, S. (2022, December). Utok: The fast rule-based tokenizer. In Proceedings of recent advances in Slavonic natural language processing. (pp. 149-154). South Moravia: Tribun EU.
Schützler, O. (2023). Frequencies in corpus linguistics: Issues of scaling and visualisation. In Data visualization in corpus linguistics: Critical reflections and future directions. Helsinki: Varieng.
Sezer, T., Sezer, B. ve Üniversitesi, M. (2013, May). TS corpus: Herkes için Türkçe derlem. In Proceedings of the 27th national linguistics conference (pp. 217-225).
Sezer, T. (2016). Tweets corpus: Building a corpus by social media. Journal of National Education and Social Sciences, 210, 621-633.
Sezer, T. (2017). TS corpus project: An online Turkish dictionary and TS DIY corpus. European Journal of Language and Literature Studies, 3(3), 18-24.
Sezer, T. (2021). TS Corpus word list (Version 001) [Data set]. TS Corpus. Erişim adresi: https://doi.org/10.57672/B6M8-8333
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.
Soliman, R. ve Familiar, L. (2024). Creating a CEFR Arabic vocabulary profile: A frequency-based multi-dialectal approach. Critical Multilingualism Studies, 11(1), 266-286.
Törenli, N. ve Kıyan, Z. (2023). The importance of sustainable communication in the covid-19 period: The case of Turkey. In SDG18 Communication for All, Volume 2: Regional perspectives and special cases (pp. 225-246). Springer International.
Tüfekçi, P. (2020). Turkish dataset for identification of author gender [Data set]. Mendele Data. https://doi.org/10.17632/8f93rjhgjk.1
Webster, J. J. ve Kit, C. (1992). Tokenization as the initial phase in NLP. In Proceedings of COLING 1992, Volume 4: The 14th International Conference on Computational Linguistics (pp. 1106-1110).
Xu, J. (2022). A historical overview of using corpora in English language teaching. In The Routledge handbook of corpora and English language teaching and learning (pp. 11-25). England: Routledge.

A Turkish Word Frequency Tool: LexiTR Frequency

Year 2025, Volume: 13 Issue: 2, 266 - 276, 30.04.2025

Taner Sezer , Özay Karadağ

https://doi.org/10.16916/aded.1636416

Abstract

Word frequency is a fundamental concept in linguistics, computational linguistics, natural language processing (NLP) and language education. Word frequency plays a critical role in understanding the characteristics and usage patterns of a word. This study introduces the "Turkish Word Frequency Tool" (TWFT), developed as part of the LexiTR Project, along with its features. TWFT is based on a balanced corpus consisting of over 193 million words from four distinct text types: academic, social media, fictional, and informative texts. TWFT serves a scalable online platform that provides researchers with the ability to examine word usage trends across different text types. It enables comprehensive analyses through real-time querying, graphical data representation, and both raw and normalized frequency values. Additionally, it provides API support, presenting word frequency information in a structured format. By filling a significant gap in the existing literature, TWFT aims to establish a consistent, transparent, and comprehensive foundation for linguistic research and natural language processing applications.

Keywords

Frequency, lexicon, tokenization, TS Tokenizer, LexiTR

References

Akın, A. A. ve Akın, M. D. (2007). Zemberek, an open source NLP framework for Turkic languages. Structure, 10(2007), 1-5.
Arslan, K. ve Bay, Y. (2023). İlkokul Türkçe ders kitaplarının söz varlığı bakımından incelenmesi. Turkish Journal of Primary Education, 8(1), 14-27.
Baş, B. (2011). Söz varlığı ile ilgili çalışmalarda kullanılacak ölçütler. Türklük Bilimi Araştırmaları, (29), 27-61.
Başaran, B. (2022). Measuring word frequency in language teaching textbooks using LexiTürk. International Online Journal of Education and Teaching (IOJET), 9(1), 571-583.
Çal, A. (2015). Türkiye’de farklı dönemlere ait kelime sıklığı çalışmaları üzerine bir değerlendirme. Turkish Studies: International Periodical for the Languages, Literature and History of Turkish or Turkic, 10(8), 715-730.
Çınar, İ. ve İnce, B. (2015). Türkçe ve Türk kültürü ders kitaplarındaki söz varlığına derlem temelli bir bakış. International Journal of Languages' Education and Teaching, 3(1), 198-209.
Davies, M. (2009). The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International journal of corpus linguistics, 14(2), 159-190.
Douglas, B. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press.
Evler, D. ve Aksoy, E. (2024). Şermin Yaşar'ın çocuklara yönelik eserlerinde söz varlığı. SEBED, 2(1), 1-15.
Göz, İ. (2003). Yazılı Türkçenin kelime sıklığı sözlüğü. Ankara: Türk Dil Kurumu Yayınları.
Gürler, H. ve Yıldız, M. (2024). Doğan Kardeş Dergisinin söz varlığı üzerine bir araştırma. Milli Eğitim Dergisi, 53(242), 969-996.
Hankamer, J. (1989). Morphological parsing and the lexicon. In W. Marslen-Wilson (Ed.), Lexical representation and process (pp. 392-408). United States: MIT Press.
Inkelas, S., Küntay, A., Orgun, O. ve Sprouse, R. (2000). Turkish electronic living lexicon (TELL). Turkic Languages, 4, 253-275.
Karadağ, Ö. (2005). İlköğretim I. kademe öğrencilerinin kelime hazinesi üzerine bir araştırma (Unpublished doctoral dissertation). Gazi University, Institute of Educational Sciences, Ankara.
Kurudayıoğlu, M. (2005). İlköğretim II. kademe öğrencilerinin kelime hazinesi üzerine bir araştırma (Unpublished doctoral dissertation). Gazi University, Institute of Educational Sciences, Ankara.
Leech, G. N. (2011). Frequency, corpora and language learning. In A taste for corpora: In honour of Sylviane Granger (pp. 7-32). Netherlands: John Benjamins Publishing Company.
McEnery, T. ve Andrew H. (2011). Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press.
Ölker, G. (2011). Yazılı Türkçenin kelime sıklığı sözlüğü (1945-1950 arası) (Unpublished doctoral dissertation). Selçuk University, Institute of Social Sciences, Konya.
Popescu, I. I., Mačutek, J. ve Altmann, G. (2009). Aspects of word frequencies. Lüdenscheid: RAM-Verlag.
Pilten-Ufuk, Ş. (2021). Derlem dilbilim ve edebiyat çalışmalarının kesişim noktası: Derlem biçem bilimi. Ö. Solak ve S. Doykun (Ed.), Disiplinlerarası edebiyat çalışmaları içinde (ss. 145-171). İstabul: Paradigma Akademi Yayın.
Rust, P., Pfeiffer, J., Vulić, I., Ruder, S. ve Gurevych, I. (2020). How good is your tokenizer? On the monolingual performance of multilingual language models. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (pp. 3118-3135). Association for Computational Linguistics, Bangkok.
Rychlý, P. ve Spalek, S. (2022, December). Utok: The fast rule-based tokenizer. In Proceedings of recent advances in Slavonic natural language processing. (pp. 149-154). South Moravia: Tribun EU.
Schützler, O. (2023). Frequencies in corpus linguistics: Issues of scaling and visualisation. In Data visualization in corpus linguistics: Critical reflections and future directions. Helsinki: Varieng.
Sezer, T., Sezer, B. ve Üniversitesi, M. (2013, May). TS corpus: Herkes için Türkçe derlem. In Proceedings of the 27th national linguistics conference (pp. 217-225).
Sezer, T. (2016). Tweets corpus: Building a corpus by social media. Journal of National Education and Social Sciences, 210, 621-633.
Sezer, T. (2017). TS corpus project: An online Turkish dictionary and TS DIY corpus. European Journal of Language and Literature Studies, 3(3), 18-24.
Sezer, T. (2021). TS Corpus word list (Version 001) [Data set]. TS Corpus. Erişim adresi: https://doi.org/10.57672/B6M8-8333
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.
Soliman, R. ve Familiar, L. (2024). Creating a CEFR Arabic vocabulary profile: A frequency-based multi-dialectal approach. Critical Multilingualism Studies, 11(1), 266-286.
Törenli, N. ve Kıyan, Z. (2023). The importance of sustainable communication in the covid-19 period: The case of Turkey. In SDG18 Communication for All, Volume 2: Regional perspectives and special cases (pp. 225-246). Springer International.
Tüfekçi, P. (2020). Turkish dataset for identification of author gender [Data set]. Mendele Data. https://doi.org/10.17632/8f93rjhgjk.1
Webster, J. J. ve Kit, C. (1992). Tokenization as the initial phase in NLP. In Proceedings of COLING 1992, Volume 4: The 14th International Conference on Computational Linguistics (pp. 1106-1110).
Xu, J. (2022). A historical overview of using corpora in English language teaching. In The Routledge handbook of corpora and English language teaching and learning (pp. 11-25). England: Routledge.

There are 33 citations in total.

Details

Primary Language	English
Subjects	Turkish Education
Journal Section	Articles
Authors	Taner Sezer 0000-0002-7328-7650 Özay Karadağ 0000-0003-4596-1203
Publication Date	April 30, 2025
Submission Date	February 9, 2025
Acceptance Date	March 17, 2025
Published in Issue	Year 2025Volume: 13 Issue: 2

Cite

APA	Sezer, T., & Karadağ, Ö. (2025). A Turkish Word Frequency Tool: LexiTR Frequency. Ana Dili Eğitimi Dergisi, 13(2), 266-276. https://doi.org/10.16916/aded.1636416

Article Files

Full Text

Attribution-NonCommercial 4.0 International