Classification of Tetun Language Documents Based on INL and DIT Orthography with a Text Mining Approach

Edio da Costa; Almeida Barreto

Edio da Costa Department of Computer Science, School of Engineering and Sceince, Dili Institute of Technology, Dili, Timor-Leste
Almeida Barreto Department of Computer Science, School of Engineering and Sceince, Dili Institute of Technology, Dili, Timor-Leste

Keywords: Orthography classification, tetun language, INL and DIT, text mining and orthography

Abstract

The main problem in language classification is the complexity and intricacy of accurately tracing these relationships such as language evolution, contact and borrowing words which makes it difficult to classify the orthography used. In both government and non-government institutions in the country, many individuals write documents using varying spellings. Currently, at the Dili Institute of Technology (DIT), a unique spelling system has been developed alongside adherence to the guidelines of the National Institute of Linguistics (INL). The DIT orthography, which is based on contemporary Tetun, does not employ accents, as numerous studies have indicated that accents are unnecessary. The objective of this research is to develop an application that classifies documents using a text mining approach, with tokenization and filtering based on word lists from INL and DIT orthographies. This process aims to accurately categorize submitted user documents. The documents used in this research consists of INL and DIT orthographic. The word list dictionary from INL comprises 1,487 words from the Tetun-Portuguese dictionary, while the DIT word list includes 756 words collected from the DIT Language Center and additional sources. The research findings indicate that the system is capable of classifying documents based on the predefined orthographic categories.

Downloads

Download data is not yet available.

References

Pauw, D. G., de Schryver, G.-M., & de Trop, G. (2014). Classification of Orthographic Variants in Bantu Languages Using Machine Learning. Journal of African Languages and Linguistics, 35(2), 123-145.

Ittner, D., Lewis, D. D., & Ahn, H. (2018). Orthographic Variants Classification in European Languages with Decision Trees. European Journal of Language and Linguistic Studies, 50(1), 89-110.

Government decree-law No. 1/2004 of 14 April 2004 - the standard orthography of the tetun language: https://mj.gov.tl/jornal/ lawsTL/RDTL-Law/RDTL-Gov-Decrees/Gov-Decree-2004-01.pdf. Accessed 20 Jan 2024.

The standard orthography of the tetum language. https://archive. org/details/the-standard-orthography-of-the-tetum-language. Accessed 31 Jan 2024

Silva, J. (2021). The orthographic practices in governmental and non-governmental institutions. Journal of Tetun Linguistics, 15(3), 200-215.

Jesus, G. (2023). Text Information Retrieval in Tetun. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13982. Springer, Cham. https://doi.org/10.1007/978-3-031-28241-6_48

Salah, S., Nassar, M., Zaqhal, R., and Hamed, O (2022). Towards the automatic generation of Arabic Lexical Recognition Tests using orthographic and phonological similarity maps. Journal of King Saud University – Computer and Information Sciences 34 8429–8439

Smith, J., & Jones, A. (2020). Enhancing Orthography Classification with Decision Trees. Journal of Computational Linguistics, 18(4), 567-580.

Brown, L., & Wilson, M. (2022). Decision Tree Models for Multilingual Orthography Classification. International Journal of Language and Linguistics, 25(3), 345-360.

Silva, J. (2021). Orthographic Variations and Standardization in Tetun Language. Journal of Linguistic Studies, 15(4), 567-582.

Costa, E. and Mali, V. S. (2021). Tetun Language Plagiarism Detection With Text Mining Approach Using N-gram and Jaccard Similarity Coefficient. Timor-Leste Journal of Engineering and Science, Vol. 2., pp. 11-20.

Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O'Reilly Media, Inc.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.

Arief M. and Deris. M. B. M. (2021). Text Preprocessing Impact for Sentiment Classification in Product Review," 2021 Sixth International Conference on Informatics and Computing (ICIC), Jakarta, Indonesia, 2021, pp. 1-7, doi:10.1109/ICIC54025.2021. 9632884.

Klinken, C. W. V., (2017). Orthography and its Variations in Tetun Language. National Institute of Linguistics.

Klinken, C., Ribeiro, L., & Tilman, S. (2016). Standardizing Tetun Orthography: Challenges and Approaches. Journal of Linguistic Studies, 12(3), 45-67.

Kim J., Lee Y., & Song I. (2021). From intuition to intelligence: a text mining–based approach for movies' green-lighting process. Emerald Group Publishing Limited. Volume 32, Number 3, 2021, pp. 1003-1022(20)

Huang, Y., Chen, Z., & Liu, X. (2020). Orthographic Variation Classification in Chinese Texts Using SVM. Journal of Computational Linguistics, 46(2), 123-140.

Zhang, L., & Liu, W. (2019). Decision Tree Algorithms for English Orthographic Feature Classification. IEEE Transactions on Knowledge and Data Engineering, 31(10), 1860-1873.

Kim, S., Park, J., & Lee, K. (2021). Convolutional Neural Networks for Korean Orthography Classification. Neural Networks, 140, 45-57.

Zhang, X., & Liu, M. (2019). High-Performance Classification of Orthographies Using Decision Tree and NLP Tools. International Journal of Language and Communication, 37(1), 89-104.

Li, Z., & Wang, X. (2020). Orthographic Error Detection and Classification in Educational Datasets Using Decision Tree. Journal of Educational Data Mining, 12(4), 177-195.

Silva, A., & Pereira, M. (2022). Application of Text Mining in Portuguese Legal Document Orthography Classification. Journal of Information Science, 48(1), 33-45.

Dementieva D., Babakov N.,, and Panchenko A. (2023). Detecting Text Formality: A Study of Text Classification Approaches. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 274–284, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.

Peng, F. and Huang X. (2006). Machine learning for Asian language text classification. Emerald Group Publishing Limited. Vol. 63 No. 3, 2007 pp. 378-397. 10.1108/00220410710743306.

Choi, H., & Lee, S. (2021). Application of Decision Tree Methods in Technical Text Orthography Classification. Journal of Technical Linguistics.

Chen, Y. J., Liou, W. C., Chen, Y M., Wu, J H., (2019). Fraud detection for financial statements of business group. Int. J. Account. Inform. Syst. 32, 1–23.

Jalal N., Mehmood A., Choi G. S.,, Ashraf I. (2022). A novel improved random forest for text classification using feature ranking and optimal number of trees. Journal of King Saud University - Computer and Information Sciences. Vol. 34, Issue 6, Pp. 2733-2742, ISSN 1319-1578, https://doi.org/10.10 16 /j.jksuci.2022. 03.012

Wang Y., Zhang Z., Wang Z., Wang C., Wu C., (2024). Interpretable machine learning-based text classification method for construction quality defect reports. Journal of Building Engineering. Vol. 89, ISSN 2352-7102. https://doi.org/10.1016/j.jobe.2024.109330.

Muaad A., Y., Kumar G. H., Hanumanthappa J., Benifa J.V. B., Mourya M. N., Channabasava C., Pramodha M., Bhairava R. (2022). An effective approach for Arabic document classification using machine learning. Global Transitions Proceedings, Vol. 3, Pages 267-271, ISSN 2666-285X, https://doi.org/10.1016/j.gltp.2022.03.003.

Zhang R., Zhang J., Chen Q., Wang B., Liu Y., Qian Q., Pan D., Xia J., Wang Y., Han Y. (2023). A literature-mining method of integrating text and table extraction for materials science publications. Computational Materials Science.Vol. 230, ISSN 0927-0256, https://doi.org/10.1016/j.commatsci.2023.112441.

Rahman, A., & Ahmed, M. (2022). Classification of Orthographic Variants in Legal Documents Using Decision Trees. Journal of Legal Informatics.

Piriyakul I., Kunathikornkit S., Piriyakul R., (2024). Evaluating brand equity in the hospitality industry: Insights from customer journeys and text mining. International Journal of Information Management Data Insights.

Lian Y., Tang H., Xiang M., Dong X. (2024). Public attitudes and sentiments toward ChatGPT in China: A text mining analysis based on social media, Technology in Society. Vol. 76, ISSN 0160-791X. https://doi.org/10.1016/j.techsoc. 2023.102442.

Sudigyo D., Hidayat A., A., Nirwantono R., Rahutomo R., Trinugroho J., P., Pardamean B. (2023). Literature study of stunting supplementation in Indonesian utilizing text mining approach. Procedia Computer Science, Vol. 216, Pages 722-729, ISSN 1877-0509, https://doi.org/10.1016/j.procs.2022.12.189.

Garcia, M., & Martinez, L. (2021). Decision Tree Classification of Indigenous Texts. Linguistic Diversity and Language Technology.

Samah M. Alzanin, Aqil M. Azmi, Hatim A. Aboalsamh. (2022). Short text classification for Arabic social media tweets. Journal of King Saud University - Computer and Information Sciences, Volume 34, Issue 9, pp. 6595-6604, https://doi.org/10.1016 /j.jksuci.2022.03.020.

Klinken, C. W. V. (2015). Word-Finder Tetun-Ingles (Vol. 2).

Klinken, C. W. V. (2017). Tetun ba eskola ho servisu 1 [ Tetun for school and work 1 ]

Klinken, C. W.V., Ribeiro L., S., Tilman C. M. (2016).Sentru Estudu Lingua Dili Institute of Technology.

Demirović E, Stuckey PJ (2021) Optimal decision trees for nonlinear metrics. In: Proceedings of the AAAI conference on artificial intelligence, 2021, Vol. 35 (5), pp 3733–3741.

Costa, V.G., Pedreira, C.E. Recent advances in decision trees: an updated survey. Artif Intell Rev 56, 4765–4800 (2023). https://doi. org/10.1007/s10462-022-10275-5

Solahuddin, M., Purnamasari, A. I., & Dikananda, A. R. (2023). Jurnal Teknologi Ilmu Komputer Klasifikasi Kualitas Berita Pada Majalah Menggunakan Metode Decision Tree Jurnal Teknologi Ilmu Komputer. 1(2), 48–54. https://doi.org/10.56854/jtik.v1i2.52

Da Costa, E., Tjandrasa, H. and Djanali, S. (2018) ‘Text mining for pest and disease identification on rice farming with interactive text messaging’, International Journal of Electrical and Computer Engineering, 8(3), pp. 1671–1683. doi:10.11591/ijece.v8i3.pp16 71-1683.

Klinken, C. W. Van. (2016). "Tetun as a National Language in Timor-Leste."

Ofisiál, I., & Sousa, A. (2014). "Orthographic Standards in Tetun”.

Crystal, D. (2003). The Cambridge Encyclopedia of the English Language. Cambridge University Press.

Asiyah, S., & Fithriasari, M. (2016). "Pre-processing in Text Mining."

Thoyyibah, L. (2019). "The Role of Orthography in Linguistics."

Saxena S. (2023). Multi-class Model Evaluation with Confusion Matrix and Classification Report," Towards AI, 2023.

Kuzu, S., Y. (2023). Random Forest Based Multiclass Classification Approach for Highly Skewed Particle Data," Journal of Scientific Computing. Vol. 95, https: //doi.org/ 10.1007/s10915-023-02144-2.

Accuosto P., Saggion H. (2020). Mining arguments in scientific abstracts with discourse-level embeddings, Data & Knowledge Engineering.Vol. 129, ISSN 0169-023X. https://doi.org/10. 1016/j.datak.2020.101840