Tetun Language Plagiarism Detection With Text Mining Approach Using N-gram and Jaccard Similarity Coefficient

  • Edio da Costa Department of Computer Science, School of Engineering and Science, Dili Institute of Technology
  • Vasco Soares Mali Department of Computer Science, School of Engineering and Science, Dili Institute of Technology
Keywords: Tetun language, plagiarism detection, text mining, n-grams, jaccard similarity coefficient.


The objective of this research is to develop Tetun language detection plagiarism application with the Text Mining approach that performs Tokenizing and Filtering that use to extract and select a word list from the title of the thesis that is submitted by the students. The n-grams and Jaccard Similarity Coefficient methods are used to retrieve the letter characters in the document to be matched and calculate what percentage of the similarities in the processed thesis title. The dataset used in this study was obtained from the Dili Institute of Technology (DIT) Library with a total of 1000. The word dictionary used consists of 2.560 Word Lists and 8.972 Stop Words that were obtained from the Language Centre of DIT. The result of experiment shows that the performance detection plagiarism obtained the highest precision and recall is 0.90 and 0.94


Download data is not yet available.


Al-thwaib, E. and Hammo, B. H. (2020) ‘An academic Arabic corpus for plagiarism detection : design , construction and experimentation’. International Journal of Educational Technology in Higher Education, pp. 1–26.

Alzahrani, S. M., Salim, N. and Abraham, A. (2012) ‘Understanding plagiarism linguistic patterns, textual features, and detection methods’, IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews, 42(2), pp. 133–149. doi: 10.1109/TSMCC.2011.2134847.

Baygin, M. (2019) ‘Classification of Text Documents based on Naive Bayes using N-Gram Features’, 2018 International Conference on Artificial Intelligence and Data Processing, IDAP 2018. IEEE, pp. 1–5. doi: 10.1109/IDAP.2018.8620853.

Carter, H., Hussey, J. and Forehand, W. (2019) ‘Plagiarism in nursing education and the ethical implications in practice’, (March). doi: 10.1016/j.heliyon.2019.e01350.

Cavnar, W. B. and Trenkle, J. M. (2001) ‘N-Gram-Based Text Categorization N-Gram-Based Text Categorization’, Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, (December 2012), pp. 1–14.

Da Costa, E., Tjandrasa, H. and Djanali, S. (2018) ‘Text mining for pest and disease identification on rice farming with interactive text messaging’, International Journal of Electrical and Computer Engineering, 8(3), pp. 1671–1683. doi:10.11591/ijece.v8i3.pp16 71-1683.

Fish, R. and Hura, G. (2013) ‘Students ’ perceptions of plagiarism’, 13(5), pp. 33–45.

Grammarly (2021) Great Writing, Simplified. Available at: https://www.grammarly.com/ (Accessed: 20 July 2021).

Hasan, E. G., Wicaksana, A. and Hansun, S. (2018) ‘The Implementation of Winnowing Algorithm for Plagiarism Detection in Moodle-based E-learning’, Proceedings - 17th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2018. IEEE, pp. 321–325. doi: 10.1109/ICIS.2018.8466429.

Henriques, P. R. (2015) ‘An AST-based Tool , Spector , for Plagiarism Detection : The Approach , Functionality ’, pp. 153–159. doi: 10.1007/978-3-319-27653-3.

Ho, P. H. et al. (2017) ‘Data Warehouse Designing for Vietnamese Textual Document-based Plagiarism Detection System’.

Horbach, S. P. J. M. S. and Hal, W. W. (2019) ‘The extent and causes of academic text recycling or “ self-plagiarism”’, 48(September 2017), pp. 492–502. doi:10.1016 /j. respol.2017 09. 004.

Klinken, C. W. (2015) Word-Finder. Edisaun 2. Dili Institute of Technology.

Klinken, C. W. (2019) ‘Dezenvolvimentu Lia-Tetun Tuir Dalan Informál’, in Timor-Leste Studies Association.

Kosmajac, D. and Keselj, V. (2017) ‘Language identification in multilingual, short and noisy texts using common N-grams’, Proceedings - 2017 IEEE International Conference on Big Data, Big Data 2017, 2018-Janua, pp. 2752–2759. doi: 10.1109/BigData.2017.8258240.

Kumar, R. and Tripathi, R. C. (2015) ‘Text mining and similarity search using extended tri-gram algorithm in the reference based local repository dataset’, Procedia - Procedia Computer Science. Elsevier Masson SAS, 65(Icc), pp. 911–919. doi: 10.1016/j.procs.2015.09.062.

Loseu, B. V., Ghasemzadeh, H. and Jafari, R. (2012) ‘A Mining Technique Using n-Grams andMotion Transcripts for Body Sensor Network Data Repository’.

Mcnamee, P. (2004) ‘Character N -Gram Tokenization for European’, pp. 73–97.

Metz, C. (2016) Forget Apple vs. the FBI: WhatsApp Just Switched on Encryption for a Billion People, Wired. Available at: http://www.wired.com/2016/04/forget-apple-vs-fbi-whatsapp-just-switched-encryption-billion-people/ (Accessed: 30 June 2018).

Oberreuter, G. and Velásquez, J. D. (2013) ‘Expert Systems with Applications Text mining applied to plagiarism detection : The use of words for detecting deviations in the writing style’, Expert Systems With Applications, 40(9), pp. 3756–3763. doi: 10.1016/j.eswa.2012.12.082.

Parwita, W. G. S., Indradewi, I. G. A. A. D. and Wijaya, I. N. S. W. (2019) ‘String matching based plagiarism detection for document in Bahasa Indonesia’, Proceedings of 2019 5th International Conference on New Media Studies, CONMEDIA 2019, pp. 54–58. doi: 10.1109/CONMEDIA46929.2019.8981821.

Plagrame (2021) Plagiarism and originality detector. Available at: https://www.plagramme.com/?gclid=Cj0KCQjw6NmHBhD2ARIsAI3hrM0XQjIsixEJuWVzYvXDyjVhoaL_BllHh00A39gSUw2ccO7axiho7AQaAoFBEALw_wcB (Accessed: 20 July 2021).

Potthast, M. and Holfeld, T. (2011) ‘Overview of the 2nd international competition on Wikipedia vandalism detection’, CEUR Workshop Proceedings, 1177(January 2010).

Mooney RJ, “Machine Learning Text Categorization”, University of Texas Austin, 2006

Tanantong, T. K. S. and Laosen, N. (2020) 'Extraction of Trend Keywords from Thai Twitters using N-Gram Word Combination'. 17th International Conference on Electrical Engineering/ Electronics, Computer, Telecommunications and Information Technology. pp.320-323.

Sakamoto, D. and Tsuda, K. (2019) ‘ScienceDirect ScienceDirect A Detection Method for Plagiarism Reports of Students A Detection Method for Plagiarism Reports of Students’, Procedia Computer Science. Elsevier B.V., 159, pp. 1329–1338. doi: 10.1016/j.procs.2019.09.303.

Setiawan, E. I. et al. (2018) ‘N-Gram Keyword Retrieval on Association Rule Mining for Predicting Teenager Deviant Behavior from School Regulation’, 2018 International Conference on Computer Engineering, Network and Intelligent Multimedia, CENIM 2018 - Proceeding. IEEE, pp. 325–328. doi: 10.1109/CENIM.2018.8710892.

Sharma, S. and Sharma, C. S. (2015) ‘Plagiarism Detection Tool “ Parikshak ”’.

SmallSeoTools (2021) Plagiarism Checker. Available at: https://smallseo.tools/plagiarism-checker (Accessed: 20 July 2021).

Suzuki, M. et al. (2008) ‘Multilingual text categorization using character N-gram’, SMCia/08 - Proceedings of the 2008 IEEE Conference on Soft Computing on Industrial Applications, 2003, pp. 49–54. doi: 10.1109/SMCIA.2008.5045934.

Suzuki, M. et al. (2010) ‘English and taiwanese text categorization using N-gram based on Vector Space Model’, ISITA/ISSSTA 2010 - 2010 International Symposium on Information Theory and Its Applications, pp. 106–111. doi: 10.1109/ISITA.2010.5649453.

Turnitin (2021) The new standard in academic integrity. Available at: https://www.turnitin.com/products/originality?utm_source=Google&utm_medium=CPC&utm_campaign=APAC_ALL_AD_ID_Integrity_2021&utm_content=Originality&utm_country=ID (Accessed: 20 July 2021).

Vysotska, V. (2018) ‘Defining Author ’ s Style for Plagiarism Detection in Academic Environment’, 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP). IEEE, pp. 128–133.

Yudhana, A. et al. (2018) ‘Implementasi Deteksi Plagiarisme Menggunakan Metode n-gram dan Jaccard Similarity Terhadap Algoritma Winnowing’, (3), pp. 2–7.

Yudhana, B. A. (2019) ‘Implementation of Pattern Matching Algorithm for Portable Document Format’.

Klinken, C. W. Ribeiro, Leoneto da S. Tilman, C. M. (2016). Tetun ba Eskola ho Servisu 1. Pp.16-17.

Badawy, M. Mahmood, M. El-aziz, A. Hefny, H. A. A. (2018). Text Mining Approach for Automatic Selection of Academic Course Topics based on Course Specifications. 2018 14th International Computer Engineering Conference (ICENCO). pp. 162-167