Tetun Language Plagiarism Detection With Text Mining Approach Using N-gram and Jaccard Similarity Coefficient

  • Edio da Costa Department of Computer Science, School of Engineering and Science, Dili Institute of Technology
  • Vasco Soares Mali Department of Computer Science, School of Engineering and Science, Dili Institute of Technology
Keywords: Tetun language, plagiarism detection, text mining, n-grams, jaccard similarity coefficient.


The objective of this research is to develop Tetun language detection plagiarism application with the Text Mining approach that performs Tokenizing and Filtering that use to extract and select a word list from the title of the thesis that is submitted by the students. The n-grams and Jaccard Similarity Coefficient methods are used to retrieve the letter characters in the document to be matched and calculate what percentage of the similarities in the processed thesis title. The dataset used in this study was obtained from the Dili Institute of Technology (DIT) Library with a total of 1000. The word dictionary used consists of 2.560 Word Lists and 8.972 Stop Words that were obtained from the Language Centre of DIT. The result of experiment shows that the performance detection plagiarism obtained the highest precision and recall is 0.90 and 0.94


Download data is not yet available.


