Scientific Journal Of King Faisal University
Basic and Applied Sciences


Scientific Journal of King Faisal University / Basic and Applied Sciences

N-grams in Texts Categorization

(Zakaria Albraiha and Badr Abdaellati)


This paper deals with automatic classification of documents; this is performed by a supervised classification since it operates on a set of preset classes. The suggested approach is original since it is based on a vector representation of the documents centred not on the words but on the n-grams of characters for n varying from 2 to 5. Considering the significant number of the n-grams generated for each class, we used in our work the law of 61539;2 to reduce the number of the characteristic n-grams of each class. The weighting of the vectors was done by using the measurement of the TFIDF, and for the calculation of the distance between two vectors, we used the method of the Cosine. The experiments were done on two well-known corpora in the community of categorization, the Reuter 21578 and the 20Newsgroups. Evaluation of the approach was performed by using a function combining both precision and recall. The results obtained show that the technique of the n-grams is very effective in the field of the categorization of texts