Modified N-Gram based Model for Identifying and Filtering Near-Duplicate Documents Detection
During last three decades World Wide Web (WWW) has expanded exponentially. A great deal of the web is full
of duplicate or near-duplicate content. Documents that are served on the web are in different formats like PDF, HTML, excel
and text. Our proposed solution is created on a publicly available dataset files. The dataset consists of files which are tagged
as duplicate. Our work in this paper is based on the duplicate and near duplicate document detection using n-Gram based, a
low-dimensional demonstration(LSI-SVD) approach, implemented in c#.net.
Keywords - Duplicate document, N-gram, SVD (Singular Value Decomposition), LSI(Latent Semantic Indexing), Cosine