انت هنا الان: الرئيسية » القسم الاكاديمي
المقالات الاكاديمية والبحثية

Web Documents Similarity Using K-Shingle Tokens and MinHash Technique

    لتحميل الملف من هنا
Views  179
Rating  0
 مهدي عبادي مانع الموسوي 03/10/2018 17:34:06
تصفح هذه الورقة الالكترونية بتقنية Media To Flash Paper

Abstract: Nowadays, web search engine plays an integral role in discarding similar documents from the web search engine using one of the effective data mining techniques. Document similarity techniques in a massive data mining is such important technique in order to detect the mirror pages and the similarity of the articles in a large web repository. This will lead to avoid showing two web pages which are near identical at the top of search results. One of the document similarity approach is based on K-shingle which is a unique sequence of consecutive K words that can be used to find the similarity between two documents (K is a positive integer). The large web documents can be represented in a sets of long bit vectors 0 and 1. Here, 0 means not found while 1 means found in that document. The two documents that are near identical should have many shingles in common. The similarity ratio is calculated by using one of the distance metrics such as Jaccard similarity between two documents. Jaccard similarity is working well in the comparison between a pair of set values in a small dataset and to find the similarity score. Whereas in the large data set, MinHash and Locality-Sensitive Hashing (LSH) techniques come to solve this problem by providing a small signature matrix for the fast approximation to the truly Jaccard similarity in less time. In this study, we apply the Jaccard similarity, MinHash and LSH techniques based on K-shingles for a different number of the documents. The results show that the MinHash and LSH techniques produce more accuracy in results with less time for large documents. The experimental results show that the chosen K-shingle is applied into different documents number of ranges from 100, 200, 300-1000 documents. The hash functions are applied in different number from 10, 20 and 30. The average similarity time is <5 sec. The false positive and false negative were minimum to truly clustering of the documents.

  • وصف الــ Tags لهذا الموضوع
  • Data Mining, Document similarity

هذه الفقرة تنقلك الى صفحات ذات علاقة بالمقالات الاكاديمية ومنها الاوراق البحثية المقدمة من قبل اساتذة جامعة بابل وكذلك مجموعة المجلات العلمية والانسانية في الجامعة وعدد من المدنات المرفوعة من قبل مشرف موقع الكلية وهي كالاتي:

قسم المعلومات

يمكنكم التواصل مع قسم معلومات الكلية في حالة تقديم اي شكاوى من خلال الكتابة الينا,يتوجب عليك اختيار نوع الرسالة التي تود ان ترسلها لادارة الموقع :