An Introduction To Duplicate Detection

Want to read this on your NOOK? Request as NOOK Book from the publisher

Thank you for requesting this book as a NOOK book from the publisher.

More About This Book

Overview
Product Details
Table of Contents

Overview

With the ever increasing volume of data, data quality problems abound. Multiple, yet different representations of the same real-world objects in data, duplicates, are one of the most intriguing data quality problems. The effects of such duplicates are detrimental; for instance, bank customers can obtain duplicate identities, inventory levels are monitored incorrectly, catalogs are mailed multiple times to the same household, etc. Automatically detecting duplicates is difficult: First, duplicate representations are usually not identical but slightly differ in their values. Second, in principle all pairs of records should be compared, which is infeasible for large volumes of data. This lecture examines closely the two main components to overcome these difficulties: (i) Similarity measures are used to automatically identify duplicates when comparing two records. Well-chosen similarity measures improve the effectiveness of duplicate detection. (ii) Algorithms are developed to perform on very large volumes of data in search for duplicates. Well-designed algorithms improve the efficiency of duplicate detection. Finally, we discuss methods to evaluate the success of duplicate detection.

Table of Contents: Data Cleansing: Introduction and Motivation / Problem Definition / Similarity Functions / Duplicate Detection Algorithms / Evaluating Detection Success / Conclusion and Outlook / Bibliography

Product Details

ISBN-13: 9781608452200
Publisher: Morgan and Claypool Publishers
Publication date: 3/12/2010
Series: Synthesis Lectures on Data Management Series
Pages: 88
Product dimensions: 7.50 (w) x 9.25 (h) x 0.18 (d)

1 Data Cleansing: Introduction and Motivation 1

1.1 Data Quality 3

1.1.1 Data Quality Dimensions 3

1.1.2 Data Cleansing 4

1.2 Causes for Duplicates 5

1.2.1 Intra-Source Duplicates 6

1.2.2 Inter-Source Duplicates 7

1.3 Use Cases for Duplicate Detection 8

1.3.1 Customer Relationship Management 8

1.3.2 Scientific Databases 9

1.3.3 Data Spaces and Linked Open Data 10

1.4 Lecture Overview 11

2 Problem Definition 13

2.1 Formal Definition 13

2.2 Complexity Analysis 16

2.3 Data in Complex Relationships 18

2.3.1 Data Model 18

2.3.2 Challenges of Data with Complex Relationships 20

3 Similarity Functions 23

3.1 Token-based Similarity 24

3.1.1 Jaccard Coefficient 24

3.1.2 Cosine Similarity Using Token Frequency and Inverse Document Frequency 26

3.1.3 Similarity Based on Tokenization Using q-grams 29

3.2 Edit-based Similarity 30

3.2.1 Edit Distance Measures 30

3.2.2 Jaro and Jaro-Winkler Distance 32

3.3 Hybrid Functions 34

3.3.1 Extended Jaccard Similarity 34

3.3.2 Monge-Elkan Measure 35

3.3.3 Soft TF/IDF 36

3.4 Measures for Data with Complex Relationships 37

3.5 Other Similarity Measures 39

3.6 Rule-based Record Comparison 40

3.6.1 Equational Theory 40

3.6.2 Duplicate Profiles 42

4 Duplicate Detection Algorithms 43

4.1 Pairwise Comparison Algorithms 43

4.1.1 Blocking 43

4.1.2 Sorted-Neighborhood 45

4.1.3 Comparison 47

4.2 Algorithms for Data with Complex Relationships 48

4.2.1 Hierarchical Relationships 48

4.2.2 Relationships Forming a Graph 49

4.3 Clustering Algorithms 52

4.3.1 Clustering Based on the Duplicate Pair Graph 52

4.3.2 Clustering Adjusting to Data & Cluster Characteristics 56

5 Evaluating Detection Success 61

5.1 Precision and Recall 61

5.2 Data Sets 65

5.2.1 Real-World Data Sets 65

5.2.2 Synthetic Data Sets 66

5.2.3 Towards a Duplicate Detection Benchmark 67

6 Conclusion and Outlook 69

Bibliography 71

Authors' Biographies 77

Customer Reviews

Be the first to write a review

( 0 )

5 Star

(0)

4 Star

(0)

3 Star

(0)

2 Star

(0)

1 Star

(0)

If you find inappropriate content, please report it to Barnes & Noble