Large Scale End-to-End Entity Resolution: Algorithms & Explanations
In this Big Data era, many organizations have become data-rich. However, as data may be collected and integrated from multiple sources (both internal and external), these organizations also face the challenge of dealing with “dirty” data. Even for a same real-world entity, it is possible that multiple occurrences from different sources may have some discrepancies. For example, David Smith may appear as “Dave Smith” as well as “David Smith” in two different datasets. Likewise, two seemingly different product descriptions may refer to the same make of flash card. Similarly, the two records may describe the same product but the brand in one record may be incorrectly recorded as part of the name of the product (e.g., product name is Adobe Acrobat 8 vs product name of Acrobat 8 and brand of Adobe. As such, before any serious data analytics can be performed, the data must be cleaned to remove redundancy, inconsistency and missing data, and correct any data errors. This proposal focuses on one such key task to ensure the quality of the data – the Entity Resolution problem.
Entity resolution, also known as duplicate record detection, is a fundamental problem in data integration and data cleaning. Given two (possibly identical) entity databases D1 and D2, the goal of ER is to determine for each entity pair rÎD1; sÎD2 whether they represent the same real-world object. When D1 and D2 are the same, the task is to identify duplicates within the same database. The problem has a very long research history and various types of methods have been proposed. While much progress has been made, existing approaches are still far from offering satisfactory results. Our preliminary study on widely used benchmark datasets shows that the F1-scores can be as low as 60% for some datasets, suggesting that there is much room for improvements.
More importantly, as in other DL-based schemes, it is unclear how to interpret the results – why are two unrelated entities wrongly labelled as a same entity, and why are two occurrences of an entity considered different entities? While there has been some recent works on explaining machine/deep learning models in general, there has not been much effort to interpret ER models.
To address these research challenges, this project will focus on developing an end-to-end entity resolution framework that offers not only high accuracy, but also facilitates interpretation of the models. More specifically, we seek to:
· Develop new ER schemes. A natural question to ask is whether deep learning can help in ER problems? If so, how effective are they for different ER tasks (e.g., structured/textual/dirty)? Do we need more complex models or would a simple model suffice? How about the size of the training data – would having more data helps a simple model or a complex model more? We seek to answer these questions as we develop new ER schemes.
· Explain ER Predictions. Traditionally, ER schemes are developed and evaluated based on the notion of “accuracy” (e.g., using recall, precision and F1 metric). Interestingly, we have discovered that even for a same model on the same dataset, the accuracy can be very different (depends on how the training and test data are selected)! It is therefore crucial for practitioners to understand why certain predictions are made in order for them to be able to assess the trustworthiness of a model. Unfortunately, while we have seen an increase in research on “Explainable AI”, there has been very little effort done in the context of ER. We will investigate mechanisms to help users to understand ER predictions.
· Develop a Large Scale End-to-End ER Management System. We will integrate our techniques developed above into full-fledge end-to-end ER management system. Such a system goes beyond just a naïve integration, and will include additional components such as blocking (a pre-processing phase to an ER phase) and a rule-based engine to facilitate labeling of training data. For the framework to be of practical use, we need to ensure that (a) it is efficient even for large datasets. (b) It is easy to use. We also plan to further enhance our methods to progressively/incrementally adapt to feedback. In other words, the schemes should adapt to prediction errors (based on user feedback) to improve in accuracy.
· Kian-Lee Tan
· Dongxiang Zhang (Zhejiang University)
· Multi-Context Attention for Entity Matching (Short Paper). D. Zhang, Y. Nie, S. Wu, Y. Shen, K.L. Tan. WWW'2020, Taipei, Taiwan, April 20-24, 2020, pp. 2634-2640.
· Unsupervised Entity Resolution with Blocking and Graph Algorithms. D. Zhang, D. Li, L. Guo, K.L. Tan. IEEE Transactions on Knowledge and Data Engineering. Accepted in Apr 2020.