My research is in developing database systems and technologies for managing big data in Data Science applications. Currently, I am working on several projects.
Large Scale End-to-End Entity Resolution: Algorithms & Explanations
Entity matching/resolution, also known as duplicate record detection, is a fundamental problem in data integration and data cleaning. Given two (possibly identical) entity databases D1 and D2, the goal of ER is to determine for each entity pair r in D1, s in D2, whether they represent the same real-world object. When D1 and D2 are the same, the task is to identify duplicates within the same database. The problem has a very long research history and various types of methods have been proposed. While much progress has been made, existing approaches are still far from offering satisfactory results. Our preliminary study on widely used benchmark datasets shows that the F1-scores can be as low as 60% for some datasets, suggesting that there is much room for improvements.
More importantly, as in other DL-based schemes, it is unclear how to interpret the results - why are two unrelated entities wrongly labelled as a same entity, and why are two occurrences of an entity considered different entities? While there has been some recent works on explaining machine/deep learning models in general, there has not been much effort to interpret ER models.
See here for more information.
MAGE: Dynamic Graph Analytics On GPUs
In many real-world graph applications, data are frequently updated. For example, online social networks continuously evolve as new social interactions occur. To enable online analytics for data streaming applications, we are motivated to propose a dynamic graph analytic system on GPUs, which simultaneously processes a stream of graph update operations and analytical queries on the dynamically maintained matrix representation of the graph. The high update rate brought about numerous questions and challenges:
· How can we efficiently perform updates against the matrix stored on GPUs? Due to the Single Instruction Multiple Thread (SIMT) architecture of GPU devices, there are two keys in optimizing GPU-resident applications: 1) minimizing thread divergence; 2) maximizing coalesced memory access. It is non-trivial to design an update-efficient storage scheme on GPU, which also supports blazing fast computation. On one hand, rebuilding the entire storage structure against each update is both time and space inefficient. On the other hand, CPU-styled updates (e.g. insertion/deletion) of SM entries tend to scatter data in the GPU memory which leads to uncoalesced memory access and increases of the level of thread divergence for the computation.
· How to expand existing SM primitives to support a wider range of applications? Existing matrix primitives, e.g. SM vector multiplication, operate on the entire matrix. This is too rigid for many applications. For instance, in the case of community tracking on dynamic networks, it is rather inefficient to query the adjacency matrix as a whole since the community is only a subgraph of the entire social network. In another example, it is often of interest to monitor and analyze certain consumers' purchase history (e.g. Singaporean) from a large matrix representing all consumers' purchase history (e.g. each row of the matrix represents a consumer and each row entry denotes which item was bought by the consumer). Again, running the analysis on the entire matrix is undesirable as only a subset of the rows is required.
· How to optimize dynamic graph analytics on GPUs? Note that it does not suffice to optimize each matrix primitive on GPUs, as many analytic queries are submitted simultaneously on the dynamically updated matrix in various application scenarios. Analysts may track different communities in a dynamic social network. Likewise, marketers are keen to analyse real-time sales records of different consumer groups to deliver targeted advertising instantly. To fully exploit the computational capabilities of GPUs, it calls for novel strategies of scheduling online graph analytics processing on GPUs. In addition, to facilitate applications in larger scales, there is a need to support optimizations on multiple GPUs and CPU-GPU heterogeneous computing environments.
· How to optimize graph processing algorithms for dynamic graphs? Existing analytics algorithms like pagerank and subgraph matching have to be redesigned to fully exploit a CPU-GPU heterogeneous computing environment.
To address the aforementioned research challenges, this project will focus on developing novel storage structures and algorithmic optimization for supporting dynamic SM analytics on GPU-assisted platforms.
· Kian-Lee Tan, NUS
· Wentian Guo, NUS
· Mo Sha, NUS
· Yuchen Li, SMU
· GPU-Accelerated Subgraph Enumeration on Partitioned Graphs . W. Guo, Y. Li, M. Sha, B. He, X. Xiao, K.L. Tan, SIGMOD 2020, pp. 1067-1082.
· TopoX: Topology Refactorization for Efficient Graph Partitioning and Processing. D. Li, Y. Zhang, J. Wang, K.L. Tan, PVLDB, Vol. 12, No. 8, 2019, pp. 891-905.
· GPU-based Graph Traversal on Compressed Graphs. M. Sha, Y. Li, K.L. Tan, SIGMOD 2019, pp. 775-792.
· Parallel Personalized Pagerank on Dynamic Graphs W. Guo, Y. Li, M. Sha, K.L. Tan, PVLDB , Vol. 11, No. 1, 2017, pp. 93-106.
· Accelerating Dynamic Graph Analytics on GPUs, M. Sha, Y. Li, B. He, K.L. Tan, PVLDB, Vol. 11, No. 1, 2017, pp. 107-120.
SONAR: Social Network Analytics
Today’s social networking platforms (SNPs) offers a rich amount of information. This include not only users profiles but also the social relationships between users and the activities on SNPs. With such information, it is possible to extract user preferences and understand what/how information are shared between friends or followers. We have been investigating innovative solutions for social media marketing (SMM) in SNPs, including influence maximization, and different channels for SMM such as directed target advertising (i.e., advertisements are directly posted to match users' preferences), celebrity social advertising (i.e, advertisements are endorsed by celebrities) and self influential advertising (i.e., advertisements are propagated to wider audiences through the advertiser's friends or followers). Our recent work have considered a variety of other critical factors that lead to a variety of important variants of the problem:
· Stream Influence Maximization. Social inﬂuence is highly dynamic and the rate at which information is propagated between users can be altered drastically by breaking news and trending topics. This means the influencers are also dynamically changing. How to maintain a changing set of influencers continuously is a challenge.
· Topic-aware Influence Maximization. Users in a social network may have various interests (which can be represented by topics), and the inﬂuence strength between two users (i.e., edge) is often topic-dependent. For example, a user may be inﬂuenced by her friend in some topics (e.g., sports), while remaining neutral/unaffected in others (e.g., politics). It is therefore important to design solutions that are topic-aware, i.e., considers user-to-user inﬂuence strength (edges) is topic-dependent.
· Location-aware Influence Maximization. While two users may be socially close in the social network, they may not be physically near. As such, it does not make sense to send an advertisement that is in Singapore to a friend that is in US. Finding the right set of users who can influence users that are spatially and socially close is the right way to go.
Our research seeks to develop approximate schemes that are not only efficient for real-time applications, but also offer theoretical guarantees on performance.
· Kian-Lee Tan, NUS
· Yanhao Wang, NUS
· Yuchen Li, SMU
· Coresets for Minimum Enclosing Balls over Sliding Windows Y. Wang, Y. Li, K.L. Tan, KDD 2019: 314-323, 2019.
· Semantic and Influence aware k-Representative Queries over Social Streams Y. Wang, Y. Li, K.L. Tan, EDBT 2019: 181-192, 2019.
· Location-aware Influence Maximization over Dynamic Social Streams Y. Wang, Y. Li, J. Fan, K.L. Tan, ACM Transactions on Information Systems 36(4): 43:1-43:35, 2018.
· Efficient Representative Subset Selection over Sliding Windows Y. Wang, Y. Li, K.L. Tan, IEEE Transactions on Knowledge and Data Engineering, July 2019, IEEE CS, 31(7): 1327-1340.
· Influence Maximization on Social Graphs: An Survey Y. Li, J. Fan, Y. Wang, K.L. Tan, IEEE Transactions on Knowledge and Data Engineering, October 2018, IEEE CS, 30(10): 1852-1872.
· River: A Real-time Influence Monitoring System on Social Media Streams (Demo Paper) M. Sha, Y. Li, Y. Wang, W. Guo, K.L. Tan, Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM'18), Singapore, Nov 17-20, 2018.
· A Sliding-Window Framework for Representative Subset Selection (Poster Paper) Y. Wang, Y. Li, K.L. Tan, Proceedings of the 34th International Conference on Data Engineering (ICDE'18), CNAM, Paris, April 16-19, pp. 1268-1271, 2018.
· OCTOPUS: An Online Topic-Aware Influence Analysis System for Social Networks (Demo) J. Fan, J. Qiu, Y. Li, Q. Meng, D. Zhang, G. Li, K.L. Tan, X. Du, Proceedings of the 34th International Conference on Data Engineering (ICDE'18), CNAM, Paris, April 16-19, pp. 1569-1572, 2018.
· Parallel Discovering Your Selling Points: Personalized Social Influential Tag Exploration Y. Li, J. Fan, D. Zhang, K.L. Tan, 2017 International Conference on Management of Data (SIGMOD'2017), May 14 - May 19, 2017, Chicago, IL, USA, pp. 619-634.
· Real-Time Influence Maximization on Dynamic Social Streams Y. Wang, Q. Fan, Y. Li, Kian-Lee Tan, PVLDB (VLDB'2017), Vol. 10, No. 7, March 2017, pp. 805-816.
· Context-Aware Advertisement Recommendation for High-Speed Social News Feeding Y. Li, D. Zhang, Z. Lan, K.L. Tan, Proceedings of the 32th International Conference on Data Engineering (ICDE'16), Helsinki, Finland, May 16-20, 2016, pp. 505-516.
· Realtime Targeted Influence Maximization for Online Advertisements Y. Li, D. Zhang, K.L. Tan, PVLDB (VLDB'2015), Vol. 8, No. 10, 2015, pp. 1070-1081.
· Online Topic-Aware Influence Maximization S. Chen, J. Fan, G. Li, J. Feng, K.L. Tan, J. Tang, PVLDB (VLDB'2015), Vol. 8, No. 6, 2015, pp. 666-677.