My research is in developing database systems and technologies for managing big data in Data Science applications. Currently, I am working on several projects.
DBMS on Persistent (Non-Volatile) Memory
In the last decade, growing demand for fast, power efficient and cost effective memory solutions have prompted major players like Micron, Intel, Samsung to invest heavily in their R&D for advanced memory technologies. In particular, there has been much anticipation for a next generation of non-volatile memory (NVM) that are byte addressable, persistent, high storage capacity and has near-DRAM read/write latency. This has also prompted forward-looking (software) researchers to develop novel programming models, memory architectures, data structures, query processing algorithms, recovery techniques, file systems and databases that exploit these features and characteristics to optimize the performance of such NVM technologies. However, as NVM hardware was not available yet, the design decisions made in these works were largely based on certain anticipated behaviors of NVMs and evaluated on emulators/simulators. For example, NVM is expected to offer read latency comparable to DRAM but much higher write latency and to have limited write endurance. These led to algorithms that minimize writes to NVM. As another example, the assumption that NVM has DRAM-like cacheline access granularity and near-DRAM read/write bandwidth facilitated the design of solutions which are (supposedly) scalable.
In 2019, non-volatile memory DIMMs are finally commercially available with the release of the Intel® Optane™ DC Persistent Memory Module (or just “Optane DC PMM” or “Optane NVDIMM”). Since then, there has been a proliferation of research studies that evaluate this technology as well as re-evaluate prior art on the real Optane DC PMM. The results of these studies showed that some of the earlier assumptions on NVM are not applicable to Optane DC PMM. For example, in Optane DC PMM, data are accessed in blocks of 256 bytes. This is in contrast to the DRAM where the cacheline access granularity is 64 bytes. This translates to a higher read latency for Optane DC PMM than DRAM. Moreover, it turns out the write latency can be as good as DRAM write latency as Optane DC PMM guarantees that data in the Optane Controller Buffer will be persisted on NVM. As another example, the bandwidth of Optane DC PMM for both read and write is limited which leads to poorer scalability than expected. In addition, it turns out that the limited write endurance may not be a critical issue for software developers to fix since hardware-layer wear-levelling is an inlined technique in Optane DC PMM.
In this project, we study how persistent memory impact the design of DBMS. We will focus particularly on the App Direct Mode of the Optane DC PMM which supports in-memory persistence. While today’s software operates under the assumption that data can only be persistent if it is written to slow storage (SSDs, HDDs, the cloud, etc.) this mode allows data to persist at memory speeds. As a start, we will build a full-fledged in-NVM DBMS by integrating techniques that have been developed in isolation (e.g., B+-tree, sorting and joining algorithms, storage engine).
See here for more information.
Large Scale End-to-End Entity Resolution: Algorithms & Explanations
Entity matching/resolution, also known as duplicate record detection, is a fundamental problem in data integration and data cleaning. Given two (possibly identical) entity databases D1 and D2, the goal of ER is to determine for each entity pair r in D1, s in D2, whether they represent the same real-world object. When D1 and D2 are the same, the task is to identify duplicates within the same database. The problem has a very long research history and various types of methods have been proposed. While much progress has been made, existing approaches are still far from offering satisfactory results. Our preliminary study on widely used benchmark datasets shows that the F1-scores can be as low as 60% for some datasets, suggesting that there is much room for improvements.
More importantly, as in other DL-based schemes, it is unclear how to interpret the results - why are two unrelated entities wrongly labelled as a same entity, and why are two occurrences of an entity considered different entities? While there has been some recent works on explaining machine/deep learning models in general, there has not been much effort to interpret ER models.
See here for more information.
MAGE: Dynamic Graph Analytics On GPUs
In many real-world graph applications, data are frequently updated. For example, online social networks continuously evolve as new social interactions occur. To enable online analytics for data streaming applications, we are motivated to propose a dynamic graph analytic system on GPUs, which simultaneously processes a stream of graph update operations and analytical queries on the dynamically maintained matrix representation of the graph. The high update rate brought about numerous questions and challenges:
· How can we efficiently perform updates against the matrix stored on GPUs? Due to the Single Instruction Multiple Thread (SIMT) architecture of GPU devices, there are two keys in optimizing GPU-resident applications: 1) minimizing thread divergence; 2) maximizing coalesced memory access. It is non-trivial to design an update-efficient storage scheme on GPU, which also supports blazing fast computation. On one hand, rebuilding the entire storage structure against each update is both time and space inefficient. On the other hand, CPU-styled updates (e.g. insertion/deletion) of SM entries tend to scatter data in the GPU memory which leads to uncoalesced memory access and increases of the level of thread divergence for the computation.
· How to expand existing SM primitives to support a wider range of applications? Existing matrix primitives, e.g. SM vector multiplication, operate on the entire matrix. This is too rigid for many applications. For instance, in the case of community tracking on dynamic networks, it is rather inefficient to query the adjacency matrix as a whole since the community is only a subgraph of the entire social network. In another example, it is often of interest to monitor and analyze certain consumers' purchase history (e.g. Singaporean) from a large matrix representing all consumers' purchase history (e.g. each row of the matrix represents a consumer and each row entry denotes which item was bought by the consumer). Again, running the analysis on the entire matrix is undesirable as only a subset of the rows is required.
· How to optimize dynamic graph analytics on GPUs? Note that it does not suffice to optimize each matrix primitive on GPUs, as many analytic queries are submitted simultaneously on the dynamically updated matrix in various application scenarios. Analysts may track different communities in a dynamic social network. Likewise, marketers are keen to analyse real-time sales records of different consumer groups to deliver targeted advertising instantly. To fully exploit the computational capabilities of GPUs, it calls for novel strategies of scheduling online graph analytics processing on GPUs. In addition, to facilitate applications in larger scales, there is a need to support optimizations on multiple GPUs and CPU-GPU heterogeneous computing environments.
· How to optimize graph processing algorithms for dynamic graphs? Existing analytics algorithms like pagerank and subgraph matching have to be redesigned to fully exploit a CPU-GPU heterogeneous computing environment.
To address the aforementioned research challenges, this project will focus on developing novel storage structures and algorithmic optimization for supporting dynamic SM analytics on GPU-assisted platforms.
· Kian-Lee Tan, NUS
· Wentian Guo, NUS
· Mo Sha, NUS
· Yuchen Li, SMU
· Enhancing Balanced Graph Edge Partition with Effective Local Search . Z. Guo, M. Xiao, Y. Zhou, D. Zhang, K.L. Tan, 35th AAAI Conference on Artifical Intelligence (AAAI'2021), Vancouver Convention Centre, Vancouver, Canada, 2-9 Feb 2021.
· Exploiting Reuse for GPU Subgraph Enumeration . W. Guo, Y. Li, K.L. Tan, IEEE Transactions on Knowledge and Data Engineering, accepted Oct 2020.
· GPU-Accelerated Subgraph Enumeration on Partitioned Graphs . W. Guo, Y. Li, M. Sha, B. He, X. Xiao, K.L. Tan, SIGMOD 2020, pp. 1067-1082.
· TopoX: Topology Refactorization for Efficient Graph Partitioning and Processing. D. Li, Y. Zhang, J. Wang, K.L. Tan, PVLDB, Vol. 12, No. 8, 2019, pp. 891-905. (An extended version appeared in IEEE/ACM Transactions on Networking, Vol. 28, No. 6, pp. 2768-2782, December 2020.)
· GPU-based Graph Traversal on Compressed Graphs. M. Sha, Y. Li, K.L. Tan, SIGMOD 2019, pp. 775-792.
· Parallel Personalized Pagerank on Dynamic Graphs W. Guo, Y. Li, M. Sha, K.L. Tan, PVLDB , Vol. 11, No. 1, 2017, pp. 93-106.
· Accelerating Dynamic Graph Analytics on GPUs, M. Sha, Y. Li, B. He, K.L. Tan, PVLDB, Vol. 11, No. 1, 2017, pp. 107-120.
SONAR: Social Network Analytics
Today’s social networking platforms (SNPs) offers a rich amount of information. This include not only users profiles but also the social relationships between users and the activities on SNPs. With such information, it is possible to extract user preferences and understand what/how information are shared between friends or followers. We have been investigating innovative solutions for social media marketing (SMM) in SNPs, including influence maximization, and different channels for SMM such as directed target advertising (i.e., advertisements are directly posted to match users' preferences), celebrity social advertising (i.e, advertisements are endorsed by celebrities) and self influential advertising (i.e., advertisements are propagated to wider audiences through the advertiser's friends or followers). Our recent work have considered a variety of other critical factors that lead to a variety of important variants of the problem:
· Stream Influence Maximization. Social inﬂuence is highly dynamic and the rate at which information is propagated between users can be altered drastically by breaking news and trending topics. This means the influencers are also dynamically changing. How to maintain a changing set of influencers continuously is a challenge.
· Topic-aware Influence Maximization. Users in a social network may have various interests (which can be represented by topics), and the inﬂuence strength between two users (i.e., edge) is often topic-dependent. For example, a user may be inﬂuenced by her friend in some topics (e.g., sports), while remaining neutral/unaffected in others (e.g., politics). It is therefore important to design solutions that are topic-aware, i.e., considers user-to-user inﬂuence strength (edges) is topic-dependent.
· Location-aware Influence Maximization. While two users may be socially close in the social network, they may not be physically near. As such, it does not make sense to send an advertisement that is in Singapore to a friend that is in US. Finding the right set of users who can influence users that are spatially and socially close is the right way to go.
Our research seeks to develop approximate schemes that are not only efficient for real-time applications, but also offer theoretical guarantees on performance.
· Kian-Lee Tan, NUS
· Yanhao Wang, NUS
· Yuchen Li, SMU
· Coresets for Minimum Enclosing Balls over Sliding Windows Y. Wang, Y. Li, K.L. Tan, KDD 2019: 314-323, 2019.
· Semantic and Influence aware k-Representative Queries over Social Streams Y. Wang, Y. Li, K.L. Tan, EDBT 2019: 181-192, 2019.
· Location-aware Influence Maximization over Dynamic Social Streams Y. Wang, Y. Li, J. Fan, K.L. Tan, ACM Transactions on Information Systems 36(4): 43:1-43:35, 2018.
· Efficient Representative Subset Selection over Sliding Windows Y. Wang, Y. Li, K.L. Tan, IEEE Transactions on Knowledge and Data Engineering, July 2019, IEEE CS, 31(7): 1327-1340.
· Influence Maximization on Social Graphs: An Survey Y. Li, J. Fan, Y. Wang, K.L. Tan, IEEE Transactions on Knowledge and Data Engineering, October 2018, IEEE CS, 30(10): 1852-1872.
· River: A Real-time Influence Monitoring System on Social Media Streams (Demo Paper) M. Sha, Y. Li, Y. Wang, W. Guo, K.L. Tan, Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM'18), Singapore, Nov 17-20, 2018.
· A Sliding-Window Framework for Representative Subset Selection (Poster Paper) Y. Wang, Y. Li, K.L. Tan, Proceedings of the 34th International Conference on Data Engineering (ICDE'18), CNAM, Paris, April 16-19, pp. 1268-1271, 2018.
· OCTOPUS: An Online Topic-Aware Influence Analysis System for Social Networks (Demo) J. Fan, J. Qiu, Y. Li, Q. Meng, D. Zhang, G. Li, K.L. Tan, X. Du, Proceedings of the 34th International Conference on Data Engineering (ICDE'18), CNAM, Paris, April 16-19, pp. 1569-1572, 2018.
· Parallel Discovering Your Selling Points: Personalized Social Influential Tag Exploration Y. Li, J. Fan, D. Zhang, K.L. Tan, 2017 International Conference on Management of Data (SIGMOD'2017), May 14 - May 19, 2017, Chicago, IL, USA, pp. 619-634.
· Real-Time Influence Maximization on Dynamic Social Streams Y. Wang, Q. Fan, Y. Li, Kian-Lee Tan, PVLDB (VLDB'2017), Vol. 10, No. 7, March 2017, pp. 805-816.
· Context-Aware Advertisement Recommendation for High-Speed Social News Feeding Y. Li, D. Zhang, Z. Lan, K.L. Tan, Proceedings of the 32th International Conference on Data Engineering (ICDE'16), Helsinki, Finland, May 16-20, 2016, pp. 505-516.
· Realtime Targeted Influence Maximization for Online Advertisements Y. Li, D. Zhang, K.L. Tan, PVLDB (VLDB'2015), Vol. 8, No. 10, 2015, pp. 1070-1081.
· Online Topic-Aware Influence Maximization S. Chen, J. Fan, G. Li, J. Feng, K.L. Tan, J. Tang, PVLDB (VLDB'2015), Vol. 8, No. 6, 2015, pp. 666-677.