The GENIE and LAMP project aims to provide a systematic investigation into the use of semi-lazy learning for predictive analytics. GENIE (GENeric InvErted index)) is a unified platform to support storage and retrieve of Big Data with various types of structure. LAMP (semi-LAzy Mining Paradigm) is a new data mining paradigm for predictive analytics, which essentially follows the lazy learning paradigm until the last step where more expensive eager learning methods are applied on a set of KNNs. Specifically, GENIE will efficiently support generic k-Nearest Neighbours (k-NN) search which will be utilized by LAMP to empower users with advanced predictive capabilities.
Figure 1 shows the overview framework of GENIE and LAMP. As can be seen, although LAMP and GENIE can logically be seen as separate concepts, integrating them into a single unit can bring synergy that make them more powerful as a predictive analytics engine. GENIE in the integrated scenario can support LAMP by searching for KNNs over a large variety of complex and high dimensional data while LAMP aside from being more efficient, can aggregate predictions and models to make more complex inferences.
1 New platforms have been extensively investigated to handle the first two Vs of big data, volume and velocity. Studies into the third V of big data, "variety" is, however, comparatively rare. Many of the big data domains generate a large amount of data with structures (e.g. sequences, trees and graphs) that need to be a process while preserving semantics. It is desirable to develop a unified framework that can be flexibly customized to handle predictive analytics on data with different structures.
2 Specially, GENIE will efficiently support generic k-Nearest Neighbours (k-NN) search. Since LAMP requires a search for KNNs to make a prediction for a query object, finding KNNs becomes a fundamental operation to support LAMP on Big Data with a large variety of complex structures.
3 The GENIE should have the following characteristics:
1 Data mining algorithms can be divided into two categories: eager learning (e.g. SVM, Neural Network) and lazy learning (e.g. nearest neighbours classification). Eager learning algorithms have high training cost and suffer information loss since they compute and commit to a model before even seeing the prediction query. Lazy learning suffers from their simplistic predicting method although they can capture a very rich set of models and hypothesis from the data. Developing methods that have the strengths but not the weakness of both approaches is highly desirable.
2 LAMP essentially follows the lazy learning paradigm until the last step where more expensive eager learning methods are applied on a set of KNNs. We argue that the modern hardware will allow these eager learning algorithms to be done efficiently on small dataset for real-time prediction.
3 The approach of LAMP goes as follow:
Anthony K. H. Tung, National University of Singapore
Chen Gang, Zhejiang University
H. V. Jagadish, University of Michigan, Ann Arbor
Shou Lidan, Zhejiang University
Wu Sai, Zhejiang University
Zheng Yuxin, National University of Singapore
Guo Qi, National University of Singapore
Liu Qi, National University of Singapore
Luo Pingyi, National University of Singapore
Hu Wenyan, National University of Singapore
Luan Wenhao, National University of Singapore
Tang Yixuan, National University of Singapore
Yang Yueji, Wuhan University
Zhou Shaowen, National University of Singapore
Zhu Lei, National University of Singapore
Bao Zhifeng, Royal Melbourne Institute of Technology
Wang Xiaoli, XiaMen University
Zhou Jingbo, BaiDu