Back to the main pageIntroductionResearch related publicationsProduct OverviewDownload trial softwareE-mail us at dm2@comp.nus.edu.sg
 
Data Mining and Intelligent Data Analysis

Funded by:

Principal Investigators: Other Team Members: See another data mining page maintained by Database Lab  

1. Objectives
Data mining has been recognized as an important technology for businesses internationally. Locally, there are many companies in Singapore that are interested in this technology. Few, however, have made much progress. One key reason is that the initial cost is high and no expertise is available. With this project, we will have a pool of experts (with more to be trained) readily available to help the local industry. In the past month, we conducted a survey among some of our major local companies. The survey results show that many of the companies with large databases are interested in data mining and would like to see more research done at the National University of Singapore. Half of the companies are also willing to support and to participate in this proposed project as collaborators. Thus, the main purposes of this project are:
    • To develop new data mining techniques, and to improve upon existing techniques.
    • To build data mining tools (both generic and industry-specific ones) that can be readily used by industry users.
    • To promote and to transfer data mining technology to local industry.
    • To establish ourselves as a center of expertise for data mining research and applications both locally and internationally.
2. Background Information
With the development of computer hardware and software and the rapid computerization of businesses, huge amount of data has been collected and stored in databases, and the rate at which data is stored is growing at a phenomenal rate. As a result, traditional ad hoc mixtures of statistical techniques and data management tools are no longer adequate for analyzing this vast collection of data. Instead, researchers begin to look for ways to intelligently assist humans in analyzing these mountains of data. Data mining (or knowledge discovery in databases, or KDD in short) has recently emerged as a growing field of multidisciplinary research for this purpose. It combines research areas such as databases, machine learning, artificial intelligence, statistics, automated scientific discovery, data visualization, decision science, and high performance computing. While each of these areas contributes in its specific ways, data mining focuses on the value that is added by creative combination of the contributing areas in order to produce innovative solutions to the data analysis task.

Over the past few years, research and development in data mining has made great progress. A large number of research and application papers have appeared in the literature. Many successful applications have been reported in various sectors such as marketing, finance, banking, manufacturing, and telecommunications. Some examples of business applications include: Using data mining techniques to analyze customer databases so that potential customers can be selected more precisely. The BusinessWeek magazine estimated that more than 50% of all U.S. retailers use or plan to use such approach of database marketing. Those using the approach have obtained good results, e.g., American Express reported a 10-15% increase in credit card use; Using data mining techniques for fraud detection -- from detecting cellular cloning fraud to identifying financial transactions that may indicate money-laundering activities. In short, data mining systems typically help businesses to expose previously unrecognized patterns in their databases. These information "nuggets" are used to improve profits, enhance customer service, and ultimately achieve a competitive advantage.

It has now been recognized that mining for information and knowledge from large databases and documents will be the next revolution in database systems. It is considered an important area for major cost savings and potential revenue with immediate applications in business, decision systems, information management, communication, scientific research and technology development. We can expect the next generation information systems to be more intelligent in that they are not only data intensive but also knowledge rich.

3. Proposed Research Programme
This proposed project consists of three main parts: basic and applied research, data mining tool developments, and seminars/workshops for local industry. Each of these parts will not be worked on in isolation. Instead each part will complement the others. Below, we briefly describe the three parts.
3.1 Basic and applied research
In research, the group will investigate the following main projects simultaneously.
    • Data cleaning
Data from real-world sources are often erroneous, incomplete, and inconsistent, which can result from operator error, system implementation flaws, etc. Such low quality data is not suitable for effective data mining. Data cleaning has been identified as an important problem. However, little progress has been made thus far. In this project, we will study the issues related to data cleaning with the aim of developing an engineering approach that can be useful to the user. The project consists of three phases: (1) identify and categorize the possible errors in data from multiple sources; (2) survey the available and potentially usable techniques to address the problem; and (3) develop a system that can identify and resolve some of the errors.
    • Data mining in multiple databases
It is common that many databases are kept in an organization. They are collected to serve different purposes. Data mining in individual databases has attracted a lot of attention. Some encouraging results have been achieved. It is time now to consider how we can make use of all the databases in an organization for data mining. Many issues remain unresolved. Significant ones are: (1) will one database help in the data mining of another database? (2) how can we consider multiple databases simultaneously for data mining? (3) can current data mining techniques help in this new situation or should we develop new techniques? (4) what is so special about mining of multiple databases?
    • Intelligent WEB document management using data mining techniques
With the development of Internet/Web technology, the volume of web documents increases dramatically. Effective management of the documents is becoming an important issue. The objective of this project is to build an intelligent system that can help Web Masters manage Web documents so that they can serve the users better. We will first survey the available and potentially useful techniques for discovering access patterns of Web documents stored in an information providerís web server. The major issues include establishing measurements and heuristics on user access patterns and developing techniques to discover and maintain such discovered patterns. The results are then expanded in the direction of using the discovered user access patterns to manage Web documents so that information subscribers can access information of interest more efficiently. Techniques to be investigated include clustering of web documents, pre-fetching and caching, and customized linkage of Web documents.
    • Data mining with neural networks
While the use of neural networks for pattern classification has been common in practice, neural networks have not been widely applied in data mining applications. The reason is that neural network decision process is not easily explainable in terms of rules that human experts can verify. In the past two years, we have investigated the problem of extracting rules from trained neural networks. Our results have been encouraging. The rules extracted by our algorithms are not only more concise than those generated from decision trees, but are, in general, more accurate. Our algorithms can extract rules of the following forms:
        • Symbolic rules, e.g. if (married = yes) and (sex = male), then .....
        • MofN rules, e.g. if 2 of the 3 conditions {number of children is not more than 3, married more than 10 years, owns private property} are satisfied, then ....
        • Oblique rules. e.g. if (monthly salary - 1.5*monthly mortgage), then .....
In this project, we plan to implement these algorithms on a 32 processor Fujitsu AP3000 parallel computer to speed up the training process.
    • Data mining in semistructured data
  • As the amount of data available on-line grows rapidly, more and more data is semistructured and hierarchical, i.e., the data has no absolute schema fixed in advance, and whose structure may be irregular or incomplete. Semistructured data arises when the source does not impose a rigid structure and when the data is obtained by combining several heterogeneous data sources. An example of semistructured, hierarchical data source is the Web. Other examples of semistructured data include: the result of integrating heterogeneous data sources, BibTex files, genome databases, drug and chemical structures, libraries of programs, and more generally, digital libraries, and on-line documentation. Since there is no fixed schema in semistructured data, the conventional data mining techniques that work with feature vector representation will not be applicable. The goal of this project is to develop a general framework of mining associative patterns from semistructured and hierarchical data.
      • Subjective interestingness of the discovered patterns
    Current research in data mining mainly focuses on the discovery algorithms and visualization techniques. There is a growing awareness that, in practice, it is easy to discover a huge number of patterns in a database where most of these patterns are actually obvious, redundant, and useless or uninteresting to the user. To prevent the user from being overwhelmed by a large number of uninteresting patterns, techniques are needed to identify only the useful/interesting patterns and present them to the user. We have been working on this problem for the past two years. In this project, we will consolidate our findings and carry the research further to produce more effective techniques to discover or to identify interesting patterns.

    3.2 Data mining tool developments
    Parallel to our research effort, we will have a group of programmers to develop a set of data mining tools. The objective of this part of the programme is to put research into practice, i.e., building data mining systems that can be readily used by industry users.

    Home | Introduction | Publications | Product Overview | Download | People | Contact Us
    Please direct queries and bug reports via E-mail: dm2@comp.nus.edu.sg
    School of Computing, National University of Singapore