|
Funded by:
Principal Investigators:
Other Team Members:
See another data mining page maintained by Database Lab
Data mining has been recognized as an important technology for businesses
internationally. Locally, there are many companies in Singapore that
are interested in this technology. Few, however, have made much progress.
One key reason is that the initial cost is high and no expertise is
available. With this project, we will have a pool of experts (with more
to be trained) readily available to help the local industry. In the
past month, we conducted a survey among some of our major local companies.
The survey results show that many of the companies with large databases
are interested in data mining and would like to see more research done
at the National University of Singapore. Half of the companies are also
willing to support and to participate in this proposed project as collaborators.
Thus, the main purposes of this project are:
- To develop new data mining techniques, and to improve upon existing
techniques.
- To build data mining tools (both generic and industry-specific
ones) that can be readily used by industry users.
- To promote and to transfer data mining technology to local industry.
- To establish ourselves as a center of expertise for data mining
research and applications both locally and internationally.
| 2. Background
Information |
With the development of computer hardware and software and the rapid
computerization of businesses, huge amount of data has been collected
and stored in databases, and the rate at which data is stored is growing
at a phenomenal rate. As a result, traditional ad hoc mixtures of statistical
techniques and data management tools are no longer adequate for analyzing
this vast collection of data. Instead, researchers begin to look for
ways to intelligently assist humans in analyzing these mountains of
data. Data mining (or knowledge discovery in databases, or KDD in short)
has recently emerged as a growing field of multidisciplinary research
for this purpose. It combines research areas such as databases, machine
learning, artificial intelligence, statistics, automated scientific
discovery, data visualization, decision science, and high performance
computing. While each of these areas contributes in its specific ways,
data mining focuses on the value that is added by creative combination
of the contributing areas in order to produce innovative solutions to
the data analysis task.
Over the past few years, research and development in data mining
has made great progress. A large number of research and application
papers have appeared in the literature. Many successful applications
have been reported in various sectors such as marketing, finance,
banking, manufacturing, and telecommunications. Some examples of business
applications include: Using data mining techniques to analyze customer
databases so that potential customers can be selected more precisely.
The BusinessWeek magazine estimated that more than 50% of all U.S.
retailers use or plan to use such approach of database marketing.
Those using the approach have obtained good results, e.g., American
Express reported a 10-15% increase in credit card use; Using data
mining techniques for fraud detection -- from detecting cellular cloning
fraud to identifying financial transactions that may indicate money-laundering
activities. In short, data mining systems typically help businesses
to expose previously unrecognized patterns in their databases. These
information "nuggets" are used to improve profits, enhance customer
service, and ultimately achieve a competitive advantage.
It has now been recognized that mining for information and knowledge
from large databases and documents will be the next revolution in
database systems. It is considered an important area for major cost
savings and potential revenue with immediate applications in business,
decision systems, information management, communication, scientific
research and technology development. We can expect the next generation
information systems to be more intelligent in that they are not only
data intensive but also knowledge rich.
| 3. Proposed Research
Programme |
This proposed project consists of three main parts: basic and applied
research, data mining tool developments, and seminars/workshops for
local industry. Each of these parts will not be worked on in isolation.
Instead each part will complement the others. Below, we briefly describe
the three parts.
3.1 Basic and applied research
In research, the group will investigate the following main projects
simultaneously.
Data from real-world sources are often erroneous, incomplete,
and inconsistent, which can result from operator error, system implementation
flaws, etc. Such low quality data is not suitable for effective
data mining. Data cleaning has been identified as an important problem.
However, little progress has been made thus far. In this project,
we will study the issues related to data cleaning with the aim of
developing an engineering approach that can be useful to the user.
The project consists of three phases: (1) identify and categorize
the possible errors in data from multiple sources; (2) survey the
available and potentially usable techniques to address the problem;
and (3) develop a system that can identify and resolve some of the
errors.
- Data mining in multiple databases
It is common that many databases are kept in an organization.
They are collected to serve different purposes. Data mining in individual
databases has attracted a lot of attention. Some encouraging results
have been achieved. It is time now to consider how we can make use
of all the databases in an organization for data mining. Many issues
remain unresolved. Significant ones are: (1) will one database help
in the data mining of another database? (2) how can we consider
multiple databases simultaneously for data mining? (3) can current
data mining techniques help in this new situation or should we develop
new techniques? (4) what is so special about mining of multiple
databases?
- Intelligent WEB document management using data mining techniques
With the development of Internet/Web technology, the volume of
web documents increases dramatically. Effective management of the
documents is becoming an important issue. The objective of this
project is to build an intelligent system that can help Web Masters
manage Web documents so that they can serve the users better. We
will first survey the available and potentially useful techniques
for discovering access patterns of Web documents stored in an information
provider’s web server. The major issues include establishing measurements
and heuristics on user access patterns and developing techniques
to discover and maintain such discovered patterns. The results are
then expanded in the direction of using the discovered user access
patterns to manage Web documents so that information subscribers
can access information of interest more efficiently. Techniques
to be investigated include clustering of web documents, pre-fetching
and caching, and customized linkage of Web documents.
- Data mining with neural networks
While the use of neural networks for pattern classification has
been common in practice, neural networks have not been widely applied
in data mining applications. The reason is that neural network decision
process is not easily explainable in terms of rules that human experts
can verify. In the past two years, we have investigated the problem
of extracting rules from trained neural networks. Our results have
been encouraging. The rules extracted by our algorithms are not
only more concise than those generated from decision trees, but
are, in general, more accurate. Our algorithms can extract rules
of the following forms:
- Symbolic rules, e.g. if (married = yes) and (sex =
male), then .....
- MofN rules, e.g. if 2 of the 3 conditions
{number of children is not more than 3, married more than 10
years, owns private property} are satisfied, then ....
- Oblique rules. e.g. if (monthly salary -
1.5*monthly mortgage), then .....
In this project, we plan to implement these algorithms on a 32
processor Fujitsu AP3000 parallel computer to speed up the training
process.
- Data mining in semistructured data
As the amount of data available on-line grows rapidly, more
and more data is semistructured and hierarchical, i.e., the data
has no absolute schema fixed in advance, and whose structure may
be irregular or incomplete. Semistructured data arises when the
source does not impose a rigid structure and when the data is
obtained by combining several heterogeneous data sources. An example
of semistructured, hierarchical data source is the Web. Other
examples of semistructured data include: the result of integrating
heterogeneous data sources, BibTex files, genome databases, drug
and chemical structures, libraries of programs, and more generally,
digital libraries, and on-line documentation. Since there is no
fixed schema in semistructured data, the conventional data mining
techniques that work with feature vector representation will not
be applicable. The goal of this project is to develop a general
framework of mining associative patterns from semistructured and
hierarchical data.
- Subjective interestingness of the discovered patterns
Current research in data mining mainly focuses on the discovery algorithms
and visualization techniques. There is a growing awareness that, in
practice, it is easy to discover a huge number of patterns in a database
where most of these patterns are actually obvious, redundant, and useless
or uninteresting to the user. To prevent the user from being overwhelmed
by a large number of uninteresting patterns, techniques are needed to
identify only the useful/interesting patterns and present them to the
user. We have been working on this problem for the past two years. In
this project, we will consolidate our findings and carry the research
further to produce more effective techniques to discover or to identify
interesting patterns.
3.2 Data mining tool developments
Parallel to our research effort, we will have a group of programmers to
develop a set of data mining tools. The objective of this part of the
programme is to put research into practice, i.e., building data mining
systems that can be readily used by industry users.
|