Platform
Most search structures bound the cost of the search to a logarithm of the search space: for a system with N nodes, the search cost is bounded at O(logN). BATON* was proposed by us to achieve O(logmN) search cost. Larger logarithm base not only improves the search cost, but also improves the fault tolerance and facilitates more efficient and effective load balancing. As a first step, we intend to design a new structure based on BATON* to provide both range search as in BATON* and hash based search as in CHORD. We aim to design a generic structured network that is able to support various functionalities. We implement the system and conduct scalability study on PlanetLab (www.planet-lab.org).
Query processing and searching
In P2P systems, query processing strategies must achieve the following. (1) Reasonable and small routing cost. (2) Hops which do not contain any answer must be reduced as much as possible. (3) Small quantities of search messages. Messages used for the search must be limited in order to avoid heavy traffic on the network. (4) Query load balancing. Unlike simple exact or range queries which might access data in any part of the space, the data access of some queries, for example skyline query, is likely to be skewed towards the portion of the space that contain the answers (skyline). The existence of such hot spots can cause imbalance in the query loads on the network. As such, special care must be taken to avoid such imbalance. Based on the overlays we are going to adopt, we study query processing strategies for various types of queries such as the conventional SQL based queries, aggregation queries, similarity queries and skyline queries.
Data modeling and integration
Recently, there are some P2P data management systems proposed that do not require a centralized global schema. They typically define mappings in the system to associate information between different peers. Queries could be posed to any peer, and the peer evaluates the query by exploiting the mappings in the system. Our earlier work, PeerDB, provide database sharing without relying on predefined schema mappings and the mapping table approach.
Our new work also does not require predefined global schema or any mappings between databases. Rather, it relies on an operator called keyword join that takes a set of lists of local answers from different data sources as input, and outputs a list of integrated results obtained by joining tuples from the input lists based on some predefined similarity measures.
The searching and querying the stored data items requires system-wide distributed indexes. We will develop a distributed full-text index for searching the data items based on content. In addition, in order to support refined SQL queries, we need to build indexes on some popular attributes for the relations in the system. We will investigate all these issues in detail and conduct extensive experimental studies since the design will affect other issues of the system.
Load balancing and fault tolerance
One of the biggest performance problems of building an effective P2P network is load balancing among peers because of lacking global knowledge at each peer. We will define a general framework for load balancing on P2P systems using histograms.
Data security and privacy
Effective sharing of data in supply-chain-management or medical industry is essential to foster the collaboration within community such as suppliers, transporters, manufacturers, distributors and retailers so as to enhance the quality and efficiency of production. However, the sharing of data also exposes data holders to the threat of data theft, wrongful disclosure and exploitation of data. Incentives to unauthorized data distribution arise from an increasingly thriving data industry where firms such as biotech companies collect, compile, share or sell biomedical data for profits. The fundamental challenge of providing safe and secure data management in such application and environment is that the application systems are built upon heterogeneous systems, owned and administered by different organizations. Both data privacy and data ownership must be protected. To meet these dual needs, we will examine the issues of data security and privacy. Our initial plan is to examine the suitability of binning, digital watermarking, k-anonymity and and l-diversity, and propose a framework that is suitable for the intended applications, namely, data sharing in the nation wide supply-chain-management.

S3 project is funded by A-star, Singapore. |