|
Projects
The goal of our research is to devise
authentication mechanisms that enable a user to verify the query
results generated by an untrusted server. Our proposed mechanisms are
designed to satisfy all the following objectives, of which the first
four are security requirements while the last is intended to ensure
that the overheads incurred by the mechanisms are acceptable in
practice:
-
Authenticity –
The user can check that all the values in a query result originated
from the owner; they have not been tampered with, nor have spurious
entries been introduced.
-
Completeness –
The user can verify that all the data points that satisfy the
conditions of a query are included in the result.
-
Precision – The
ratio of the query result size over the total size of the data
points and their attribute values returned; should be high.
-
Security – It is
computationally infeasible for the publisher to cheat by generating
a proof for an incorrect query result.
-
Efficiency – The
procedure for the publisher to generate the proof for a correct
(i.e., complete and authentic) query result has polynomial
complexity. Likewise the procedure for the user to verify a query
result has polynomial complexity.
In this research project, we will develop
authentication schemes for various domains, ranging from traditional
relational context (SPJ queries) to data warehousing
(multi-dimensional queries), OLAP and data analysis applications
(aggregate queries). We will also develop techniques for query
execution assurances for complex and compute-intensive queries.
Relational
Databases (SPJ Queries)
As a start, we will focus on designing
a scheme for users to verify that their query
results are complete (i.e., no qualifying tuples are omitted) and
authentic (i.e., all the result values originated from the owner) in
the context of relational databases. We will design a scheme that
supports range selection, project, join as well as
multi-point
queries on relational databases.
Data
Warehousing/OLAP (Multidimensional Queries)
The
value of a data warehouse lies in its ability to support business
intelligence. These applications perform many multidimensional
calculations to analyze data across dimensions. An example query given
in the Oracle OLAP datasheet (http://www.oracle.com/technology/products/oracle9i/datasheets/olap.pdf)
is “the top ten products for each of the top ten customers during a
rolling six month time period based on growth in dollar sales”. In
this query, a product ranking is nested within a customer ranking, and
data is analyzed across a number of time periods and a virtual
measure.

Figure 1: Data Warehousing Set-Up
In a deployment set-up, the data warehouse may be
situated near the analytics users, away from the underlying production
(OLTP) database(s) as illustrated in Figure 1. To offer faster
response for large, distributed user communities, the data warehouse
may even be replicated on platforms such as content delivery network,
edge computing, computing grid, and peer-to-peer databases. Such a
deployment scenario corresponds to the data publishing model, and
consequently there is a need for the users to be able to authenticate
the results of the multidimensional queries on the data warehouse(s).
OLAP/Data
Analysis (Aggregate Queries)
Data
analysis applications typically aggregate data across many dimensions
looking for anomalies or unusual patterns. These applications
categorize data values and trends, extract statistical information,
and then contrast one category with another. To perform such data
analysis efficiently over large datasets, many of the analysis tools
are built on top of aggregation operators supported by database
management systems. The SQL aggregate operations and the GROUP BY
operator produce zero-dimensional or one-dimensional aggregates.
Specifically, the SQL standard (IS9075 International Standard for
Database Language SQL, 1992) specifies the following aggregate
operators: COUNT( ), SUM( ), MIN( ), MAX( ), AVG( ). Among those
aggregate operations, COUNT( ), MIN( ) and MAX( ) can be checked
easily from the meta-data, while AVG( ) is derived from SUM( ) and
COUNT( ). So we will focus on authentication for SUM( ) in our
research. In addition, we will address the authentication of the data
cube, a more powerful aggregate operator. As explained in [1], the
cube operator generalizes the histogram, cross-tabulation, roll-up,
drill-down, and sub-total constructs found in most report writers.
[1] Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don
Reichart, Murali Venkatrao, Frank Pellow, Hamid Pirahesh, “Data Cube:
A Relational Aggregation Operator Generalizing Group-By, Cross-Tab,
and Sub-Totals”, Journal of Data Mining and Knowledge Discovery, 1(1),
1997, 29-53.
Query Execution Assurance
Ensuring completeness and authenticity for
arbitrary query types remain an open hard problem. In such cases, we
explore mechanism of runtime query “proofs” in a challenge-response
protocol. For each batch of queries, the server is “challenged” to
provide a proof of query execution (rather than on the answers
returned) which is then checked by the client as a prerequisite to
accepting the actual query results as accurate. This scenario is
especially useful for compute-intensive applications where servers may
cheat for economical reasons (claiming to have performed a
compute-intensive operation without actually doing it). We plan to
study this in the context of the expensive data mining process.
|