For this project,
you will gain a feel for how query processing works in a “real” system,
specifically focusing on a simple SPJ (Select-Project-Join) query engine.
You will also hopefully see how different query execution trees have different
performance results, which will provide some motivation for query
optimization. (Note that the differences in running times in your simple
query processing system will probably be minor, because you will be using “toy”
data sets and an execution system that ignores many of the complex aspects of a
real system.) All
query processing techniques MUST BE implemented using the iterator
model.
This project will
be a three-person group project, so you should form your team ASAP. Each
group will do AT LEAST the following:
1.
Implement
the Block Nested Loops join
2.
Implement
a) Sort Merge join (which means you must also
implement an external sort-merge algorithm)
b) Indexed Nested Loops Join (you need to build
an index)
c) Hash Join
d) Any other advanced join algorithms.
1.
Modify the parser so that the following operations
can be supported:
a) “order by” clause
b)
2.
Implement
an algorithm to process each operator you support in Part (3)
3.
Replace
the current optimizer (that comes with the package) with a “new” optimizer of
your choice (either another randomized algorithm,
dynamic programming or greedy heuristics).
Code
A query processing
engine written in Java is provided here in tar.gz
format. This code includes a simple query processor (with a SPJ parser,
scan operator for reading data files, select and project operators and simple
nested loops join for joining tables).
Begin by familiarizing yourselves with the basic architecture of the
code and the object model. Try the code out as an
user
first, then as a developer.
You should create
five tables for you to experiment with:
l
Flights(flno: integer, from: string, to: string,
distance: integer, departs: time, arrives: time)
l
Aircrafts(aid:
integer, aname: string, cruisingrange:
integer)
l
Schedule(flno:integer, aid: integer)
l
Certified(eid: integer, aid: integer)
l
Employees(eid: integer, ename:
string, salary: integer)
where the underscored attributes are the keys of
the respective tables. Note that Employees relation describes pilots and other
kinds of employees as well; every pilot is certified for some aircraft, and
only pilots are certified to fly. The Schedule table shows the assignment of
aircraft to flights. You will find a program (ConvertTxtToTBL)
that converts an input text file to an output table in the package you
downloaded. You should create your own data files. Since we are dealing with a
large main memory system, your tables must contain sufficiently large number of
records (more than 1000) to be able to see the difference in performance
between the various algorithms.
For Experiment 1,
run the following joins of two relations:
l Employees and Certified (via eid)
l Flights and Schedule (via flno)
l Schedule and Aircrafts (via aid)
Run each join using
the two possible plans, under the two join algorithms your team have
implemented. Record the time to perform each join under each join
algorithm.
For Experiment 2, suppose we would
like to know the list of pilots who have been
scheduled for a flight. Your team should try out three different
execution plans for this query (by restricting the join methods supported by
the optimizer) and execute them with your different join algorithms.
Now do a combined write-up of the
results your group obtained.
In your experimental write-up,
include the timings, the plans used, and a discussion of the differences (if
any) between timings on different algorithms and different plans.
Also discuss why query execution plan orderings can make a more significant
difference in a real-world system with larger data sets.
Take your various
new and modified source files and your experimental write-up and zip or tar/gzip them into a single file. You should then upload
the file to the workbin in IVLE. Marks will be deducted for late submission: 10%
per day. You will then be scheduled to demo your system (using the
submitted version). At the same time, submit a hardcopy of a brief report (no more than 6 single-column pages)
that describes what has been implemented and the experimental write-up.
Demo
You are required to
demonstrate your project (all that you have implemented) during the week of 8-12
Nov 2010.
Please book a
timeslot with Prof Tan for your demo. You will be given 20 minutes to demo your
project.
Grading
You will get 70% of the total marks if you
fulfill the basic requirement as specified. The remaining 30% will be awarded
based on ADDITIONAL (creative) components/algorithms that your team develop, for example:
Marks will be awarded for:
CHEATING/PLAGIARISM
Groups found cheating
or plagiarising WILL BE referred to the SoC/