On Query Processing Engine

Due Monday, Nov 8, 2010, 0900 hrs

Goals

For this project, you will gain a feel for how query processing works in a “real” system, specifically focusing on a simple SPJ (Select-Project-Join) query engine. You will also hopefully see how different query execution trees have different performance results, which will provide some motivation for query optimization. (Note that the differences in running times in your simple query processing system will probably be minor, because you will be using “toy” data sets and an execution system that ignores many of the complex aspects of a real system.) All query processing techniques MUST BE implemented using the iterator model.

Specifications

This project will be a three-person group project, so you should form your team ASAP. Each group will do AT LEAST the following:

1. Implement the Block Nested Loops join

2. Implement ONE of the followings:

a) Sort Merge join (which means you must also implement an external sort-merge algorithm)

b) Indexed Nested Loops Join (you need to build an index)

c) Hash Join

d) Any other advanced join algorithms.

1. Modify the parser so that the following operations can be supported:

a) “order by” clause

b) ONE of these: aggregates, intersection, union, duplicate elimination, etc.

2. Implement an algorithm to process each operator you support in Part (3)

3. Replace the current optimizer (that comes with the package) with a “new” optimizer of your choice (either another randomized algorithm, dynamic programming or greedy heuristics).

Code

A query processing engine written in Java is provided here in tar.gz format. This code includes a simple query processor (with a SPJ parser, scan operator for reading data files, select and project operators and simple nested loops join for joining tables).

Begin by familiarizing yourselves with the basic architecture of the code and the object model. Try the code out as an user first, then as a developer.

Experiments

You should create five tables for you to experiment with:

l Flights(flno: integer, from: string, to: string, distance: integer, departs: time, arrives: time)

l Aircrafts(aid: integer, aname: string, cruisingrange: integer)

l Schedule(flno:integer, aid: integer)

l Certified(eid: integer, aid: integer)

l Employees(eid: integer, ename: string, salary: integer)

where the underscored attributes are the keys of the respective tables. Note that Employees relation describes pilots and other kinds of employees as well; every pilot is certified for some aircraft, and only pilots are certified to fly. The Schedule table shows the assignment of aircraft to flights. You will find a program (ConvertTxtToTBL) that converts an input text file to an output table in the package you downloaded. You should create your own data files. Since we are dealing with a large main memory system, your tables must contain sufficiently large number of records (more than 1000) to be able to see the difference in performance between the various algorithms.

For Experiment 1, run the following joins of two relations:

l Employees and Certified (via eid)

l Flights and Schedule (via flno)

l Schedule and Aircrafts (via aid)

Run each join using the two possible plans, under the two join algorithms your team have implemented. Record the time to perform each join under each join algorithm.

For Experiment 2, suppose we would like to know the list of pilots who have been scheduled for a flight. Your team should try out three different execution plans for this query (by restricting the join methods supported by the optimizer) and execute them with your different join algorithms.

Now do a combined write-up of the results your group obtained.

In your experimental write-up, include the timings, the plans used, and a discussion of the differences (if any) between timings on different algorithms and different plans. Also discuss why query execution plan orderings can make a more significant difference in a real-world system with larger data sets.

Submitting Your Project

Take your various new and modified source files and your experimental write-up and zip or tar/gzip them into a single file. You should then upload the file to the workbin in IVLE. Marks will be deducted for late submission: 10% per day. You will then be scheduled to demo your system (using the submitted version). At the same time, submit a hardcopy of a brief report (no more than 6 single-column pages) that describes what has been implemented and the experimental write-up.

Demo

You are required to demonstrate your project (all that you have implemented) during the week of 8-12 Nov 2010.

Please book a timeslot with Prof Tan for your demo. You will be given 20 minutes to demo your project.

Grading

You will get 70% of the total marks if you fulfill the basic requirement as specified. The remaining 30% will be awarded based on ADDITIONAL (creative) components/algorithms that your team develop, for example:

Identify and correct any bugs with the SPJ query engine provided to you.
Implement additional operators than the basic requirement.
Better user interfaces.
Better supporting programs.
Etc
Etc

Marks will be awarded for:

correctness of program
difficulty/complexity of the algorithms implemented
documentation within program
report - writeup on the experimental results (keep your writeup brief and straight to the point)
demonstration

CHEATING/PLAGIARISM

Groups found cheating or plagiarising WILL BE referred to the SoC/NUS disciplinary board. Note that “code sharing” is considered cheating.