UIT2201 (Fall 2016), NUS

UIT2201: Tutorial Set 8 (Fall 2016)
(Solution Sketch to Selected Problems)
NOT TO BE GIVEN TO FUTURE UIT2201 STUDENTS

Comments on designing algorithmic query processing:
0. First get the equivalent SQL query. (So we know which tables are used).
1. Use e-select first wherever possible to reduce table sizes.
2. Use e-join as late as possible.
3. Use e-project at the end.

T8-D2: (Algorithmic Query Processing) --- SOLUTION SKETCH

The declarative SQL-query: (declares what, but not how.)
SELECT ID, PlanType FROM Employees, InsurancePolicies WHERE (Birthdate > #1/01/60# AND (ID = EmployeeID);

The algorithmic version (using e-project, e-select, e-join): (the how)
S1 <== e-select from Employee where (Birthdate > #1/01/60#); S2 <== e-join S1 and InsurancePolicies where (ID = EmployeeID); Ans <== e-project ID, PlanType from S2;

T8-Q1: (Algorithmic Query Processing) --- SOLUTION SKETCH

The declarative SQL-query: (declares what, but not how.)
SELECT ID, LastName, FirstName, PayRate FROM EMPLOYEES WHERE (PayRate < 15.00);

The algorithmic version (using e-project, e-select, e-join): (the how)
T1 <== e-select from Employee where (PayRate < 15.00); Ans <== e-project ID, LastName, FirstName, PayRate from T1;

Consider a database with 3 tables, STUDENT-INFO, COURSE-INFO, and ENROLMENT. Assume
• the STUDENT-INFO table has 30,000 (3x10⁴) rows,
• the COURSE-INFO table has 1,000 (10³) rows,
• the ENROLMENT table has 100,000 (10⁵) rows.

STUDENT-INFO

`Student-ID`	`Name`	`NRIC-ID`	`Address`	`Tel-No`	`Faculty`	`Major`
...	...	...	...	...	...	...

COURSE-INFO

`Course-ID`	`Name`	`Day`	`Hour`	`Venue`	`Instructor`
...	...	...	...	...	...

ENROLMENT

`Student-ID`	`Course-ID`
...	...

T8-Q2: (5 points) (Algorithmic Query Processing)

The query is
List the Student-ID, SI.Name, Tel-No of all History majors;

The declarative SQL-query: (declares what, but not how.)
SELECT SI.Student-ID, SI.Name, SI.Tel-No FROM SI WHERE (Major="History");

The algorithmic version (using e-project, e-select, e-join): (the how)
U1 <== e-select FROM SI WHERE (Major="History"); Ans <== e-project SI.Student-ID, SI.Name, SI.Tel-No FROM U1;

Analysis: (Assume m = number of history majors = 200)

First step (e-select) takes 30,000 row ops; table U1 has m = 200 rows.
2nd step (e-project) takes O(m) = 200 row ops;
Total: About 30,200 row operations!

T8-Q3: (10 points) (Algorithmic Query Processing)
The query is
List the Student-ID, Major, Course-ID of all courses taken by History majors;

The declarative SQL-query: (declares what, but not how.)
SELECT SI.Student-ID, SI.Major, EN.Course-ID FROM SI, EN WHERE (Major="History") AND (SI.Student-ID = EN.Student-ID);

(a) [The Bad Algorithm -- e-join, e-select, e-project]
B1 <== e-join SI and EN WHERE (SI.Student-ID = EN.Student-ID); B2 <== e-select from B1 WHERE (Major="History"); Ans <== e-project SI.Student-ID, SI.Major, EN.Course-ID FROM B2;

Analysis: (Assume m = number of history majors = 200; each take 5 courses)

First step (e-join) takes (30000*100,000) row ops;
table B1 has exactly 100,000 rows. (each row in EN match exactly one row in SI)
2nd step (e-select) takes 100,000 row ops; table B2 has 1000 rows.
3rd step (e-project) takes 1000 row ops;
Total: About 3,000,101,000 (about 3x10⁹) row operations! (dominated by the join operation!)

(b) [The Good Algorithm -- e-select, e-join, e-project]
G1 <-- e-select FROM SI WHERE (Major="History"); G2 <-- e-join G1 and EN WHERE (G1.Student-ID = EN.Student-ID); Ans <== e-project SI.Student-ID, SI.Major, EN.Course-ID FROM G2;

Analysis: (Assume m = number of history majors = 200; each take 5 courses)

1st step (e-select) takes 30,000 row ops; table G1 has m = 200 rows.
2nd step (e-join) takes (200*100,000) row ops; table G2 has about 1000 rows.
3rd step (e-project) takes 1000 row ops;
Total: About 20,031,000 (about 2x10⁷) row operations! (a much smaller join operation!)

Observation: There is a factor of more than 150 difference between the bad and the good algorithm. (3x10⁹/2x10⁷ = 150). That's a difference of (1min versus 2.5hr)!!

T8-Q4: (10 points) (Multiple Joins Query)
The query is
List the Student-ID, SI.Name, Tel-No of all History majors who have lectures in "LT13".

The declarative SQL-query: (declares what, but not how.)
SELECT SI.Student-ID, SI.Name, SI.Tel-No FROM SI, CI, EN WHERE (Major="History") AND (Venue="LT13") AND (SI.Student-ID = EN.Student-ID); AND (CI.Course-ID = EN.Course-ID);

The algorithmic version; (Remember: e-selects first, e-joins later, whenever possible.)
H1 <-- e-select FROM SI WHERE (Major="History"); H2 <-- e-select FROM CI WHERE (Venue="LT13"); H3 <-- e-join H2 and EN WHERE (H2.Course-ID = EN.Course-ID); H4 <-- e-join H3 and H1 WHERE (H3.Student-ID = H1.Student-ID); Ans <== e-project SI.Student-ID, SI.Name, SI.Tel-No FROM H4;

Analysis: (Assume m = number of history majors = 200; each take 5 courses)
(Assume p = courses using LT13 per week = 30, average 100 each class;)

1st step (e-select) takes 30,000 row ops; table H1 has m = 200 rows.
2nd step (e-select) takes 1,000 row ops; table H2 has p = 30 rows.
3rd step (e-join) takes (30*100,000) row ops; table H3 has about 3000 rows.
4th step (e-join) takes (3000*200) row ops; table H4 has no more than 3000 rows.
5th step (e-project) takes 3000 row ops;
Total: About 3,634,000 (about 3.6x10⁶) row operations! (a much smaller join operation!)

UIT2201: CS & IT Revolution; (LeongHW, 2016) [Prints well with A4-page-offsets l=r=b=0.5", t=0.7"]

UIT2201: Tutorial Set 8 (Fall 2016) (Solution Sketch to Selected Problems) ** NOT TO BE GIVEN TO FUTURE UIT2201 STUDENTS **

UIT2201: Tutorial Set 8 (Fall 2016)
(Solution Sketch to Selected Problems)
NOT TO BE GIVEN TO FUTURE UIT2201 STUDENTS