(Database Query Processing)
A Worked Example with SQL and algorithmic processing
using primitives
(with analysis of performance) [last updated; Oct 2014]
Consider a database with 3 tables,
STUDENT-INFO,
COURSE-INFO, and
ENROLMENT.
Assume
•
the STUDENT-INFO table has 30,000 (3x104) rows,
•
the COURSE-INFO table has 1,000 (103) rows,
[BiYing checked CORS & said 1365 for Spr 2009. Thx]
•
the ENROLMENT table has 100,000 (105) rows.
Student-ID | Name | NRIC-ID | Address | Tel-No | Faculty | Major |
... |
... |
... |
... |
... |
... |
... |
|
|
Question Q4 from Spring 2012 Quiz:
(a) List the Course-ID, Day, Hour of all courses taught in the venue "USP-SR1".
Give both SQL-query (declarative) and the procedural one using basic primitives.
(i) SQL query:
(ii) Using basic primitives:
(b) List the Student-ID, Major of all FASS students who have classes
in "USP-SR1".
Give both SQL-query (declarative) and the procedural one using basic primitives.
(i) SQL query:
(ii) Using basic primitives:
(a) List the Course-ID, Day, Hour of all courses taught in the venue "USP-SR1".
Give both SQL-query (declarative) and the procedural one using basic primitives.
We first give the declarative SQL-query: (that declares what, but not how.)
SELECT Course-ID, Day, Hour FROM CI WHERE (Venue = 'USP-SR1'); |
And here is the algorithmic version (using e-project, e-select, e-join): (the how)
S1 <-- e-select from CI WHERE (Venue = 'USP-SR1'); Ans <-- e-project Course-ID, Day, Hour from S1; |
Analysis: (Assume m = number of courses taught in USP-SR1 = 20)
- First step (e-select) takes 1,000 row ops;
table S1 has m = 20 rows.
- 2nd step (e-project) takes O(m) = 20 row ops;
Total: About 1,020 row operations!
We first give the declarative SQL-query: (that declares what, but not how.)
SELECT Student-ID, Major FROM CI, SI, EN WHERE (Venue = 'USP-SR1') AND (Faculty = "FASS' ) AND (SI.SID = EN.SID ) AND (CI.CID - EN.CID ); |
There are many different algorithmic solutions to
process this query with the primitives: (using e-project, e-select, e-join)
(the how). They have different running times.
(X) Call this Algorithm Sp-2012-Q4-X (slow algorithm).
X1 <-- e-join SI and EN WHERE (SI.Student-ID = EN.Student-ID); X2 <-- e-join X1 and CI WHERE (CI.Course-ID = X1.Course-ID); X3 <-- e-select from X2 WHERE (Faculty='FASS') and (Venue='USP-SR1'); Ans <-- e-project Student-ID, Major from X3; |
(G) A more efficient algorithm is Algorithm Sp-2010-Q4-G (faster algorithm).
G1 <-- e-select from SI WHERE (Faculty='FASS'); G2 <-- e-select from CI WHERE (Venue='USP-SR1'); G3 <-- e-join G2 and EN WHERE (G2.Course-ID = EN.Course-ID); G4 <-- e-join G3 and G1 WHERE (G3.Student-ID = G1.Student-ID); Ans <-- e-project Student-ID, Major from G4; |
Analysis: (Assume # of FASS students = 3000, # of classes in USP-SR1 = 20)
- 1st step (e-select) takes 30,000 row ops;
table G1 has 3,000 rows.
- 2nd step (e-select) takes 1,000 row ops;
table G2 has 20 rows.
- 3rd step (e-join) takes (20*100,000) row ops;
table G3 has about 200 rows.
- 4th step (e-join) takes (200*3,000) row ops;
table G4 has about 50 rows.
- 5th step (e-project) takes 50 row ops;
(3.1 x 109)/(2.63 x 106)=1,187! That's (1sec vs 20min) or (1 day vs 3.25 years)!! |
(H) Still another possible algorithm is Algorithm Sp-2012-Q4-H.
H1 <-- e-select from SI WHERE (Faculty='FASS'); H2 <-- e-select from CI WHERE (Venue='USP-SR1'); H3 <-- e-join H1 and EN WHERE (H1.Student-ID = EN.Student-ID); H4 <-- e-join H3 and H2 WHERE (H3.Course-ID = H2.Course-ID); Ans <-- e-project Student-ID, Major from H4; |
Analysis: (Assume # of FASS students = 3000, # of classes in USP-SR1 = 20)
- 1st step (e-select) takes 30,000 row ops;
table H1 has 3,000 rows.
- 2nd step (e-select) takes 1,000 row ops;
table H2 has 20 rows.
- 3rd step (e-join) takes (3,000*100,000) row ops;
table H3 has about 15,000 rows.
- 4th step (e-join) takes (20*15,000) row ops;
table H4 has about 50 rows.
- 5th step (e-project) takes 50 row ops;
(This is about 10 times faster than Sp-2012-Q4-X.)
Q: Where is the main problem with this algorithm?
(M)
There are other algorithms (using the basic primitives) that you can
design for processing this query -- with running times somewhere
between Sp-2012-Q4-G and Sp-2012-Q4-X.
Q: Can you find two of them?
|
Final Lesson:
Try to do e-select first whenever possible. Always do e-join last (wherever possible). And e-join with smallest tables first, whenever possible. |