UIT2201: CS & the IT Revolution

(Database Query Processing)
A Worked Example with SQL and algorithmic processing using primitives
(with analysis of performance) [last updated; Oct 2014]


Consider a database with 3 tables, STUDENT-INFO, COURSE-INFO, and ENROLMENT. Assume
    • the STUDENT-INFO table has 30,000 (3x104) rows,
    • the COURSE-INFO table has 1,000 (103) rows, [BiYing checked CORS & said 1365 for Spr 2009. Thx]
    • the ENROLMENT table has 100,000 (105) rows.

STUDENT-INFO
Student-ID Name NRIC-ID Address Tel-No Faculty Major

...

...

...

...

...

...

...

COURSE-INFO
Course-ID Name Day Hour Venue Instructor

...

...

...

...

...

...

ENROLMENT
Student-ID Course-ID

...

...


Some General Comments about Efficiency: To analyze efficiency, we recall that:
  1. The e-project or e-select operation on a table with N rows take O(N) row operations.
  2. the e-join operation on a table (N rows) with another table (M rows) takes O(N*M) row operations.
  3. we assume that each row operation takes O(1) time.
For complex queries, there can be a very big difference between efficient procedures and inefficient procedures for answer the same query (and getting the same answer). We illustrate this with

Question Q4 from Spring 2012 Quiz:

(a) List the Course-ID, Day, Hour of all courses taught in the venue "USP-SR1".
Give both SQL-query (declarative) and the procedural one using basic primitives.

(i) SQL query:

(ii) Using basic primitives:

(b) List the Student-ID, Major of all FASS students who have classes in "USP-SR1".
Give both SQL-query (declarative) and the procedural one using basic primitives.

(i) SQL query:

(ii) Using basic primitives:


Question Q4 from Spring 2012 Quiz -- SOLUTION SKETCH

(a) List the Course-ID, Day, Hour of all courses taught in the venue "USP-SR1".
Give both SQL-query (declarative) and the procedural one using basic primitives.

We first give the declarative SQL-query: (that declares what, but not how.)
   SELECT  Course-ID, Day, Hour
   FROM    CI
   WHERE  (Venue = 'USP-SR1');

And here is the algorithmic version (using e-project, e-select, e-join): (the how)
   S1  <-- e-select from CI WHERE (Venue = 'USP-SR1');
   Ans <-- e-project Course-ID, Day, Hour from S1;

Analysis: (Assume m = number of courses taught in USP-SR1 = 20)
- First step (e-select) takes 1,000 row ops; table S1 has m = 20 rows.
- 2nd step (e-project) takes O(m) = 20 row ops;
Total: About 1,020 row operations!


(b) List the Student-ID, Major of all FASS students who have classes in "USP-SR1".

We first give the declarative SQL-query: (that declares what, but not how.)
   SELECT  Student-ID, Major
   FROM    CI, SI, EN
   WHERE  (Venue = 'USP-SR1') AND
          (Faculty = "FASS' ) AND
          (SI.SID = EN.SID ) AND
          (CI.CID - EN.CID );


There are many different algorithmic solutions to process this query with the primitives: (using e-project, e-select, e-join) (the how). They have different running times.

(X) Call this Algorithm Sp-2012-Q4-X (slow algorithm).
   X1  <-- e-join SI and EN WHERE (SI.Student-ID = EN.Student-ID);
   X2  <-- e-join X1 and CI WHERE (CI.Course-ID = X1.Course-ID);
   X3  <-- e-select from X2 WHERE (Faculty='FASS') and (Venue='USP-SR1');
   Ans <-- e-project Student-ID, Major from X3;

Analysis: (Assume q = # of FASS students taking classes in USP-SR1 = 50)
- First step (e-join) takes (30,000*100,000) row ops; table X1 has at least 100,000 rows.
- 2nd step (e-join) takes (1,000*100,000) row ops; table X2 has at least 100,000 rows.
- 3rd step (e-select) takes 100,000 row ops; table X3 has q = 50 rows.
- 4th step (e-project) takes 50 row ops;
Total: About 3,100,100,050 (about 3.1 x 109) row operations! (e-join is dominant operation!)

(G) A more efficient algorithm is Algorithm Sp-2010-Q4-G (faster algorithm).
   G1  <-- e-select from SI WHERE (Faculty='FASS');
   G2  <-- e-select from CI WHERE (Venue='USP-SR1');
   G3  <-- e-join G2 and EN WHERE (G2.Course-ID = EN.Course-ID);
   G4  <-- e-join G3 and G1 WHERE (G3.Student-ID = G1.Student-ID);
   Ans <-- e-project Student-ID, Major from G4;

Analysis: (Assume # of FASS students = 3000, # of classes in USP-SR1 = 20)
- 1st step (e-select) takes 30,000 row ops; table G1 has 3,000 rows.
- 2nd step (e-select) takes 1,000 row ops; table G2 has 20 rows.
- 3rd step (e-join) takes (20*100,000) row ops; table G3 has about 200 rows.
- 4th step (e-join) takes (200*3,000) row ops; table G4 has about 50 rows.
- 5th step (e-project) takes 50 row ops;

  • Total: About 2,631,050 (about 2.63 x 106) row operations! (a much smaller join operation!)

    Observation:
    There is about 1,200 times difference between algorithm Sp-2012-Q4-X and Sp-2012-Q4-G.
    (3.1 x 109)/(2.63 x 106)=1,187! That's (1sec vs 20min) or (1 day vs 3.25 years)!!

    (H) Still another possible algorithm is Algorithm Sp-2012-Q4-H.
       H1  <-- e-select from SI WHERE (Faculty='FASS');
       H2  <-- e-select from CI WHERE (Venue='USP-SR1');
       H3  <-- e-join H1 and EN WHERE (H1.Student-ID = EN.Student-ID);
       H4  <-- e-join H3 and H2 WHERE (H3.Course-ID = H2.Course-ID);
       Ans <-- e-project Student-ID, Major from H4;
    

    Analysis: (Assume # of FASS students = 3000, # of classes in USP-SR1 = 20)
    - 1st step (e-select) takes 30,000 row ops; table H1 has 3,000 rows.
    - 2nd step (e-select) takes 1,000 row ops; table H2 has 20 rows.
    - 3rd step (e-join) takes (3,000*100,000) row ops; table H3 has about 15,000 rows.
    - 4th step (e-join) takes (20*15,000) row ops; table H4 has about 50 rows.
    - 5th step (e-project) takes 50 row ops;

  • Total: About 300,331,050 (about 3 x 108) row operations!

    (This is about 10 times faster than Sp-2012-Q4-X.)
    Q: Where is the main problem with this algorithm?


    (M) There are other algorithms (using the basic primitives) that you can design for processing this query -- with running times somewhere between Sp-2012-Q4-G and Sp-2012-Q4-X.
    Q: Can you find two of them?

    Additional Discusion: prompted by tutorial on Wed, 06-Mar-2013:
    I thank everyone for the corrections, questions, discussions on this worked example -- especially on the estimation of the sizes of the intermediate tables. This is not an exact science and some estimation (ala Fermi-type estimation) is necessary. But, the good news is that while the estimates may differ a little bit, generally the final conclusion does not change.


    Final Lesson:
    Try to do e-select first whenever possible.
    Always do e-join last (wherever possible).
    And e-join with smallest tables first, whenever possible.



    UIT2201: CS & IT Revolution; (Spring 2013); Prof Leong HW