Course Overview
Description
The goal of this semester is to better understand how we deal with graphs and datasets that are very, very big. We will explore how to adapt classical algorithms, introducing a variety of tools for building efficient algorithms that can handle data at scale. As the semester progresses, we will encounter a variety of challenges:
 Scale: The problems are bigger. A graph with 1000 nodes is easy to deal with. A graph with one billion nodes is not. In many ways, this question of scale motivates much of what we are talking about this semester.
 Where is the data? The data is often not provided in a simple and accessible format. It may be distributed among different servers. It may be provided online as a stream of information. There may be noise in the dataset.
 Dynamic changes: The data is no longer static. For example, instead of working with a fixed, unchanging graph, the graph may change over time. It is no longer sufficient to simply solve a problem once.
 Context matters: The data comes from somewhere, and that matters. Does it represent a social network or a wireless network or a game? There are often special properties and structure of the data that you can leverage.
Imagine, for example, that you want to find a minimum spanning tree for a very large graph. Classical solutions like Prim's and Kruskal's may be difficult to implement efficiently if your graph contains one billion nodes! Can you find a faster randomized algorithm? Can you leverage special structure (e.g., is the graph planar, or derived from a social network) to derive a faster algorithm? Can you find an approximate solution faster? What if the graph is available only as a stream? What if the graph is changing over time with edges being added and removed (perhaps as friendships are made and broken). How does cache performance impact your algorithm? Can you take advantage of multicore machines to solve the problem faster? Maybe you could use a distributed cluster (or Hadoop)?
The goal this semester to develop a set of tools you can use to adapt graph algorithms for these types of scenarios, when the graph is very large and you are no longer able to simply apply classical solutions.
Grading
Grades in the course will be based on problem sets (including a miniproject), a midterm exam, and a final exam.
In the last month of the semester, you will have the chance to work on a small project that involves further exploring some of the ideas that we have covered during the class.
Schedule
Below is the tentative schedule for the course. I will modify
the schedule as the course progresses.
Problem sets will be posted below on this schedule as they become available.
Class
Number  Date  Description 
  Sampling and Sketching. 
  Part 1: Graph properties in less than linear time 
1  15/08  Introduction and logistics. Sampling. Graph connectivity.

2  22/08  Finding the weight of an MST using sampling
 Lecture notes (handwritten, pdf) [Download]
 Lecture notes (slides, pdf) [Download]
 Notes on sublinear time algorithms [Download]
 Problem Set 2 [Download] (Due: Aug. 30)
 Problem Set 2 solutions [Download]
 Background reading:

3  29/08  Property Testing, Sorted, and Yao's Principle

  Part 2: Sketches and streams 
4  05/09  Sampling from a stream.
 Lecture notes (slides, pdf) [Download]
 Lecture notes (handwritten, pdf) [Download]
 Tutorial notes [Download]
 Problem Set 3 [Download] (Due: Sept. 15)
 Problem Set 3 solutions [Download]
 Background reading:

5  12/09  Streaming algorithms for graphs.
 Lecture slides (powerpoint, pdf) [Download]
 Lecture notes (handwritten, pdf) [Download]
 Other material:
 Amit Chakrabarti notes on streaming algorithms for graphs: Chapter 13 (Graphs).
 Amit Chakrabarti notes on streaming algorithms for graphs: Chapter 6 (TugofWar sketch).
 Amit Chakrabarti notes on streaming algorithms for graphs: Chapter 14 (Matchings and Triangles).
 Moses Charikar lectures on shortest paths and counting triangles. (Local copy)
 A survey by Andrew McGregor on streaming algorithms for graphs.
 A sequence of lectures on streaming algorithms for graphs, including the dynamic connectivity algorithm seen in class.

6  19/09  Clustering and Streaming Algorithms for Clustering.
 Lecture notes (PowerPoint, pdf) [Download]
 Problem Set 4 [Download] (Due: Oct. 3)
 Additional background material on clustering:

Break  26/09  No class: Midsemester break

  Efficient Algorithms for Modern Machines. 
  Part 3: Efficient Algorithms for Modern Machines 
7  3/10  Introduction to Caching
 Lecture notes, introduction to caching (slides, pdf) [Download]
 Lecture notes, Btrees and Buffer Trees (handwritten, pdf) [Download]
 Notes on caching (pdf) [Download]
 Additional background material on caching:

8  10/10  Caches and cacheefficient algorithms (Part 2)
 Lecture notes, cacheefficient graph algorithms (slides, pdf) [Download]
 Sample MidTerm Exam (2018) [Download]
 Sample MidTerm Exam (2018) solutions [Download]
 MiniProject Instructions [Download]
 MiniProject Proposal due. Submit here.
 MiniProject partner finding spreadsheet. Beware: this link is currently public. Do not put any sensitive information. (I will remove the link once teams have been formed.)

9  17/10  MidTerm Exam

  Part 4: Parallel Algorithms 
10  24/10  Multicore Parallel Algorithms
 Notes on parallel algorithms (pdf) [Download]
 Lecture notes, parallel algorithms (handwritten, pdf) [Download]
 Lecture notes, parallel algorithms (slides) (pdf) [Download]

11  31/10  MPC Parallel Algorithms (Part 1)
 Notes on parallel algorithms (pdf) [Download]
 Lecture notes, MapReduce algorithms (slides, pdf) [Download]
 Notes on parallel prefixsums (pdf) [Download]
 Other notes:

12  7/11  MPC Parallel Algorithms (Part 2)

13  14/11  MiniProject Presentations and Wrap Up
 Wrapup Slides (pdf) [Download]
 Presentations:
Thanks to all the students that presented!

  End of class 
 27/11  Final Exam (Afternoon)
 Please check the official schedule to confirm the time and place.
 Sample Final Exam (2018) [Download]
 Sample Final Exam (2018) solutions [Download]

Miscellaneous Resources
Below you can find various resources that may help you with the material we are covering in this class.
 For a quick review of asymptotic notation:
 Brief Notes by TB Schardl (for an MIT undergraduate algorithms class).
 Lecture notes for the MIT undergraduate algorithms class on asymptotic notation.
 For more details, you might look at Introduction to Algorithms by Cormen, Leiserson, Rivest, and Stein, Chapter 3.
 For a review of some of the basic tools of randomized algorithms:
 Notes on Markov's Inequality, Chebychev's Inequality, and Chernoff Bounds.
 Recitation problems from the same class (above) on randomized algorithms.
 For notes on streaming algorithms, see the class by Jelani Nelson at Harvard. (Of particular note to this class, see the lectures on CountMin sketch and CountSketch and graph sketching.) There are also links to many other resources.
 For sublinear time algorithms, see the following classes:
 Czumac and and Sohler wrote a nice survey a few years ago on what is currently known about sublinear time algorithms.
 McGregor wrote a nice survey recently on the state of the art in streaming algorithms for graphs.
 Lars Arge is one of the world experts on cacheefficient algorithms. You can find notes from his class here. They cover several of the data structures we looked at. (It also includes some good references to interesting papers.)
 Erik Demaine wrote an interesting survey on cacheoblivious algorithms. You can find more recent details in his lectures notes: Lecture 7, Lecture 8, and Lecture 9 of his advanced algorithms class. (This includes details on the dynamic cacheoblivious search tree that we skipped.)
 Forkjoin parallelism has been implemented in several languages. For example, see here for Java documentation. See CilkPlus for a C/C++ implementation (i.e., Intel's Cilk Plus).
 You can find a somewhat old set of notes on multithreaded progrmming here. While old, it contains most of the basic concepts we talked about. More recent lectures notes (in the same style): Lecture 1 and Lecture 2.
Course Logistics
Office Hours
For now, office hours are by apointment. Once the semester begins, I'll schedule regular weekly office hours (and update the website here).
Contact
To ask me small logistical questions, I prefer that you grab my attention
immediately before or after class. For substantive content questions, schedule a meeting with me.
Problem Sets
There will be weekly problem sets in the first 2/3rds of the class. (I expect this will end up being 67 problem sets in total.) These problems sets will focus on algorithm questions: solving problems, devising algorithms, proving theorems. There may be a few short programming assignments.
Problems will be graded on a simplistic 0/1/2/3 scale: a 0 indicates minimal understanding (or a problem not completed); a 1 indicates some understanding but many mistakes or poor presentation; 2 indicates a satisfactory answer; a 3 indicates a perfect answer that goes beyond expectations. You should expect to get a 2 on most problems.
The following rules will help keep the problem set submission and
grading process running smoothly:
Some advice on problem sets:
 Answers to homework should be written at a level suitable for and to address the instructor. In other words, don't put in long explanation for trivial things.

You will be graded not only on the correctness and efficiency of your answers, but also on your clarity. Therefore, it pays to make your answers more concise and clearer. Understandability of the solution is as desirable as correctness. Sloppy answers will be at a disadvantage. Learn to make appropriate definitions to make your answers more precise and concise.

Solutions do not have to be typeset, but they have to be neat and readable. Do not waste your time beautifying your answers. If you want to spend more time, you should spend it on improving its clarity, understandability, and other aspects of its technical content.

From a previous module, you can find some advice on proofs here.

From a previous module, you can find a sample of what a good solution looks like here.
Exams
There will be one two exams in the course: a midterm and a final. The midterm exam will be held in class on October 17. The final exam will be held on November 27. (Please check the official schedule to confirm, as it may have changed since this webpage was created.)
Academic Integrity
I take academic integrity seriously. To repeat the problem set instructions from above: the solutions you submit must be your own unique work. Any cases of cheating will be dealt with harshly: a first offense will result in at least a one letter grade deduction for the module (or potentially a zero for the entire module, depending on the severity). When in doubt, ask me what is allowed.