CS4257: Algorithmic Foundations of Privacy

Course Information

Instructor: Prof. Reza Shokri
Semester: Spring 2019

Lectures: Mon. 10:00 - 12:00
Location: Seminar Room 2

Office Hours: Mon. 13:00 - 14:00 (Please email in advance.)

Course Description

Individuals’ data is being collected and processed in an unprecedented manner. This data, in most cases, is personal and sensitive, and is observable to untrusted entities. Besides, private communication between individuals happen over public networks which are also observable to untrusted entities. Given some available information, an attacker can use inference algorithms to learn even further information about each individual. Following this capability, large scale attacks and privacy breaches are increasingly threatening network and data privacy, and simple conventional security techniques are not sufficient to safeguard individual’s privacy.

This module covers algorithmic foundations of computation and communication privacy. It provides a thorough methodology for analysis of privacy against inference attacks using techniques from statistics, probability theory, and machine learning. Students will learn how to reason quantitatively about privacy, and evaluate it using the appropriate metrics. The module will help students to design privacy-preserving mechanisms for a range of systems from anonymous communication to data analytics. After this module, students should be able to identify privacy vulnerabilities in a system, design inference attacks, and propose effective countermeasures in a systematic manner.

Schedule

[Week 00] Background (probability and inference)

[Read M03I (chapters 1 through 3)]

[Week 01] Course overview; Introduction to privacy; Traffic analysis

[slides] [Read D07I, S05A (optional)]

[Week 02] Anonymous communications; Mix Networks; Tor; Measuring Anonymity

[slides] [Read C81U, D04T, S02T]

[Week 03] Anonymous communications; Unobservability; Dining Cryptographers

[slides] [Read C88T, C15R]

[Week 04] Data privacy; Anonymization

[slides] [Read L07T, and G08C]

[Week 05] (Location) data privacy; Inference attacks

[slides] [Read S11Q]

[Week 06] Encrypted Databases (Reconstruction Attacks); Summary Statistics (Tracing / Membership Inference Attacks)

[Read B17T (first 5 sections), S09G (main paper, and T1,T2,T3 of the supplementary material), D03R]

[Week 07] Differential privacy

[Read D14A (Chapters 1 and 2)]

[Week 08] Differential Privacy; Laplacian Mechanism; Randomized Response; Composition Theorem

[Read D14A (Chapter 3 -- up to 3.4)]

[Week 09] Differential Privacy; Exponential Mechanism

[Read D14A (Chapter 3)]

[Week 10] Oblivious RAM

[slides] [Read G96S]

[Week 11] Differential Privacy; Median; Smooth Sensitivity

[Read N07S]

[Week 12] Differential Privacy; Sparse Vector Technique

[Read D14A (Chapter 3), and L17U]

[Week 13] Overview

References

[M03I] Information theory, inference and learning algorithms, MacKay, David JC. 2003

[D07I] Introducing traffic analysis, Danezis, George and Clayton, Richard. 2007

[S05A] A taxonomy of privacy, Solove, Daniel J. 2005

[C81U] Untraceable electronic mail, return addresses, and digital pseudonyms, Chaum, David L. 1981

[C88T] The dining cryptographers problem: Unconditional sender and recipient untraceability, Chaum, David. 1988

[D04T] Tor: The second-generation onion router, Dingledine, Roger and Mathewson, Nick and Syverson, Paul. 2004

[S02T] Towards an information theoretic metric for anonymity, Serjantov, Andrei and Danezis, George. 2002

[C15R] Riposte: An anonymous messaging system handling millions of users, Corrigan-Gibbs, Henry, Dan Boneh, and David Mazières. 2015

[L07T] t-closeness: Privacy beyond k-anonymity and l-diversity, Li, Ninghui, Tiancheng Li, and Suresh Venkatasubramanian. 2007

[G08C] Composition Attacks and Auxiliary Information in Data Privacy, Ganta, Srivatsava Ranjit, Shiva Prasad Kasiviswanathan, and Adam Smith. 2008

[S11Q] Quantifying location privacy Shokri, Reza, George Theodorakopoulos, Jean-Yves Le Boudec, and Jean-Pierre Hubaux. 2011

[B17T] The Tao of Inference in Privacy-Protected Databases, Bindschaedler, Vincent, Paul Grubbs, David Cash, Thomas Ristenpart, and Vitaly Shmatikov. 2017

[S09G] Genomic privacy and limits of individual detection in a pool, [supplementary material], Sankararaman, Sriram, Guillaume Obozinski, Michael I. Jordan, and Eran Halperin. 2009

[D03R] Revealing information while preserving privacy, Dinur, Irit and Nissim, Kobbi. 2003

[D14A] The algorithmic foundations of differential privacy, Dwork, Cynthia and Roth, Aaron. 2014

[G96S] Software protection and simulation on oblivious RAMs, Goldreich, Oded, and Rafail Ostrovsky. 1996

[N07S] Smooth Sensitivity and Sampling in Private Data Analysis, Nissim, Kobbi, Sofya Raskhodnikova, and Adam Smith. 2007

[L17U] Understanding the sparse vector technique for differential privacy, Lyu, Min and Su, Dong and Li, Ninghui, 2017

Project 1: Traffic Analysis

The objective of this project is to show the vulnerability of the HTTPS traffic to traffic analysis attacks. Prior to the attack, in the profiling phase, the attacker connects to a set of URLs and captures the communicated packets between his machine and the server hosting the URL. The attacker performs the profiling attack multiple times, in order to obtain many samples of the HTTPS traffic to target webpages, for generating a fingerprint for each webpage. In the attack phase, where the attacker observes the connection of the victim to an anonymous webpage (through a proxy server, or anonymity network such at Tor), he tries to match his observation with the fingerprints.

In this project, you receive the processed captured packets of 35 URLs, in 8 different days, to build the fingerprints. The packet traces for each URL are saved in a file, with a same and persistent file ID across the 8 measurements. You also receive the captured packets, corresponding to the attacker's observation of anonymous webpages. The set of anonymous URLs is the same as the profiled URLs, but of course their identities are randomly permuted. This is known as the closed-world setting. You receive the observations of two different attackers, for the same set of URLs. So, the file names across the two observations do NOT correspond to the same webpages. You need to run the traffic analysis attack for both of the observations, given the provided profiles. The packet trace files contain the time of arrival for each packet, its size in Bytes, and direction (in: towards the client, out: towards the server). You can obtain the profiles and the attacker observation files from here. This is an example of a trace file:

00:00:00.000000 0 in
00:00:00.000056 0 out
00:00:00.000897 201 out
00:00:00.134478 0 in
00:00:00.134523 0 out
00:00:00.135007 201 out
00:00:00.243595 0 in
00:00:00.262097 1448 in
00:00:00.262118 0 out	
...

The expected output of the attack is 1 file, with 35 lines (for each URL in the profiles, where the line number represents the URL id) and 2 columns (one for each attack observation), separated with a space character. In each column, please provide the anonymous id of a URL corresponding to its id in the profiles. Please do NOT put any extra character/line, as your submissions will be corrected automatically. This is an example of a solution, where the first line shows the URL with anonymous id 2 in observation1 and anonymous id 34 in observation2 are associated with the profiles with id 1:

The due date for submission of the solutions + 2 page summary of your attack algorithm is Monday Feb 11, 8am.

Project 2: Anonymization

Analyze the recently released guidelines by Singapore personal data protection commission (PDPC) on data anonymization techniques. Write a short report (at most 3 pages) on the issues with the strenghts and weaknesses of the guideline, and how you suggest improving it.

The due date for submission of the report is Monday Feb 18, 8am.

Project 3: Generate Synthetic Data with Differential Privacy

One of the challenging problems in privacy-preserving data analysis is generating synthetic data that preserves the utility of real data, yet it does not leak significant amount of sensitive information about it. One solution is to generate the synthetic data with differential privacy, while making sure certain statistical properties of the data is preserved. See the following research papers, as examples of how to generate differentially private synthetic data: Private Multiplicative Weight Update Algorithm, Differentially Private Bayesian Networks.

In this project, you have access to a dataset with binary attributes. Please download the dataset from here. You are expected to generate synthetic data records that preserve the following utility measures of the original data: summary statistics (the marginal distribution over each attribute), and clustering of the original dataset. For the latter, if we run a clustering algorithm on the real data, and separately on the privacy-preserving synthetic data, the centroids of the clusters need to be very close. The real data has 100 clusters. The overhead of the algorithm on the clusters of data can be computed as the average Euclidean distance between the best match of the centroids (over original data and the synthetic data)

Try to generate your dataset with epsilon=10, and epsilon=3 differential privacy. You would need to provide a proof that your algorithm is differentially private. Generate datasets of size 10000, and report the error of summary statistics and clustering.

More details will be given in the class.

The due date for submission of the report, the generated data, and the code, is Saturday, 20 April, 8am.