Comet: Batched Stream Processing for Data Intensive Distributed Computing
Batched stream processing is a new distributed data processing paradigm that models recurring batch computations on
incrementally bulk-appended data streams. The model is inspired by our empirical study on a trace from a large-scale
production data-processing cluster; it allows a set of effective query optimizations that are not possible in a traditional
batch processing model. We have developed a query processing system called Comet that embraces batched stream processing and integrates with
DryadLINQ. We used two complementary methods to evaluate the effectiveness of optimizations that Comet enables.
First, a prototype system deployed on a 40-node cluster shows an I/O reduction of over 40% using our benchmark.
Second, when applied to a real production trace covering over 19 million machine-hours, our simulator shows an estimated I/O saving of over 50%.
The Comet system has first revealed the significant redundant I/O and computation in modern big data systems, and revolutionized the view of batch processing and stream processing (Spark Stream has similar design principle).
In the following, we present more details on the "impact factors" of this project (see definition of "impact factors").
Relevance to Industry and Open-Source Community
This system has inspired other open-source systems and industry systems.
- [WWW] Krishnan, Dhanya R., Do Le Quoc, Pramod Bhatotia, Christof Fetzer, and Rodrigo Rodrigues.
IncApprox: A Data Analytics System for Incremental Approximate Computing, WWW 2016.
- [CSE] Kalavri, Vasiliki, Hui Shang, and Vladimir Vlassov.
m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data, CSE 2013.
System Repeatability and Academic Impacts
The system is used in the evaluation of the following papers:
- [Book] Ian Gorton, Deborah K. Gracio, Cambridge University Press.
Data-Intensive Computing: Architectures, Algorithms, and Applications,
- [Book] Albert Y. Zomaya, Sherif Sakr, Springer.
Handbook of Big Data Technologies,
25 Feb 2017.
- [Book] Deze Zeng, Lin Gu, Song Guo, Springer.
Cloud Networking for Big Data,
9 Dec 2015.
- [Book] Matei Zaharia, Morgan & Claypool.
An Architecture for Fast and General Data Processing on Large Clusters,
1 May 2016.
- [Book] Kuan-Ching Li, Hai Jiang, Albert Y. Zomaya, CRC Press.
Big Data Management and Processing,
19 May 2017.
- [Book] Sherif Sakr, Mohamed Gaber, CRC Press.
Large Scale and Big Data: Processing and Management,
25 Jun 2014.
- [Course] University at Albany, State University of New York.
CSI 445/600. Distributed Systems
- [Course] University of Illinois, Urbana-Champaign.
CS 525: Advanced Distributed Systems
Back to Bingsheng's Influential Works © 2020