Comet

Comet: Batched Stream Processing for Data Intensive Distributed Computing

Publications

Bingsheng He, Mao Yang, Zhenyu Guo, Rishan Chen, Wei Lin, Bing Su, Lidong Zhou. Comet: Batched Stream Processing for Data Intensive Distributed Computing. First ACM Symposium on Cloud Computing (SoCC) 2010.

Abstract

Batched stream processing is a new distributed data processing paradigm that models recurring batch computations on incrementally bulk-appended data streams. The model is inspired by our empirical study on a trace from a large-scale production data-processing cluster; it allows a set of effective query optimizations that are not possible in a traditional batch processing model. We have developed a query processing system called Comet that embraces batched stream processing and integrates with DryadLINQ. We used two complementary methods to evaluate the effectiveness of optimizations that Comet enables. First, a prototype system deployed on a 40-node cluster shows an I/O reduction of over 40% using our benchmark. Second, when applied to a real production trace covering over 19 million machine-hours, our simulator shows an estimated I/O saving of over 50%.

Highlights

The Comet system has first revealed the significant redundant I/O and computation in modern big data systems, and revolutionized the view of batch processing and stream processing (Spark Stream has similar design principle).

In the following, we present more details on the "impact factors" of this project (see definition of "impact factors").

Citations

According to GoogleScholar, the publication has received over 150 citations since 2010.
The paper is featured among top twenty papers selected from over 5,000 papers published by MSR Asia, under “Microsoft Research Asia - 20 Years of Research”.

Relevance to Industry and Open-Source Community

This system has inspired other open-source systems and industry systems.

[WWW] Krishnan, Dhanya R., Do Le Quoc, Pramod Bhatotia, Christof Fetzer, and Rodrigo Rodrigues.
IncApprox: A Data Analytics System for Incremental Approximate Computing, WWW 2016.

[CSE] Kalavri, Vasiliki, Hui Shang, and Vladimir Vlassov.
m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data, CSE 2013.

System Repeatability and Academic Impacts

The system is used in the evaluation of the following papers:

[ICDE] Zhang, Ce, Cheng Xu, Jianliang Xu, Yuzhe Tang, and Byron Choi.
GEM 2-Tree: A Gas-Efficient Structure for Authenticated Range Queries in Blockchain., ICDE 2019.

Educational Adoptions

[Book] Ian Gorton, Deborah K. Gracio, Cambridge University Press.
Data-Intensive Computing: Architectures, Algorithms, and Applications, 2013.

[Book] Albert Y. Zomaya, Sherif Sakr, Springer.
Handbook of Big Data Technologies, 25 Feb 2017.

[Book] Deze Zeng, Lin Gu, Song Guo, Springer.
Cloud Networking for Big Data, 9 Dec 2015.

[Book] Matei Zaharia, Morgan & Claypool.
An Architecture for Fast and General Data Processing on Large Clusters, 1 May 2016.

[Book] Kuan-Ching Li, Hai Jiang, Albert Y. Zomaya, CRC Press.
Big Data Management and Processing, 19 May 2017.

[Book] Sherif Sakr, Mohamed Gaber, CRC Press.
Large Scale and Big Data: Processing and Management, 25 Jun 2014.

[Course] University at Albany, State University of New York.
CSI 445/600. Distributed Systems

[Course] University of Illinois, Urbana-Champaign.
CS 525: Advanced Distributed Systems

Media Coverage

[News] 微软沈向洋：书写未来科研的二十年, 22 Nov 2018.

[News] Celebrating Twenty Years of Writing the Future, 20 Nov 2018.