Many enterprises produce heterogeneous data sets. Structured data (e.g., customer data) may be stored in a relational database. Unstructured data (e.g., text, images, and videos) may be stored in file systems. Analysts, typically, use different software, library, or tools to analyze the data set of each type (e.g., SQL may be used to analyze relational data and various software packages may be used for unstructured data). The problem of this per-data set analytical approach is that it is hard to derive the insights by merging all kinds of data sets. The state-of-the-art approach for analyzing heterogeneous data sets is MapReduce. But this approach requires the analyst to rewrite most of existing data analytical software in MapReduce interface which is often a hard task to achieve in practice.
E3 is a programming framework for simplifying scalable heterogeneous data processing on large clusters. The key feature of E3 is that instead of enforcing the analyst to write the whole analytical program in a single interface (like MapReduce), E3 allows the analyst to reuse existing data analytical programs to process data sets of each type and coordinate those data processing programs to produce the final results.
E3 introduces an Actor-like concurrency programming model to achieve the above goal. An Actor is an independent execution unit which processes a fixed number of input messages and produces output messages for other actors to further process. Using E3, the analysts simply run data analytical programs inside a set of actors and coordinate them for parallel execution by message passing. Currently, we have a working E3 core runtime system and MapReduce extensions for running MapReduce programs inside actors.