MemepiC is the in-memory version of epiC, an extensible and scalable system based on Actor Concurrent programming model, which has been designed for processing Big Data. It not only provides low latency storage service as a distributed key-value store, but also integrates in-memory data analytics functionality to support online analytics. With an efficient data eviction and fetching mechanism, MemepiC has been designed to maintain data that is much larger than the available memory, without severe performance degradation.
MemepiC is featured for its less-syscall design, user-space virtual memory management, integration of storage service and online data analytics, etc.
1 The conventional database design that relies on syscalls for communication with hardware or synchronization is no longer suitable for achieving good performance demanded by in-memory database systems, as the overhead incurred by syscalls is detrimental to the overall performance. Thus, MemepiC subscribes to the less-syscall design principle, and attempts to reduce as much as possible the use of syscalls in the storage access (via memory-mapped file instead), network communication (via RDMA), synchronization (via transactional memory or atomic primitives) and fault-tolerance (via remote memory logging).
2 The problem of relatively smaller size of main memory is alleviated in MemepiC via an efficient user-space virtual memory management (UVMM) mechanism, by allowing data to be freely evicted to disks when the total data size exceeds the memory size, based on a configurable paging strategy. The adaptability of data storage enables a smooth transition from disk-based to memory-based databases, by utilizing a hybrid of storages. It takes advantage of not only semantics-aware eviction strategy but also hardware-assisted I/O and CPU utilization, exhibiting a great potential as a more general approach of "Anti-Caching".
3 In order to meet the requirement of online data analytics, MemepiC integrates data analytics functionality, to allow analyzing data where it is stored. With the integration of data storage and analytics, it significantly eliminates the data movement cost, which typically dominates in conventional data analytics scenarios, where data is first fetched from the database layer to the application layer, only after which it can be analyzed. The synchronization between data analytics and storage service is achieved based on atomic primitives and fork-based virtual snapshot.