Participants: Stefano Perna, Val Tannen, Kian Lee Tan, Chee Yong Chan, Limsoon Wong
Modern programming languages provide comprehension syntax for manipulating collection types. Comprehension syntax makes programs more readable, but comprehensions typically correspond to nested loops. So, it is difficult using it to express efficient algorithms. This has motivated developments that introduced alternative binding semantics for comprehension syntax, so that some comprehensions are not compiled into nested loops. Nonetheless, it has not been shown that efficient algorithms, such as that for equijoin, cannot be implemented without such refinements to comprehension syntax. I.e., a gap exists in our understanding of the intensional expressive power of comprehension syntax.
The objectives of this project are:
Highlight: This draft paper provides an introduction to Synchrony fold and Synchrony iterator, and information on how they can be implemented. Abstract: Modern programming languages typically provide some form of comprehension syntax which renders programs manipulating collection types more readable and understandable. However, comprehension syntax corresponds to nested loops in general. There is no simple way of using it to express efficient general synchronized iterations on multiple ordered collections, such as linear-time algorithms for low-selectivity database joins. Synchrony fold is proposed here as a novel characterization of synchronized iteration. Central to this characterization is a monotonic isBefore predicate for relating the orderings on the two collections being iterated on, and an antimonotonic canSee predicate for identifying matching pairs in the two collections to synchronize and act on. A restriction is then placed on Synchrony fold, cutting its extensional expressive power to match that of comprehension syntax, giving us Synchrony generator. Synchrony generator retains sufficient intensional expressive power for expressing efficient synchronized iteration on ordered collections. In particular, it is proved to be a natural generalization of the database merge join algorithm, extending the latter to more general database joins. Finally, Synchrony iterator is derived from Synchrony generator as a novel form of iterator. While Synchrony iterator has the same extensional and intensional expressive power as Synchrony generator, the former is better dovetailed with comprehension syntax. Thereby, algorithms requiring synchronized iterations on multiple ordered collections, including those for efficient general database joins become expressible naturally in comprehensinon syntax.
Highlight: This draft paper is a detailed study on the intensional expressive power of Synchrony iterator. It shows that Synchrony iterator precisely fills the intensional expressiveness gap between the algorithms expressible by comprehension syntax and typical relational database systems. Abstract: Comprehension syntax is widely adopted in modern programming languages as a means for manipulating collection types. This paper articulates and investigates an apparent gap in the intensional expressive power between comprehension syntax and relational database systems: (i) All subquadratic algorithms which are expressible in comprehension syntax, even after allowing some functions commonly available in the collectiontype libraries of modern programming languages, do not compute low-selectivity joins. As database systems support these joins efficiently, this confirms the intensional expressiveness gap. (ii) A “Synchrony iterator” construct for synchronized iteration on multiple collections is introduced. This enables more algorithms, but not functions, to become definable using comprehension syntax. In particular, efficient algorithms for low-selectivity joins become expressible. So, the ability to iterate on multiple collections in synchrony constitutes an exact characterization of this intensional expressiveness gap. (iii) The proof of this intensional expressiveness gap relies on a “limited mixing” lemma which states that subquadratic algorithms expressible using comprehension syntax have limited ability for mixing atomic objects in their inputs. This limited-mixing lemma is non-query specific and is applicable even when ordered data types are present. It thus considerably enriches the available theoretical tools for studying intensional expressive power, as these tools are often query specific and are inapplicable in the presence of ordered data types. It is also a useful intensional counterpart to Gaifman’s locality property. Gaifman’s locality are very useful for analyzing extensional expressiveness of first-order query languages on unordered data types, but is not useful on ordered data types. (iv) Incidentally, efficient interval joins with overlap predicates are obtained as a free byproduct of Synchrony iterator. This kind of joins are often needed for practical applications such as temporal data and genomic data processing, but are not supported well in typical relational database systems.
Highlight: This draft paper puts Synchrony iterator to a practical test. The powerful GenoMetric Query Language (GMQL) is emulated using Synchrony iterator in Scala/Python. The resulting equivalents of GMQL queries are very efficient, generally better than a local installation of GMQL by large margins. Abstract: Processing of large data files is unavoidable in genomic pipelines. Many tools that do this are either stand-alone languages or command-line tools. There is an impedance mismatch when these tools are used with a host programming language to support more complex analysis. A novel concept, Synchrony iterator, is introduced. It allows efficient algorithms underlying such tools to be easily expressed. As a demonstration, the powerful GenoMetric Query Language (GMQL) is emulated using Synchrony iterators in Scala/Python, and the resulting equivalents of these queries are very efficient. Notably, a user can freely intermix GMQL-like queries with other features of Scala/Python, thereby overcoming the impedance mismatch problem.
A basic Synchrony iterator implementation. Mainly for illustrating the central idea of synchronized iteration. This module demos it for enhancing Scala's comprehension syntax to support linear-time low-selectivity database non-equijoins. It allows a programmer write something like
for ( (a, bs, cs, ds, es) <- A join B on condAB join C using condAC join D on condAD where condAD' join E using condAE where condAE'; ... blah blah ... ) yield ... more blah ...
Here, for each a in A, bs is the result of B.filter(b => condAB(b,a)), cs is the result of C.filter(c => condAC(c,a)), and ds is the result of D.filter(d => condAD(d,a) && condAD'). es is the result of E.filter(d => condAE(d,a) && condAE'). Moreover, the complexity is O(|A| + k|B| + k|C| + k|D| + k|E|) where k is the selectivity of the (non-equi)join predicates condAB, condAC, condAD, and condAE. This complexity is much much better then the naive execution complexity of O(|A| * (|B| + |C| + |D| + |E|)). Note that usual techniques for implementing equijoins via grouping (ala Wadler and Peyton-Jones) and indexed table (ala Gibbons) don't work here, because these are non-equijoins. Best to read these codes along side the introduction paper, wls-sychrony2020-v12.pdf, to appreciate how this is achieved.
A simple ordered file is derived as a subclass of the ordered collection in Synchrony DBModel. I.e., all the fancy stuff (like efficient non-equijoin) can be done on files too.
Synchrony GMQL is an implementation of Synchrony iterators and emulation of GMQL, in Scala. It is described briefly in wls-synchrony2020-v12.pdf and in more detail in synchrony-gmql-v12.pdf. Version 5.1 also includes a Synchrony-based implementation of Peng Hui's ProInfer, a powerful package for protein calling on current modern proteomic mass-spectrometry platforms (e.g., diaPASEF.)
An implementation of Synchrony iterators and emulation of GMQL, in Python 3. This implementation is provided just for fun; I learned Python by doing this implementation. Will replace it with a more serious vesion when I have time.