Synchrony iterators

Participants: Stefano Perna, Val Tannen, Kian Lee Tan, Chee Yong Chan, Limsoon Wong

Overview

Modern programming languages provide comprehension syntax for manipulating collection types. Comprehension syntax makes programs more readable, but comprehensions typically correspond to nested loops. So, it is difficult using it to express efficient algorithms. This has motivated developments that introduced alternative binding semantics for comprehension syntax, so that some comprehensions are not compiled into nested loops. Nonetheless, it has not been shown that efficient algorithms, such as that for equijoin, cannot be implemented without such refinements to comprehension syntax. I.e., a gap exists in our understanding of the intensional expressive power of comprehension syntax.

The objectives of this project are:

Theoretical study to (i) prove that this intensional expressiveness gap is real, and (ii) propose a programming construct or a library function that precisely characterizes this gap. We hypothesize that this gap can be filled exactly by a novel programming construct for iterating on two or more collections in synchrony. Let us call this construct the Synchrony iterator.
To realise Synchrony iterator as a novel data type/class in several popular programming languages and dovetail it with comprehension syntax.
We hypothesize that Synchrony iterator generalizes merge-join in relational database management systems (RDBMS). And, in contrast to merge-join, Synchrony iterator will be applicable even when the join predicate does not involve any equality tests (i.e., non-equijoin). Thus, we propose integrating Synchrony iterator into an RDBMS (viz. PostgreSQL – a popular open-source RDBMS) so that its efficient merge-join can be generalized and used in executing more queries.

Manuscripts

Stefano Perna, Val Tannen, Limsoon Wong. Iterating on multiple collections in synchrony. Journal of Functional Programming, 32:e9, July 2022. PDF
Highlight: This paper provides an introduction to Synchrony fold and Synchrony iterator, and information on how they can be implemented. Abstract: Modern programming languages typically provide some form of comprehension syntax which renders programs manipulating collection types more readable and understandable. However, comprehension syntax corresponds to nested loops in general. There is no simple way of using it to express efficient general synchronized iterations on multiple ordered collections, such as linear-time algorithms for low-selectivity database joins. Synchrony fold is proposed here as a novel characterization of synchronized iteration. Central to this characterization is a monotonic isBefore predicate for relating the orderings on the two collections being iterated on, and an antimonotonic canSee predicate for identifying matching pairs in the two collections to synchronize and act on. A restriction is then placed on Synchrony fold, cutting its extensional expressive power to match that of comprehension syntax, giving us Synchrony generator. Synchrony generator retains sufficient intensional expressive power for expressing efficient synchronized iteration on ordered collections. In particular, it is proved to be a natural generalization of the database merge join algorithm, extending the latter to more general database joins. Finally, Synchrony iterator is derived from Synchrony generator as a novel form of iterator. While Synchrony iterator has the same extensional and intensional expressive power as Synchrony generator, the former is better dovetailed with comprehension syntax. Thereby, algorithms requiring synchronized iterations on multiple ordered collections, including those for efficient general database joins become expressible naturally in comprehensinon syntax.
Limsoon Wong. An intensional expressiveness gap of comprehension syntax. The Provenance of Elegance in Computation - Essays dedicated to Val Tannen, OASIcs, vol. 119, article no. 11: pp. 11.1--11.14, May 2024. PDF
Highlight: This paper proves that efficient join algorithms are inexpressible using comprehension syntax in a first-order setting. Abstract: Comprehension syntax is widely adopted in modern programming languages as a means for manipulating collection types. This paper proves that all subquadratic algorithms which are expressible in comprehension syntax, do not compute low-selectivity joins. As database systems support these joins efficiently, this confirms an intensional expressiveness gap between comprehension syntax and relational database systems. The proof of this intensional expressiveness gap relies on a "limited-mixing" lemma which states that subquadratic algorithms expressible using comprehension syntax have limited ability for mixing atomic objects in their inputs.
Limsoon Wong, Addressing an intensional expressiveness gap of comprehension syntax. Manuscript, 2021. v5-wls-natural2021.pdf
Highlight: This draft paper is a detailed study on the intensional expressive power of Synchrony iterator. It shows that Synchrony iterator precisely fills the intensional expressiveness gap between the algorithms expressible by comprehension syntax and typical relational database systems. Abstract: Comprehension syntax is widely adopted in modern programming languages as a means for manipulating collection types. This paper articulates and investigates an apparent gap in the intensional expressive power between comprehension syntax and relational database systems: (i) All subquadratic algorithms which are expressible in comprehension syntax, even after allowing some functions commonly available in the collectiontype libraries of modern programming languages, do not compute low-selectivity joins. As database systems support these joins efficiently, this confirms the intensional expressiveness gap. (ii) A “Synchrony iterator” construct for synchronized iteration on multiple collections is introduced. This enables more algorithms, but not functions, to become definable using comprehension syntax. In particular, efficient algorithms for low-selectivity joins become expressible. So, the ability to iterate on multiple collections in synchrony constitutes an exact characterization of this intensional expressiveness gap. (iii) The proof of this intensional expressiveness gap relies on a “limited mixing” lemma which states that subquadratic algorithms expressible using comprehension syntax have limited ability for mixing atomic objects in their inputs. This limited-mixing lemma is non-query specific and is applicable even when ordered data types are present. It thus considerably enriches the available theoretical tools for studying intensional expressive power, as these tools are often query specific and are inapplicable in the presence of ordered data types. It is also a useful intensional counterpart to Gaifman’s locality property. Gaifman’s locality are very useful for analyzing extensional expressiveness of first-order query languages on unordered data types, but is not useful on ordered data types. (iv) Incidentally, efficient interval joins with overlap predicates are obtained as a free byproduct of Synchrony iterator. This kind of joins are often needed for practical applications such as temporal data and genomic data processing, but are not supported well in typical relational database systems.
Stefano Perna, Pietro Pinoli, Val Tannen, Stefano Ceri, Limsoon Wong, Synchronized iteration for genomic data processing. Manuscript, 2021. synchrony-gmql-v12.pdf
Highlight: This draft paper puts Synchrony iterator to a practical test. The powerful GenoMetric Query Language (GMQL) is emulated using Synchrony iterator in Scala/Python. The resulting equivalents of GMQL queries are very efficient, generally better than a local installation of GMQL by large margins. Abstract: Processing of large data files is unavoidable in genomic pipelines. Many tools that do this are either stand-alone languages or command-line tools. There is an impedance mismatch when these tools are used with a host programming language to support more complex analysis. A novel concept, Synchrony iterator, is introduced. It allows efficient algorithms underlying such tools to be easily expressed. As a demonstration, the powerful GenoMetric Query Language (GMQL) is emulated using Synchrony iterators in Scala/Python, and the resulting equivalents of these queries are very efficient. Notably, a user can freely intermix GMQL-like queries with other features of Scala/Python, thereby overcoming the impedance mismatch problem.

Selected Presentations

Limsoon Wong. Some thoughts on designing a genomic query language. Invited talk at GeCo Workshop on Challenges in Data-Driven Genomic Computing, Villa del Grumello, Como, Italy, 6-8 March 2019. PPT
Limsoon Wong. Iterating on multiple collections in synchrony. JFP talk at 27th ACM SIGPLAN International Conference on Functional Programming (ICFP2022), Ljubljana, Slovenia, 11-16 September 2022. PPT
Limsoon Wong. From comprehension syntax to efficient non-equijoins: A journey with Val Tannen. Invited talk at ValFest @ UPenn, University of Pennsylvania, Philadelphia, USA, 24-25 May 2024. PPT

Software

Synchrony DBModel, version 10.1, for Scala 3: dbmodel-v10.1.scala, version 9.1, for Scala 2: dbmodel-v9.1.scala, deprecated version: dbmodel-v8.scala

A basic Synchrony iterator implementation. Mainly for illustrating the central idea of synchronized iteration. This module demos it for enhancing Scala's comprehension syntax to support linear-time low-selectivity database non-equijoins. Note that usual techniques for implementing equijoins via grouping (ala Wadler and Peyton-Jones) and indexed table (ala Gibbons) don't work here, because these are non-equijoins. Best to read these codes along side the introduction paper, wls-sychrony2020-v12.pdf, to appreciate how this is achieved.
Synchrony DBFile, version 10.1, for Scala 3: dbfile-v10.1.scala, version 9.1, for Scala2: dbfile-v9.1.scala, deprecated version: dbfile-v8.scala
A simple ordered file is derived as a subclass of the ordered collection in Synchrony DBModel. I.e., all the fancy stuff (like efficient non-equijoin) can be done on files too.
Synchrony GMQL, version 11.1, for Scala 3: synchrony-11.1.zip, deprecated versions: synchrony-6.zip, version 10.1, for Scala 3; synchrony-v5.3.zip, for Scala 2; synchrony_v3_0_4.zip, for Scala 2
Synchrony GMQL is an implementation of Synchrony iterators and emulation of GMQL, in Scala. It is described briefly in wls-synchrony2020-v12.pdf and in more detail in synchrony-gmql-v12.pdf. Version 10.1 and 5.3 also include a Synchrony-based implementation of Peng Hui's ProInfer, a powerful package for protein calling on current modern proteomic mass-spectrometry platforms (e.g., diaPASEF.) Version 11.1 further includes Weijia's ProtRec, a powerful package for missing protein inference.
Synchrony GMQL in Python 3, Version 4: python-synchrony-v4.zip and a test-script: test-queries.txt
An implementation of Synchrony iterators and emulation of GMQL, in Python 3. This implementation is provided just for fun; I learned Python by doing this implementation. Will replace it with a more serious vesion when I have time.

Acknowledgements

This work was supported in part by National Research Foundation, Singapore, under its Synthetic Biology Research and Development Programme (Award No: SBP-P3); and in part by Ministry of Education, Singapore, Academic Research Fund Tier-1 (Award No: MOE T1 251RES1206 and MOE T1 251RES1725) and Academic Research Fund Tier-2 (Award No: MOE-T2EP20221-0014).

Last updated: 7 June 2024, Limsoon Wong