
SYNCHRONY ITERATOR PROJECT SOURCE CODES
Version 11.1

Order collection & Synchrony iterator, cloc: 1746 lines
GMQL emulation, cloc: 1392 lines
Total size, cloc: 3138 lines
 
Wong Limsoon
30 May 2023




**** IMPORT NOTE ****


Depending on which version of Scala you have, some parallel operations
may hang Scala, esp. the REPL.  This is apparent due to Scala's default
lazy initialization of static functions and objects.  This can usually 
be solved using this Scala REPL option:

scala -Yrepl-class-based

Consult https://github.com/scala/scala-parallel-collections/issues/34
for a discussion.




**** EXPLANATIONS OF THE FILES ****


These files,
  dbresource-v11.1.scala,  dbkey-v11.1.scala, 
  dbocoll-v11.1.scala,     dbsynchronizable-v11.1.scala,
  dbsynchrony-v11.1.scala, dbsql-v10.1.scala,
  dbfile-v10.1.scala,      dbtsv-v11.1.scala, 
  dbaggr-v11.1.scala,      dbpredicates-v11.1.scala, 
form the infrastructure of ordered collection and Synchrony iterators.


dbresource-v11.1.scala (cloc: 30 lines)

	A basic structure for resource management, such as garbage 
	collection of intermediate data files generated by users.


dbkey-v11.1.scala (cloc: 13 lines)

	A simple structure for declaring ordering keys which are 
	essential for ordered collections.


dbocoll-v11.1.scala  (cloc: 164 lines)

	A structure for ordered collections. An ordered collection is
	any collection with an ordering key.


dbsynchronizable-v11.1.scala (cloc: 157 lines)

	A structure for a buffered closeable iterator that is more
	suited for iterating on large files.


dbsynchrony-v11.1.scala (cloc: 136 lines)

	Implementation of Synchrony iterator. A Synchrony iterator enables
	a collection to be iterated in a manner that is synchronized to
	the iteration on another collection. Efficient equi and nonequijoin
	can be easily implemented using Synchrony iterator.

 
dbsql-v11.1.scala (cloc: 160 lines)

	This file provides a database-like join operator to be used with
	comprehension syntax. The operator is based on Synchrony iterators
	and is efficient on many non-equijoins as well as equijoins.
	This is realized via a nice syntax for using Synchrony iterators.
 	Basically, it lets you write codes like this:

               for (
                 (a, bs, cs, ds, es) <- A join B on CondAB
                                          join C using condAC
                                          join D on CondAD where condAD'
                                          join E on condAE where condAE'
                   ... blah blah ...
               ) yield ... more blah ...



dbfile-v11.1.scala  (cloc: 301 lines)

	A framework for turning text files into ordered collections. 
	Users can provide some parser/unparser for their text files.
	Thereafter, a text file can be used like a collection, with 
	its items being move into memory or written to disk transparently
	as needed. All operations available on ordered collections are
	available, including Synchrony iterator.


dbtsv-v11.1.scala (cloc: 315 lines)

	"Database connectivity" to TSV and CSV files (i.e. tsb- and comma-
	delimited files). Provides a simple framework for encoding/decoding
	user-defined data types into TSV & CSV files. Also provides
	Remy-style record types with named-field access.



dbaggr-v11.1.scala (cloc: 139 lines)

	Commonly-used aggregate functions.


dbpredicates-v11.1.scala (cloc: 331 lines), 

	Commonly-used predicates.



Next three files (genomelocus-v11.1.scala, simplebed-v11.1.scala,
bedfileops-v11.1.scala) provide the parser/unparser for BED-formatted
files (which are widely used in bioinformatics), as well as powerful
operations for manipulating BED files, similar to those provided by 
the popular bedtools package and the GMQL system.


genomelocus-v11.1.scala  (cloc: 143 lines)

	This file defines the GenomeLocus.Locus data type that represents
	loci information on a genome. 


simplebed-v11.1.scala  (cloc: 228 lines)

	This file provides parser/unparser for BED files, as well as a
	BED data type for representing BED file entries. Unlike plain
	BED files where the data fields are accessed by positions, 
	named fields are supported in this implementation.

bedfileops-v11.1.scala  (cloc: 480 lines)

	This file provides high-level methods for manipulating BED files.
	The methods are implemented using Synchrony iterators. These
	methods are able to expressed easily operations in the popular
	bedtools package and also those in GMQL.  Usage examples are
	provided at end of the file.



Next two files (sample-v11.1.scala, samplefileops-v11.1.scala) provide 
parser/unparser for Sample files and methods for emulating GMQL-like
queries on Sample files. A Sample file is basically a collection of BED
files along with metadata of each of these BED files.


sample-v11.1.scala  (cloc: 253 lines)
	
	This file provides the parser/unparser for Sample file, as well as
	the Sample data type that represents entries of Sample file. It
	supports the Sample file format of GMQL, as well as a simpler
	Sample file format designed by Limsoon.

samplefileops-v11.1.scala  (cloc: 288 lines)

	This file implements GMQL-like methods for manipulating Sample files
	and their component BED files. Sequential and sample-parallel query
	modes are both supported. Usage examples are provided at end of
        the file.

 

Last set of files contain example GMQL queries on sample files. The queries
are in the file lib/mm-test.scala. The data are in the test folder.





**** EXTRA, SUPPORT FOR PROTEOMICS, 30 May 2023 ****

Some files for supporting proteomics protein calling and missing protein
inference are added. 


dbpeptides-v11.1.scala  (cloc: 242 lines)

	This file implements the main input file type to ProInfer. 
	It captures PSM (protein-sequence matches) from proteomics
	mass-spec runs.  It builds on dbtsv-v10.1.scala.


dbfasta-v11.1.scala  (cloc: 175 lines)

	This file implements the FASTA file format that ProInfer uses
	for its reference proteins and decoys. It builds on dbtsv-v10.1.scala.


dbcorum-v11.1.scala  (cloc: 138 lines)

	This file implements the CORUM file format that ProInfer uses
	for its reference protein complexes. It builds on dbtsv-v10.1.scala.


proinfer-v11.1.scala  (cloc: 502 lines)

	This file implements ProInfer, a protein caller described in
	[Peng, Wong, & Goh, "ProInfer: An interpretable protein inference 
	tool leveraging on biological networks", PLoS Computational Biology,
	19(3):e1010961, March 2023]. (* Thanks, PENG Hui, for providing
	the original Python codes of ProInfer and explaining many details
	and testing this re-implementation. *)


protrec-v11.1.scala (cloc: 246 lines)

	This file implements ProtRec, a missing protein inference method
	described in [Kong et al., "PROTREC: A probability-based approach
	for recovering missing proteins based on biological networks",
   	Journal of Proteomics, 250:104392, January 2022.]


Wong Limsoon
30 May 2023


