Below are materials for the 4 sessions led by Wong Limsoon in Jan 2026
Protein function prediction and some lessons for classifier performance evaluation
Generally, if two proteins are quite similar in their sequence, they would have a common ancestor and would have inherited their function from that ancestor. Thus, if one knows the function of one of these two proteins, one can infer the function of the other protein. However, at sequence similarity below 30%, this way of inferring protein function is accompanied by an explosion of false positives. Deep learning methods are proposed as a solution. Do these approaches work? In this session, we discuss one of these methods, DeepFam [Seo et al., "DeepFam: Deep learning based alignment-free method for protein family modeling and prediction", Bioinformatics, 34(13):i254-i262, 2018]. Along with this assessment, we also discuss some nuances in classifier performance evaluation that are often overlooked and can result in disappointments when a classifier, which is evaluated as high performing, is deployed.
Home work, due 10/1/2026. Read
[Yu et al., "Accurate prediction and key protein sequence feature
identification of cyclins", Briefings in Functional Genomics,
22:411-419, 2023].
Focusing on the way it evaluated the performance of the proposed cyclin
classifier. Comments by ChatGPT in this regard are shown on slide #37.
Submit a report (max 1 page) on your analysis of
ChatGPT's comments and point out any major flaws missed by ChatGPT.
Be prepared to make a 5-minutes presentation to the class on 12/1/2026.
Gene expression analysis and some lessons for statistical hypothesis testing
Gene expression profiling data is a powerful source of information for understanding biological systems. One of its compelling applications is the identification of differentially expressed genes (DEGs) as biomarkers for disease diagnosis, prognosis, and treatment response. However, DEG selection has been plagued by replicability issues. Many pathway-based methods have been proposed to address this problem. In this session, we discuss the popular overlap-enrichment approach, exemplified by Onto-Express [Draghici et al., "Global functional profiling of gene expression", Genomics, 81(2):98-104, 2003]. In the process, I also expose students to the theory-practice gap that exists when theoretical statistics is applied on real-world data.
Students are also given the opportunity to present their reports on their homework from the previous week, viz. [Yu et al., 2023]. Thereby, I hope to help students deepen their understanding of some additional nuances of classifier performance evaluation.
Home work, due 17/1/2026. Read
[Srihari et al., "Inferring synthetic lethal interactions from
mutual exclusivity of genetic events in cancer",
Biology Direct, 10:57, 2015].
Focus on how it tests for synthetic-lethal gene pairs.
Comments by ChatGPT in this regard are attached in slide #51.
Submit a report (max 1 page) on your analysis of
ChatGPT's comments and point out any major flaws missed by ChatGPT.
Be prepared to make a 5-minutes presentation to the class on 19/1/2026.
The data science of PCA: Myths, misuses, and missed signals
Many students in computer science, engineering, and data science are familiar with principal component analysis (PCA) as a tool for dimension reduction or data visualization. But what if that is only scratching the surface? PCA’s true power lies in its ability to untangle complex variations in data and reveal meaningful patterns that are often overlooked. For example, [Goh & Wong, "Protein complex-based analysis is resistant to the obfuscating consequences of batch effects---a case study in clinical proteomics", BMC Genomics, 18(S2):142, 2017] show that even when biological signals and batch effects are confounded in clinical proteomics, PCA can deconvolute these intertwined effects---especially when the data is projected into the space of protein-complex abundances. In a very different setting, [Giuliani et al., "On the constructive role of noise in spatial systems", Physics Letters A, 247(1-2):47-52, 1998] show that PCA can extract meaningful structure even from PCs that account for less than 1% of the variance and provide criteria to distinguish informative low-variance PCs from residual noise. In this session, we will explore these unexpected and powerful applications of PCA, uncovering its potential and reshaping the way you think about this fundamental technique.
Students are also given the opportunity to present their reports on their homework from the previous week, viz. [Srihari et al., 2023]. Thereby, I hope to help students deepen their understanding of some additional nuances of statistical hypothesis testing.
Home work, due 24/1/2025. Read
[Oliver et al., "A Bayesian computer vision system for modeling
human interactions", IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(8):831-843, 2000].
Focus on Section 3.1, Segmentation by eigenbackground subtraction.
Comments by Gemini in this regard are attached in slide #33.
Submit a report (max 1 page) on your analysis of
Gemini's comments and point out any major flaws missed by Gemini.
Be prepared to make a 5-minutes presentation to the class on 26/1/2026.
Insight + logic = elegant solutions
In this session, we tie up the loose ends from the three earlier discussions: (i) illuminating the twilight zone of protein function prediction, (ii) addressing the replicability crisis in differential gene selection, and (iii) correcting a fundamental flaw in eigenbackground subtraction. Together, these examples highlight how easily useful information can be missed---and how far a bit of domain insight can go. Much of the complexity we face comes from working against the intrinsic nature of the data. When our tools respect the physical structure of the signal, solutions often become strikingly simple and elegant.
Students are also given the opportunity to present their homework from
the previous week, viz. [Oliver et al., 2000]. Thereby, I hope to
help students deepen their understanding of some additional nuances of PCA.