Gene network estimation by using information of transcription factor binding DNA sequence and microarray data from many samples

Ryoichi Minai1, Tsuyoshi Waku2 and Shoji Makino1

1 University of Tsukuba, Japan.
2 University of Tokyo, Japan.

Background: In gene network estimation from microarray data (for example, with co-expression based methods), large number of gene expression data errors is a nasty problem. There are many studies that attempt to resolve this issue by using network theories such as Bayesian networks, but the processing of huge data and calculation amount remains a problem. Therefore, we firstly created a transcriptional network theoretically by using the binding DNA sequence information of transcription factors from public database. After that validation and modification of the theoretical network by microarray data from many samples were performed. Furthermore we attempted to analyze microarray data by using the theoretical network.

Methods: Creation of a theoretical transcriptional network: i) Position weight matrix data in the Jaspar database and sequence data of promoter regions of the RefSeq genes in hg19 (Human Genome version 19) were obtained. ii) Target genes of transcription factors were calculated exhaustively by using FIMO tool of the MEME suite. iii) By connecting all relation data (transcription factor -> target gene), the entire transcriptional network is created. iv) Nodes that are not included in microarray probes are deleted.

Selection of analyzing target samples: The GEO has more than 80,000 samples with Affymetrix Human Genome U133 Plus 2.0 Array. We performed self-organizing maps (SOM) calculation based on differences in the expression status of the probes to examine the classification of cellular phenotypes for these samples. In addition, target samples of the subsequent analyzes by the theoretical network are selected with SOM.

Results: A theoretical transcriptional network with 41 transcription factors and 5608 target genes (these were limited to genes contained in Affymetrix Human Genome U133 Plus 2.0 Array) was created. The difference of the distribution between cancer and non-cancer cell samples was clearly obtained by SOM.