S1-1
MiST: Variant-detection
through whole-exome sequencing
Sailakshmi
Subramanian1, Valentina Di Pierro1, Hardik Shah1,
Ajish George1, Bruce Gelb1, Ravi Sachidanandam1
1United States Mount
Sinai School of Medicine
Whole-exome sequencing
is a promising approach to find causative mutations in human disease,
especially for Mendelian disorders. It involves the capture of sequences from
exons in genomic DNA using probes from exonic regions of the genome. The
captured exonic sequences are deeply sequenced and analyzed for variants from
the reference genome. There are several tools to align sequenced reads to
reference genomes and call SNPs and variants. We have developed a
variant-calling platform, MiST that builds on our previously published tool, Geoseq.
The tool mimics the experimental technique, computationally fishing reads from
the deep sequencing set using probes from the targeted exons. The captured
reads are mapped with great sensitivity to accurately call SNPs and variants.
Our pipeline carefully eliminates paralogous read- mapping, which can lead to
spurious SNP calls. It also tracks strand-bias and clonality in the sequencing
libraries, allowing for more accurate measurements of coverage and variant
detection. The platform identifies variant calls that have already been seen in
other samples by comparing them to a database of known variants collected from
dbSNP, 1000-genomes and private variant collections. A web-based interface
allows users to visualize the alignments and other raw data underlying a
variant call. The user can rapidly filter calls based on known and predicted
functional characteristics. The pipeline is parallelizable and runs over a
cluster, allowing the process to be scaled up. It also comes with a web-based
interface that allows end-users to explore and visualize the data. We used
targeted re-sequencing (Sanger) to confirm the validity of a few of the
variants inferred by MiST. In addition, we compare it to variants calls made by
the gatk platform and demonstrate the benefits of our approach, as well as the
commonalities between the programs.
S1-2
Improve the Nucleotide Coding
Technique, Use Support Vector Machine, Get the Better Accuracy: Survey of Human
Splice Site Prediction
A.T.M.Golam Bari1,
Mst.Rokeya Reaz1, Md.Azam Hossain1, Ho-Jin Choi2,
Byeong-Soo Jeong1
1Kyung Hee University,
Dept. of Computer Engineering, 1732 Deokyoungdaero, Giheung-gu, Yongin-si,
Gyeonggi-do, 446-701, Republic of Korea
2Korea Advanced Institute of Science and Technology,335
Guseong-dong, Yuseong-gu, Daejeon 305-701, Republic of Korea
Splice
site prediction in DNA sequence is a basic search problem for finding
exon-intron and intron-exon boundary. Removing introns and then joining the
exons together forms the coding sequences which are the input of translation
process and a necessary step in central dogma of molecular biology. Finding out
the exact GT and AG ending sequence among the set of ATCGs sequence and
identifying the true and false GT and AG ending sequences are the main task of
splice site prediction. In this paper, we survey recent research works on
splice site prediction based on support vector machines (SVM)). The basic
difference among these works is nucleotide encoding technique - some methods
encode sparse way whereas others encode in a probabilistic manner. All these
coding sequences serve as input of SVM. The task of SVM is to classify them
using its learning model. We observe each coding techniques and classify them
according to their similarity. Our survey paper will provide basic
understanding of encoding approach for splice site prediction.
S1-3
New Features of MTRAP Alignment and
its Advantage: All-in-one Interface for Sequence Analysis, MSA and the Support
for Non-coding RNA
Toshihide Hara1,2, Keiko
Sato1,2, Masanori Ohya1,2
1Department of
Information Sciences, Tokyo University of Science, 2Quantum
Bio-Informatics Research Division, Tokyo University of Science, 2641 Yamazaki,
Noda City, Chiba, Japan
Sequence
alignment of proteins or DNA/RNA sequences is one of the most important things
in modern bioinformatic analysis. In this field studies start with the
comparison of target sequences, and the comparison is realized by constructing
the alignment. Under a rapid increase of genome data from the growth of Next
Generation Sequencing, the need for high quality alignment becomes more
apparent. Although there exists an obvious need, the quality level is not
enough. Recently we developed a high quality alignment method called MTRAP. We
showed that the significant improvement of sequence alignment can be done by
considering the correlation between two consecutive pairs of residues. In the
first paper we showed that our method generates good results for protein
sequences, but it is not understood whether it works for DNA/RNA sequences or
not. In this paper, we show our recent study for non-coding RNA sequences. In
addition, we explain the new features of recent version of MTRAP.
S2-1
Computational Methods for Cancer Subtype
Classification using Integrated Data
Shinuk Kim1,2,3, Taesung
Park2, Mark Kon1,3
1Bioinformatics
program, Boston University, Boston, MA 02215, USA
2Department of Statistics, Seoul National University, Seoul 151-747
Republic of Korea
3Department of Mathematics and Statistics, Boston University,
Boston, MA 02215 USA
MicroRNAs
(miRNAs) are known to be strongly involved in cancer pathology through
regulation of target messenger RNA (mRNA) molecules. We study a potentially
useful methodology based on machine learning (ML) involving integration of
separate biomarker classes to improve prediction and separation of ovarian
cancer survival times. We use an ML-based protocol for feature selection,
integrating information from miRNA and mRNA profiles at the feature level. For
prediction of survival phenotypes, we use two classifiers, one a machine
learning method (support vector machine, SVM), and the second a novel
regression-based method (SVM-based Fisher feature selection together with Cox
proportional hazard regression, FSCR). We compared these two methods using
three types of cancer tissue features: i) miRNA expression, ii) mRNA expression,
and iii) integrated miRNA and mRNA expression information, with features
selected either from combined miRNA/mRNA profiles (CFS), or separately from the
two feature sets (IFS). The accuracy of survival classification using the
combined miRNA/mRNA profiles was 88.64 % using IFS-SVM, and 84.09% using
IFS-FSCR in a balanced dataset. These accuracies are higher than those using
miRNA alone (81.82%, SVM; 75%, FSCR) or mRNA alone (70.45%, SVM; 72.73%, FSCR).
The latter differences indicate sometimes strong interactions between miRNA and
mRNA features which are not visible in individual analyses. In addition we
focus on the most significant miRNAs obtained by SVM-based feature selection
which include hsa-miR-23b, hsa-miR-27b. We predicted 16 target genes of hsa-miR-23b
and hsa-miR-27b, by integrating sequence information , and information of gene
expression profile which include cancer related genes.
S2-2
A combination algorithm for 5-year
survivability of breast cancer patient
Kung-Jeng Wang1, Bunjira
Makond1, and Kun-Huang Chen1,2
1Department of
Industrial Management, National Taiwan University of Science and Technology,
Taipei 106, Taiwan, R.O.C.
2School of Dentistry,
College of Oral Medicine, Taipei Medical University, Taipei, 110, Taiwan,
R.O.C.
In
this study, we have proposed the new algorithm to enhance the effectiveness of
classification for 5-year survivability of breast cancer patients which the
data set is imbalanced. The algorithm is the combination of Synthetic minority
oversampling technique (SMOTE) and Particle swarm optimization (PSO) based
decision tree (C5): SMOTE+PSO+C5. G-mean is a metric to evaluate the proposed
algorithm for classification; moreover, the proposed algorithm is compared with
PSO+C5 and C5. The results show that SMOTE+PSO+C5 algorithm has the highest
performance for 5-year survivability of breast cancer patient classification
when the data set is imbalanced. This proposed method can classify well for
both survival and non-survival cases. In addition, implementation PSO+C5 method
to imbalanced data cannot improve the classification performance from using
standard classifier solely.
S2-3
Gene Interaction-Level Cancer
Classification using Gene Expression Profiles
Ashis Saha, Jaewoo Kang1
1Korea University,
Seoul 136713, Korea.
Recent
studies suggest that biological pathways have the power to be stronger
biomarkers for cancer than individual genes. The knowledgebase of pathways
contains the interactions among the genes. However, it is not necessary for all
the genes in a pathway to interact with each other. Closely interacting genes
are supposed to have a collective effect to cause cancer or other disease. Here
we propose a novel cancer classification method utilizing the collective effect
of the set of closely interacting genes which we call Gene Interaction Set
(GIS). We first find out the possible strength levels of each gene interaction
set using clustering method and then rank all the sets with our proposed
entropy metric using the proportion of samples of different classes having same
strength level and finally predict the class of a new sample by weighted voting
of top k gene interaction sets. The important feature of our method is that the
process of causing the disease can easily be figured out. We validate our
method comparing with other classification methods known to produce very high
accuracy on 7 cancer datasets.
S3-1
Globally Inferring Targets From
Phenotypic Small-Molecule Screens
S. Joshua Swamidass1,2,
Michael Barratt1, Bradley T. Calhoun1
1Division of Laboratory
and Genomic Medicine, Department of Pathology and Immunology, Washington
University School of Medicine, St. Louis, MO.
2Chemical Biology/Novel Therapeutics, Broad Institute of Harvard and
MIT, Cambridge, MA
A
central challenge in modern drug discovery is the identification of the target
proteins and pathways that can be manipulated to modulate disease. Gaps in our
understanding of how targets modulate disease are evident in the high rate of
Phase II clinical trial failures, when medicines are first tested for efficacy.
The high reward for finding novel connections between targets and diseases is
evident in several examples where known medicines have been repurposed to treat
new diseases. In this study, we present and validate a new way of Globally
Inferring protein Targets from Phenotypes (GIPT) by finding patterns in
small-molecule screens of medically-relevant, cellular assays. Mining
phenotypic, small-molecule screens is a promising strategy because it leverages
translatable experimental data and because it is biased towards druggable
proteins. We demonstrate that this strategy can both recover known targets and
suggest plausible novel targets for several medically-relevant
phenotypes---including insulin signaling, amyloid precursor protein expression,
and cyclic-AMP levels---with applications in diabetes, Alzhiemer's diease, and
depression.
S3-2
More Reproducible Results from Small-sample
Clinical Genomics Studies by Multi-Parameter Shrinkage, with Application to High-throughput
RNA Interference Screening Data
Mark A. van de Wiel1,
Renee X. de Menezes2, Ellen Siebring2,3, Victor W. van
Beusechem2
1Department of
Epidemiology and Biostatistics, 2RNA Interference Functional
Oncogenomics Laboratory (RIFOL), 3Department of Pulmonary Disease,
VU University Medical Center, PO Box 7057, 1007 MB Amsterdam, The Netherlands
High-throughput
(HT) RNA interference screens are increasingly used for reverse genetics and
drug discovery. These experiments are laborious and costly, hence sample sizes
are often very small. Powerful statistical techniques to detect siRNAs that
potentially enhance treatment are currently lacking, because they do not
optimally use the amount of data in the other dimension, the feature dimension.
We introduce ShrinkHT, a Bayesian method for shrinking multiple parameters in a
statistical model, where `shrinkage' refers to borrowing information across
features. ShrinkHT is very flexible in fitting the effect size distribution for
the main parameter of interest, thereby accommodating skewness that naturally
occurs when siRNAs are compared with controls. In addition, it naturally
down-weights the impact of nuisance parameters (e.g. assay-specific effects)
when these tend to have little effects across siRNAs. We show that these
properties lead to better ROC-curves than with the popular limma software.
Moreover, in a 3 + 3 treatment vs control experiment with `assay' as an
additional nuisance factor, ShrinkHT is able to detect three significant siRNAs
with stronger enhancement effects than the positive control. In the context of
gene-targeted (conjugate) treatment, these are interesting candidates for
further research.
S3-3
Breast Cancer Survivability
Prediction with Labeled, Unlabeled, and Pseudo-Labeled Patient Data
Juhyeon Kim1, Hyunjung
Shin1
1Department of
Industrial Engineering, Ajou University, Wonchun-dong, Yeongtong-gu, Suwon
443-749, South Korea
Prognostic
study on breast cancer survivability has been aided by machine learning
algorithms which provide prediction on the survival of a particular patient on
the basis of historical patient data. A labeled patient record however, is not
easy to collect. It takes at least five years to label a patient record as
¡°survived" or "not survived¡±: meanwhile, unguided trials on numerous
types of oncology-therapy cost highly. Moreover, it requires confidentiality
agreements from both doctors and patients to obtain a labeled patient record.
The difficulties in collection of labeled patient data have drawn researchers'
attention to Semi-Supervised Learning (SSL), one of the most recent machine
learning algorithms, since it is capable of utilizing unlabeled patient data as
well which relatively much easier to collect, and therefore is regarded as a
pertinent algorithm to circumvent the difficulties. However, the fact is yet
valid even on SSL that more labeled data lead to better prediction. To make up
for insufficiency of labeled patient data, one may consider an idea of tagging
virtual labels to unlabeled patient data, namely ¡°pseudo-labels¡±, and using
them as if they are labeled. The proposed algorithm, "SSL
Co-training", implements the idea based on SSL. SSL Co-training was tested
on the surveillance, epidemiology, and end results database for breast cancer
(SEER) and achieved avg. 76% accuracy and avg. 0.81 AUC.
S4-1
Semantic PubMed Searches
Illhoi Yoo1, 2
1Health Management
& Informatics, School of Medicine, 2Informatics Institute,
University of Missouri, Columbia, MO, USA
The
Evidence-Based Medicine (EBM) Working Group has defined efficient biomedical
literature searching as a core skill required for the practice of the EBM.
Although the information obtained from PubMed could significantly improve the
quality of health care, physicians typically do not pursue their questions
about patient care. This paper discusses the importance of PubMed searches for
physicians, identifies the origin of the well-known obstacles to answering
physicians¡¯ clinical questions using PubMed, and introduces a novel system
called Semantic-oriented MEDLINE search (SoMs) to the original problems to
enhance their information retrieval experience in PubMed. Based on the variety
of the literature in information retrieval, cognitive science, and medical
science, we analyzed widely accepted obstacles to answering physicians¡¯
clinical questions and then identified the origins of the obstacles to provide
a technical solution for each obstacle category. Physicians¡¯ information
seeking behavior problem is two-fold: a user-side problem and a system-side
problem. The user-side problem comes from the user¡¯s emergent information needs
and unfamiliarity with MeSH terms and the MeSH Tree, and the system-side
problem comes from the fragmented information available from PubMed. We suggest
the use of a biomedical semantic network with a concept-filtering tool to
address the emergent information need problem, and the Concept-Based PubMed
Archive (CBPA) to address the fragmented information problem. The SoMs can
concisely answer many clinical questions PubMed cannot.
S4-2
Research Domain Grouping and
Analysis in Bioinformatics Domain using Text Mining
Junbeom Kim1, Chae-Gyun
Lim1, Sung Suk Kim1, Dukyong Yoon2, Rae-Woong Park2,
Ho-Jin Choi1
1Department of Computer
Science, Korea Advanced Institute of Science and Technology, 291 Daehak-ro,
Yuseong-gu, Daejeon 305-701, Korea
2School of Medicine & Graduate School of Medicine, Ajou
University, San 5 Woncheon-dong, Yeongtong-gu, Suwon 443-721, Korea
In
this paper, we propose a new information extraction and analysis using a text
mining of the research domains for assistance of bioinformatics research. To do
this work, we use Term Frequency Inverse Document Frequency method and
reference link aggregation which combine each other and induce useful
information to analysis the structures and relations of the interest fields.
From the information induced from TFIDF and reference link aggregation, useful
connections and relations, that generates and finds new information and
knowledge, can be obtained. The results help researchers to extract and find
more additional knowledge of related domains and fields. To show usefulness of
the proposed method, we demonstrate research domain clustering and induced
results from the clusters.
S4-3
ICD-9 Tobacco Use Codes are
Effective Identifiers of Smoking Status
Laura K. Wiley1,2,
Anushi Shah2, Hua Xu2, William S. Bush1,2
1Center for Human
Genetics Research, 2Department of Biomedical Informatics, Vanderbilt
University School of Medicine, Nashville, TN, USA
With
the increased development of clinic-based biorepositories, Electronic Medical
Records (EMRs) are being used for genetic epidemiology research. These studies
often require identification of and adjustment for clinical covariates, such as
smoking status. Unfortunately, a patient¡¯s smoking status is often difficult to
extract from clinical text. The International Classification of Disease 9th
Edition (ICD-9) contains two codes designating tobacco use - one for former and
one for current use - but the reliability of these codes for classifying
smoking status is often questioned due to their ambiguous use in clinical
environments. In this study we evaluated the utility of these codes to identify
ever-smokers in general and high smoking prevalence (lung cancer) clinic
populations. We assessed potential biases in documentation, and performed
temporal analysis relating transitions between smoking codes to smoking
cessation attempts. We also examined the suitability of these codes for use in
genetic association analyses. We establish that ICD-9 tobacco use codes can
precisely identify smokers in a general clinic population (specificity = 1;
sensitivity = 0.32), and that there is little evidence of documentation bias.
Frequency of code transitions between ¡°current¡± and ¡°former¡± tobacco use is
significantly correlated with initial success at smoking cessation
(p<0.0001). Finally, we illustrate that code-based smoking status assignment
is a comparable covariate to text-based smoking status for genetic association
studies. Our results support the use of ICD-9 tobacco use codes for identifying
smokers in a clinical population, and justify use of this derived status in
genetic studies utilizing electronic health records.
S5-1
Extracting of Coordinated Patterns
of DNA Methylation and Gene Expression in Ovarian Cancer
Je-Gun Joung1,2,3,
Dokyoon Kim1,2, Kyung Hwa Kim1,2, Ju Han Kim1,2
1Seoul National
University Biomedical Informatics (SNUBI), Div. of Biomedical Informatics, 2Systems
Biomedical Informatics National Core Research Center, 3Institute of
Endemic Diseases, Seoul National University College of Medicine, 103 Daehak-ro,
Jongno-gu, Seoul 110-799, Korea
DNA
methylation, a regulator of gene expression, plays an important role in diverse
biological processes including developmental process, carcinogenesis and aging.
In particular, aberrant DNA methylation has been enormously observed in several
types of cancers. Currently, it is important to extract disease-specific
genesets associated with the regulation of DNA methylation. Here we propose a
novel approach to find the minimum regulatory units of genes, co-Methylated and
co-Expressed Gene Pairs (MEGPs) that are highly correlated gene pairs between
DNA methylation and gene expression showing the co-regulatory relationship. To
evaluate whether our method is meaningful to extract disease-associated genes,
we applied our method to a large-scale dataset from The Cancer Genome Atlas,
extracted significantly associated MEGPs and analyzed their functional
correlation. We observed that our many MEGPs are physically interacted each other
and show high semantic imilarity with Gene Ontology terms. Furthermore, we
performed gene set enrichment tests to identify how they are correlated in a
complex biological process. Our MEGPs were highly enriched in the biological
pathway associated with ovarian cancers. Our approach can be useful for
discovering coordinated epigenetic markers associated with specific diseases.
S5-2
Network Models of GWAS Uncover the
Topological Centrality of Protein Interactions in Complex Disease Traits
Younghee Lee1,2, Haiquan
Li1,2,3, Jianrong Li1,2,3, Ellen Rebman1,3,
Kelly Regan3, Eric R Gamazon2, James L Chen1,4,
Xinan Yang1,2, Nancy J Cox1,2,5, Yves A Lussier1,2,4,5,6
1Center for Biomedical
Informatics and 2Section of Genetic Medicine, Department of
Medicine, The University of Chicago, Chicago, IL 60637
3Department of Medicine, The University of Illinois at Chicago,
Chicago, IL, 60612,
4Section of Hematology/Oncology, Department of medicine, The
University of Chicago, Chicago, IL60637
5Institute for Genomics and Systems Biology, and 6Computation
Institute, The University of Chicago, Chicago, IL 60637
While
Genome Wide Association Studies (GWAS) of complex traits have revealed
thousands of reproducible genetic associations to date, these loci collectively
confer very little of the heritability of their respective diseases and, in
general, have contributed little to our understanding the underlying disease
biology. Physical protein interactions have been utilized to increase our
understanding of human Mendelian disease loci but have yet to be fully
exploited for complex traits. Here, we hypothesized that protein interaction
modeling of GWAS findings could highlight important disease-associated loci and
unveil the role of their network topology in the genetic architecture of
diseases with complex inheritance. Network modeling of proteins associated with
the intragenic SNPs of the NHGRI catalog of complex trait GWAS revealed that
complex trait associated loci are more likely to be hub and bottleneck genes in
available, albeit incomplete, networks (odds ratio=1.59, FET-P value <
2.24X10-12). Network modeling also prioritized novel Type 2 Diabetes(T2D)
genetic variations from the Finland-United States Investigation of NIDDM
Genetics and the Wellcome Trust GWAS data, and demonstrated the enrichment of
hubs and bottlenecks in prioritized T2D GWAS genes. The potential biological
relevance of the T2D hub and bottleneck genes was revealed by their increased
number of first degree protein interactions with known T2D genes according to
several independent sources (P-value<0.01, probability of being first
interactors of known T2D genes). Virtually all common diseases are complex
human traits, and thus the topological centrality in protein networks of
complex trait genes has implications in genetics, personal genomics, and in
therapy.
S5-3
Identification of Multiple
Gene-Gene Interactions for Ordinal Phenotypes
Kyunga Kim1, Min-Seok
Kwon2, Sohee Oh3, Taesung Park2,3
1Department of
Statistics, Sookmyung Women¡¯s University, South Korea
2Interdisciplinary Program in Bioinformatics, Seoul National
University, South Korea
3Department of Statistics, Seoul National University, South Korea
Multifactor
dimensionality reduction (MDR) is a powerful method for analysis of gene-gene
interactions and has been successfully applied to many genetic studies of
complex diseases. However, the main application of MDR has been limited to
binary traits, while traits having ordinal features are commonly observed in
many genetic studies (e.g., obesity classification - normal, pre-obese, mild
obese and severe obese). We propose ordinal MDR (OMDR) to facilitate gene-gene
interaction analysis for ordinal traits. As an alternative to balanced
accuracy, the use of tau-b, a common ordinal association measure, was suggested
to evaluate interactions. Also, we generalized cross-validation consistency
(GCVC) to identify multiple best interactions. GCVC can be practically useful
for analyzing complex traits, especially in large-scale genetic studies. In
simulations, OMDR showed fairly good performance in terms of power,
predictability and selection stability and outperformed MDR. For demonstration,
we used a real data of body mass index (BMI) and scanned 1~4-way interactions
of obesity ordinal and binary traits of BMI via OMDR and MDR, respectively. In
real data analysis, more interactions were identified for ordinal trait than
binary traits. On average, the commonly identified interactions showed higher
predictability for ordinal trait than binary traits. The proposed OMDR and GCVC
were implemented in a C/C++ program, executables of which are freely available
for Linux, Windows and MacOS upon request for non-commercial research
institutions.
S5-4
Key genes for modulating
information flow play a temporal role as breast tumor coexpression networks are
dynamically rewired by letrozole
Nadia M. Penrod1,2 and
Jason H. Moore2,3
1Department of
Pharmacology and Toxicology, 2Department of Genetics, 3Institute
for Quantitative Biomedical Sciences, Geisel School of Medicine at Dartmouth
College, Hanover, NH, USA
Genes
do not act in isolation but instead as part of complex regulatory networks. To
understand how breast tumors react to the presence of the drug letrozole it is
necessary to understand how the entire gene network changes as it is perturbed
by the drug. Using transcriptomic data generated from sequential tumor biopsy
samples, taken at diagnosis and following 10-14 days and 90 days on letrozole,
we build temporal gene coexpression networks. Coexpression is determined by a
pairwise partial correlation statistic. We find that the breast tumor network
is in a continual state of flux maintaining few relationships between time
points. This means that the genes integral for maintaining network integrity
and controlling information flow are dynamically changing as the network is
rewired. By understanding how gene-gene relationships change in the presence of
the drug letrozole we can begin to understand causes of drug resistance.
S6-1
Diplotyper: Diplotype-based
Association Analysis
Sunshin Kim1, KyungChae
Park2, Chol Shin3, Nam H Cho4, Jeong-Jae Ko1,
InSong Koh5, KyuBum Kwack1
1Department of
Biomedical Science, College of Life Science, CHA University, Seongnam, Korea
2Department of Family Medicine, CHA Bundang Medical Center, CHA
University, Seongnam, Korea
3Division of Pulmonary and Critical Care Medicine, Department of
Internal Medicine, Korea University Ansan Hospital, Ansan, Korea
4Department of Preventive Medicine, Ajou University School of
Medicine, Suwon, Korea, 5Department of Physiology, College of Medicine, Hanyang
University, Seoul, Korea
Diplotyper
is a fully automated tool for performing association analysis based on
diplotypes in a population. Diplotyper combines a novel algorithm designed to
cluster haplotypes of interest from a given set of haplotypes with two existing
tools: Haploview, for analyses of linkage disequilibrium blocks and haplotypes
(with frequency threshold of 1%), and PLINK, to generate all possible
diplotypes from a given population sample and calculate linear or logistic
regression. In addition, procedures for generating all possible diplotype
groups from the haplotype groups and transforming these diplotypes into PLINK
formats were implemented. Diplotyper was tested through association analysis of
hepatic lipase (LIPC) gene polymorphisms or diplotypes and levels of
high-density lipoprotein (HDL) cholesterol. This analysis identified much more
significant signals over single-locus tests.
S6-2
Computational
Studies of Post-translational Modifications
Zexian Liu1, Jian Ren2,
Yu Xue3
1China University of
Science and Technology of China
2China Sun Yat-sen University
3China Huazhong University of Science and Technology
Background: Through temporally
and spatially modified proteins, post-translational modifications (PTMs)
greatly expand the proteome diversity and play critical roles in regulating the
biological processes. Identification of site-specific substrates is fundamental
for understanding the molecular mechanisms and biological functions of PTMs,
while it is still a great challenge under current technique limitations. To
date, the accumulation of experimental discoveries makes it available to
develop computational tools for prediction of PTMs.
Methods: To predict PTM sites, a previously developed GPS (Group-based Prediction
System) algorithm was adopted and improved. Weight training and k-mean
clustering methods were introduced for prediction of pupylation sites in
prokaryotic proteins and tyrosine nitration sites, respectively. Besides PTMs,
GPS algorithm was extended to predict I-Ag7 and HLA-DQ8 epitopes through
combination with Gibbs sampling approach. The CPLA database was constructed
with manually collected experimental identified lysine acetylation sites from
literature. The protein-protein interaction (PPI) information for construction
of protein network was collected from five major PPI databases.
Results: The GPS algorithm was improved and employed to implement a series of
softwares to predict PTMs including GPS-CCD, GPS-PUP and GPS-YNO2 for
prediction of calpain cleavage, pupylation, tyrosine nitration site,
respectively. Furthermore, the GPS algorithm was extended to develop predictor
of GPS-MBA and GPS-ARM for prediction of MHC Class II Epitopes and APC/C
recognition motif, respectively. With the predictive tools and the pipeline, we
systematically compared the functional distribution and preference of
S-nitrosylation and nitration. The functional diversity of the D-box and
KEN-box mediated APC/C recognition and degradation was also statistically
exploited. In addition, by integrating existed protein acetylome data, the
human lysine acetylation network (HLAN) was firstly modeled and demonstrated,
while the triplet relationship among HAT-substrate-HDAC was proposed as the
fundamental component of HLAN.
Conclusions:
Taken together, since the developed computational tools could provide helpful
information with convenience, we anticipated that the combination of
computational predictions and experimental verifications will become the
foundation of systematically understanding the mechanisms and the dynamics of
PTMs.
S6-3
The Efficiency of Spatial model in
Assigning Protein Sequences to Protein Families
Hamid
Pezeshk1,3, Vahid Rezaei2,3
1School of Mathematics,
Statistics and Computer Science, College of Science University of Tehran, Iran.
2Faculty of Mathematical Science, Tarbiat Modares University,
Tehran, Iran.
3Bioinformatics Research Group, School of Computer Science,
Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
In
this research we introduce a spatial model on a regular lattice based on
multiple sequence alignment (MSA) for assignment of a protein sequence to a
protein family. In this model, we assume that both the top and the bottom
residues of each amino acid, in a profile of aligned protein sequences, contain
useful information due to evolutionary relationship. For this purpose, we use
top twenty profiles in the Pfam database to assess the performance of our
spatial model in protein assignment to protein families. We then compare our
model with profile hidden Markov model (PHMM). Results show that using spatial
model will increase the accuracy of protein sequence assignments considerably.
S6-4
Computational Approach for Protein
Structure Prediction
Amouda Nizam1,
G.Jeyakodi1, C.Manimozhi 1
1Centre for
Bioinformatics, Pondicherry University, India
Genetic
algorithm (GA) is used to solve difficult optimization problem of huge space
where little is known in various domain and biological field is no exception.
Many variants of Standard GA (SGA) are applied to a complex problem like
Protein Structure Prediction (PSP) which is identified as NP-hard problem in
molecular biology. Unfortunately SGA requires a special attention by the
non-domain experts for the right choice of values for the parameter setting
manually to reach a better solution. This research proposes a novel algorithm
(SOGA) by blending a self-organizing concepts and GA in order to automate the
appropriate choice of the parameter values. The proposed algorithm is developed
with the entire knowledge of the problem (PSP) and the selection of different
parameters is based on the problem and fitness value acquired in each
generation. SOGAPSP is validated by comparing the native and predicted
structure of protein. The minimal energy value of predicted protein structure
indicates the stability of molecule. The Rampage server result implies the
confirmation psi and phi angles of the predicted protein structure are feasible
for amino acid residues in protein structure. The RSMD value indicates the
similar conformation with the native structure of protein. The efficiency of
the proposed algorithm reduces the time requirement for optimizing the
parameter values to avoid premature convergence by self organizing the genetic
operators of GA. The application of this algorithm to protein structure
prediction achieved better results by self organizing the cross-over rates and
mutation. Exceptionally there is no requirement of known structure to predict
the unknown structure.
S7-1
Revealing Molecular Mechanism of Rare
Mental Disorders
Zhe Zhang1,2, Shawn
Witham1, Margo Petukh1, Gautier Moroy2, Maria
Miteva2, Yoshihiko Ikeguchi3, Emil Alexov1
1Computational
Biophysics and Bioinformatics, Department of Physics, Clemson University,
Clemson, SC 29634, USA
2Universite Paris Diderot, Sorbonne Paris Cite, Molecules
Therapeutiques In Silico, Inserm UMR-S 973, 35 rue Helene Brion,75013 Paris,
France
3 Faculty of Pharmaceutical Sciences, Josai University, Japan
Intellectual
disability (ID) is a disease which is characterized by significant limitations
in cognitive abilities and social/behavioral adaptive skills. It is one of the
primary reasons for pediatric, neurologic, and genetic referrals. Particularly,
with respect to the protein-encoding genes on the X chromosome, it was shown
that approximately 10% of them have been implicated in ID, and the
corresponding ID is termed X-linked ID (XLID). Although the numbers of
mutations and reported families are small and XLID is a rare disease,
collectively the impact of XLID is significant, because the patients almost
always cannot fully participate in society. Here we report our findings of the
effects of missense mutations of wild type properties of proteins and protein
complexes involved in XLID. Using various in silico methods we reveal the
molecular mechanism of XLID for cases involving proteins with available 3D
structure. The 3D structures were used to predict the effect of disease-causing
missense mutations on the folding free energy, conformational dynamics,
hydrogen bond network and, if appropriate, on protein binding free energy. It
is shown that vast majority of XLID mutation sites are outside the active
pocket and are accessible from the water phase providing the opportunity that
their effect can be altered by binding appropriate small molecules to the
vicinity of the mutation site. This observation is used to demonstrate,
computationally and experimentally, that a particular case, the Snyder-Robinson
Syndrome causing G56S spermine synthase mutation, can be rescued by small
molecule binding.
S7-2
Comparative Genomics Revealed
General Evolutionary Trends of Insulin
ElbashirAbbas1, Junbeom
Kim1, Yan Zhang2, Luonen Chen2, Ho-Jin Choi1
1Knowledge Engineering
and Collective Intelligence Lab.(KECI), Dept., of Computer Science, Korea
Advanced Institute of Science and Technology (KAIST), Daejeon 305-701, Korea, 2Key
Laboratory of Systems Biology, Shanghai Institutes for Biological
Sciences(SIBS), Chinese Academy of Sciences, Shanghai 200233, China
Since
its discovery, the hormone insulin has been associated with several diseases
that plague man. The most famous of these is diabetes mellitus. As of last year
346 million people worldwide have been diagnosed with diabetes. No permanent
treatment exists, and 80% of deaths are due to an inability in acquiring the
chronic treatment. Previous studies have not thoroughly attempted to identify
the origins of insulin, and with the recent discoveries and advances in
available data it is possible to perform such a study and determine the
evolution of this peptide. In addition, comparative studies have identified an
overlooked an aspect in insulin that has not been thoroughly investigated.
Namely, the new properties attributed to C-peptide, a subunit of the precursor
of insulin. In this paper we present a comparative study between vertebrates
and invertebrates with regards to the insulin precursor and insulin receptor.
Our goal is to determine insulin origins and evolution across vertebrates and
invertebrates by performing a comparative study of the insulin precursor and
receptor in these species. Phylogenetic trees were constructed to visualize and
determine the level of conservation of proinsulin and c-peptide and their
respective distribution across different vertebrates. We have determined that
both vertebrates and invertebrates contain insulin or insulin like proteins,
however there number may differ, the coding patterns differ and the physical
composition of C-peptide differs. Also the interacting insulin and insulin
receptor residues found in both species classes show that some are conserved
among both, but the majority are different.. Further work is required to expand
on the results acquired and add to the insights gained.
S7-3
An Information-Gain Approach to Detecting
Three-Way Epistatic Interactions in Genetic Association Studies
Ting Hu1, Yuanzhu Chen1,2,
Jeff W. Kiralis1, Ryan L. Collins1, Christian Wejse3,
Giorgio Sirugo4, Scott M. Williams1,5 and Jason H. Moore1,
5
1Department of
Genetics, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA
2Department of Computer Science, Memorial University, St. John¡¯s,
NL, Canada
3Center for Global Health, School of Public Health, Aarhus
University, Skejby, Denmark
4Centro di Genetica, Centro di Ricerca Scientifica, Ospedale San
Pietro FBF, Rome, Italy
5Institute for Quantitative Biomedical Sciences, Dartmouth College,
Hanover, NH, USA
Epistasishas been historically used to describe the phenomenon that the effect of a
given gene on a phenotype can be dependent on one or more other genes, and is
an essential element for understanding the association between genetic and
phenotypic variations. Quantifying epistasis of orders higher than two is very
challenging due to both the computational complexity of enumerating all
possible combinations in genome-wide data and the lack of efficient and
effective methodologies. In this study, we propose a fast, non-parametric, and
model-free measure for three-way epistasis using information gain. It is able
to separate all lower-order effects from pure three-way epistasis. Our method
was verified on synthetic data and applied to real data from a candidate-gene
study of tuberculosis (TB) in a West African population. In the TB data, we
found a statistically significant pure three-way epistatic interaction effect
that was stronger than any lower-order associations. Our study provides a
methodological basis for detecting and characterizing high-order gene-gene
interactions in genetic association studies.
S7-4
Rare Variant Analysis Using Publically
Available Biological Knowledge
Carrie B. Moore1,2, John
R. Wallace2, Alex T. Frase2, Sarah A. Pendergrass2,
Marylyn D. Ritchie2
1Center for Human
Genetics Research, Vanderbilt University, Nashville, TN 37232, USA,
2Center for Systems Genomics, Pennsylvania State University,
University Park, PA 16802, USA
With
the recent flood of genome sequence data, there has been increasing interest in
rare variants and methods to detect their association to disease. We developed
a flexible collapsing method inspired by biological knowledge called BioBin. We
also built the Library of Knowledge Integration (LOKI), a repository of data
assembled from public databases, which contains resources such as: the National
Center for Biotechnology (NCBI) dbSNP and gene Entrez database information,
Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, Gene Ontology (GO),
Protein families database (Pfam), NetPath -signal transduction pathways,
Molecular INTeraction database (MINT), Biological General Repository for
Interaction Datasets (BioGrid), Pharmacogenomics Knowledge Base (PharmGKB),
Open Regulatory Annotation Database (ORegAnno), and information from UCSC
Genome Browser about evolutionary conserved regions (ECRs). BioBin can apply
multiple levels of burden testing, including: functional regions, evolutionary
conserved regions, genes, and/or pathways. We tested BioBin using simulated
data as well as with low coverage data from the 1000 Genomes Project to
evaluate bins with simulated causative variants and conducted a pairwise
comparison of rare variant (MAF < 0.03) burden differences between Yoruba
individuals (YRI) and individuals of European descent (CEU). Lastly, we
analyzed NHLBI GO Exome Sequencing Project Kabuki dataset, with sequenced data
from individuals with Kabuki syndrome, a congenital disorder affecting multiple
organs and often intellectual disability, contrasted with 1000 genomes data as
controls. BioBin is proving to be a very useful and flexible tool to analyze
sequence data and uncover novel associations with complex disease.
S8-1
Personalized Chemotherapy for Ovarian
Cancer by Integrating Genomic Data with Clinical Data
Youngchul Kim1, Kian
Behbakht2, Jennifer R. Diamond2, Dan Theodorescu2,
Jae K. Lee1
1Department of Public
Health Sciences, University of Virginia, PO Box800717, Charlottesville, VA
22908, USA 2University of Colorado Cancer Center, University of
Colorado Denver, Box 8117, Aurora, CO 80045, USA
Despite
multiple standard chemotherapy drugs and novel agents, the overall therapeutic
response of advanced Epithelial Ovarian Cancer (EOC) patients has been stagnant
over the last two decades. Aggressive tumors such as EOC are highly
heterogeneous in their therapeutic responses, so overall therapeutic responses
are not likely to be improved much if used without selection. Previous
biomarker studies of drug response were limited as it was difficult to develop
single drug predictors based on patients treated with multiple drugs.
Additionally, outcomes were often confounded with other factors beyond given
therapies. By directly combining patients¡¯ therapeutic outcome information with
the COXEN algorithm based on each drug¡¯s cell line activity data, we have
developed integrated predictors of three standard chemotherapy drugs in treating
EOC: paclitaxel, cyclophosphamide, and topotecan. Our integrated COXEN
predictors of the three drugs demonstrated high predictability simultaneously
on patients¡¯ short-term therapeutic responses and long-term survival outcomes.
In particular, when the three drug predictors were hypothetically used for a
historical patient cohort, overall survival and progression-free survival of
the cohort would have been prolonged more than one year and five months,
respectively. When examined for patients with recurrent disease, overall
survival was improved more than 21 months. While the current study still
remains within analytic potential due to relatively small sample sizes for
rigorous evaluation of some of these predictors, the study has shown a
possibility that overall therapeutic response and outcome can be dramatically
improved by optimally utilizing these integrated predictors for individual
patients with EOC.
S8-2
The Role of Genetic Heterogeneity
and Epistasis in Bladder Cancer Susceptibility and Outcome: A Learning
Classifier System Approach
Ryan J. Urbanowicz1,
Angeline S. Andrew1, Margaret R. Karagas1, Jason H. Moore1
1Geisel School of
Medicine, Dartmouth College, 1 Medical Center Dr., Lebanon, NH 03756
Detecting
complex patterns of association between genetic or environmental risk factors
and disease risk has become an important target for epidemiological research.
In particular, strategies that accommodate multifactor interactions or
heterogeneous patterns of association can offer new insights in association
studies wherein traditional analytic tools have had limited success. In an
effort to concurrently address these phenomena, previous work has successfully
considered the application of learning classifier systems (LCSs), a flexible
class of evolutionary algorithms that distributes learned associations over a
population of rules. Subsequent work addressed the inherent problems of
knowledge discovery and interpretation within these algorithms, allowing for
the characterization of heterogeneous patterns of association. While these
previous advancements were evaluated using complex simulation studies, this
study applied these collective works to a real world genetic epidemiology study
of bladder cancer susceptibility. Notably, we replicated the identification of
previously characterized factors that modify bladder cancer risk: i.e. single
nucleotide polymorphisms (SNPs) from a DNA repair gene, and smoking.
Furthermore, we identified potentially heterogeneous groups of subjects
characterized by distinct patterns of association. Cox proportional hazard
models comparing clinical outcome variables between the cases of the two
largest groups yielded a significant, meaningful difference in survivorship. A
marginally significant difference in time to recurrence was also noted. These
results support the hypothesis that an LCS approach can offer greater insight
into complex patterns of association. This methodology appears to be well
suited to the dissection of disease heterogeneity, a key component in the advancement
of personalized medicine.
S8-3
Multiclass cancer classification
using gene expression comparisons
Sitan Yang 1 and Daniel
Q. Naiman2
1,2Applied Mathematics
and Statistics Department, Johns Hopkins University, Baltimore, Maryland 21218,
USA
As
our knowledge of cancer has grown, its heterogeneous nature has become
increasingly apparent, and there has been an accompanying tendency to identify
and differentiate various cancer subtypes. In this situation, microarray-based
cancer classification poses new methodological and computational challenges,
and the identification of novel and effective approaches to multiclass
classification deserves greater attention. While cancer classification has
achieved considerable success in binary problems, the situation for multiclass
problems is not as clear. In this paper, we introduce a new approach to
multiclass cancer diagnosis based on gene expression profiles. Our method
focuses on detecting a small set of genes whose expression levels have
significant changes relative to each other from class to class. For a k-class
problem, the decision rule only depends on the relative orderings of expression
values of k genes and is transparent enough to be immediately explored for
biological discoveries. We demonstrate on five cancer datasets that our method,
while simple, is as powerful as many popular but complex classifiers.
Furthermore, we show that the decision rules built on these datasets involve
some informative genes that are known to have biological relevance for some
cancer types, which may help us understand their potential mechanisms.
S8-4
Curation-Free Biomodules Mechanisms
in Prostate Cancer Predict Recurrent Disease
James L. Chen1,
Alexander Hsu1,2, Xinan Yang1, Jianrong Li2, Gurunadh
Parinandi2, Haiquan Li2, Yves A. Lussier1,2,3
1Ctr for Biomed.
Informatics and Dept. of Medicine, The University of Chicago, Chicago, IL
2Depts of Medicine & of Bioengineering, University of Illinois
at Chicago, Chicago, IL
3University of Illinois Hospital and Health Science System
Motivation:
Gene expression-based prostate cancer gene signatures of poor prognosis are
hampered by lack of gene feature reproducibility and a lack of
understandability of their function. Molecular pathway-level mechanisms are
intrinsically more stable and more robust than an individual gene. The
Functional Analysis of Individual Microarray Expression (FAIME) we developed
allows distinctive sample-level pathway measurements with utility for
correlation with continuous phenotypes (e.g. survival). Further, we and others
have previously demonstrated that pathway-level classifiers can be as accurate
as gene-level classifiers using curated genesets that may implicitly comprise
ascertainment biases (e.g. KEGG, GO). Here, we hypothesized that transformation
of individual prostate cancer patient gene expression to pathway-level
mechanisms derived from automated high throughput analyses of genomic datasets
may also permit personalized pathway analysis and improve prognosis of
recurrent disease.
Results:
Via FAIME, three independent prostate cancer gene expression arrays with both
normal and tumor samples were transformed into two distinct types of molecular
pathways mechanism and then compared: (i) the curated Gene Ontology (GO) and
(ii) dynamic expression activity networks of cancer (Cancer Modules).
FAIME-derived mechanisms for tumorigenesis were then identified. Curated GO and
computationally generated ¡°Cancer Module¡± mechanisms overlap significantly and
are enriched for known oncogenic deregulations and highlight potential areas of
investigation. We further show in two independent datasets that these
pathway-level tumorigenesis mechanisms can identify men who are more likely to
develop recurrent prostate cancer (log-rank_p=0.019 and 0.04, respectively).
S9-1
Comparison and Validation of
Genomic Predictors for Anticancer Drug Sensitivity
Simon Papillon-Cavanagh1,
Nicolas De Jay1, Nehme Hachem1, Catharina Olsen2,
Gianluca Bontempi2, Hugo Aerts3, John Quackenbush4,
Benjamin Haibe-Kains1
1Bioinformatics and
Computational Genomics Laboratory, Institut de recherches cliniques de
Montreal, University of Montreal, Montreal, Quebec, Canada
2Machine Learning Group, Universite Libre de Bruxelles, Bruxelles,
Belgium
3Department of Radiation Oncology and 4 Department or Biostatistics
and Computational Biology, Dana-Farber Cancer Institute, Harvard
University, Boston, MA, USA,
An
enduring challenge in personalized medicine lies in selecting the right drug
for each individual patient. While direct testing of drugs on patients is the
only way to assess their clinical efficacy and toxicity, we dramatically lack
resources to test the hundreds of drugs that are currently under development.
Therefore the use of preclinical model systems has been intensively
investigated as this approach enables to test response to hundreds of drugs in
multiple cell lines in parallel. Recently two large-scale pharmacogenomic
studies screened multiple anticancer drugs on more than 1000 cell lines. Here
we propose to combine these datasets to build and robustly validate genomic
predictors of drug response. We compared five different approaches for building
predictors of increasing complexity. We assessed their performance in
cross-validation and in two large validation sets, one containing the same cell
lines present in the training set and another dataset composed of cell lines
that have never been used during the training phase. Sixteen drugs were found
in common between the datasets. We were able to validate multivariate
predictors for four out of the sixteen tested drugs, namely Irinotecan,
PD-0325901, PLX4720 and Lapatinib. Moreover, we observed than response to
17-AAG, an inhibitor of Hsp90, could be efficiently predicted by the expression
level of a single gene, NQO1. Altogether these results suggest that predictors
could be robustly validated for specific drugs. If successfully validated in
patients¡¯ tumor cells, and subsequently in clinical trials, they could act as
companion tests for the corresponding drugs and play an important role in personalized
medicine.
S9-2
Improve Binding Affinity by Twin
Adhesive Drugs Mined in-between Docking Bio-mimicry Omega-shape Nona-peptide
Agretope on HLA-1 Pit
Chun-Fan Chang1,
Chen-Chieh Fan2,3, Hsueh-Ting Chu4 and Cheng-Yan Kao2
1Department of Animal
Science, Chinese Culture University, Taipei 11114, Taiwan;
2Department of Computer Science and Information Engineering,
National Taiwan University, Taipei 10617, Taiwan; and 3ENT Division,
National Taiwan University Hospital, Taipei 10002, Taiwan.
4Department of Computer Science and Information Engineering, Asia University,
Taichung 41354, Taiwan.
Motivation:
The oncogenesis process of nasopharyngeal carcinoma (NPC) may equip
proliferation advantage and immune evasion in overcoming efficient host immune
clearance mechanisms against Epstein Barr virus (EBV). The proliferation
advantage is likely from encoding EBV latent infection phase membrane protein 1
(LMP1) and the immune evasion is likely from mutating EBV genome for poor
immune reactivity at AMI-antigen epitopes and CMI-antigen epitopes/agretopes of
LMP1/LMP2 and EBNA upon class I human leukocyte antigen (HLA-1) IIn this work,
we developed a structure-based immunoinformatic tool of EBV-LMP1 related
omega-shape nona-peptide (LMP1np) design for docking HLA-1 pit towards mining
twin adhesive drugs (TAD) with improved binding affinity (BAff).
Results:
Our implemented bio-mimicry peptide design algorithm tool (bmPDA tool) designs
nona-peptide structures with bulge-side epitope and anchor-side agretope from
LMP-1 and NLMP-1 segments for docking HLA-1 of A*0201 and A*0207. The design
efficiency of bio-mimicry peptide by bmPDA tool is demonstrated with
preliminary reference nona-peptide structure of vasopressin protein. The
binding affinity (BAff) between putative agretope and verified HLA1 pit shows
notable weakening for likely immune evasion in the cases of A*0207 and NLMP1 at
initial amino acid positions of 32, 35, 86, 92, 125, 147, and 166. In that, our
algorithm mines twin adhesive drugs (TAD) among FDA-approval list exemplified
with Nizatidine, Benzonatate, Entecavir, Famotidine, and Alprostadil for improving
BAff between A*0207 pit and weak agretope of NLMP1np structures.
S9-3
Altering Physiological Networks
using Drugs: Steps towards Personalized Physiology
Adam D Grossman, PhD1,
Mitchell J Cohen, MD2, Geoffrey T Manley, MD, PhD3, Atul
J Butte, MD, PhD4
1Department of
Bioengineering, Stanford University, Stanford, CA, USA
2Department of Surgery, University of California San Francisco, San
Francisco, CA, USA
3Department of Neurosurgery, University of California San Francisco,
San Francisco, CA, USA
4Department of Pediatrics and the Department of Medicine, Stanford
University School of Medicine, Stanford, CA, and Lucile Packard Children's
Hospital, Palo Alto, CA, USA.
The
rise of personalized medicine has reminded us that each patient must be treated
as an individual. One factor in making treatment decisions is the physiological
state of each patient, but definitions of relevant states and methods to
visualize state-related physiologic changes are scarce. We constructed
correlation networks from physiologic data to demonstrate changes associated
with pressor use in the intensive care unit. We collected 29 physiological
variables at one-minute intervals from nineteen trauma patients in the
intensive care unit of an academic hospital and grouped each minute of data as
receiving or not receiving pressors. For each group we constructed Spearman
correlation networks of pairs of physiologic variables. To visualize
drug-associated changes we split the networks into three components: an
unchanging network, a network of connections with changing correlation sign,
and a network of connections only present in one group. Out of a possible 406
connections between the 29 physiological measures, 64, 39, and 48 were present
in each of the three component networks. The static network confirms expected
physiological relationships while the network of associations with changed
correlation sign suggests putative changes due to the drugs. The network of
associations present only with pressors suggests new relationships that could
be worthy of study. We demonstrated that visualizing physiological
relationships using correlation networks provides insight into underlying
physiologic states while also showing that many of these relationships change
when the state is defined by the presence of drugs. This method applied to
targeted experiments could change the way critical care patients are monitored
and treated.
S9-4
Compensating for Literature
Annotation Bias when Predicting Novel Drug-Disease Relationships through
Medical Subject Heading Over-representation Profile (MeSHOP) Similarity
Warren A. Cheung1,2, BF
Francis Ouellette3,4, Wyeth W. Wasserman1,5
1Centre for Molecular
Medicine and Therapeutics at the Child and Family Research Institute,
University of British Columbia, Vancouver, BC, Canada, 2Bioinformatics
Graduate Program, University of British Columbia, Vancouver, BC, Canada, 3Ontario
Institute for Cancer Research, Toronto, ON, Canada, 4Department of
Cells and Systems Biology, University of Toronto, Toronto, ON, Canada, 5Department
of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
Medical
Subject Heading Overrepresentation Profiles (MeSHOPs) quantitatively summarise
the literature associated with biological entities such as diseases or drugs. A
profile is constructed by counting the number of times each MeSH term is
assigned to an entity-related research publication in the MEDLINE/PUBMED
database and calculating the significance of the count relative to a background
expectation. Based on the expectation that drugs suitable for treatment of a
disease (or disease symptom) will have similar annotation properties to the
disease, we successfully predict drug-disease associations by comparing MeSHOPs
of diseases and drugs. The MeSHOP comparison approach delivers an 11%
improvement over bibliometric baselines. However, novel drug-disease
associations are observed to be biased towards drugs and diseases with more
publications. To account for the annotation biases, a correction procedure is
introduced and evaluated. By explicitly accounting for the annotation bias,
unexpectedly similar drug-disease pairs are highlighted as candidates for drug
repositioning research.
S10-1
Detection of Pleiotropy through a
Phenome-Wide Association Study (PheWAS) in the National Health and Nutrition
Examination Surveys (NHANES)
M.A. Hall1, A. Verma1,
K.D. Brown-Gentry2, R. Goodloe2, J. Boston2,
S. Wilson2, B. McClellan2, C. Sutcliffe2, H.H.
Dilks2,3, N.B. Gillani2, H. Jin2, P. Mayo2, M.
Allen2, N. SchnetzBoutaud2, D.C. Crawford2,3,
M.D. Ritchie1, S.A. Pendergrass1
1Center for Systems
Genomics, Department of Biochemistry and Molecular Biology, The Huck Institutes
of the Life Sciences, The Pennsylvania State University, University Park, PA,
USA;
2Center for Human Genetics Research, 3Department of
Molecular Physiology and Biophysics, Vanderbilt University, Nashville TN, USA
Herein
we describe the results of a Phenome-wide association study (PheWAS) utilizing
the diverse genotypic and phenotypic data that exists for multiple
race-ethnicites in the National Health and Nutrition Examination Surveys
(NHANES), conducted by the Centers for Disease Control and Prevention (CDC) and
accessed by the Epidemiological Architecture for Genes Linked to Environment
(EAGLE) study. PheWAS is a novel approach for discovering the complex
mechanisms involved in human disease by testing SNPs for association with a
large and diverse set of phenotypes. Comprehensive unadjusted tests of
association were performed in NHANES III and NHANES 1999-2002 for 575 SNPs with
1009 phenotypes stratified by race-ethnicity. We identified 51 PheWAS
associations that were consistent between the two surveys for the same SNP,
phenotype-class, direction of effect, and race-ethnicity with p<0.01, allele
frequency > 0.01, and sample size > 200. Of these, 28 replicated
previously reported SNP-phenotype associations, 9 were related to previously
reported associations in the literature, and 14 were novel SNP-phenotype
associations. We also identified SNPs associated with multiple novel
phenotypes. These results demonstrate the utility of phenome-wide association
studies for exploring associations between genetic variation and phenotypic
variation in a high throughput and comprehensive manner using existing
epidemiologic study data. The results of PheWAS promise to expose more of the
genetic architecture underlying multiple traits and generate hypotheses about
pleiotropic interactions for future research.
S10-2
Analysis of Type 2 Diabetes GWAS
Dataset using Expanded Gene Set Enrichment Analysis and Protein-Protein
Interaction Network
Chiyong Kang1, Hyeji Yu1,Gwan-Su Yi1
1Department of Bio and
Brain Engineering, KAIST, Daejeon 305701, Korea
Genome-wide
association studies (GWAS) have been identified approximately 40 type 2
diabetes (T2D) associated SNPs. However, only small fraction of the T2D genetic
risk is explained with identified T2D associated SNPs. While pathway enrichment
analysis that considers multiple SNPs is suggested to reveal the mechanisms of
complex diseases, pathway gene set can cover only small portion of human genes.
For the better understanding of biological mechanisms of T2D and T2D causal
gene detection, enrichment analysis with expanded gene sets and mapping GWAS
based T2D associated gene into protein-protein interaction (PPI) network are
proposed. Gene set enrichment analysis (GESA) is applied on WTCCC T2D GWAS
dataset with expanded gene sets including pathway, function, TF-target,
miRNA-target and complex. From expanded GSEA, 451 T2D associated gene sets are
detected with p-value < 0.05 and 441 gene sets out of selected 451 gene sets
contain known T2D genes. To find novel T2D gene candidates, 64 GWAS based T2D
associated genes which are from 2,960 SNPs with p-value threshold 0.05 in WTCCC
T2D GWAS dataset are mapped into integrated PPI network and total 24 novel T2D
gene candidates are detected. Among detected T2D gene candidates, GBR2 is the
most associated gene with T2D. Expanded GSEA and PPI mapping of GWAS based T2D
associated genes showed the possibility of providing insights of T2D mechanisms
and detecting novel T2D gene candidates.
S10-3
Integrative Analysis of Congenital
Muscular Torticollis: from Gene Expression to Clinical Indication
Shin-Young Yim, MD, PhD1,
Dukyong Yoon, MD, MS2, Myong Chul Park, MD, PhD3, Il Jae
Lee, MD, PhD3, Jang-Hee Kim, MD, MS4, Myung Ae Lee,PhD5,
Kyu-Sung Kwack, MD, PhD6, Jan-Dee Lee, MD, PhD7,
Euy-Young Soh, MD, PhD8, Young-In Na, MS9, Rae Woong
Park, MD, PhD2, KiYoung Lee, PhD2, and Jae-Bum Jun, MD,
PhD9
1The Center for
Torticollis, Department of Physical Medicine and Rehabilitation, Ajou
University School of Medicine, Suwon, Republic of Korea
2Department of Biomedical Informatics, Ajou University School of
Medicine, Suwon, Republic of Korea
3Department of Plastic and Reconstructive Surgery, Ajou University
School of Medicine, Suwon, Republic of Korea
4Department of Pathology, Ajou University School of Medicine, Suwon,
Republic of Korea
5Brain Disease Research Center, Ajou University School of Medicine,
Suwon, Republic of Korea
6Department of Radiology, Ajou University School of Medicine, Suwon,
Republic of Korea
7 Department of Surgery, Eulji General Hospital, Seoul, Republic of
Korea
8Department of Surgery, Ajou University School of Medicine, Suwon,
Republic of Korea
9Department of Rheumatology, The Hospital for Rheumatic Diseases,
Hanyang University College of Medicine, Seoul, Republic of Korea
Congenital
muscular torticollis (CMT) is characterized by thickening and/or tightness of
the unilateral sternocleidomastoid muscle (SCM), ending up with torticollis.
Our aim was to discover differentially expressed genes (DEGs) and novel protein
interaction network modules of CMT and to discover the relationship between
gene expressions and clinical severity of CMT or protein expressions encoded by
DEG. Twenty-three sternocleidomastoid muscle (SCM) of CMT patients and 5 normal
SCMs were allocated for microarray, MRI, or imunohistochemical studies. We
identified 269 genes as the DEGs in CMT. Gene ontology enrichment analysis
revealed that the main function of the DEGs is for extracellular region part
during developmental processes. Five CMT-related protein network modules were
identified, which showed that the important pathway is fibrosis related with collagen
and elastin fibrillogenesis with an evidence of DNA repair mechanism. The
expression levels of some meaningful DEGs showed good correlation with the
pre-operational MRI color intensities of CMT, indicating clinical severity.
Moreover, the protein expressions encoded by the DEGs confirmed the different
gene expressions of CMT. We provided an integrative analysis of CMT from gene
expression to clinical indication, which showed good correlation with clinical
severity of CMT. Furthermore, the CMT-related protein network modules were
identified, which provided more in-depth understanding of pathophysiology of
CMT.
S10-4
Detecting early-warning signals of
type 1 diabetes and its leading biomolecular networks by dynamical network
biomarkers
Xiaoping Liu1,2, Rui Liu3,4,
Xing-Ming Zhao2, Luonan Chen1,2,4
1Key Laboratory of
Systems Biology, SIBS-Novo Nordisk Translational Research Centre for
PreDiabetes, Shanghai Institutes for Biological Sciences, Chinese Academy of
Sciences, Shanghai 200031, China;
2Institute of Systems Biology, Shanghai University, Shanghai 200444,
China;
3Department of Mathematics, South China University of Technology,
Guangzhou 510640, China;
4Collaborative Research Center for Innovative Mathematical
Modelling, Institute of Industrial Science, University of Tokyo, Tokyo
153-8505, Japan
Type
1 diabetes is a complex disease and harmful to human health, and most of the
existing biomarkers are mainly to measure the disease phenotype after the
disease onset (or drastic deterioration). Until now, there is no effective
biomarker which can predict the upcoming disease (or pre-disease state) before
disease onset or disease deterioration. Further, the detail molecular mechanism
for such deterioration of the disease, e.g., driver genes or causal network of
the disease, is still unclear. In this study, we detected early-warning signals
of type 1 diabetes and its leading biomolecular networks based on serial gene
expression profiles of NOD mice by identifying new type of biomarkers, i.e.,
dynamical network biomarkers which form a specific module for marking the time
period just before the drastic deterioration of type 1 diabetes. Specifically,
two dynamical network biomarkers were obtained to signal the emergence of two
critical deteriorations for the disease, and could be used to predict the
upcoming sudden changes during the disease progression. We found that the two
critical transitions led to peri-insulitis and hyperglycemia in NOD mices,
which are consistent with the experimental results. Hence, the identified
dynamical network biomarkers can be used to detect the early-warning signals of
type 1 diabetes and predict upcoming disease onset before the drastic deterioration.
In addition, we also demonstrated that the leading biomolecular networks are
causally related to the initiation and progression of Type 1 diabetes, and
provide the biological insight into the molecular mechanism of type 1 diabetes.
Experimental data and Functional analysis on DNBs validated the computational
results.
Creating subnetworks from
transcriptomic data on central nervous system conditions informed by a massive
transcriptomic network.
Yaping Feng1, Judith A.
Syrkin-Nikolau2, Eve S. Wurtele1
1Iowa State University,
Department of Genetics, Development and Cell Biology, Ames, IA 50011, USA, 2
Macalester College, MN, 55105
We
use a human pairwise co-expression matrix derived from a large dataset
(>18,000 samples) of high quality publicly available transcriptomic data
representing relationships in gene expression across a diverse set of
biological conditions (1) as a context network to explore CNS transcriptomics.
In oneapproach, we derive a network from within the CNS samples, derive gene
clusters, and compare thesignificance of these to the clusters derived from the
larger network. In the second approach, we identifygenes that characterize
individual subsets of samples from within a disease condition. Specifically,
differences in gene expression within and between to two designations of glial
cancer, astrocytoma and glioblastoma, are evaluated in the context of the
broader network. Such related groups of genes, termedoutlier-networks tease out
abnormally expressed genes and the particular samples they are associated with.
This study identifies a set of 48 subnetworks of outlier genes belong to
astrocytoma and glioblastoma.As a case study, we investigate the relationships
among the genes of a small astrocytoma-only subnetwork.
|