Sequence analysis in bioinformatics: methodological and practical aspects
Abstract
My PhD research activities has focused on the development of new
computational methods for biological sequence analyses.
To overcome an intrinsic problem to protein sequence analysis, whose aim was
to infer homologies in large biological protein databases with short queries, I
developed a statistical framework BLAST-based to detect distant homologies
conserved in transmembrane domains of different bacterial membrane proteins.
Using this framework, transmembrane protein domains of all Salmonella spp. have
been screened and more than five thousands of significant homologies have been
identified. My results show that the proposed framework detects distant homologies
that, because of their conservation in distinct bacterial membrane proteins, could
represent ancient signatures about the existence of primeval genetic elements (or
mini-genes) coding for short polypeptides that formed, through a primitive assembly
process, more complex genes. Further, my statistical framework lays the foundation
for new bioinformatics tools to detect homologies domain-oriented, or in other words,
the ability to find statistically significant homologies in specific target-domains.
The second problem that I faced deals with the analysis of transcripts obtained
with RNA-Seq data. I developed a novel computational method that combines
transcript borders, obtained from mapped RNA-Seq reads, with sequence features
based operon predictions to accurately infer operons in prokaryotic genomes. Since
the transcriptome of an organism is dynamic and condition dependent, the RNA-Seq
mapped reads are used to determine a set of confirmed or predicted operons and
from it specific transcriptomic features are extracted and combined with standard
genomic features to train and validate three operon classification models (Random
Forests - RFs, Neural Networks – NNs, and Support Vector Machines - SVMs).
These classifiers have been exploited to refine the operon map annotated by DOOR,
one of the most used database of prokaryotic operons. This method proved that the
integration of genomic and transcriptomic features improve the accuracy of operon
predictions, and that it is possible to predict the existence of potential new operons.
An inherent limitation of using RNA-Seq to improve operon structure predictions is
that it can be not applied to genes not expressed under the condition studied. I
evaluated my approach on different RNA-Seq based transcriptome profiles of
Histophilus somni and Porphyromonas gingivalis. These transcriptome profiles were
obtained using the standard RNA-Seq or the strand-specific RNA-Seq method. My
experimental results demonstrate that the three classifiers achieved accurate operon
maps including reliable predictions of new operons. [edited by author]