From protein structure prediction to non-coding RNA identification
Van Du TRAN Thong
28 March 2013, 14h30 - 28 March 2013, 15h30 Salle/Bat : 465/PCRI-N
Contact :
Activités de recherche :
Résumé :
Proteins are known to perform a variety of biological functions through their three-dimensional conformation. Meanwhile, non-coding RNAs, which are not translated into proteins, also get involved in several cellular processes. Predicting protein structures and identifying non-coding RNAs in a genome are usually difficult, expensive and time-consuming with experimental techniques. The in silico genome-wide prediction of protein structures and non-coding RNAs has then become an important task in biological and medical sciences, especially with the arrival of new sequencing technologies.
For proteins, we introduce a novel graph-theoretic model for classification and prediction of permuted super-secondary structures of a particular type of proteins, namely transmembrane beta-barrel proteins (TMB), from their amino acid sequence. It consists in finding the thermodynamically most stable structure, i.e. the structure of minimum energy. This protein structure prediction problem is thereby modeled into finding the longest cycle-attached path in a graph with respect to a given permutation. We
study the NP-completeness of the problem of finding the optimal permuted super-secondary structure, then propose a dynamic programming-based algorithm to solve it in certain specific cases with Greek-key motifs.
For non-coding RNAs, we firstly present an ensemble method using boosting technique with weakened support vector machine (SVM) component classifiers to identify a special class of miRNA precursors, based on their intrinsic properties on sequence and structure. The proposed learning-based algorithm is able to deal with imbalanced training data, which deduces from a much higher number of non-miRNAs compared with miRNAs in hairpin-like structures. Then, we develop novel RNA-seq analysis methods to discover non-coding RNAs in genomes and their functions via differential expression.
The first method is intended to determine clusters of co-regulated transcripts in time series experiments while the second one aims to identify the variation in RNA processing between different conditions. We show an application of these methods to data sets of Arabidopsis thaliana RNA deep sequencing experiments performed in various stress conditions.