Multi-biclustering solutions for classification and prediction problems

Nosova, Ekaterina

Mostra/Apri

tesi di dottorato (6.356Mb)

Data

2011-03-19

Autore

Nosova, Ekaterina

Metadata

Mostra tutti i dati dell'item

Abstract

The search for similarities in large data sets has a relevant role in many scientific fields. It permits to classify several types of data without an explicit information about them. Unfortunately, the experimental data contains noise and errors, and therefore the main task of mathematicians is to find algorithms that permit to analyze this data with maximal precision. In many cases researchers use methodologies such as clustering to classify data with respect to the patterns or conditions. But in the last few years new analysis tool such as biclustering was proposed and applied to many specific problems. My choice of biclustering methods is motivated by the accuracy obtained in the results and the possibility to find not only rows or columns that provide a dataset partition but also rows and columns together. In this work, two new biclustering algorithms, the Combinatorial Biclustering Algorithm (CBA) and an improvement of the Possibilistic Biclustering Algorithm, called Biclustering by resampling, are presented. The first algorithm (that I call Combinatorial) is based on the direct definition of bicluster, that makes it clear and very easy to understand. My algorithm permits to control the error of biclusters in each step, specifying the accepted value of the error and defining the dimensions of the desired biclusters from the beginning. The comparison with other known biclustering algorithms is shown. The second algorithm is an improvement of the Possibilistic Biclustering Algorithm (PBC). The PBC algorithm, proposed by M. Filippone et al., is based on the Possibilistic Clustering paradigm, and finds one bicluster at a time, assigning a membership to the bicluster for each gene and for each condition. PBC uses an objective function that maximizes a bicluster cardinality and minimizes a residual error. The biclustering problem is faced as the optimization of a proper functional. This algorithm obtains a fast convergence and good quality of the solutions. Unfortunately, PBC finds only one bicluster at a time. I propose an improved PBC algorithm based on data resampling, specifically Bootstrap aggregation, and Genetics algorithms. In such a way I can find all the possible biclusters together and include overlapped solutions. I apply the algorithm to a synthetic data and to the Yeast dataset and compare it with the original PBC method. [edited by Author]

URI

http://hdl.handle.net/10556/190

Collections

Matematica

Find Full text