Multi-biclustering solutions for classification and prediction problems
Abstract
The search for similarities in large data sets has a relevant role in many
scientific fields. It permits to classify several types of data without an
explicit information about them. Unfortunately, the experimental data
contains noise and errors, and therefore the main task of mathematicians
is to find algorithms that permit to analyze this data with maximal
precision. In many cases researchers use methodologies such as clustering
to classify data with respect to the patterns or conditions. But in the
last few years new analysis tool such as biclustering was proposed and
applied to many specific problems. My choice of biclustering methods is
motivated by the accuracy obtained in the results and the possibility to
find not only rows or columns that provide a dataset partition but also
rows and columns together.
In this work, two new biclustering algorithms, the Combinatorial Biclustering
Algorithm (CBA) and an improvement of the Possibilistic Biclustering
Algorithm, called Biclustering by resampling, are presented. The
first algorithm (that I call Combinatorial) is based on the direct definition
of bicluster, that makes it clear and very easy to understand. My
algorithm permits to control the error of biclusters in each step, specifying the accepted value of the error and defining the dimensions of the
desired biclusters from the beginning. The comparison with other known
biclustering algorithms is shown.
The second algorithm is an improvement of the Possibilistic Biclustering
Algorithm (PBC). The PBC algorithm, proposed by M. Filippone et al.,
is based on the Possibilistic Clustering paradigm, and finds one bicluster
at a time, assigning a membership to the bicluster for each gene and for
each condition. PBC uses an objective function that maximizes a bicluster
cardinality and minimizes a residual error. The biclustering problem
is faced as the optimization of a proper functional. This algorithm obtains
a fast convergence and good quality of the solutions. Unfortunately,
PBC finds only one bicluster at a time. I propose an improved PBC algorithm
based on data resampling, specifically Bootstrap aggregation, and
Genetics algorithms. In such a way I can find all the possible biclusters
together and include overlapped solutions. I apply the algorithm to a synthetic
data and to the Yeast dataset and compare it with the original PBC method. [edited by the author]