Data Stream Profiling: Evolutionary and Incremental Algorithms for Dependency Discovery
Abstract
Data Profiling represents one of the most crucial processes in data quality
assessment. It includes a set of activities to efficiently analyze datasets
and provide insights from them. Such activities rely on the identification
of metadata to capture semantic relationships within data, and can be
exploited for several purposes, such as optimizing queries, cleaning data,
evaluating feasibility of machine learning models, and so forth. The
types of metadata range from simple counters of attribute values or null
values, to complex integrity constraints, such as functional dependencies
(fds), relaxed functional dependencies (rfds), and inclusion dependencies
(inds). However, the discovery of these metadata represents an important
challenge for data profiling tasks, since the number of possible metadata
can be exponential with respect to the number of attributes, and requires
analyzing a huge number of attribute combinations. To this end, several
discovery algorithms have been proposed in the literature, with the aim
of providing solutions in which the complexity of the search space is
reduced by exploiting some theoretical properties of the different types
of metadata. Although some of the discovery algorithms described in the
literature achieve good performances, most of them are not suitable in
dynamic scenarios, in which new data are frequently added and updated
into the datasets. This need is widely growning with the proliferation of
the Internet of Things (IoT) technologies, since it is necessary to define
new algorithms capable of dynamically analyzing the streams of data
they produce.
In this scenario, after reviewing basic data profiling tasks and applica-
tions, as well as basic notations for representing profiling metadata, this
thesis starts presenting an innovative tool that extracts metadata from
unstructured web data sources, aiming to derive a focused crawler. Then,
the thesis focuses on the discovery problem of fds and rfds in static and
dynamic scenarios, by analyzing their complexities and by introducing
several new incremental methodologies and algorithms for discovering
iii
fds and rfds, aiming to avoid the re-execution of the discovery process
from scratch upon update operations on datasets. In particular, a first
proposal is an evolutionary discovery algorithm for hybrid rfds named
REDEVO (RElaxed fD EVOlutionary discovery algorithm), which uses
naturally inspired operations to iteratively browse candidates over the
search space, few of which survive the evolution process. It identifies a
broad class of rfds, by evaluating each candidate by means of support
and confidence quality measures as a fitness function. [...] [edited by Author]