Data Stream Profiling: Evolutionary and Incremental Algorithms for Dependency Discovery

Cirillo, Stefano

Mostra/Apri

tesi di dottorato (11.12Mb)

abstract a cura dell’autore (versione italiana e inglese) (91.35Kb)

Data

2022-02-22

Autore

Cirillo, Stefano

Metadata

Mostra tutti i dati dell'item

Abstract

Data Profiling represents one of the most crucial processes in data quality assessment. It includes a set of activities to efficiently analyze datasets and provide insights from them. Such activities rely on the identification of metadata to capture semantic relationships within data, and can be exploited for several purposes, such as optimizing queries, cleaning data, evaluating feasibility of machine learning models, and so forth. The types of metadata range from simple counters of attribute values or null values, to complex integrity constraints, such as functional dependencies (fds), relaxed functional dependencies (rfds), and inclusion dependencies (inds). However, the discovery of these metadata represents an important challenge for data profiling tasks, since the number of possible metadata can be exponential with respect to the number of attributes, and requires analyzing a huge number of attribute combinations. To this end, several discovery algorithms have been proposed in the literature, with the aim of providing solutions in which the complexity of the search space is reduced by exploiting some theoretical properties of the different types of metadata. Although some of the discovery algorithms described in the literature achieve good performances, most of them are not suitable in dynamic scenarios, in which new data are frequently added and updated into the datasets. This need is widely growning with the proliferation of the Internet of Things (IoT) technologies, since it is necessary to define new algorithms capable of dynamically analyzing the streams of data they produce. In this scenario, after reviewing basic data profiling tasks and applica- tions, as well as basic notations for representing profiling metadata, this thesis starts presenting an innovative tool that extracts metadata from unstructured web data sources, aiming to derive a focused crawler. Then, the thesis focuses on the discovery problem of fds and rfds in static and dynamic scenarios, by analyzing their complexities and by introducing several new incremental methodologies and algorithms for discovering iii fds and rfds, aiming to avoid the re-execution of the discovery process from scratch upon update operations on datasets. In particular, a first proposal is an evolutionary discovery algorithm for hybrid rfds named REDEVO (RElaxed fD EVOlutionary discovery algorithm), which uses naturally inspired operations to iteratively browse candidates over the search space, few of which survive the evolution process. It identifies a broad class of rfds, by evaluating each candidate by means of support and confidence quality measures as a fitness function. .. [edited by Author]

URI

http://elea.unisa.it/xmlui/handle/10556/7139
http://dx.doi.org/10.14273/unisa-5191

Collections

Informatica

Find Full text