Data Stream Profiling: Evolutionary and Incremental Algorithms for Dependency Discovery

Cirillo, Stefano

dc.date.accessioned	2024-06-26T10:22:02Z
dc.date.available	2024-06-26T10:22:02Z
dc.description	2020 - 2021	it_IT
dc.description.abstract	Data Profiling represents one of the most crucial processes in data quality assessment. It includes a set of activities to efficiently analyze datasets and provide insights from them. Such activities rely on the identification of metadata to capture semantic relationships within data, and can be exploited for several purposes, such as optimizing queries, cleaning data, evaluating feasibility of machine learning models, and so forth. The types of metadata range from simple counters of attribute values or null values, to complex integrity constraints, such as functional dependencies (fds), relaxed functional dependencies (rfds), and inclusion dependencies (inds). However, the discovery of these metadata represents an important challenge for data profiling tasks, since the number of possible metadata can be exponential with respect to the number of attributes, and requires analyzing a huge number of attribute combinations. To this end, several discovery algorithms have been proposed in the literature, with the aim of providing solutions in which the complexity of the search space is reduced by exploiting some theoretical properties of the different types of metadata. Although some of the discovery algorithms described in the literature achieve good performances, most of them are not suitable in dynamic scenarios, in which new data are frequently added and updated into the datasets. This need is widely growning with the proliferation of the Internet of Things (IoT) technologies, since it is necessary to define new algorithms capable of dynamically analyzing the streams of data they produce. In this scenario, after reviewing basic data profiling tasks and applica- tions, as well as basic notations for representing profiling metadata, this thesis starts presenting an innovative tool that extracts metadata from unstructured web data sources, aiming to derive a focused crawler. Then, the thesis focuses on the discovery problem of fds and rfds in static and dynamic scenarios, by analyzing their complexities and by introducing several new incremental methodologies and algorithms for discovering iii fds and rfds, aiming to avoid the re-execution of the discovery process from scratch upon update operations on datasets. In particular, a first proposal is an evolutionary discovery algorithm for hybrid rfds named REDEVO (RElaxed fD EVOlutionary discovery algorithm), which uses naturally inspired operations to iteratively browse candidates over the search space, few of which survive the evolution process. It identifies a broad class of rfds, by evaluating each candidate by means of support and confidence quality measures as a fitness function. .. [edited by Author]	it_IT
dc.language.iso	en	it_IT
dc.subject.miur	INF/01 INFORMATICA	it_IT
dc.contributor.coordinatore	De Lucia, Andrea	it_IT
dc.description.ciclo	XXXIV ciclo	it_IT
dc.contributor.tutor	Polese, Giuseppe	it_IT
dc.title	Data Stream Profiling: Evolutionary and Incremental Algorithms for Dependency Discovery	it_IT
dc.contributor.author	Cirillo, Stefano
dc.date.issued	2022-02-22
dc.identifier.uri	http://elea.unisa.it/xmlui/handle/10556/7139
dc.identifier.uri	http://dx.doi.org/10.14273/unisa-5191
dc.type	Doctoral Thesis	it_IT
dc.subject	Incremental Data Profiling	it_IT
dc.subject	Functional Dependencies	it_IT
dc.subject	Data Stream	it_IT
dc.publisher.alternative	Universita degli studi di Salerno	it_IT

Find Full text

Files in questo item

Name:: tesi_di_dottorato_S_Cirillo.pdf
Dimensione:: 11.12Mb
Formato:: PDF
Description:: tesi di dottorato

Mostra/Apri

Name:: abstract in italiano e in inglese ...
Dimensione:: 91.35Kb
Formato:: PDF
Description:: abstract a cura dell’autore ...

Mostra/Apri

Questo item appare nelle seguenti collezioni

Informatica

Mostra i principali dati dell'item