Mostra i principali dati dell'item

dc.contributor.authorCirillo, Stefano
dc.date.accessioned2024-06-26T10:22:02Z
dc.date.available2024-06-26T10:22:02Z
dc.date.issued2022-02-22
dc.identifier.urihttp://elea.unisa.it/xmlui/handle/10556/7139
dc.description2020 - 2021it_IT
dc.description.abstractData Profiling represents one of the most crucial processes in data quality assessment. It includes a set of activities to efficiently analyze datasets and provide insights from them. Such activities rely on the identification of metadata to capture semantic relationships within data, and can be exploited for several purposes, such as optimizing queries, cleaning data, evaluating feasibility of machine learning models, and so forth. The types of metadata range from simple counters of attribute values or null values, to complex integrity constraints, such as functional dependencies (fds), relaxed functional dependencies (rfds), and inclusion dependencies (inds). However, the discovery of these metadata represents an important challenge for data profiling tasks, since the number of possible metadata can be exponential with respect to the number of attributes, and requires analyzing a huge number of attribute combinations. To this end, several discovery algorithms have been proposed in the literature, with the aim of providing solutions in which the complexity of the search space is reduced by exploiting some theoretical properties of the different types of metadata. Although some of the discovery algorithms described in the literature achieve good performances, most of them are not suitable in dynamic scenarios, in which new data are frequently added and updated into the datasets. This need is widely growning with the proliferation of the Internet of Things (IoT) technologies, since it is necessary to define new algorithms capable of dynamically analyzing the streams of data they produce. In this scenario, after reviewing basic data profiling tasks and applica- tions, as well as basic notations for representing profiling metadata, this thesis starts presenting an innovative tool that extracts metadata from unstructured web data sources, aiming to derive a focused crawler. Then, the thesis focuses on the discovery problem of fds and rfds in static and dynamic scenarios, by analyzing their complexities and by introducing several new incremental methodologies and algorithms for discovering iii fds and rfds, aiming to avoid the re-execution of the discovery process from scratch upon update operations on datasets. In particular, a first proposal is an evolutionary discovery algorithm for hybrid rfds named REDEVO (RElaxed fD EVOlutionary discovery algorithm), which uses naturally inspired operations to iteratively browse candidates over the search space, few of which survive the evolution process. It identifies a broad class of rfds, by evaluating each candidate by means of support and confidence quality measures as a fitness function. [...] [edited by Author]it_IT
dc.language.isoenit_IT
dc.publisherUniversita degli studi di Salernoit_IT
dc.subjectIncremental Data Profilingit_IT
dc.subjectFunctional Dependenciesit_IT
dc.subjectData Streamit_IT
dc.titleData Stream Profiling: Evolutionary and Incremental Algorithms for Dependency Discoveryit_IT
dc.typeDoctoral Thesisit_IT
dc.subject.miurINF/01 INFORMATICAit_IT
dc.contributor.coordinatoreDe Lucia, Andreait_IT
dc.description.cicloXXXIV cicloit_IT
dc.contributor.tutorPolese, Giuseppeit_IT
 Find Full text

Files in questo item

Thumbnail
Thumbnail

Questo item appare nelle seguenti collezioni

Mostra i principali dati dell'item