Exportação concluída — 

Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética

Detalhes bibliográficos
Ano de defesa: 2010
Autor(a) principal: Gabriel Silva Goncalves
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://hdl.handle.net/1843/BUBD-9JWQAQ
Resumo: The increasing volume of information available in digital media is becoming a challenge for administrators of large data repositories such as digital libraries and databases of large corporations. Nowadays, it is possible to say that the quality of the data used by an organization is proportional to its capacity of providing useful services to their users.Thus, companies and government institutions are investing a lot of money in developing efficient methods to identify and remove duplicates in large data repositories. Because record deduplication is a task that demands a lot of time and processing power, the proposed methods should be able to get good results as efficiently as possible.Recently, machine learning techniques have been used to deal with the record deduplication problem. However, these techniques require examples - usually generated manually - to perform a training phase necessary to learn duplication patterns from existing data, what may restrict the use of such techniques due to the cost required tocreate the training set. This MSc thesis proposes an approach that uses a deterministic technique to automatically suggest training examples for a record deduplication method based on genetic programming (GP). Experiments using synthetic data show that it is possible to use reduced training sets to faster generate deduplication functions withoutsignificantly reducing the quality of the solutions generated, even in data repositories with high levels of difficulty for deduplication. In addition, a factorial design was performed to measure the difficulty levels to deduplicate data repositories, identifying the characteristics that may affect the use of our approach to selecting training examples for the record deduphcation method based on GP.
id UFMG_a5b2182dde99bc03d3c32d5d46735182
oai_identifier_str oai:repositorio.ufmg.br:1843/BUBD-9JWQAQ
network_acronym_str UFMG
network_name_str Repositório Institucional da UFMG
repository_id_str
spelling Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genéticaProgramação genética (Computação)ComputaçãogenéticaInteligência artificialIdentificação de duplicatasProgramaçãoThe increasing volume of information available in digital media is becoming a challenge for administrators of large data repositories such as digital libraries and databases of large corporations. Nowadays, it is possible to say that the quality of the data used by an organization is proportional to its capacity of providing useful services to their users.Thus, companies and government institutions are investing a lot of money in developing efficient methods to identify and remove duplicates in large data repositories. Because record deduplication is a task that demands a lot of time and processing power, the proposed methods should be able to get good results as efficiently as possible.Recently, machine learning techniques have been used to deal with the record deduplication problem. However, these techniques require examples - usually generated manually - to perform a training phase necessary to learn duplication patterns from existing data, what may restrict the use of such techniques due to the cost required tocreate the training set. This MSc thesis proposes an approach that uses a deterministic technique to automatically suggest training examples for a record deduplication method based on genetic programming (GP). Experiments using synthetic data show that it is possible to use reduced training sets to faster generate deduplication functions withoutsignificantly reducing the quality of the solutions generated, even in data repositories with high levels of difficulty for deduplication. In addition, a factorial design was performed to measure the difficulty levels to deduplicate data repositories, identifying the characteristics that may affect the use of our approach to selecting training examples for the record deduphcation method based on GP.Universidade Federal de Minas Gerais2019-08-11T06:30:15Z2025-09-09T01:18:09Z2019-08-11T06:30:15Z2010-04-30info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://hdl.handle.net/1843/BUBD-9JWQAQGabriel Silva Goncalvesinfo:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMG2025-09-09T01:18:09Zoai:repositorio.ufmg.br:1843/BUBD-9JWQAQRepositório InstitucionalPUBhttps://repositorio.ufmg.br/oairepositorio@ufmg.bropendoar:2025-09-09T01:18:09Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false
dc.title.none.fl_str_mv Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética
title Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética
spellingShingle Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética
Gabriel Silva Goncalves
Programação genética (Computação)
Computação
genética
Inteligência artificial
Identificação de duplicatas
Programação
title_short Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética
title_full Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética
title_fullStr Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética
title_full_unstemmed Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética
title_sort Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética
author Gabriel Silva Goncalves
author_facet Gabriel Silva Goncalves
author_role author
dc.contributor.author.fl_str_mv Gabriel Silva Goncalves
dc.subject.por.fl_str_mv Programação genética (Computação)
Computação
genética
Inteligência artificial
Identificação de duplicatas
Programação
topic Programação genética (Computação)
Computação
genética
Inteligência artificial
Identificação de duplicatas
Programação
description The increasing volume of information available in digital media is becoming a challenge for administrators of large data repositories such as digital libraries and databases of large corporations. Nowadays, it is possible to say that the quality of the data used by an organization is proportional to its capacity of providing useful services to their users.Thus, companies and government institutions are investing a lot of money in developing efficient methods to identify and remove duplicates in large data repositories. Because record deduplication is a task that demands a lot of time and processing power, the proposed methods should be able to get good results as efficiently as possible.Recently, machine learning techniques have been used to deal with the record deduplication problem. However, these techniques require examples - usually generated manually - to perform a training phase necessary to learn duplication patterns from existing data, what may restrict the use of such techniques due to the cost required tocreate the training set. This MSc thesis proposes an approach that uses a deterministic technique to automatically suggest training examples for a record deduplication method based on genetic programming (GP). Experiments using synthetic data show that it is possible to use reduced training sets to faster generate deduplication functions withoutsignificantly reducing the quality of the solutions generated, even in data repositories with high levels of difficulty for deduplication. In addition, a factorial design was performed to measure the difficulty levels to deduplicate data repositories, identifying the characteristics that may affect the use of our approach to selecting training examples for the record deduphcation method based on GP.
publishDate 2010
dc.date.none.fl_str_mv 2010-04-30
2019-08-11T06:30:15Z
2019-08-11T06:30:15Z
2025-09-09T01:18:09Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://hdl.handle.net/1843/BUBD-9JWQAQ
url https://hdl.handle.net/1843/BUBD-9JWQAQ
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Federal de Minas Gerais
publisher.none.fl_str_mv Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFMG
instname:Universidade Federal de Minas Gerais (UFMG)
instacron:UFMG
instname_str Universidade Federal de Minas Gerais (UFMG)
instacron_str UFMG
institution UFMG
reponame_str Repositório Institucional da UFMG
collection Repositório Institucional da UFMG
repository.name.fl_str_mv Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv repositorio@ufmg.br
_version_ 1856414079934005248