Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética
| Ano de defesa: | 2010 |
|---|---|
| Autor(a) principal: | |
| Orientador(a): | |
| Banca de defesa: | |
| Tipo de documento: | Dissertação |
| Tipo de acesso: | Acesso aberto |
| Idioma: | por |
| Instituição de defesa: |
Universidade Federal de Minas Gerais
|
| Programa de Pós-Graduação: |
Não Informado pela instituição
|
| Departamento: |
Não Informado pela instituição
|
| País: |
Não Informado pela instituição
|
| Palavras-chave em Português: | |
| Link de acesso: | https://hdl.handle.net/1843/BUBD-9JWQAQ |
Resumo: | The increasing volume of information available in digital media is becoming a challenge for administrators of large data repositories such as digital libraries and databases of large corporations. Nowadays, it is possible to say that the quality of the data used by an organization is proportional to its capacity of providing useful services to their users.Thus, companies and government institutions are investing a lot of money in developing efficient methods to identify and remove duplicates in large data repositories. Because record deduplication is a task that demands a lot of time and processing power, the proposed methods should be able to get good results as efficiently as possible.Recently, machine learning techniques have been used to deal with the record deduplication problem. However, these techniques require examples - usually generated manually - to perform a training phase necessary to learn duplication patterns from existing data, what may restrict the use of such techniques due to the cost required tocreate the training set. This MSc thesis proposes an approach that uses a deterministic technique to automatically suggest training examples for a record deduplication method based on genetic programming (GP). Experiments using synthetic data show that it is possible to use reduced training sets to faster generate deduplication functions withoutsignificantly reducing the quality of the solutions generated, even in data repositories with high levels of difficulty for deduplication. In addition, a factorial design was performed to measure the difficulty levels to deduplicate data repositories, identifying the characteristics that may affect the use of our approach to selecting training examples for the record deduphcation method based on GP. |
| id |
UFMG_a5b2182dde99bc03d3c32d5d46735182 |
|---|---|
| oai_identifier_str |
oai:repositorio.ufmg.br:1843/BUBD-9JWQAQ |
| network_acronym_str |
UFMG |
| network_name_str |
Repositório Institucional da UFMG |
| repository_id_str |
|
| spelling |
Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genéticaProgramação genética (Computação)ComputaçãogenéticaInteligência artificialIdentificação de duplicatasProgramaçãoThe increasing volume of information available in digital media is becoming a challenge for administrators of large data repositories such as digital libraries and databases of large corporations. Nowadays, it is possible to say that the quality of the data used by an organization is proportional to its capacity of providing useful services to their users.Thus, companies and government institutions are investing a lot of money in developing efficient methods to identify and remove duplicates in large data repositories. Because record deduplication is a task that demands a lot of time and processing power, the proposed methods should be able to get good results as efficiently as possible.Recently, machine learning techniques have been used to deal with the record deduplication problem. However, these techniques require examples - usually generated manually - to perform a training phase necessary to learn duplication patterns from existing data, what may restrict the use of such techniques due to the cost required tocreate the training set. This MSc thesis proposes an approach that uses a deterministic technique to automatically suggest training examples for a record deduplication method based on genetic programming (GP). Experiments using synthetic data show that it is possible to use reduced training sets to faster generate deduplication functions withoutsignificantly reducing the quality of the solutions generated, even in data repositories with high levels of difficulty for deduplication. In addition, a factorial design was performed to measure the difficulty levels to deduplicate data repositories, identifying the characteristics that may affect the use of our approach to selecting training examples for the record deduphcation method based on GP.Universidade Federal de Minas Gerais2019-08-11T06:30:15Z2025-09-09T01:18:09Z2019-08-11T06:30:15Z2010-04-30info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://hdl.handle.net/1843/BUBD-9JWQAQGabriel Silva Goncalvesinfo:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMG2025-09-09T01:18:09Zoai:repositorio.ufmg.br:1843/BUBD-9JWQAQRepositório InstitucionalPUBhttps://repositorio.ufmg.br/oairepositorio@ufmg.bropendoar:2025-09-09T01:18:09Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false |
| dc.title.none.fl_str_mv |
Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética |
| title |
Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética |
| spellingShingle |
Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética Gabriel Silva Goncalves Programação genética (Computação) Computação genética Inteligência artificial Identificação de duplicatas Programação |
| title_short |
Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética |
| title_full |
Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética |
| title_fullStr |
Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética |
| title_full_unstemmed |
Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética |
| title_sort |
Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética |
| author |
Gabriel Silva Goncalves |
| author_facet |
Gabriel Silva Goncalves |
| author_role |
author |
| dc.contributor.author.fl_str_mv |
Gabriel Silva Goncalves |
| dc.subject.por.fl_str_mv |
Programação genética (Computação) Computação genética Inteligência artificial Identificação de duplicatas Programação |
| topic |
Programação genética (Computação) Computação genética Inteligência artificial Identificação de duplicatas Programação |
| description |
The increasing volume of information available in digital media is becoming a challenge for administrators of large data repositories such as digital libraries and databases of large corporations. Nowadays, it is possible to say that the quality of the data used by an organization is proportional to its capacity of providing useful services to their users.Thus, companies and government institutions are investing a lot of money in developing efficient methods to identify and remove duplicates in large data repositories. Because record deduplication is a task that demands a lot of time and processing power, the proposed methods should be able to get good results as efficiently as possible.Recently, machine learning techniques have been used to deal with the record deduplication problem. However, these techniques require examples - usually generated manually - to perform a training phase necessary to learn duplication patterns from existing data, what may restrict the use of such techniques due to the cost required tocreate the training set. This MSc thesis proposes an approach that uses a deterministic technique to automatically suggest training examples for a record deduplication method based on genetic programming (GP). Experiments using synthetic data show that it is possible to use reduced training sets to faster generate deduplication functions withoutsignificantly reducing the quality of the solutions generated, even in data repositories with high levels of difficulty for deduplication. In addition, a factorial design was performed to measure the difficulty levels to deduplicate data repositories, identifying the characteristics that may affect the use of our approach to selecting training examples for the record deduphcation method based on GP. |
| publishDate |
2010 |
| dc.date.none.fl_str_mv |
2010-04-30 2019-08-11T06:30:15Z 2019-08-11T06:30:15Z 2025-09-09T01:18:09Z |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
| format |
masterThesis |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
https://hdl.handle.net/1843/BUBD-9JWQAQ |
| url |
https://hdl.handle.net/1843/BUBD-9JWQAQ |
| dc.language.iso.fl_str_mv |
por |
| language |
por |
| dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais |
| publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais |
| dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG |
| instname_str |
Universidade Federal de Minas Gerais (UFMG) |
| instacron_str |
UFMG |
| institution |
UFMG |
| reponame_str |
Repositório Institucional da UFMG |
| collection |
Repositório Institucional da UFMG |
| repository.name.fl_str_mv |
Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG) |
| repository.mail.fl_str_mv |
repositorio@ufmg.br |
| _version_ |
1856414079934005248 |