Levenshtein distance for information extraction in databases and for natural language processing.
Ano de defesa: | 2007 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | eng |
Instituição de defesa: |
Instituto Tecnológico de Aeronáutica
|
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://www.bd.bibl.ita.br/tde_busca/arquivo.php?codArquivo=529 |
Resumo: | While performing information extraction or natural language processing tasks, one usually encounters problems when working with data or texts containing noise, typing mistakes or other different kinds of errors. In this thesis we investigate the use of modified Levenshtein edit distances to deal with these problems in two specific tasks. The first one is the record linkage in databases where distinct records can be representing the same entity. For this task we used and extended the WEKA API for Machine Learning and we were able to show that a modified Levenshtein distance provides good precision and recall results in the detection of records representing the same entities. The second task is the search and annotation of occurrences of specified words in texts written in natural language. Our main result in this task was the implementation of an approximate Gazetteer for GATE, the General Architecture for Text Engineering. |
id |
ITA_beb9555441875fb680595c73e725cee5 |
---|---|
oai_identifier_str |
oai:agregador.ibict.br.BDTD_ITA:oai:ita.br:529 |
network_acronym_str |
ITA |
network_name_str |
Biblioteca Digital de Teses e Dissertações do ITA |
spelling |
Levenshtein distance for information extraction in databases and for natural language processing.Processamento de textosLinguagem natural (computadores)Rotinas de edição (computadores)Teoria da informaçãoComputaçãoWhile performing information extraction or natural language processing tasks, one usually encounters problems when working with data or texts containing noise, typing mistakes or other different kinds of errors. In this thesis we investigate the use of modified Levenshtein edit distances to deal with these problems in two specific tasks. The first one is the record linkage in databases where distinct records can be representing the same entity. For this task we used and extended the WEKA API for Machine Learning and we were able to show that a modified Levenshtein distance provides good precision and recall results in the detection of records representing the same entities. The second task is the search and annotation of occurrences of specified words in texts written in natural language. Our main result in this task was the implementation of an approximate Gazetteer for GATE, the General Architecture for Text Engineering.Instituto Tecnológico de AeronáuticaCarlos Henrique Costa RibeiroBruno Woltzenlogel Paleo2007-12-21info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesishttp://www.bd.bibl.ita.br/tde_busca/arquivo.php?codArquivo=529reponame:Biblioteca Digital de Teses e Dissertações do ITAinstname:Instituto Tecnológico de Aeronáuticainstacron:ITAenginfo:eu-repo/semantics/openAccessapplication/pdf2019-02-02T14:01:50Zoai:agregador.ibict.br.BDTD_ITA:oai:ita.br:529http://oai.bdtd.ibict.br/requestopendoar:null2020-05-28 19:33:31.167Biblioteca Digital de Teses e Dissertações do ITA - Instituto Tecnológico de Aeronáuticatrue |
dc.title.none.fl_str_mv |
Levenshtein distance for information extraction in databases and for natural language processing. |
title |
Levenshtein distance for information extraction in databases and for natural language processing. |
spellingShingle |
Levenshtein distance for information extraction in databases and for natural language processing. Bruno Woltzenlogel Paleo Processamento de textos Linguagem natural (computadores) Rotinas de edição (computadores) Teoria da informação Computação |
title_short |
Levenshtein distance for information extraction in databases and for natural language processing. |
title_full |
Levenshtein distance for information extraction in databases and for natural language processing. |
title_fullStr |
Levenshtein distance for information extraction in databases and for natural language processing. |
title_full_unstemmed |
Levenshtein distance for information extraction in databases and for natural language processing. |
title_sort |
Levenshtein distance for information extraction in databases and for natural language processing. |
author |
Bruno Woltzenlogel Paleo |
author_facet |
Bruno Woltzenlogel Paleo |
author_role |
author |
dc.contributor.none.fl_str_mv |
Carlos Henrique Costa Ribeiro |
dc.contributor.author.fl_str_mv |
Bruno Woltzenlogel Paleo |
dc.subject.por.fl_str_mv |
Processamento de textos Linguagem natural (computadores) Rotinas de edição (computadores) Teoria da informação Computação |
topic |
Processamento de textos Linguagem natural (computadores) Rotinas de edição (computadores) Teoria da informação Computação |
dc.description.none.fl_txt_mv |
While performing information extraction or natural language processing tasks, one usually encounters problems when working with data or texts containing noise, typing mistakes or other different kinds of errors. In this thesis we investigate the use of modified Levenshtein edit distances to deal with these problems in two specific tasks. The first one is the record linkage in databases where distinct records can be representing the same entity. For this task we used and extended the WEKA API for Machine Learning and we were able to show that a modified Levenshtein distance provides good precision and recall results in the detection of records representing the same entities. The second task is the search and annotation of occurrences of specified words in texts written in natural language. Our main result in this task was the implementation of an approximate Gazetteer for GATE, the General Architecture for Text Engineering. |
description |
While performing information extraction or natural language processing tasks, one usually encounters problems when working with data or texts containing noise, typing mistakes or other different kinds of errors. In this thesis we investigate the use of modified Levenshtein edit distances to deal with these problems in two specific tasks. The first one is the record linkage in databases where distinct records can be representing the same entity. For this task we used and extended the WEKA API for Machine Learning and we were able to show that a modified Levenshtein distance provides good precision and recall results in the detection of records representing the same entities. The second task is the search and annotation of occurrences of specified words in texts written in natural language. Our main result in this task was the implementation of an approximate Gazetteer for GATE, the General Architecture for Text Engineering. |
publishDate |
2007 |
dc.date.none.fl_str_mv |
2007-12-21 |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/publishedVersion info:eu-repo/semantics/masterThesis |
status_str |
publishedVersion |
format |
masterThesis |
dc.identifier.uri.fl_str_mv |
http://www.bd.bibl.ita.br/tde_busca/arquivo.php?codArquivo=529 |
url |
http://www.bd.bibl.ita.br/tde_busca/arquivo.php?codArquivo=529 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Instituto Tecnológico de Aeronáutica |
publisher.none.fl_str_mv |
Instituto Tecnológico de Aeronáutica |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações do ITA instname:Instituto Tecnológico de Aeronáutica instacron:ITA |
reponame_str |
Biblioteca Digital de Teses e Dissertações do ITA |
collection |
Biblioteca Digital de Teses e Dissertações do ITA |
instname_str |
Instituto Tecnológico de Aeronáutica |
instacron_str |
ITA |
institution |
ITA |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações do ITA - Instituto Tecnológico de Aeronáutica |
repository.mail.fl_str_mv |
|
subject_por_txtF_mv |
Processamento de textos Linguagem natural (computadores) Rotinas de edição (computadores) Teoria da informação Computação |
_version_ |
1706804988235218944 |