Levenshtein distance for information extraction in databases and for natural language processing.

Bruno Woltzenlogel Paleo

Levenshtein distance for information extraction in databases and for natural language processing.

Detalhes bibliográficos
Ano de defesa:	2007
Autor(a) principal:	Bruno Woltzenlogel Paleo
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Instituto Tecnológico de Aeronáutica
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Processamento de textos Linguagem natural (computadores) Rotinas de edição (computadores) Teoria da informação Computação
Link de acesso:	http://www.bd.bibl.ita.br/tde_busca/arquivo.php?codArquivo=529
Resumo:	While performing information extraction or natural language processing tasks, one usually encounters problems when working with data or texts containing noise, typing mistakes or other different kinds of errors. In this thesis we investigate the use of modified Levenshtein edit distances to deal with these problems in two specific tasks. The first one is the record linkage in databases where distinct records can be representing the same entity. For this task we used and extended the WEKA API for Machine Learning and we were able to show that a modified Levenshtein distance provides good precision and recall results in the detection of records representing the same entities. The second task is the search and annotation of occurrences of specified words in texts written in natural language. Our main result in this task was the implementation of an approximate Gazetteer for GATE, the General Architecture for Text Engineering.

Metadados do item

id	ITA_beb9555441875fb680595c73e725cee5
oai_identifier_str	oai:agregador.ibict.br.BDTD_ITA:oai:ita.br:529
network_acronym_str	ITA
network_name_str	Biblioteca Digital de Teses e Dissertações do ITA
spelling	Levenshtein distance for information extraction in databases and for natural language processing.Processamento de textosLinguagem natural (computadores)Rotinas de edição (computadores)Teoria da informaçãoComputaçãoWhile performing information extraction or natural language processing tasks, one usually encounters problems when working with data or texts containing noise, typing mistakes or other different kinds of errors. In this thesis we investigate the use of modified Levenshtein edit distances to deal with these problems in two specific tasks. The first one is the record linkage in databases where distinct records can be representing the same entity. For this task we used and extended the WEKA API for Machine Learning and we were able to show that a modified Levenshtein distance provides good precision and recall results in the detection of records representing the same entities. The second task is the search and annotation of occurrences of specified words in texts written in natural language. Our main result in this task was the implementation of an approximate Gazetteer for GATE, the General Architecture for Text Engineering.Instituto Tecnológico de AeronáuticaCarlos Henrique Costa RibeiroBruno Woltzenlogel Paleo2007-12-21info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesishttp://www.bd.bibl.ita.br/tde_busca/arquivo.php?codArquivo=529reponame:Biblioteca Digital de Teses e Dissertações do ITAinstname:Instituto Tecnológico de Aeronáuticainstacron:ITAenginfo:eu-repo/semantics/openAccessapplication/pdf2019-02-02T14:01:50Zoai:agregador.ibict.br.BDTD_ITA:oai:ita.br:529http://oai.bdtd.ibict.br/requestopendoar:null2020-05-28 19:33:31.167Biblioteca Digital de Teses e Dissertações do ITA - Instituto Tecnológico de Aeronáuticatrue
dc.title.none.fl_str_mv	Levenshtein distance for information extraction in databases and for natural language processing.
title	Levenshtein distance for information extraction in databases and for natural language processing.
spellingShingle	Levenshtein distance for information extraction in databases and for natural language processing. Bruno Woltzenlogel Paleo Processamento de textos Linguagem natural (computadores) Rotinas de edição (computadores) Teoria da informação Computação
title_short	Levenshtein distance for information extraction in databases and for natural language processing.
title_full	Levenshtein distance for information extraction in databases and for natural language processing.
title_fullStr	Levenshtein distance for information extraction in databases and for natural language processing.
title_full_unstemmed	Levenshtein distance for information extraction in databases and for natural language processing.
title_sort	Levenshtein distance for information extraction in databases and for natural language processing.
author	Bruno Woltzenlogel Paleo
author_facet	Bruno Woltzenlogel Paleo
author_role	author
dc.contributor.none.fl_str_mv	Carlos Henrique Costa Ribeiro
dc.contributor.author.fl_str_mv	Bruno Woltzenlogel Paleo
dc.subject.por.fl_str_mv	Processamento de textos Linguagem natural (computadores) Rotinas de edição (computadores) Teoria da informação Computação
topic	Processamento de textos Linguagem natural (computadores) Rotinas de edição (computadores) Teoria da informação Computação
dc.description.none.fl_txt_mv	While performing information extraction or natural language processing tasks, one usually encounters problems when working with data or texts containing noise, typing mistakes or other different kinds of errors. In this thesis we investigate the use of modified Levenshtein edit distances to deal with these problems in two specific tasks. The first one is the record linkage in databases where distinct records can be representing the same entity. For this task we used and extended the WEKA API for Machine Learning and we were able to show that a modified Levenshtein distance provides good precision and recall results in the detection of records representing the same entities. The second task is the search and annotation of occurrences of specified words in texts written in natural language. Our main result in this task was the implementation of an approximate Gazetteer for GATE, the General Architecture for Text Engineering.
description	While performing information extraction or natural language processing tasks, one usually encounters problems when working with data or texts containing noise, typing mistakes or other different kinds of errors. In this thesis we investigate the use of modified Levenshtein edit distances to deal with these problems in two specific tasks. The first one is the record linkage in databases where distinct records can be representing the same entity. For this task we used and extended the WEKA API for Machine Learning and we were able to show that a modified Levenshtein distance provides good precision and recall results in the detection of records representing the same entities. The second task is the search and annotation of occurrences of specified words in texts written in natural language. Our main result in this task was the implementation of an approximate Gazetteer for GATE, the General Architecture for Text Engineering.
publishDate	2007
dc.date.none.fl_str_mv	2007-12-21
dc.type.driver.fl_str_mv	info:eu-repo/semantics/publishedVersion info:eu-repo/semantics/masterThesis
status_str	publishedVersion
format	masterThesis
dc.identifier.uri.fl_str_mv	http://www.bd.bibl.ita.br/tde_busca/arquivo.php?codArquivo=529
url	http://www.bd.bibl.ita.br/tde_busca/arquivo.php?codArquivo=529
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Instituto Tecnológico de Aeronáutica
publisher.none.fl_str_mv	Instituto Tecnológico de Aeronáutica
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações do ITA instname:Instituto Tecnológico de Aeronáutica instacron:ITA
reponame_str	Biblioteca Digital de Teses e Dissertações do ITA
collection	Biblioteca Digital de Teses e Dissertações do ITA
instname_str	Instituto Tecnológico de Aeronáutica
instacron_str	ITA
institution	ITA
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações do ITA - Instituto Tecnológico de Aeronáutica
repository.mail.fl_str_mv
subject_por_txtF_mv	Processamento de textos Linguagem natural (computadores) Rotinas de edição (computadores) Teoria da informação Computação
_version_	1706804988235218944

Levenshtein distance for information extraction in databases and for natural language processing.

Registros relacionados