Aprendizado sem-fim de paráfrases
Ano de defesa: | 2016 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de São Carlos
Câmpus São Carlos |
Programa de Pós-Graduação: |
Programa de Pós-Graduação em Ciência da Computação - PPGCC
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Palavras-chave em Inglês: | |
Área do conhecimento CNPq: | |
Link de acesso: | https://repositorio.ufscar.br/handle/ufscar/7868 |
Resumo: | Use different words to express/convey the same message is a necessity in any natural language and, as such, should be investigated in research in Natural Language Processing (NLP). When it is just a simple word, we say that the interchangeable words are synonyms; while the term paraphrase is used to express a more general idea and that also may involve more than one word. For example, the sentences "the light is red" and "the light is closed" are examples of paraphrases as "sign" and "traffic light" represent synonymous in this context. Proper treatment of paraphrasing is important in several NLP applications, such as Machine Translation, which paraphrases can be used to increase the coverage of Statistical Machine Translation systems; on Multidocument Summarization, where paraphrases identification allows the recognition of repeated information; and Natural Language Generation, where the generation of paraphrases allows creating more varied and fluent texts. The project described in this document is intended to verify that is possible to learn, in an incremental and automatic way, paraphrases in words level from a bilingual parallel corpus, using Never-Ending Machine Learning (NEML) strategy and the Internet as a source of knowledge. The NEML is a machine learning strategy, based on how humans learn: what is learned previously can be used to learn new information and perhaps more complex in the future. Thus, the NEML has been applied together with the strategy for paraphrases extraction proposed by Bannard and Callison-Burch (2005) where, from bilingual parallel corpus, paraphrases are extracted using a pivot language. In this context, it was developed NEPaL (Never-Ending Paraphrase Learner) AMSF system responsible for: (1) extract the internet texts, (2) align the text using a pivot language, (3) rank the candidates according to a classification model and (4) use the knowledge to produce a new classifier model and therefore gain more knowledge restarting the never-ending learning cycle. |
id |
SCAR_38ac1ad1a0fdab0b6bd1994adbf7f39d |
---|---|
oai_identifier_str |
oai:repositorio.ufscar.br:ufscar/7868 |
network_acronym_str |
SCAR |
network_name_str |
Repositório Institucional da UFSCAR |
repository_id_str |
|
spelling |
Polastri, Paulo CésarCaseli, Helena de Medeiroshttp://lattes.cnpq.br/6608582057810385http://lattes.cnpq.br/13419411415351782016-10-14T14:13:28Z2016-10-14T14:13:28Z2016-03-04POLASTRI, Paulo César. Aprendizado sem-fim de paráfrases. 2016. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, São Carlos, 2016. Disponível em: https://repositorio.ufscar.br/handle/ufscar/7868.https://repositorio.ufscar.br/handle/ufscar/7868Use different words to express/convey the same message is a necessity in any natural language and, as such, should be investigated in research in Natural Language Processing (NLP). When it is just a simple word, we say that the interchangeable words are synonyms; while the term paraphrase is used to express a more general idea and that also may involve more than one word. For example, the sentences "the light is red" and "the light is closed" are examples of paraphrases as "sign" and "traffic light" represent synonymous in this context. Proper treatment of paraphrasing is important in several NLP applications, such as Machine Translation, which paraphrases can be used to increase the coverage of Statistical Machine Translation systems; on Multidocument Summarization, where paraphrases identification allows the recognition of repeated information; and Natural Language Generation, where the generation of paraphrases allows creating more varied and fluent texts. The project described in this document is intended to verify that is possible to learn, in an incremental and automatic way, paraphrases in words level from a bilingual parallel corpus, using Never-Ending Machine Learning (NEML) strategy and the Internet as a source of knowledge. The NEML is a machine learning strategy, based on how humans learn: what is learned previously can be used to learn new information and perhaps more complex in the future. Thus, the NEML has been applied together with the strategy for paraphrases extraction proposed by Bannard and Callison-Burch (2005) where, from bilingual parallel corpus, paraphrases are extracted using a pivot language. In this context, it was developed NEPaL (Never-Ending Paraphrase Learner) AMSF system responsible for: (1) extract the internet texts, (2) align the text using a pivot language, (3) rank the candidates according to a classification model and (4) use the knowledge to produce a new classifier model and therefore gain more knowledge restarting the never-ending learning cycle.Usar palavras diferentes para expressar/transmitir a mesma mensagem é uma necessidade em qualquer língua natural e, como tal, deve ser investigada nas pesquisas em Processamento de Língua Natural (PLN). Quando se trata apenas de uma palavra simples, dizemos que as palavras intercambiáveis são sinônimos; enquanto o termo paráfrase é utilizado para expressar uma ideia mais geral e que pode envolver também mais de uma palavra. Por exemplo, as sentenças “o sinal está vermelho” e “o semáforo está fechado” são exemplo de paráfrases enquanto “sinal” e “semáforo” representam sinônimos, nesse contexto. O tratamento adequado de paráfrases é importante em diversas aplicações de PLN, como na Tradução Automática, onde paráfrases podem ser utilizadas para aumentar a cobertura de sistemas de Tradução Automática Estatística; na Sumarização Multidocumento, onde a identificação de paráfrases permite o reconhecimento de informações repetidas; e na Geração de Língua Natural, onde a geração de paráfrases permite criar textos mais variados e fluentes. O projeto descrito neste documento visa verificar se é possível aprender, de modo incremental e automático, paráfrases em nível de palavras a partir de corpus paralelo bilíngue, utilizando a estratégia de Aprendizado de Máquina Sem-fim (AMSF) e a Internet como fonte de conhecimento. O AMSF é uma estratégia de Aprendizado de Máquina, baseada na forma como os humanos aprendem: o que é aprendido previamente pode ser utilizado para aprender informações novas e talvez mais complexas, futuramente. Para tanto, o AMSF foi aplicado juntamente com a estratégia para a extração de paráfrases proposta por Bannard e Callison-Burch (2005) onde, a partir de corpus paralelo bilíngue, paráfrases são extraídas utilizando um idioma pivô. Nesse contexto, foi desenvolvido o NEPaL (Never-Ending Paraphrase Learner), sistema de AMSF responsável por: (1) extrair textos da internet, (2) alinhar os textos utilizando um idioma pivô, (3) classificar as candidatas de acordo com um modelo de classificação e (4) utilizar o conhecimento para produzir um novo modelo classificador e, consequentemente, adquirir mais conhecimento reiniciando o ciclo de aprendizado sem-fim.Não recebi financiamentoporUniversidade Federal de São CarlosCâmpus São CarlosPrograma de Pós-Graduação em Ciência da Computação - PPGCCUFSCarParáfrasesReconhecimento automático de paráfrasesAprendizado de máquina sem-fimProcessamento de língua naturalPortuguês do BrasilParaphrase lexiconAutomatic paraphrase recognitionNever-ending machine learningNatural language processingBrazilian PortugueseCIENCIAS EXATAS E DA TERRAAprendizado sem-fim de paráfrasesinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisOnlineinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINALDissPCP.pdfDissPCP.pdfapplication/pdf1921482https://{{ getenv "DSPACE_HOST" "repositorio.ufscar.br" }}/bitstream/ufscar/7868/1/DissPCP.pdf5298cc1a066e0cfe217b2b9c61076e65MD51LICENSElicense.txtlicense.txttext/plain; charset=utf-81957https://{{ getenv "DSPACE_HOST" "repositorio.ufscar.br" }}/bitstream/ufscar/7868/2/license.txtae0398b6f8b235e40ad82cba6c50031dMD52TEXTDissPCP.pdf.txtDissPCP.pdf.txtExtracted texttext/plain272846https://{{ getenv "DSPACE_HOST" "repositorio.ufscar.br" }}/bitstream/ufscar/7868/3/DissPCP.pdf.txt2adbb4720a9af279ee256ee584334773MD53THUMBNAILDissPCP.pdf.jpgDissPCP.pdf.jpgIM Thumbnailimage/jpeg10034https://{{ getenv "DSPACE_HOST" "repositorio.ufscar.br" }}/bitstream/ufscar/7868/4/DissPCP.pdf.jpg40e84b04dce0c161336ec9d2206042aaMD54ufscar/78682019-09-11 02:27:34.757oai:repositorio.ufscar.br:ufscar/7868TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEKCkNvbSBhIGFwcmVzZW50YcOnw6NvIGRlc3RhIGxpY2Vuw6dhLCB2b2PDqiAobyBhdXRvciAoZXMpIG91IG8gdGl0dWxhciBkb3MgZGlyZWl0b3MgZGUgYXV0b3IpIGNvbmNlZGUgw6AgVW5pdmVyc2lkYWRlCkZlZGVyYWwgZGUgU8OjbyBDYXJsb3MgbyBkaXJlaXRvIG7Do28tZXhjbHVzaXZvIGRlIHJlcHJvZHV6aXIsICB0cmFkdXppciAoY29uZm9ybWUgZGVmaW5pZG8gYWJhaXhvKSwgZS9vdQpkaXN0cmlidWlyIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyAoaW5jbHVpbmRvIG8gcmVzdW1vKSBwb3IgdG9kbyBvIG11bmRvIG5vIGZvcm1hdG8gaW1wcmVzc28gZSBlbGV0csO0bmljbyBlCmVtIHF1YWxxdWVyIG1laW8sIGluY2x1aW5kbyBvcyBmb3JtYXRvcyDDoXVkaW8gb3UgdsOtZGVvLgoKVm9jw6ogY29uY29yZGEgcXVlIGEgVUZTQ2FyIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28KcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBhIFVGU0NhciBwb2RlIG1hbnRlciBtYWlzIGRlIHVtYSBjw7NwaWEgYSBzdWEgdGVzZSBvdQpkaXNzZXJ0YcOnw6NvIHBhcmEgZmlucyBkZSBzZWd1cmFuw6dhLCBiYWNrLXVwIGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIGRlY2xhcmEgcXVlIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyDDqSBvcmlnaW5hbCBlIHF1ZSB2b2PDqiB0ZW0gbyBwb2RlciBkZSBjb25jZWRlciBvcyBkaXJlaXRvcyBjb250aWRvcwpuZXN0YSBsaWNlbsOnYS4gVm9jw6ogdGFtYsOpbSBkZWNsYXJhIHF1ZSBvIGRlcMOzc2l0byBkYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvIG7Do28sIHF1ZSBzZWphIGRlIHNldQpjb25oZWNpbWVudG8sIGluZnJpbmdlIGRpcmVpdG9zIGF1dG9yYWlzIGRlIG5pbmd1w6ltLgoKQ2FzbyBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gY29udGVuaGEgbWF0ZXJpYWwgcXVlIHZvY8OqIG7Do28gcG9zc3VpIGEgdGl0dWxhcmlkYWRlIGRvcyBkaXJlaXRvcyBhdXRvcmFpcywgdm9jw6oKZGVjbGFyYSBxdWUgb2J0ZXZlIGEgcGVybWlzc8OjbyBpcnJlc3RyaXRhIGRvIGRldGVudG9yIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBwYXJhIGNvbmNlZGVyIMOgIFVGU0NhcgpvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUKaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBURVNFIE9VIERJU1NFUlRBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UKQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PIFFVRSBOw4NPIFNFSkEgQSBVRlNDYXIsClZPQ8OKIERFQ0xBUkEgUVVFIFJFU1BFSVRPVSBUT0RPUyBFIFFVQUlTUVVFUiBESVJFSVRPUyBERSBSRVZJU8ODTyBDT01PClRBTULDiU0gQVMgREVNQUlTIE9CUklHQcOHw5VFUyBFWElHSURBUyBQT1IgQ09OVFJBVE8gT1UgQUNPUkRPLgoKQSBVRlNDYXIgc2UgY29tcHJvbWV0ZSBhIGlkZW50aWZpY2FyIGNsYXJhbWVudGUgbyBzZXUgbm9tZSAocykgb3UgbyhzKSBub21lKHMpIGRvKHMpCmRldGVudG9yKGVzKSBkb3MgZGlyZWl0b3MgYXV0b3JhaXMgZGEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvLCBlIG7Do28gZmFyw6EgcXVhbHF1ZXIgYWx0ZXJhw6fDo28sIGFsw6ltIGRhcXVlbGFzCmNvbmNlZGlkYXMgcG9yIGVzdGEgbGljZW7Dp2EuCg==Repositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestopendoar:43222023-05-25T12:53:09.831390Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false |
dc.title.por.fl_str_mv |
Aprendizado sem-fim de paráfrases |
title |
Aprendizado sem-fim de paráfrases |
spellingShingle |
Aprendizado sem-fim de paráfrases Polastri, Paulo César Paráfrases Reconhecimento automático de paráfrases Aprendizado de máquina sem-fim Processamento de língua natural Português do Brasil Paraphrase lexicon Automatic paraphrase recognition Never-ending machine learning Natural language processing Brazilian Portuguese CIENCIAS EXATAS E DA TERRA |
title_short |
Aprendizado sem-fim de paráfrases |
title_full |
Aprendizado sem-fim de paráfrases |
title_fullStr |
Aprendizado sem-fim de paráfrases |
title_full_unstemmed |
Aprendizado sem-fim de paráfrases |
title_sort |
Aprendizado sem-fim de paráfrases |
author |
Polastri, Paulo César |
author_facet |
Polastri, Paulo César |
author_role |
author |
dc.contributor.authorlattes.por.fl_str_mv |
http://lattes.cnpq.br/1341941141535178 |
dc.contributor.author.fl_str_mv |
Polastri, Paulo César |
dc.contributor.advisor1.fl_str_mv |
Caseli, Helena de Medeiros |
dc.contributor.advisor1Lattes.fl_str_mv |
http://lattes.cnpq.br/6608582057810385 |
contributor_str_mv |
Caseli, Helena de Medeiros |
dc.subject.por.fl_str_mv |
Paráfrases Reconhecimento automático de paráfrases Aprendizado de máquina sem-fim Processamento de língua natural Português do Brasil |
topic |
Paráfrases Reconhecimento automático de paráfrases Aprendizado de máquina sem-fim Processamento de língua natural Português do Brasil Paraphrase lexicon Automatic paraphrase recognition Never-ending machine learning Natural language processing Brazilian Portuguese CIENCIAS EXATAS E DA TERRA |
dc.subject.eng.fl_str_mv |
Paraphrase lexicon Automatic paraphrase recognition Never-ending machine learning Natural language processing Brazilian Portuguese |
dc.subject.cnpq.fl_str_mv |
CIENCIAS EXATAS E DA TERRA |
description |
Use different words to express/convey the same message is a necessity in any natural language and, as such, should be investigated in research in Natural Language Processing (NLP). When it is just a simple word, we say that the interchangeable words are synonyms; while the term paraphrase is used to express a more general idea and that also may involve more than one word. For example, the sentences "the light is red" and "the light is closed" are examples of paraphrases as "sign" and "traffic light" represent synonymous in this context. Proper treatment of paraphrasing is important in several NLP applications, such as Machine Translation, which paraphrases can be used to increase the coverage of Statistical Machine Translation systems; on Multidocument Summarization, where paraphrases identification allows the recognition of repeated information; and Natural Language Generation, where the generation of paraphrases allows creating more varied and fluent texts. The project described in this document is intended to verify that is possible to learn, in an incremental and automatic way, paraphrases in words level from a bilingual parallel corpus, using Never-Ending Machine Learning (NEML) strategy and the Internet as a source of knowledge. The NEML is a machine learning strategy, based on how humans learn: what is learned previously can be used to learn new information and perhaps more complex in the future. Thus, the NEML has been applied together with the strategy for paraphrases extraction proposed by Bannard and Callison-Burch (2005) where, from bilingual parallel corpus, paraphrases are extracted using a pivot language. In this context, it was developed NEPaL (Never-Ending Paraphrase Learner) AMSF system responsible for: (1) extract the internet texts, (2) align the text using a pivot language, (3) rank the candidates according to a classification model and (4) use the knowledge to produce a new classifier model and therefore gain more knowledge restarting the never-ending learning cycle. |
publishDate |
2016 |
dc.date.accessioned.fl_str_mv |
2016-10-14T14:13:28Z |
dc.date.available.fl_str_mv |
2016-10-14T14:13:28Z |
dc.date.issued.fl_str_mv |
2016-03-04 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.citation.fl_str_mv |
POLASTRI, Paulo César. Aprendizado sem-fim de paráfrases. 2016. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, São Carlos, 2016. Disponível em: https://repositorio.ufscar.br/handle/ufscar/7868. |
dc.identifier.uri.fl_str_mv |
https://repositorio.ufscar.br/handle/ufscar/7868 |
identifier_str_mv |
POLASTRI, Paulo César. Aprendizado sem-fim de paráfrases. 2016. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, São Carlos, 2016. Disponível em: https://repositorio.ufscar.br/handle/ufscar/7868. |
url |
https://repositorio.ufscar.br/handle/ufscar/7868 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Universidade Federal de São Carlos Câmpus São Carlos |
dc.publisher.program.fl_str_mv |
Programa de Pós-Graduação em Ciência da Computação - PPGCC |
dc.publisher.initials.fl_str_mv |
UFSCar |
publisher.none.fl_str_mv |
Universidade Federal de São Carlos Câmpus São Carlos |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFSCAR instname:Universidade Federal de São Carlos (UFSCAR) instacron:UFSCAR |
instname_str |
Universidade Federal de São Carlos (UFSCAR) |
instacron_str |
UFSCAR |
institution |
UFSCAR |
reponame_str |
Repositório Institucional da UFSCAR |
collection |
Repositório Institucional da UFSCAR |
bitstream.url.fl_str_mv |
https://{{ getenv "DSPACE_HOST" "repositorio.ufscar.br" }}/bitstream/ufscar/7868/1/DissPCP.pdf https://{{ getenv "DSPACE_HOST" "repositorio.ufscar.br" }}/bitstream/ufscar/7868/2/license.txt https://{{ getenv "DSPACE_HOST" "repositorio.ufscar.br" }}/bitstream/ufscar/7868/3/DissPCP.pdf.txt https://{{ getenv "DSPACE_HOST" "repositorio.ufscar.br" }}/bitstream/ufscar/7868/4/DissPCP.pdf.jpg |
bitstream.checksum.fl_str_mv |
5298cc1a066e0cfe217b2b9c61076e65 ae0398b6f8b235e40ad82cba6c50031d 2adbb4720a9af279ee256ee584334773 40e84b04dce0c161336ec9d2206042aa |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR) |
repository.mail.fl_str_mv |
|
_version_ |
1767351112178860032 |