Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings
Ano de defesa: | 2018 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Não Informado pela instituição
|
Programa de Pós-Graduação: |
Pós-Graduação em Ciência da Computação
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Palavras-chave em Inglês: | |
Área do conhecimento CNPq: | |
Link de acesso: | http://ri.ufs.br/jspui/handle/riufs/11230 |
Resumo: | Word representations are important for many Natural Language Processing (NLP) tasks. Obtaining good representations is essential since most machine learning methods responsible for solving NLP tasks consist of mathematical models that use these numerical representations, which are capable of incorporating syntactic and semantic information from the words. The so-called Word Embeddings, vectors of real numbers generated by machine learning models, are a recent and popular example of the aforementioned representations. GloVe and Word2Vec are widespread models in literature that learn said representations. However, both attribute a single vectorial representation for each word, so that: (i) their morphological information is ignored and (ii) paraphrases at word level are represented by different vectors. Not using morphological knowledge is considered an issue because that knowledge is composed by very important information, such as: radical, gender and number ending, vowel themed, affixes. Words sharing such features must have similar representations. Paraphrase representations at word level must be similar because they consist of words written differently that share the same meaning. The FastText model tries to solve problem (i) by representing a word as a bag of character n-grams; thus, each n-gram is represented as a vector of real numbers and a word is represented by the sum of its respective n-gram vectors. Nevertheless, using every possible character n-gram is a brute force solution, without any scientific basis, that compromises (or makes unviable) model training performance in most computing platforms available for research institutions since it is computationally costly. Besides, some n-grams do not show any semantic relation with their reference words. In order to tackle this issue, this work proposes the Morphological Skip-Gram model. The formulated research hypothesis states that exchanging the character bag of n-grams for the word bag of morpheme results in words with similar morphems and contexts having similar representations. This model was evaluated in terms of 12 different tasks. These tasks aim to evaluate how well the learned word embeddings incorporate syntactic and semantic information from the words. The obtained results show that the Morphological Skip-Gram model is competitive when compared to FastText, being 40% faster. In order to try solving problem (ii), this work proposes the GloVe Paraphrase method, where information from a paraphrase at word level dataset is used to reinforce the original GloVe method and, as a result, paraphrase vectors end up more similar. The experimental results show that GloVe Paraphrase requires less training epochs to obtain good vectorial representations. |
id |
UFS-2_21734c2fa06e45af29c9795c7c286006 |
---|---|
oai_identifier_str |
oai:ufs.br:riufs/11230 |
network_acronym_str |
UFS-2 |
network_name_str |
Repositório Institucional da UFS |
repository_id_str |
|
spelling |
Santos, Flávio Arthur OliveiraMacedo, Hendrik Teixeira2019-05-28T22:25:22Z2019-05-28T22:25:22Z2018-07-31SANTOS, Flávio Arthur Oliveira. Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings. 2018. 70 f. Dissertação (Mestrado em Ciência da Computação) - Universidade Federal de Sergipe, São Cristóvão, SE, 2018.http://ri.ufs.br/jspui/handle/riufs/11230Word representations are important for many Natural Language Processing (NLP) tasks. Obtaining good representations is essential since most machine learning methods responsible for solving NLP tasks consist of mathematical models that use these numerical representations, which are capable of incorporating syntactic and semantic information from the words. The so-called Word Embeddings, vectors of real numbers generated by machine learning models, are a recent and popular example of the aforementioned representations. GloVe and Word2Vec are widespread models in literature that learn said representations. However, both attribute a single vectorial representation for each word, so that: (i) their morphological information is ignored and (ii) paraphrases at word level are represented by different vectors. Not using morphological knowledge is considered an issue because that knowledge is composed by very important information, such as: radical, gender and number ending, vowel themed, affixes. Words sharing such features must have similar representations. Paraphrase representations at word level must be similar because they consist of words written differently that share the same meaning. The FastText model tries to solve problem (i) by representing a word as a bag of character n-grams; thus, each n-gram is represented as a vector of real numbers and a word is represented by the sum of its respective n-gram vectors. Nevertheless, using every possible character n-gram is a brute force solution, without any scientific basis, that compromises (or makes unviable) model training performance in most computing platforms available for research institutions since it is computationally costly. Besides, some n-grams do not show any semantic relation with their reference words. In order to tackle this issue, this work proposes the Morphological Skip-Gram model. The formulated research hypothesis states that exchanging the character bag of n-grams for the word bag of morpheme results in words with similar morphems and contexts having similar representations. This model was evaluated in terms of 12 different tasks. These tasks aim to evaluate how well the learned word embeddings incorporate syntactic and semantic information from the words. The obtained results show that the Morphological Skip-Gram model is competitive when compared to FastText, being 40% faster. In order to try solving problem (ii), this work proposes the GloVe Paraphrase method, where information from a paraphrase at word level dataset is used to reinforce the original GloVe method and, as a result, paraphrase vectors end up more similar. The experimental results show that GloVe Paraphrase requires less training epochs to obtain good vectorial representations.Representações de palavras são importantes para muitas tarefas de Processamento de Linguagem Natural (PLN). Obter boas representações é muito importante uma vez que a maioria dos métodos de aprendizado de máquina responsáveis pelas soluções dos problemas de PLN consistem de modelos matemáticos que fazem uso dessas representações numéricas capazes de incorporar as informações sintáticas e semânticas das palavras. Os chamados Word Embeddings, vetores de números reais gerados através de modelos de aprendizado de máquina, é um exemplo recente e popularizado dessa representação. GloVe e Word2Vec são modelos bastante difundidos na literatura que aprendem tais representações. Porém, ambos atribuem uma única representação vetorial para cada palavra, de forma que: (i) ignoram o conhecimento morfológico destas e (ii) representam paráfrases a nível de palavra com vetores diferentes. Não utilizar o conhecimento morfológico das palavras é considerado um problema porque este conhecimento é composto de informações muito importantes, tais como, radical, desinência de gênero e número, vogal temática e afixos. Palavras com essas características em comum devem ter representações semelhantes. As representações de paráfrases a nível de palavra devem ser semelhantes porque são palavras com escritas diferentes mas que compartilham o significado. O modelo FastText representa uma palavra como uma bag dos n-grams dos caracteres na tentativa de resolver o problema (i); assim, cada um destes n-gram é representado como um vetor de números reais e uma palavra é representada pela soma dos vetores dos seus respectivos n-grams. Entretanto, utilizar todos os n-grams possíveis dos caracteres é uma solução de força bruta, sem qualquer embasamento científico e que compromete (ou inviabiliza) a performance do treinamento dos modelos na maioria das plataformas computacionais existentes em instituições de pesquisa, por ser extremamente custoso. Além disso, alguns n-grams não apresentam qualquer relação semântica com suas respectivas palavras de referência. Para resolver este problema, este trabalho propõe o modelo Skip-Gram Morfológico. A hipótese de pesquisa levantada é a de que ao se trocar a bag dos n-grams dos caracteres pela bag de morfemas da palavra, palavras com morfemas e contextos similares também irão ser similares. Este modelo foi avaliado com 12 tarefas diferentes. Essas tarefas tem como finalidade avaliar o quanto os word embeddings aprendidos incorporam as informações sintáticas e semânticas das palavras. Os resultados obtidos mostraram que o modelo Skip-Gram Morfológico é competitivo se comparado ao FastText, sendo 40% mais rápido. Para tentar resolver o problema (ii), este trabalho propõe o método GloVe Paráfrase, onde uma base de dados de paráfrases a nível de palavra é utilizada para enriquecer o método GloVe original com esta informação e, assim, os vetores das paráfrases tornarem-se mais semelhantes. Os resultados da aplicação deste método mostraram que o GloVe Paráfrase necessita de menos épocas de treinamento para obter boas representações vetoriais.Fundação de Apoio a Pesquisa e à Inovação Tecnológica do Estado de Sergipe - FAPITEC/SESão Cristóvão, SEporComputaçãoProcessamento de linguagem natural (Computação)Conhecimento morfológicoParáfraseWord embeddingsNatural language processingMorphological knowledgeParaphraseCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOSobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddingsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisPós-Graduação em Ciência da ComputaçãoUFSreponame:Repositório Institucional da UFSinstname:Universidade Federal de Sergipe (UFS)instacron:UFSinfo:eu-repo/semantics/openAccessTEXTFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.txtFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.txtExtracted texttext/plain123027https://ri.ufs.br/jspui/bitstream/riufs/11230/3/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.txt52db20dbb03c3f34cf797eea67d43d3cMD53THUMBNAILFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.jpgFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.jpgGenerated Thumbnailimage/jpeg1337https://ri.ufs.br/jspui/bitstream/riufs/11230/4/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.jpgf709abfca1809163eac231545db48cffMD54LICENSElicense.txtlicense.txttext/plain; charset=utf-81475https://ri.ufs.br/jspui/bitstream/riufs/11230/1/license.txt098cbbf65c2c15e1fb2e49c5d306a44cMD51ORIGINALFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdfFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdfapplication/pdf1992731https://ri.ufs.br/jspui/bitstream/riufs/11230/2/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf43deea61ea3285ed9df500dc2395d27dMD52riufs/112302019-05-28 19:25:22.933oai:ufs.br:riufs/11230TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEKCkNvbSBhIGFwcmVzZW50YcOnw6NvIGRlc3RhIGxpY2Vuw6dhLCB2b2PDqiAobyBhdXRvcihlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSDDoCBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkZSBTZXJnaXBlIG8gZGlyZWl0byBuw6NvLWV4Y2x1c2l2byBkZSByZXByb2R1emlyIHNldSB0cmFiYWxobyBubyBmb3JtYXRvIGVsZXRyw7RuaWNvLCBpbmNsdWluZG8gb3MgZm9ybWF0b3Mgw6F1ZGlvIG91IHbDrWRlby4KClZvY8OqIGNvbmNvcmRhIHF1ZSBhIFVuaXZlcnNpZGFkZSBGZWRlcmFsIGRlIFNlcmdpcGUgcG9kZSwgc2VtIGFsdGVyYXIgbyBjb250ZcO6ZG8sIHRyYW5zcG9yIHNldSB0cmFiYWxobyBwYXJhIHF1YWxxdWVyIG1laW8gb3UgZm9ybWF0byBwYXJhIGZpbnMgZGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIHRhbWLDqW0gY29uY29yZGEgcXVlIGEgVW5pdmVyc2lkYWRlIEZlZGVyYWwgZGUgU2VyZ2lwZSBwb2RlIG1hbnRlciBtYWlzIGRlIHVtYSBjw7NwaWEgZGUgc2V1IHRyYWJhbGhvIHBhcmEgZmlucyBkZSBzZWd1cmFuw6dhLCBiYWNrLXVwIGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIGRlY2xhcmEgcXVlIHNldSB0cmFiYWxobyDDqSBvcmlnaW5hbCBlIHF1ZSB2b2PDqiB0ZW0gbyBwb2RlciBkZSBjb25jZWRlciBvcyBkaXJlaXRvcyBjb250aWRvcyBuZXN0YSBsaWNlbsOnYS4gVm9jw6ogdGFtYsOpbSBkZWNsYXJhIHF1ZSBvIGRlcMOzc2l0bywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgbsOjbyBpbmZyaW5nZSBkaXJlaXRvcyBhdXRvcmFpcyBkZSBuaW5ndcOpbS4KCkNhc28gbyB0cmFiYWxobyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgw6AgVW5pdmVyc2lkYWRlIEZlZGVyYWwgZGUgU2VyZ2lwZSBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvLgoKQSBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkZSBTZXJnaXBlIHNlIGNvbXByb21ldGUgYSBpZGVudGlmaWNhciBjbGFyYW1lbnRlIG8gc2V1IG5vbWUocykgb3UgbyhzKSBub21lKHMpIGRvKHMpIApkZXRlbnRvcihlcykgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIGRvIHRyYWJhbGhvLCBlIG7Do28gZmFyw6EgcXVhbHF1ZXIgYWx0ZXJhw6fDo28sIGFsw6ltIGRhcXVlbGFzIGNvbmNlZGlkYXMgcG9yIGVzdGEgbGljZW7Dp2EuIAo=Repositório InstitucionalPUBhttps://ri.ufs.br/oai/requestrepositorio@academico.ufs.bropendoar:2019-05-28T22:25:22Repositório Institucional da UFS - Universidade Federal de Sergipe (UFS)false |
dc.title.pt_BR.fl_str_mv |
Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings |
title |
Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings |
spellingShingle |
Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings Santos, Flávio Arthur Oliveira Computação Processamento de linguagem natural (Computação) Conhecimento morfológico Paráfrase Word embeddings Natural language processing Morphological knowledge Paraphrase CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
title_short |
Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings |
title_full |
Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings |
title_fullStr |
Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings |
title_full_unstemmed |
Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings |
title_sort |
Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings |
author |
Santos, Flávio Arthur Oliveira |
author_facet |
Santos, Flávio Arthur Oliveira |
author_role |
author |
dc.contributor.author.fl_str_mv |
Santos, Flávio Arthur Oliveira |
dc.contributor.advisor1.fl_str_mv |
Macedo, Hendrik Teixeira |
contributor_str_mv |
Macedo, Hendrik Teixeira |
dc.subject.por.fl_str_mv |
Computação Processamento de linguagem natural (Computação) Conhecimento morfológico Paráfrase |
topic |
Computação Processamento de linguagem natural (Computação) Conhecimento morfológico Paráfrase Word embeddings Natural language processing Morphological knowledge Paraphrase CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
dc.subject.eng.fl_str_mv |
Word embeddings Natural language processing Morphological knowledge Paraphrase |
dc.subject.cnpq.fl_str_mv |
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
description |
Word representations are important for many Natural Language Processing (NLP) tasks. Obtaining good representations is essential since most machine learning methods responsible for solving NLP tasks consist of mathematical models that use these numerical representations, which are capable of incorporating syntactic and semantic information from the words. The so-called Word Embeddings, vectors of real numbers generated by machine learning models, are a recent and popular example of the aforementioned representations. GloVe and Word2Vec are widespread models in literature that learn said representations. However, both attribute a single vectorial representation for each word, so that: (i) their morphological information is ignored and (ii) paraphrases at word level are represented by different vectors. Not using morphological knowledge is considered an issue because that knowledge is composed by very important information, such as: radical, gender and number ending, vowel themed, affixes. Words sharing such features must have similar representations. Paraphrase representations at word level must be similar because they consist of words written differently that share the same meaning. The FastText model tries to solve problem (i) by representing a word as a bag of character n-grams; thus, each n-gram is represented as a vector of real numbers and a word is represented by the sum of its respective n-gram vectors. Nevertheless, using every possible character n-gram is a brute force solution, without any scientific basis, that compromises (or makes unviable) model training performance in most computing platforms available for research institutions since it is computationally costly. Besides, some n-grams do not show any semantic relation with their reference words. In order to tackle this issue, this work proposes the Morphological Skip-Gram model. The formulated research hypothesis states that exchanging the character bag of n-grams for the word bag of morpheme results in words with similar morphems and contexts having similar representations. This model was evaluated in terms of 12 different tasks. These tasks aim to evaluate how well the learned word embeddings incorporate syntactic and semantic information from the words. The obtained results show that the Morphological Skip-Gram model is competitive when compared to FastText, being 40% faster. In order to try solving problem (ii), this work proposes the GloVe Paraphrase method, where information from a paraphrase at word level dataset is used to reinforce the original GloVe method and, as a result, paraphrase vectors end up more similar. The experimental results show that GloVe Paraphrase requires less training epochs to obtain good vectorial representations. |
publishDate |
2018 |
dc.date.issued.fl_str_mv |
2018-07-31 |
dc.date.accessioned.fl_str_mv |
2019-05-28T22:25:22Z |
dc.date.available.fl_str_mv |
2019-05-28T22:25:22Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.citation.fl_str_mv |
SANTOS, Flávio Arthur Oliveira. Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings. 2018. 70 f. Dissertação (Mestrado em Ciência da Computação) - Universidade Federal de Sergipe, São Cristóvão, SE, 2018. |
dc.identifier.uri.fl_str_mv |
http://ri.ufs.br/jspui/handle/riufs/11230 |
identifier_str_mv |
SANTOS, Flávio Arthur Oliveira. Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings. 2018. 70 f. Dissertação (Mestrado em Ciência da Computação) - Universidade Federal de Sergipe, São Cristóvão, SE, 2018. |
url |
http://ri.ufs.br/jspui/handle/riufs/11230 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.program.fl_str_mv |
Pós-Graduação em Ciência da Computação |
dc.publisher.initials.fl_str_mv |
UFS |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFS instname:Universidade Federal de Sergipe (UFS) instacron:UFS |
instname_str |
Universidade Federal de Sergipe (UFS) |
instacron_str |
UFS |
institution |
UFS |
reponame_str |
Repositório Institucional da UFS |
collection |
Repositório Institucional da UFS |
bitstream.url.fl_str_mv |
https://ri.ufs.br/jspui/bitstream/riufs/11230/3/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.txt https://ri.ufs.br/jspui/bitstream/riufs/11230/4/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.jpg https://ri.ufs.br/jspui/bitstream/riufs/11230/1/license.txt https://ri.ufs.br/jspui/bitstream/riufs/11230/2/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf |
bitstream.checksum.fl_str_mv |
52db20dbb03c3f34cf797eea67d43d3c f709abfca1809163eac231545db48cff 098cbbf65c2c15e1fb2e49c5d306a44c 43deea61ea3285ed9df500dc2395d27d |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFS - Universidade Federal de Sergipe (UFS) |
repository.mail.fl_str_mv |
repositorio@academico.ufs.br |
_version_ |
1802111133698490368 |