Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings

Detalhes bibliográficos
Ano de defesa: 2018
Autor(a) principal: Santos, Flávio Arthur Oliveira
Orientador(a): Macedo, Hendrik Teixeira
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Não Informado pela instituição
Programa de Pós-Graduação: Pós-Graduação em Ciência da Computação
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Palavras-chave em Inglês:
Área do conhecimento CNPq:
Link de acesso: http://ri.ufs.br/jspui/handle/riufs/11230
Resumo: Word representations are important for many Natural Language Processing (NLP) tasks. Obtaining good representations is essential since most machine learning methods responsible for solving NLP tasks consist of mathematical models that use these numerical representations, which are capable of incorporating syntactic and semantic information from the words. The so-called Word Embeddings, vectors of real numbers generated by machine learning models, are a recent and popular example of the aforementioned representations. GloVe and Word2Vec are widespread models in literature that learn said representations. However, both attribute a single vectorial representation for each word, so that: (i) their morphological information is ignored and (ii) paraphrases at word level are represented by different vectors. Not using morphological knowledge is considered an issue because that knowledge is composed by very important information, such as: radical, gender and number ending, vowel themed, affixes. Words sharing such features must have similar representations. Paraphrase representations at word level must be similar because they consist of words written differently that share the same meaning. The FastText model tries to solve problem (i) by representing a word as a bag of character n-grams; thus, each n-gram is represented as a vector of real numbers and a word is represented by the sum of its respective n-gram vectors. Nevertheless, using every possible character n-gram is a brute force solution, without any scientific basis, that compromises (or makes unviable) model training performance in most computing platforms available for research institutions since it is computationally costly. Besides, some n-grams do not show any semantic relation with their reference words. In order to tackle this issue, this work proposes the Morphological Skip-Gram model. The formulated research hypothesis states that exchanging the character bag of n-grams for the word bag of morpheme results in words with similar morphems and contexts having similar representations. This model was evaluated in terms of 12 different tasks. These tasks aim to evaluate how well the learned word embeddings incorporate syntactic and semantic information from the words. The obtained results show that the Morphological Skip-Gram model is competitive when compared to FastText, being 40% faster. In order to try solving problem (ii), this work proposes the GloVe Paraphrase method, where information from a paraphrase at word level dataset is used to reinforce the original GloVe method and, as a result, paraphrase vectors end up more similar. The experimental results show that GloVe Paraphrase requires less training epochs to obtain good vectorial representations.
id UFS-2_21734c2fa06e45af29c9795c7c286006
oai_identifier_str oai:ufs.br:riufs/11230
network_acronym_str UFS-2
network_name_str Repositório Institucional da UFS
repository_id_str
spelling Santos, Flávio Arthur OliveiraMacedo, Hendrik Teixeira2019-05-28T22:25:22Z2019-05-28T22:25:22Z2018-07-31SANTOS, Flávio Arthur Oliveira. Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings. 2018. 70 f. Dissertação (Mestrado em Ciência da Computação) - Universidade Federal de Sergipe, São Cristóvão, SE, 2018.http://ri.ufs.br/jspui/handle/riufs/11230Word representations are important for many Natural Language Processing (NLP) tasks. Obtaining good representations is essential since most machine learning methods responsible for solving NLP tasks consist of mathematical models that use these numerical representations, which are capable of incorporating syntactic and semantic information from the words. The so-called Word Embeddings, vectors of real numbers generated by machine learning models, are a recent and popular example of the aforementioned representations. GloVe and Word2Vec are widespread models in literature that learn said representations. However, both attribute a single vectorial representation for each word, so that: (i) their morphological information is ignored and (ii) paraphrases at word level are represented by different vectors. Not using morphological knowledge is considered an issue because that knowledge is composed by very important information, such as: radical, gender and number ending, vowel themed, affixes. Words sharing such features must have similar representations. Paraphrase representations at word level must be similar because they consist of words written differently that share the same meaning. The FastText model tries to solve problem (i) by representing a word as a bag of character n-grams; thus, each n-gram is represented as a vector of real numbers and a word is represented by the sum of its respective n-gram vectors. Nevertheless, using every possible character n-gram is a brute force solution, without any scientific basis, that compromises (or makes unviable) model training performance in most computing platforms available for research institutions since it is computationally costly. Besides, some n-grams do not show any semantic relation with their reference words. In order to tackle this issue, this work proposes the Morphological Skip-Gram model. The formulated research hypothesis states that exchanging the character bag of n-grams for the word bag of morpheme results in words with similar morphems and contexts having similar representations. This model was evaluated in terms of 12 different tasks. These tasks aim to evaluate how well the learned word embeddings incorporate syntactic and semantic information from the words. The obtained results show that the Morphological Skip-Gram model is competitive when compared to FastText, being 40% faster. In order to try solving problem (ii), this work proposes the GloVe Paraphrase method, where information from a paraphrase at word level dataset is used to reinforce the original GloVe method and, as a result, paraphrase vectors end up more similar. The experimental results show that GloVe Paraphrase requires less training epochs to obtain good vectorial representations.Representações de palavras são importantes para muitas tarefas de Processamento de Linguagem Natural (PLN). Obter boas representações é muito importante uma vez que a maioria dos métodos de aprendizado de máquina responsáveis pelas soluções dos problemas de PLN consistem de modelos matemáticos que fazem uso dessas representações numéricas capazes de incorporar as informações sintáticas e semânticas das palavras. Os chamados Word Embeddings, vetores de números reais gerados através de modelos de aprendizado de máquina, é um exemplo recente e popularizado dessa representação. GloVe e Word2Vec são modelos bastante difundidos na literatura que aprendem tais representações. Porém, ambos atribuem uma única representação vetorial para cada palavra, de forma que: (i) ignoram o conhecimento morfológico destas e (ii) representam paráfrases a nível de palavra com vetores diferentes. Não utilizar o conhecimento morfológico das palavras é considerado um problema porque este conhecimento é composto de informações muito importantes, tais como, radical, desinência de gênero e número, vogal temática e afixos. Palavras com essas características em comum devem ter representações semelhantes. As representações de paráfrases a nível de palavra devem ser semelhantes porque são palavras com escritas diferentes mas que compartilham o significado. O modelo FastText representa uma palavra como uma bag dos n-grams dos caracteres na tentativa de resolver o problema (i); assim, cada um destes n-gram é representado como um vetor de números reais e uma palavra é representada pela soma dos vetores dos seus respectivos n-grams. Entretanto, utilizar todos os n-grams possíveis dos caracteres é uma solução de força bruta, sem qualquer embasamento científico e que compromete (ou inviabiliza) a performance do treinamento dos modelos na maioria das plataformas computacionais existentes em instituições de pesquisa, por ser extremamente custoso. Além disso, alguns n-grams não apresentam qualquer relação semântica com suas respectivas palavras de referência. Para resolver este problema, este trabalho propõe o modelo Skip-Gram Morfológico. A hipótese de pesquisa levantada é a de que ao se trocar a bag dos n-grams dos caracteres pela bag de morfemas da palavra, palavras com morfemas e contextos similares também irão ser similares. Este modelo foi avaliado com 12 tarefas diferentes. Essas tarefas tem como finalidade avaliar o quanto os word embeddings aprendidos incorporam as informações sintáticas e semânticas das palavras. Os resultados obtidos mostraram que o modelo Skip-Gram Morfológico é competitivo se comparado ao FastText, sendo 40% mais rápido. Para tentar resolver o problema (ii), este trabalho propõe o método GloVe Paráfrase, onde uma base de dados de paráfrases a nível de palavra é utilizada para enriquecer o método GloVe original com esta informação e, assim, os vetores das paráfrases tornarem-se mais semelhantes. Os resultados da aplicação deste método mostraram que o GloVe Paráfrase necessita de menos épocas de treinamento para obter boas representações vetoriais.Fundação de Apoio a Pesquisa e à Inovação Tecnológica do Estado de Sergipe - FAPITEC/SESão Cristóvão, SEporComputaçãoProcessamento de linguagem natural (Computação)Conhecimento morfológicoParáfraseWord embeddingsNatural language processingMorphological knowledgeParaphraseCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOSobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddingsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisPós-Graduação em Ciência da ComputaçãoUFSreponame:Repositório Institucional da UFSinstname:Universidade Federal de Sergipe (UFS)instacron:UFSinfo:eu-repo/semantics/openAccessTEXTFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.txtFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.txtExtracted texttext/plain123027https://ri.ufs.br/jspui/bitstream/riufs/11230/3/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.txt52db20dbb03c3f34cf797eea67d43d3cMD53THUMBNAILFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.jpgFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.jpgGenerated Thumbnailimage/jpeg1337https://ri.ufs.br/jspui/bitstream/riufs/11230/4/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.jpgf709abfca1809163eac231545db48cffMD54LICENSElicense.txtlicense.txttext/plain; charset=utf-81475https://ri.ufs.br/jspui/bitstream/riufs/11230/1/license.txt098cbbf65c2c15e1fb2e49c5d306a44cMD51ORIGINALFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdfFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdfapplication/pdf1992731https://ri.ufs.br/jspui/bitstream/riufs/11230/2/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf43deea61ea3285ed9df500dc2395d27dMD52riufs/112302019-05-28 19:25:22.933oai:ufs.br:riufs/11230TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEKCkNvbSBhIGFwcmVzZW50YcOnw6NvIGRlc3RhIGxpY2Vuw6dhLCB2b2PDqiAobyBhdXRvcihlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSDDoCBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkZSBTZXJnaXBlIG8gZGlyZWl0byBuw6NvLWV4Y2x1c2l2byBkZSByZXByb2R1emlyIHNldSB0cmFiYWxobyBubyBmb3JtYXRvIGVsZXRyw7RuaWNvLCBpbmNsdWluZG8gb3MgZm9ybWF0b3Mgw6F1ZGlvIG91IHbDrWRlby4KClZvY8OqIGNvbmNvcmRhIHF1ZSBhIFVuaXZlcnNpZGFkZSBGZWRlcmFsIGRlIFNlcmdpcGUgcG9kZSwgc2VtIGFsdGVyYXIgbyBjb250ZcO6ZG8sIHRyYW5zcG9yIHNldSB0cmFiYWxobyBwYXJhIHF1YWxxdWVyIG1laW8gb3UgZm9ybWF0byBwYXJhIGZpbnMgZGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIHRhbWLDqW0gY29uY29yZGEgcXVlIGEgVW5pdmVyc2lkYWRlIEZlZGVyYWwgZGUgU2VyZ2lwZSBwb2RlIG1hbnRlciBtYWlzIGRlIHVtYSBjw7NwaWEgZGUgc2V1IHRyYWJhbGhvIHBhcmEgZmlucyBkZSBzZWd1cmFuw6dhLCBiYWNrLXVwIGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIGRlY2xhcmEgcXVlIHNldSB0cmFiYWxobyDDqSBvcmlnaW5hbCBlIHF1ZSB2b2PDqiB0ZW0gbyBwb2RlciBkZSBjb25jZWRlciBvcyBkaXJlaXRvcyBjb250aWRvcyBuZXN0YSBsaWNlbsOnYS4gVm9jw6ogdGFtYsOpbSBkZWNsYXJhIHF1ZSBvIGRlcMOzc2l0bywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgbsOjbyBpbmZyaW5nZSBkaXJlaXRvcyBhdXRvcmFpcyBkZSBuaW5ndcOpbS4KCkNhc28gbyB0cmFiYWxobyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgw6AgVW5pdmVyc2lkYWRlIEZlZGVyYWwgZGUgU2VyZ2lwZSBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvLgoKQSBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkZSBTZXJnaXBlIHNlIGNvbXByb21ldGUgYSBpZGVudGlmaWNhciBjbGFyYW1lbnRlIG8gc2V1IG5vbWUocykgb3UgbyhzKSBub21lKHMpIGRvKHMpIApkZXRlbnRvcihlcykgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIGRvIHRyYWJhbGhvLCBlIG7Do28gZmFyw6EgcXVhbHF1ZXIgYWx0ZXJhw6fDo28sIGFsw6ltIGRhcXVlbGFzIGNvbmNlZGlkYXMgcG9yIGVzdGEgbGljZW7Dp2EuIAo=Repositório InstitucionalPUBhttps://ri.ufs.br/oai/requestrepositorio@academico.ufs.bropendoar:2019-05-28T22:25:22Repositório Institucional da UFS - Universidade Federal de Sergipe (UFS)false
dc.title.pt_BR.fl_str_mv Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings
title Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings
spellingShingle Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings
Santos, Flávio Arthur Oliveira
Computação
Processamento de linguagem natural (Computação)
Conhecimento morfológico
Paráfrase
Word embeddings
Natural language processing
Morphological knowledge
Paraphrase
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
title_short Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings
title_full Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings
title_fullStr Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings
title_full_unstemmed Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings
title_sort Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings
author Santos, Flávio Arthur Oliveira
author_facet Santos, Flávio Arthur Oliveira
author_role author
dc.contributor.author.fl_str_mv Santos, Flávio Arthur Oliveira
dc.contributor.advisor1.fl_str_mv Macedo, Hendrik Teixeira
contributor_str_mv Macedo, Hendrik Teixeira
dc.subject.por.fl_str_mv Computação
Processamento de linguagem natural (Computação)
Conhecimento morfológico
Paráfrase
topic Computação
Processamento de linguagem natural (Computação)
Conhecimento morfológico
Paráfrase
Word embeddings
Natural language processing
Morphological knowledge
Paraphrase
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
dc.subject.eng.fl_str_mv Word embeddings
Natural language processing
Morphological knowledge
Paraphrase
dc.subject.cnpq.fl_str_mv CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
description Word representations are important for many Natural Language Processing (NLP) tasks. Obtaining good representations is essential since most machine learning methods responsible for solving NLP tasks consist of mathematical models that use these numerical representations, which are capable of incorporating syntactic and semantic information from the words. The so-called Word Embeddings, vectors of real numbers generated by machine learning models, are a recent and popular example of the aforementioned representations. GloVe and Word2Vec are widespread models in literature that learn said representations. However, both attribute a single vectorial representation for each word, so that: (i) their morphological information is ignored and (ii) paraphrases at word level are represented by different vectors. Not using morphological knowledge is considered an issue because that knowledge is composed by very important information, such as: radical, gender and number ending, vowel themed, affixes. Words sharing such features must have similar representations. Paraphrase representations at word level must be similar because they consist of words written differently that share the same meaning. The FastText model tries to solve problem (i) by representing a word as a bag of character n-grams; thus, each n-gram is represented as a vector of real numbers and a word is represented by the sum of its respective n-gram vectors. Nevertheless, using every possible character n-gram is a brute force solution, without any scientific basis, that compromises (or makes unviable) model training performance in most computing platforms available for research institutions since it is computationally costly. Besides, some n-grams do not show any semantic relation with their reference words. In order to tackle this issue, this work proposes the Morphological Skip-Gram model. The formulated research hypothesis states that exchanging the character bag of n-grams for the word bag of morpheme results in words with similar morphems and contexts having similar representations. This model was evaluated in terms of 12 different tasks. These tasks aim to evaluate how well the learned word embeddings incorporate syntactic and semantic information from the words. The obtained results show that the Morphological Skip-Gram model is competitive when compared to FastText, being 40% faster. In order to try solving problem (ii), this work proposes the GloVe Paraphrase method, where information from a paraphrase at word level dataset is used to reinforce the original GloVe method and, as a result, paraphrase vectors end up more similar. The experimental results show that GloVe Paraphrase requires less training epochs to obtain good vectorial representations.
publishDate 2018
dc.date.issued.fl_str_mv 2018-07-31
dc.date.accessioned.fl_str_mv 2019-05-28T22:25:22Z
dc.date.available.fl_str_mv 2019-05-28T22:25:22Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.citation.fl_str_mv SANTOS, Flávio Arthur Oliveira. Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings. 2018. 70 f. Dissertação (Mestrado em Ciência da Computação) - Universidade Federal de Sergipe, São Cristóvão, SE, 2018.
dc.identifier.uri.fl_str_mv http://ri.ufs.br/jspui/handle/riufs/11230
identifier_str_mv SANTOS, Flávio Arthur Oliveira. Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings. 2018. 70 f. Dissertação (Mestrado em Ciência da Computação) - Universidade Federal de Sergipe, São Cristóvão, SE, 2018.
url http://ri.ufs.br/jspui/handle/riufs/11230
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.program.fl_str_mv Pós-Graduação em Ciência da Computação
dc.publisher.initials.fl_str_mv UFS
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFS
instname:Universidade Federal de Sergipe (UFS)
instacron:UFS
instname_str Universidade Federal de Sergipe (UFS)
instacron_str UFS
institution UFS
reponame_str Repositório Institucional da UFS
collection Repositório Institucional da UFS
bitstream.url.fl_str_mv https://ri.ufs.br/jspui/bitstream/riufs/11230/3/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.txt
https://ri.ufs.br/jspui/bitstream/riufs/11230/4/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.jpg
https://ri.ufs.br/jspui/bitstream/riufs/11230/1/license.txt
https://ri.ufs.br/jspui/bitstream/riufs/11230/2/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf
bitstream.checksum.fl_str_mv 52db20dbb03c3f34cf797eea67d43d3c
f709abfca1809163eac231545db48cff
098cbbf65c2c15e1fb2e49c5d306a44c
43deea61ea3285ed9df500dc2395d27d
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFS - Universidade Federal de Sergipe (UFS)
repository.mail.fl_str_mv repositorio@academico.ufs.br
_version_ 1802111133698490368