Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings

Santos, Flávio Arthur Oliveira

Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings

Detalhes bibliográficos
Ano de defesa:	2018
Autor(a) principal:	Santos, Flávio Arthur Oliveira
Orientador(a):	Macedo, Hendrik Teixeira
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	por
Instituição de defesa:	Não Informado pela instituição
Programa de Pós-Graduação:	Pós-Graduação em Ciência da Computação
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Computação Processamento de linguagem natural (Computação) Conhecimento morfológico Paráfrase
Palavras-chave em Inglês:	Word embeddings Natural language processing Morphological knowledge Paraphrase
Área do conhecimento CNPq:	CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Link de acesso:	http://ri.ufs.br/jspui/handle/riufs/11230
Resumo:	Word representations are important for many Natural Language Processing (NLP) tasks. Obtaining good representations is essential since most machine learning methods responsible for solving NLP tasks consist of mathematical models that use these numerical representations, which are capable of incorporating syntactic and semantic information from the words. The so-called Word Embeddings, vectors of real numbers generated by machine learning models, are a recent and popular example of the aforementioned representations. GloVe and Word2Vec are widespread models in literature that learn said representations. However, both attribute a single vectorial representation for each word, so that: (i) their morphological information is ignored and (ii) paraphrases at word level are represented by different vectors. Not using morphological knowledge is considered an issue because that knowledge is composed by very important information, such as: radical, gender and number ending, vowel themed, affixes. Words sharing such features must have similar representations. Paraphrase representations at word level must be similar because they consist of words written differently that share the same meaning. The FastText model tries to solve problem (i) by representing a word as a bag of character n-grams; thus, each n-gram is represented as a vector of real numbers and a word is represented by the sum of its respective n-gram vectors. Nevertheless, using every possible character n-gram is a brute force solution, without any scientific basis, that compromises (or makes unviable) model training performance in most computing platforms available for research institutions since it is computationally costly. Besides, some n-grams do not show any semantic relation with their reference words. In order to tackle this issue, this work proposes the Morphological Skip-Gram model. The formulated research hypothesis states that exchanging the character bag of n-grams for the word bag of morpheme results in words with similar morphems and contexts having similar representations. This model was evaluated in terms of 12 different tasks. These tasks aim to evaluate how well the learned word embeddings incorporate syntactic and semantic information from the words. The obtained results show that the Morphological Skip-Gram model is competitive when compared to FastText, being 40% faster. In order to try solving problem (ii), this work proposes the GloVe Paraphrase method, where information from a paraphrase at word level dataset is used to reinforce the original GloVe method and, as a result, paraphrase vectors end up more similar. The experimental results show that GloVe Paraphrase requires less training epochs to obtain good vectorial representations.

Metadados do item

id	UFS-2_21734c2fa06e45af29c9795c7c286006
oai_identifier_str	oai:ufs.br:riufs/11230
network_acronym_str	UFS-2
network_name_str	Repositório Institucional da UFS
repository_id_str
spelling	Santos, Flávio Arthur OliveiraMacedo, Hendrik Teixeira2019-05-28T22:25:22Z2019-05-28T22:25:22Z2018-07-31SANTOS, Flávio Arthur Oliveira. Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings. 2018. 70 f. Dissertação (Mestrado em Ciência da Computação) - Universidade Federal de Sergipe, São Cristóvão, SE, 2018.http://ri.ufs.br/jspui/handle/riufs/11230Word representations are important for many Natural Language Processing (NLP) tasks. Obtaining good representations is essential since most machine learning methods responsible for solving NLP tasks consist of mathematical models that use these numerical representations, which are capable of incorporating syntactic and semantic information from the words. The so-called Word Embeddings, vectors of real numbers generated by machine learning models, are a recent and popular example of the aforementioned representations. GloVe and Word2Vec are widespread models in literature that learn said representations. However, both attribute a single vectorial representation for each word, so that: (i) their morphological information is ignored and (ii) paraphrases at word level are represented by different vectors. Not using morphological knowledge is considered an issue because that knowledge is composed by very important information, such as: radical, gender and number ending, vowel themed, affixes. Words sharing such features must have similar representations. Paraphrase representations at word level must be similar because they consist of words written differently that share the same meaning. The FastText model tries to solve problem (i) by representing a word as a bag of character n-grams; thus, each n-gram is represented as a vector of real numbers and a word is represented by the sum of its respective n-gram vectors. Nevertheless, using every possible character n-gram is a brute force solution, without any scientific basis, that compromises (or makes unviable) model training performance in most computing platforms available for research institutions since it is computationally costly. Besides, some n-grams do not show any semantic relation with their reference words. In order to tackle this issue, this work proposes the Morphological Skip-Gram model. The formulated research hypothesis states that exchanging the character bag of n-grams for the word bag of morpheme results in words with similar morphems and contexts having similar representations. This model was evaluated in terms of 12 different tasks. These tasks aim to evaluate how well the learned word embeddings incorporate syntactic and semantic information from the words. The obtained results show that the Morphological Skip-Gram model is competitive when compared to FastText, being 40% faster. In order to try solving problem (ii), this work proposes the GloVe Paraphrase method, where information from a paraphrase at word level dataset is used to reinforce the original GloVe method and, as a result, paraphrase vectors end up more similar. The experimental results show that GloVe Paraphrase requires less training epochs to obtain good vectorial representations.Representações de palavras são importantes para muitas tarefas de Processamento de Linguagem Natural (PLN). Obter boas representações é muito importante uma vez que a maioria dos métodos de aprendizado de máquina responsáveis pelas soluções dos problemas de PLN consistem de modelos matemáticos que fazem uso dessas representações numéricas capazes de incorporar as informações sintáticas e semânticas das palavras. Os chamados Word Embeddings, vetores de números reais gerados através de modelos de aprendizado de máquina, é um exemplo recente e popularizado dessa representação. GloVe e Word2Vec são modelos bastante difundidos na literatura que aprendem tais representações. Porém, ambos atribuem uma única representação vetorial para cada palavra, de forma que: (i) ignoram o conhecimento morfológico destas e (ii) representam paráfrases a nível de palavra com vetores diferentes. Não utilizar o conhecimento morfológico das palavras é considerado um problema porque este conhecimento é composto de informações muito importantes, tais como, radical, desinência de gênero e número, vogal temática e afixos. Palavras com essas características em comum devem ter representações semelhantes. As representações de paráfrases a nível de palavra devem ser semelhantes porque são palavras com escritas diferentes mas que compartilham o significado. O modelo FastText representa uma palavra como uma bag dos n-grams dos caracteres na tentativa de resolver o problema (i); assim, cada um destes n-gram é representado como um vetor de números reais e uma palavra é representada pela soma dos vetores dos seus respectivos n-grams. Entretanto, utilizar todos os n-grams possíveis dos caracteres é uma solução de força bruta, sem qualquer embasamento científico e que compromete (ou inviabiliza) a performance do treinamento dos modelos na maioria das plataformas computacionais existentes em instituições de pesquisa, por ser extremamente custoso. Além disso, alguns n-grams não apresentam qualquer relação semântica com suas respectivas palavras de referência. Para resolver este problema, este trabalho propõe o modelo Skip-Gram Morfológico. A hipótese de pesquisa levantada é a de que ao se trocar a bag dos n-grams dos caracteres pela bag de morfemas da palavra, palavras com morfemas e contextos similares também irão ser similares. Este modelo foi avaliado com 12 tarefas diferentes. Essas tarefas tem como finalidade avaliar o quanto os word embeddings aprendidos incorporam as informações sintáticas e semânticas das palavras. Os resultados obtidos mostraram que o modelo Skip-Gram Morfológico é competitivo se comparado ao FastText, sendo 40% mais rápido. Para tentar resolver o problema (ii), este trabalho propõe o método GloVe Paráfrase, onde uma base de dados de paráfrases a nível de palavra é utilizada para enriquecer o método GloVe original com esta informação e, assim, os vetores das paráfrases tornarem-se mais semelhantes. Os resultados da aplicação deste método mostraram que o GloVe Paráfrase necessita de menos épocas de treinamento para obter boas representações vetoriais.Fundação de Apoio a Pesquisa e à Inovação Tecnológica do Estado de Sergipe - FAPITEC/SESão Cristóvão, SEporComputaçãoProcessamento de linguagem natural (Computação)Conhecimento morfológicoParáfraseWord embeddingsNatural language processingMorphological knowledgeParaphraseCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOSobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddingsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisPós-Graduação em Ciência da ComputaçãoUFSreponame:Repositório Institucional da UFSinstname:Universidade Federal de Sergipe (UFS)instacron:UFSinfo:eu-repo/semantics/openAccessTEXTFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.txtFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.txtExtracted texttext/plain123027https://ri.ufs.br/jspui/bitstream/riufs/11230/3/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.txt52db20dbb03c3f34cf797eea67d43d3cMD53THUMBNAILFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.jpgFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.jpgGenerated Thumbnailimage/jpeg1337https://ri.ufs.br/jspui/bitstream/riufs/11230/4/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.jpgf709abfca1809163eac231545db48cffMD54LICENSElicense.txtlicense.txttext/plain; charset=utf-81475https://ri.ufs.br/jspui/bitstream/riufs/11230/1/license.txt098cbbf65c2c15e1fb2e49c5d306a44cMD51ORIGINALFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdfFLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdfapplication/pdf1992731https://ri.ufs.br/jspui/bitstream/riufs/11230/2/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf43deea61ea3285ed9df500dc2395d27dMD52riufs/112302019-05-28 19:25:22.933oai:ufs.br:riufs/11230TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEKCkNvbSBhIGFwcmVzZW50YcOnw6NvIGRlc3RhIGxpY2Vuw6dhLCB2b2PDqiAobyBhdXRvcihlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSDDoCBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkZSBTZXJnaXBlIG8gZGlyZWl0byBuw6NvLWV4Y2x1c2l2byBkZSByZXByb2R1emlyIHNldSB0cmFiYWxobyBubyBmb3JtYXRvIGVsZXRyw7RuaWNvLCBpbmNsdWluZG8gb3MgZm9ybWF0b3Mgw6F1ZGlvIG91IHbDrWRlby4KClZvY8OqIGNvbmNvcmRhIHF1ZSBhIFVuaXZlcnNpZGFkZSBGZWRlcmFsIGRlIFNlcmdpcGUgcG9kZSwgc2VtIGFsdGVyYXIgbyBjb250ZcO6ZG8sIHRyYW5zcG9yIHNldSB0cmFiYWxobyBwYXJhIHF1YWxxdWVyIG1laW8gb3UgZm9ybWF0byBwYXJhIGZpbnMgZGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIHRhbWLDqW0gY29uY29yZGEgcXVlIGEgVW5pdmVyc2lkYWRlIEZlZGVyYWwgZGUgU2VyZ2lwZSBwb2RlIG1hbnRlciBtYWlzIGRlIHVtYSBjw7NwaWEgZGUgc2V1IHRyYWJhbGhvIHBhcmEgZmlucyBkZSBzZWd1cmFuw6dhLCBiYWNrLXVwIGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIGRlY2xhcmEgcXVlIHNldSB0cmFiYWxobyDDqSBvcmlnaW5hbCBlIHF1ZSB2b2PDqiB0ZW0gbyBwb2RlciBkZSBjb25jZWRlciBvcyBkaXJlaXRvcyBjb250aWRvcyBuZXN0YSBsaWNlbsOnYS4gVm9jw6ogdGFtYsOpbSBkZWNsYXJhIHF1ZSBvIGRlcMOzc2l0bywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgbsOjbyBpbmZyaW5nZSBkaXJlaXRvcyBhdXRvcmFpcyBkZSBuaW5ndcOpbS4KCkNhc28gbyB0cmFiYWxobyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgw6AgVW5pdmVyc2lkYWRlIEZlZGVyYWwgZGUgU2VyZ2lwZSBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvLgoKQSBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkZSBTZXJnaXBlIHNlIGNvbXByb21ldGUgYSBpZGVudGlmaWNhciBjbGFyYW1lbnRlIG8gc2V1IG5vbWUocykgb3UgbyhzKSBub21lKHMpIGRvKHMpIApkZXRlbnRvcihlcykgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIGRvIHRyYWJhbGhvLCBlIG7Do28gZmFyw6EgcXVhbHF1ZXIgYWx0ZXJhw6fDo28sIGFsw6ltIGRhcXVlbGFzIGNvbmNlZGlkYXMgcG9yIGVzdGEgbGljZW7Dp2EuIAo=Repositório InstitucionalPUBhttps://ri.ufs.br/oai/requestrepositorio@academico.ufs.bropendoar:2019-05-28T22:25:22Repositório Institucional da UFS - Universidade Federal de Sergipe (UFS)false
dc.title.pt_BR.fl_str_mv	Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings
title	Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings
spellingShingle	Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings Santos, Flávio Arthur Oliveira Computação Processamento de linguagem natural (Computação) Conhecimento morfológico Paráfrase Word embeddings Natural language processing Morphological knowledge Paraphrase CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
title_short	Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings
title_full	Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings
title_fullStr	Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings
title_full_unstemmed	Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings
title_sort	Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings
author	Santos, Flávio Arthur Oliveira
author_facet	Santos, Flávio Arthur Oliveira
author_role	author
dc.contributor.author.fl_str_mv	Santos, Flávio Arthur Oliveira
dc.contributor.advisor1.fl_str_mv	Macedo, Hendrik Teixeira
contributor_str_mv	Macedo, Hendrik Teixeira
dc.subject.por.fl_str_mv	Computação Processamento de linguagem natural (Computação) Conhecimento morfológico Paráfrase
topic	Computação Processamento de linguagem natural (Computação) Conhecimento morfológico Paráfrase Word embeddings Natural language processing Morphological knowledge Paraphrase CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
dc.subject.eng.fl_str_mv	Word embeddings Natural language processing Morphological knowledge Paraphrase
dc.subject.cnpq.fl_str_mv	CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
description	Word representations are important for many Natural Language Processing (NLP) tasks. Obtaining good representations is essential since most machine learning methods responsible for solving NLP tasks consist of mathematical models that use these numerical representations, which are capable of incorporating syntactic and semantic information from the words. The so-called Word Embeddings, vectors of real numbers generated by machine learning models, are a recent and popular example of the aforementioned representations. GloVe and Word2Vec are widespread models in literature that learn said representations. However, both attribute a single vectorial representation for each word, so that: (i) their morphological information is ignored and (ii) paraphrases at word level are represented by different vectors. Not using morphological knowledge is considered an issue because that knowledge is composed by very important information, such as: radical, gender and number ending, vowel themed, affixes. Words sharing such features must have similar representations. Paraphrase representations at word level must be similar because they consist of words written differently that share the same meaning. The FastText model tries to solve problem (i) by representing a word as a bag of character n-grams; thus, each n-gram is represented as a vector of real numbers and a word is represented by the sum of its respective n-gram vectors. Nevertheless, using every possible character n-gram is a brute force solution, without any scientific basis, that compromises (or makes unviable) model training performance in most computing platforms available for research institutions since it is computationally costly. Besides, some n-grams do not show any semantic relation with their reference words. In order to tackle this issue, this work proposes the Morphological Skip-Gram model. The formulated research hypothesis states that exchanging the character bag of n-grams for the word bag of morpheme results in words with similar morphems and contexts having similar representations. This model was evaluated in terms of 12 different tasks. These tasks aim to evaluate how well the learned word embeddings incorporate syntactic and semantic information from the words. The obtained results show that the Morphological Skip-Gram model is competitive when compared to FastText, being 40% faster. In order to try solving problem (ii), this work proposes the GloVe Paraphrase method, where information from a paraphrase at word level dataset is used to reinforce the original GloVe method and, as a result, paraphrase vectors end up more similar. The experimental results show that GloVe Paraphrase requires less training epochs to obtain good vectorial representations.
publishDate	2018
dc.date.issued.fl_str_mv	2018-07-31
dc.date.accessioned.fl_str_mv	2019-05-28T22:25:22Z
dc.date.available.fl_str_mv	2019-05-28T22:25:22Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	SANTOS, Flávio Arthur Oliveira. Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings. 2018. 70 f. Dissertação (Mestrado em Ciência da Computação) - Universidade Federal de Sergipe, São Cristóvão, SE, 2018.
dc.identifier.uri.fl_str_mv	http://ri.ufs.br/jspui/handle/riufs/11230
identifier_str_mv	SANTOS, Flávio Arthur Oliveira. Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings. 2018. 70 f. Dissertação (Mestrado em Ciência da Computação) - Universidade Federal de Sergipe, São Cristóvão, SE, 2018.
url	http://ri.ufs.br/jspui/handle/riufs/11230
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.program.fl_str_mv	Pós-Graduação em Ciência da Computação
dc.publisher.initials.fl_str_mv	UFS
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFS instname:Universidade Federal de Sergipe (UFS) instacron:UFS
instname_str	Universidade Federal de Sergipe (UFS)
instacron_str	UFS
institution	UFS
reponame_str	Repositório Institucional da UFS
collection	Repositório Institucional da UFS
bitstream.url.fl_str_mv	https://ri.ufs.br/jspui/bitstream/riufs/11230/3/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.txt https://ri.ufs.br/jspui/bitstream/riufs/11230/4/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf.jpg https://ri.ufs.br/jspui/bitstream/riufs/11230/1/license.txt https://ri.ufs.br/jspui/bitstream/riufs/11230/2/FLAVIO_ARTHUR_OLIVEIRA_SANTOS.pdf
bitstream.checksum.fl_str_mv	52db20dbb03c3f34cf797eea67d43d3c f709abfca1809163eac231545db48cff 098cbbf65c2c15e1fb2e49c5d306a44c 43deea61ea3285ed9df500dc2395d27d
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFS - Universidade Federal de Sergipe (UFS)
repository.mail.fl_str_mv	repositorio@academico.ufs.br
_version_	1802111133698490368

Sobre o uso de conhecimento especialista para auxiliar no aprendizado de Word Embeddings

Registros relacionados