An analysis of hierarchical text classification using word embeddings

Detalhes bibliográficos
Ano de defesa: 2018
Autor(a) principal: Stein, Roger Alan
Orientador(a): Maillard, Patrícia Augustin Jaques
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade do Vale do Rio dos Sinos
Programa de Pós-Graduação: Programa de Pós-Graduação em Computação Aplicada
Departamento: Escola Politécnica
País: Brasil
Palavras-chave em Português:
Palavras-chave em Inglês:
Área do conhecimento CNPq:
Link de acesso: http://www.repositorio.jesuita.org.br/handle/UNISINOS/7624
Resumo: Efficient distributed numerical word representation models (word embeddings) combined with modern machine learning algorithms have recently yielded considerable improvement on automatic document classification tasks. However, the effectiveness of such techniques has not been assessed for the hierarchical text classification (HTC) yet. This study investigates application of those models and algorithms on this specific problem by means of experimentation and analysis. Classification models were trained with prominent machine learning algorithm implementations—fastText, XGBoost, and Keras’ CNN—and noticeable word embeddings generation methods—GloVe, word2vec, and fastText—with publicly available data and evaluated them with measures specifically appropriate for the hierarchical context. FastText achieved an LCAF1 of 0.871 on a single-labeled version of the RCV1 dataset. The results analysis indicates that using word embeddings is a very promising approach for HTC.
id USIN_1a4235ac37f32d68ace8ad8f9fb56fc3
oai_identifier_str oai:www.repositorio.jesuita.org.br:UNISINOS/7624
network_acronym_str USIN
network_name_str Repositório Institucional da UNISINOS (RBDU Repositório Digital da Biblioteca da Unisinos)
repository_id_str
spelling 2019-03-07T14:41:05Z2019-03-07T14:41:05Z2018-03-28Submitted by JOSIANE SANTOS DE OLIVEIRA (josianeso) on 2019-03-07T14:41:05Z No. of bitstreams: 1 Roger Alan Stein_.pdf: 476239 bytes, checksum: a87a32ffe84d0e5d7a882e0db7b03847 (MD5)Made available in DSpace on 2019-03-07T14:41:05Z (GMT). No. of bitstreams: 1 Roger Alan Stein_.pdf: 476239 bytes, checksum: a87a32ffe84d0e5d7a882e0db7b03847 (MD5) Previous issue date: 2018-03-28Efficient distributed numerical word representation models (word embeddings) combined with modern machine learning algorithms have recently yielded considerable improvement on automatic document classification tasks. However, the effectiveness of such techniques has not been assessed for the hierarchical text classification (HTC) yet. This study investigates application of those models and algorithms on this specific problem by means of experimentation and analysis. Classification models were trained with prominent machine learning algorithm implementations—fastText, XGBoost, and Keras’ CNN—and noticeable word embeddings generation methods—GloVe, word2vec, and fastText—with publicly available data and evaluated them with measures specifically appropriate for the hierarchical context. FastText achieved an LCAF1 of 0.871 on a single-labeled version of the RCV1 dataset. The results analysis indicates that using word embeddings is a very promising approach for HTC.Modelos eficientes de representação numérica textual (word embeddings) combinados com algoritmos modernos de aprendizado de máquina têm recentemente produzido uma melhoria considerável em tarefas de classificação automática de documentos. Contudo, a efetividade de tais técnicas ainda não foi avaliada com relação à classificação hierárquica de texto. Este estudo investiga a aplicação daqueles modelos e algoritmos neste problema em específico através de experimentação e análise. Modelos de classificação foram treinados usando implementações proeminentes de algoritmos de aprendizado de máquina—fastText, XGBoost e CNN (Keras)— e notórios métodos de geração de word embeddings—GloVe, word2vec e fastText—com dados disponíveis publicamente e avaliados usando métricas especificamente adequadas ao contexto hierárquico. Nesses experimentos, fastText alcançou um LCAF1 de 0,871 usando uma versão da base de dados RCV1 com apenas uma categoria por tupla. A análise dos resultados indica que a utilização de word embeddings é uma abordagem muito promissora para classificação hierárquica de texto.CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível SuperiorStein, Roger Alanhttp://lattes.cnpq.br/6303163503199490http://lattes.cnpq.br/5723385125570881Valiati, João Franciscohttp://lattes.cnpq.br/4658545839496086Maillard, Patrícia Augustin JaquesUniversidade do Vale do Rio dos SinosPrograma de Pós-Graduação em Computação AplicadaUnisinosBrasilEscola PolitécnicaAn analysis of hierarchical text classification using word embeddingsACCNPQ::Ciências Exatas e da Terra::Ciência da ComputaçãoClassificação hierárquicaClassificação textualRedes neurais (computação)FastTextHierarchical classificationText classificationWord embeddingsConvolutional neural networksFastTextinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesishttp://www.repositorio.jesuita.org.br/handle/UNISINOS/7624info:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UNISINOS (RBDU Repositório Digital da Biblioteca da Unisinos)instname:Universidade do Vale do Rio dos Sinos (UNISINOS)instacron:UNISINOSORIGINALRoger Alan Stein_.pdfRoger Alan Stein_.pdfapplication/pdf476239http://repositorio.jesuita.org.br/bitstream/UNISINOS/7624/1/Roger+Alan+Stein_.pdfa87a32ffe84d0e5d7a882e0db7b03847MD51LICENSElicense.txtlicense.txttext/plain; charset=utf-82175http://repositorio.jesuita.org.br/bitstream/UNISINOS/7624/2/license.txt320e21f23402402ac4988605e1edd177MD52UNISINOS/76242019-03-07 11:53:03.897oai:www.repositorio.jesuita.org.br:UNISINOS/7624Ck5PVEE6IENPTE9RVUUgQVFVSSBBIFNVQSBQUsOTUFJJQSBMSUNFTsOHQQoKRXN0YSBsaWNlbsOnYSBkZSBleGVtcGxvIMOpIGZvcm5lY2lkYSBhcGVuYXMgcGFyYSBmaW5zIGluZm9ybWF0aXZvcy4KCkxpY2Vuw6dhIERFIERJU1RSSUJVScOHw4NPIE7Dg08tRVhDTFVTSVZBCgpDb20gYSBhcHJlc2VudGHDp8OjbyBkZXN0YSBsaWNlbsOnYSwgdm9jw6ogKG8gYXV0b3IgKGVzKSBvdSBvIHRpdHVsYXIgZG9zIGRpcmVpdG9zIGRlIGF1dG9yKSBjb25jZWRlIMOgIApVbml2ZXJzaWRhZGUgZG8gVmFsZSBkbyBSaW8gZG9zIFNpbm9zIChVTklTSU5PUykgbyBkaXJlaXRvIG7Do28tZXhjbHVzaXZvIGRlIHJlcHJvZHV6aXIsICB0cmFkdXppciAoY29uZm9ybWUgZGVmaW5pZG8gYWJhaXhvKSwgZS9vdSAKZGlzdHJpYnVpciBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gKGluY2x1aW5kbyBvIHJlc3VtbykgcG9yIHRvZG8gbyBtdW5kbyBubyBmb3JtYXRvIGltcHJlc3NvIGUgZWxldHLDtG5pY28gZSAKZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBjb25jb3JkYSBxdWUgYSBTaWdsYSBkZSBVbml2ZXJzaWRhZGUgcG9kZSwgc2VtIGFsdGVyYXIgbyBjb250ZcO6ZG8sIHRyYW5zcG9yIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyAKcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBhIFNpZ2xhIGRlIFVuaXZlcnNpZGFkZSBwb2RlIG1hbnRlciBtYWlzIGRlIHVtYSBjw7NwaWEgYSBzdWEgdGVzZSBvdSAKZGlzc2VydGHDp8OjbyBwYXJhIGZpbnMgZGUgc2VndXJhbsOnYSwgYmFjay11cCBlIHByZXNlcnZhw6fDo28uCgpWb2PDqiBkZWNsYXJhIHF1ZSBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gw6kgb3JpZ2luYWwgZSBxdWUgdm9jw6ogdGVtIG8gcG9kZXIgZGUgY29uY2VkZXIgb3MgZGlyZWl0b3MgY29udGlkb3MgCm5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IApjb25oZWNpbWVudG8sIGluZnJpbmdlIGRpcmVpdG9zIGF1dG9yYWlzIGRlIG5pbmd1w6ltLgoKQ2FzbyBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gY29udGVuaGEgbWF0ZXJpYWwgcXVlIHZvY8OqIG7Do28gcG9zc3VpIGEgdGl0dWxhcmlkYWRlIGRvcyBkaXJlaXRvcyBhdXRvcmFpcywgdm9jw6ogCmRlY2xhcmEgcXVlIG9idGV2ZSBhIHBlcm1pc3PDo28gaXJyZXN0cml0YSBkbyBkZXRlbnRvciBkb3MgZGlyZWl0b3MgYXV0b3JhaXMgcGFyYSBjb25jZWRlciDDoCBTaWdsYSBkZSBVbml2ZXJzaWRhZGUgCm9zIGRpcmVpdG9zIGFwcmVzZW50YWRvcyBuZXN0YSBsaWNlbsOnYSwgZSBxdWUgZXNzZSBtYXRlcmlhbCBkZSBwcm9wcmllZGFkZSBkZSB0ZXJjZWlyb3MgZXN0w6EgY2xhcmFtZW50ZSAKaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBURVNFIE9VIERJU1NFUlRBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgCkFQT0lPIERFIFVNQSBBR8OKTkNJQSBERSBGT01FTlRPIE9VIE9VVFJPIE9SR0FOSVNNTyBRVUUgTsODTyBTRUpBIEEgU0lHTEEgREUgClVOSVZFUlNJREFERSwgVk9Dw4ogREVDTEFSQSBRVUUgUkVTUEVJVE9VIFRPRE9TIEUgUVVBSVNRVUVSIERJUkVJVE9TIERFIFJFVklTw4NPIENPTU8gClRBTULDiU0gQVMgREVNQUlTIE9CUklHQcOHw5VFUyBFWElHSURBUyBQT1IgQ09OVFJBVE8gT1UgQUNPUkRPLgoKQSBTaWdsYSBkZSBVbml2ZXJzaWRhZGUgc2UgY29tcHJvbWV0ZSBhIGlkZW50aWZpY2FyIGNsYXJhbWVudGUgbyBzZXUgbm9tZSAocykgb3UgbyhzKSBub21lKHMpIGRvKHMpIApkZXRlbnRvcihlcykgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIGRhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbywgZSBuw6NvIGZhcsOhIHF1YWxxdWVyIGFsdGVyYcOnw6NvLCBhbMOpbSBkYXF1ZWxhcyAKY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4KBiblioteca Digital de Teses e DissertaçõesPRIhttp://www.repositorio.jesuita.org.br/oai/requestmaicons@unisinos.br ||dspace@unisinos.bropendoar:2019-03-07T14:53:03Repositório Institucional da UNISINOS (RBDU Repositório Digital da Biblioteca da Unisinos) - Universidade do Vale do Rio dos Sinos (UNISINOS)false
dc.title.pt_BR.fl_str_mv An analysis of hierarchical text classification using word embeddings
title An analysis of hierarchical text classification using word embeddings
spellingShingle An analysis of hierarchical text classification using word embeddings
Stein, Roger Alan
ACCNPQ::Ciências Exatas e da Terra::Ciência da Computação
Classificação hierárquica
Classificação textual
Redes neurais (computação)
FastText
Hierarchical classification
Text classification
Word embeddings
Convolutional neural networks
FastText
title_short An analysis of hierarchical text classification using word embeddings
title_full An analysis of hierarchical text classification using word embeddings
title_fullStr An analysis of hierarchical text classification using word embeddings
title_full_unstemmed An analysis of hierarchical text classification using word embeddings
title_sort An analysis of hierarchical text classification using word embeddings
author Stein, Roger Alan
author_facet Stein, Roger Alan
author_role author
dc.contributor.authorLattes.pt_BR.fl_str_mv http://lattes.cnpq.br/6303163503199490
dc.contributor.advisorLattes.pt_BR.fl_str_mv http://lattes.cnpq.br/5723385125570881
dc.contributor.author.fl_str_mv Stein, Roger Alan
dc.contributor.advisor-co1.fl_str_mv Valiati, João Francisco
dc.contributor.advisor-co1Lattes.fl_str_mv http://lattes.cnpq.br/4658545839496086
dc.contributor.advisor1.fl_str_mv Maillard, Patrícia Augustin Jaques
contributor_str_mv Valiati, João Francisco
Maillard, Patrícia Augustin Jaques
dc.subject.cnpq.fl_str_mv ACCNPQ::Ciências Exatas e da Terra::Ciência da Computação
topic ACCNPQ::Ciências Exatas e da Terra::Ciência da Computação
Classificação hierárquica
Classificação textual
Redes neurais (computação)
FastText
Hierarchical classification
Text classification
Word embeddings
Convolutional neural networks
FastText
dc.subject.por.fl_str_mv Classificação hierárquica
Classificação textual
Redes neurais (computação)
FastText
dc.subject.eng.fl_str_mv Hierarchical classification
Text classification
Word embeddings
Convolutional neural networks
FastText
description Efficient distributed numerical word representation models (word embeddings) combined with modern machine learning algorithms have recently yielded considerable improvement on automatic document classification tasks. However, the effectiveness of such techniques has not been assessed for the hierarchical text classification (HTC) yet. This study investigates application of those models and algorithms on this specific problem by means of experimentation and analysis. Classification models were trained with prominent machine learning algorithm implementations—fastText, XGBoost, and Keras’ CNN—and noticeable word embeddings generation methods—GloVe, word2vec, and fastText—with publicly available data and evaluated them with measures specifically appropriate for the hierarchical context. FastText achieved an LCAF1 of 0.871 on a single-labeled version of the RCV1 dataset. The results analysis indicates that using word embeddings is a very promising approach for HTC.
publishDate 2018
dc.date.issued.fl_str_mv 2018-03-28
dc.date.accessioned.fl_str_mv 2019-03-07T14:41:05Z
dc.date.available.fl_str_mv 2019-03-07T14:41:05Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://www.repositorio.jesuita.org.br/handle/UNISINOS/7624
url http://www.repositorio.jesuita.org.br/handle/UNISINOS/7624
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade do Vale do Rio dos Sinos
dc.publisher.program.fl_str_mv Programa de Pós-Graduação em Computação Aplicada
dc.publisher.initials.fl_str_mv Unisinos
dc.publisher.country.fl_str_mv Brasil
dc.publisher.department.fl_str_mv Escola Politécnica
publisher.none.fl_str_mv Universidade do Vale do Rio dos Sinos
dc.source.none.fl_str_mv reponame:Repositório Institucional da UNISINOS (RBDU Repositório Digital da Biblioteca da Unisinos)
instname:Universidade do Vale do Rio dos Sinos (UNISINOS)
instacron:UNISINOS
instname_str Universidade do Vale do Rio dos Sinos (UNISINOS)
instacron_str UNISINOS
institution UNISINOS
reponame_str Repositório Institucional da UNISINOS (RBDU Repositório Digital da Biblioteca da Unisinos)
collection Repositório Institucional da UNISINOS (RBDU Repositório Digital da Biblioteca da Unisinos)
bitstream.url.fl_str_mv http://repositorio.jesuita.org.br/bitstream/UNISINOS/7624/1/Roger+Alan+Stein_.pdf
http://repositorio.jesuita.org.br/bitstream/UNISINOS/7624/2/license.txt
bitstream.checksum.fl_str_mv a87a32ffe84d0e5d7a882e0db7b03847
320e21f23402402ac4988605e1edd177
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UNISINOS (RBDU Repositório Digital da Biblioteca da Unisinos) - Universidade do Vale do Rio dos Sinos (UNISINOS)
repository.mail.fl_str_mv maicons@unisinos.br ||dspace@unisinos.br
_version_ 1853242072025268224