An analysis of hierarchical text classification using word embeddings
| Ano de defesa: | 2018 |
|---|---|
| Autor(a) principal: | |
| Orientador(a): | |
| Banca de defesa: | |
| Tipo de documento: | Dissertação |
| Tipo de acesso: | Acesso aberto |
| Idioma: | por |
| Instituição de defesa: |
Universidade do Vale do Rio dos Sinos
|
| Programa de Pós-Graduação: |
Programa de Pós-Graduação em Computação Aplicada
|
| Departamento: |
Escola Politécnica
|
| País: |
Brasil
|
| Palavras-chave em Português: | |
| Palavras-chave em Inglês: | |
| Área do conhecimento CNPq: | |
| Link de acesso: | http://www.repositorio.jesuita.org.br/handle/UNISINOS/7624 |
Resumo: | Efficient distributed numerical word representation models (word embeddings) combined with modern machine learning algorithms have recently yielded considerable improvement on automatic document classification tasks. However, the effectiveness of such techniques has not been assessed for the hierarchical text classification (HTC) yet. This study investigates application of those models and algorithms on this specific problem by means of experimentation and analysis. Classification models were trained with prominent machine learning algorithm implementations—fastText, XGBoost, and Keras’ CNN—and noticeable word embeddings generation methods—GloVe, word2vec, and fastText—with publicly available data and evaluated them with measures specifically appropriate for the hierarchical context. FastText achieved an LCAF1 of 0.871 on a single-labeled version of the RCV1 dataset. The results analysis indicates that using word embeddings is a very promising approach for HTC. |
| id |
USIN_1a4235ac37f32d68ace8ad8f9fb56fc3 |
|---|---|
| oai_identifier_str |
oai:www.repositorio.jesuita.org.br:UNISINOS/7624 |
| network_acronym_str |
USIN |
| network_name_str |
Repositório Institucional da UNISINOS (RBDU Repositório Digital da Biblioteca da Unisinos) |
| repository_id_str |
|
| spelling |
2019-03-07T14:41:05Z2019-03-07T14:41:05Z2018-03-28Submitted by JOSIANE SANTOS DE OLIVEIRA (josianeso) on 2019-03-07T14:41:05Z No. of bitstreams: 1 Roger Alan Stein_.pdf: 476239 bytes, checksum: a87a32ffe84d0e5d7a882e0db7b03847 (MD5)Made available in DSpace on 2019-03-07T14:41:05Z (GMT). No. of bitstreams: 1 Roger Alan Stein_.pdf: 476239 bytes, checksum: a87a32ffe84d0e5d7a882e0db7b03847 (MD5) Previous issue date: 2018-03-28Efficient distributed numerical word representation models (word embeddings) combined with modern machine learning algorithms have recently yielded considerable improvement on automatic document classification tasks. However, the effectiveness of such techniques has not been assessed for the hierarchical text classification (HTC) yet. This study investigates application of those models and algorithms on this specific problem by means of experimentation and analysis. Classification models were trained with prominent machine learning algorithm implementations—fastText, XGBoost, and Keras’ CNN—and noticeable word embeddings generation methods—GloVe, word2vec, and fastText—with publicly available data and evaluated them with measures specifically appropriate for the hierarchical context. FastText achieved an LCAF1 of 0.871 on a single-labeled version of the RCV1 dataset. The results analysis indicates that using word embeddings is a very promising approach for HTC.Modelos eficientes de representação numérica textual (word embeddings) combinados com algoritmos modernos de aprendizado de máquina têm recentemente produzido uma melhoria considerável em tarefas de classificação automática de documentos. Contudo, a efetividade de tais técnicas ainda não foi avaliada com relação à classificação hierárquica de texto. Este estudo investiga a aplicação daqueles modelos e algoritmos neste problema em específico através de experimentação e análise. Modelos de classificação foram treinados usando implementações proeminentes de algoritmos de aprendizado de máquina—fastText, XGBoost e CNN (Keras)— e notórios métodos de geração de word embeddings—GloVe, word2vec e fastText—com dados disponíveis publicamente e avaliados usando métricas especificamente adequadas ao contexto hierárquico. Nesses experimentos, fastText alcançou um LCAF1 de 0,871 usando uma versão da base de dados RCV1 com apenas uma categoria por tupla. A análise dos resultados indica que a utilização de word embeddings é uma abordagem muito promissora para classificação hierárquica de texto.CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível SuperiorStein, Roger Alanhttp://lattes.cnpq.br/6303163503199490http://lattes.cnpq.br/5723385125570881Valiati, João Franciscohttp://lattes.cnpq.br/4658545839496086Maillard, Patrícia Augustin JaquesUniversidade do Vale do Rio dos SinosPrograma de Pós-Graduação em Computação AplicadaUnisinosBrasilEscola PolitécnicaAn analysis of hierarchical text classification using word embeddingsACCNPQ::Ciências Exatas e da Terra::Ciência da ComputaçãoClassificação hierárquicaClassificação textualRedes neurais (computação)FastTextHierarchical classificationText classificationWord embeddingsConvolutional neural networksFastTextinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesishttp://www.repositorio.jesuita.org.br/handle/UNISINOS/7624info:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UNISINOS (RBDU Repositório Digital da Biblioteca da Unisinos)instname:Universidade do Vale do Rio dos Sinos (UNISINOS)instacron:UNISINOSORIGINALRoger Alan Stein_.pdfRoger Alan Stein_.pdfapplication/pdf476239http://repositorio.jesuita.org.br/bitstream/UNISINOS/7624/1/Roger+Alan+Stein_.pdfa87a32ffe84d0e5d7a882e0db7b03847MD51LICENSElicense.txtlicense.txttext/plain; charset=utf-82175http://repositorio.jesuita.org.br/bitstream/UNISINOS/7624/2/license.txt320e21f23402402ac4988605e1edd177MD52UNISINOS/76242019-03-07 11:53:03.897oai:www.repositorio.jesuita.org.br:UNISINOS/7624Ck5PVEE6IENPTE9RVUUgQVFVSSBBIFNVQSBQUsOTUFJJQSBMSUNFTsOHQQoKRXN0YSBsaWNlbsOnYSBkZSBleGVtcGxvIMOpIGZvcm5lY2lkYSBhcGVuYXMgcGFyYSBmaW5zIGluZm9ybWF0aXZvcy4KCkxpY2Vuw6dhIERFIERJU1RSSUJVScOHw4NPIE7Dg08tRVhDTFVTSVZBCgpDb20gYSBhcHJlc2VudGHDp8OjbyBkZXN0YSBsaWNlbsOnYSwgdm9jw6ogKG8gYXV0b3IgKGVzKSBvdSBvIHRpdHVsYXIgZG9zIGRpcmVpdG9zIGRlIGF1dG9yKSBjb25jZWRlIMOgIApVbml2ZXJzaWRhZGUgZG8gVmFsZSBkbyBSaW8gZG9zIFNpbm9zIChVTklTSU5PUykgbyBkaXJlaXRvIG7Do28tZXhjbHVzaXZvIGRlIHJlcHJvZHV6aXIsICB0cmFkdXppciAoY29uZm9ybWUgZGVmaW5pZG8gYWJhaXhvKSwgZS9vdSAKZGlzdHJpYnVpciBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gKGluY2x1aW5kbyBvIHJlc3VtbykgcG9yIHRvZG8gbyBtdW5kbyBubyBmb3JtYXRvIGltcHJlc3NvIGUgZWxldHLDtG5pY28gZSAKZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBjb25jb3JkYSBxdWUgYSBTaWdsYSBkZSBVbml2ZXJzaWRhZGUgcG9kZSwgc2VtIGFsdGVyYXIgbyBjb250ZcO6ZG8sIHRyYW5zcG9yIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyAKcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBhIFNpZ2xhIGRlIFVuaXZlcnNpZGFkZSBwb2RlIG1hbnRlciBtYWlzIGRlIHVtYSBjw7NwaWEgYSBzdWEgdGVzZSBvdSAKZGlzc2VydGHDp8OjbyBwYXJhIGZpbnMgZGUgc2VndXJhbsOnYSwgYmFjay11cCBlIHByZXNlcnZhw6fDo28uCgpWb2PDqiBkZWNsYXJhIHF1ZSBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gw6kgb3JpZ2luYWwgZSBxdWUgdm9jw6ogdGVtIG8gcG9kZXIgZGUgY29uY2VkZXIgb3MgZGlyZWl0b3MgY29udGlkb3MgCm5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IApjb25oZWNpbWVudG8sIGluZnJpbmdlIGRpcmVpdG9zIGF1dG9yYWlzIGRlIG5pbmd1w6ltLgoKQ2FzbyBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gY29udGVuaGEgbWF0ZXJpYWwgcXVlIHZvY8OqIG7Do28gcG9zc3VpIGEgdGl0dWxhcmlkYWRlIGRvcyBkaXJlaXRvcyBhdXRvcmFpcywgdm9jw6ogCmRlY2xhcmEgcXVlIG9idGV2ZSBhIHBlcm1pc3PDo28gaXJyZXN0cml0YSBkbyBkZXRlbnRvciBkb3MgZGlyZWl0b3MgYXV0b3JhaXMgcGFyYSBjb25jZWRlciDDoCBTaWdsYSBkZSBVbml2ZXJzaWRhZGUgCm9zIGRpcmVpdG9zIGFwcmVzZW50YWRvcyBuZXN0YSBsaWNlbsOnYSwgZSBxdWUgZXNzZSBtYXRlcmlhbCBkZSBwcm9wcmllZGFkZSBkZSB0ZXJjZWlyb3MgZXN0w6EgY2xhcmFtZW50ZSAKaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBURVNFIE9VIERJU1NFUlRBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgCkFQT0lPIERFIFVNQSBBR8OKTkNJQSBERSBGT01FTlRPIE9VIE9VVFJPIE9SR0FOSVNNTyBRVUUgTsODTyBTRUpBIEEgU0lHTEEgREUgClVOSVZFUlNJREFERSwgVk9Dw4ogREVDTEFSQSBRVUUgUkVTUEVJVE9VIFRPRE9TIEUgUVVBSVNRVUVSIERJUkVJVE9TIERFIFJFVklTw4NPIENPTU8gClRBTULDiU0gQVMgREVNQUlTIE9CUklHQcOHw5VFUyBFWElHSURBUyBQT1IgQ09OVFJBVE8gT1UgQUNPUkRPLgoKQSBTaWdsYSBkZSBVbml2ZXJzaWRhZGUgc2UgY29tcHJvbWV0ZSBhIGlkZW50aWZpY2FyIGNsYXJhbWVudGUgbyBzZXUgbm9tZSAocykgb3UgbyhzKSBub21lKHMpIGRvKHMpIApkZXRlbnRvcihlcykgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIGRhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbywgZSBuw6NvIGZhcsOhIHF1YWxxdWVyIGFsdGVyYcOnw6NvLCBhbMOpbSBkYXF1ZWxhcyAKY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4KBiblioteca Digital de Teses e DissertaçõesPRIhttp://www.repositorio.jesuita.org.br/oai/requestmaicons@unisinos.br ||dspace@unisinos.bropendoar:2019-03-07T14:53:03Repositório Institucional da UNISINOS (RBDU Repositório Digital da Biblioteca da Unisinos) - Universidade do Vale do Rio dos Sinos (UNISINOS)false |
| dc.title.pt_BR.fl_str_mv |
An analysis of hierarchical text classification using word embeddings |
| title |
An analysis of hierarchical text classification using word embeddings |
| spellingShingle |
An analysis of hierarchical text classification using word embeddings Stein, Roger Alan ACCNPQ::Ciências Exatas e da Terra::Ciência da Computação Classificação hierárquica Classificação textual Redes neurais (computação) FastText Hierarchical classification Text classification Word embeddings Convolutional neural networks FastText |
| title_short |
An analysis of hierarchical text classification using word embeddings |
| title_full |
An analysis of hierarchical text classification using word embeddings |
| title_fullStr |
An analysis of hierarchical text classification using word embeddings |
| title_full_unstemmed |
An analysis of hierarchical text classification using word embeddings |
| title_sort |
An analysis of hierarchical text classification using word embeddings |
| author |
Stein, Roger Alan |
| author_facet |
Stein, Roger Alan |
| author_role |
author |
| dc.contributor.authorLattes.pt_BR.fl_str_mv |
http://lattes.cnpq.br/6303163503199490 |
| dc.contributor.advisorLattes.pt_BR.fl_str_mv |
http://lattes.cnpq.br/5723385125570881 |
| dc.contributor.author.fl_str_mv |
Stein, Roger Alan |
| dc.contributor.advisor-co1.fl_str_mv |
Valiati, João Francisco |
| dc.contributor.advisor-co1Lattes.fl_str_mv |
http://lattes.cnpq.br/4658545839496086 |
| dc.contributor.advisor1.fl_str_mv |
Maillard, Patrícia Augustin Jaques |
| contributor_str_mv |
Valiati, João Francisco Maillard, Patrícia Augustin Jaques |
| dc.subject.cnpq.fl_str_mv |
ACCNPQ::Ciências Exatas e da Terra::Ciência da Computação |
| topic |
ACCNPQ::Ciências Exatas e da Terra::Ciência da Computação Classificação hierárquica Classificação textual Redes neurais (computação) FastText Hierarchical classification Text classification Word embeddings Convolutional neural networks FastText |
| dc.subject.por.fl_str_mv |
Classificação hierárquica Classificação textual Redes neurais (computação) FastText |
| dc.subject.eng.fl_str_mv |
Hierarchical classification Text classification Word embeddings Convolutional neural networks FastText |
| description |
Efficient distributed numerical word representation models (word embeddings) combined with modern machine learning algorithms have recently yielded considerable improvement on automatic document classification tasks. However, the effectiveness of such techniques has not been assessed for the hierarchical text classification (HTC) yet. This study investigates application of those models and algorithms on this specific problem by means of experimentation and analysis. Classification models were trained with prominent machine learning algorithm implementations—fastText, XGBoost, and Keras’ CNN—and noticeable word embeddings generation methods—GloVe, word2vec, and fastText—with publicly available data and evaluated them with measures specifically appropriate for the hierarchical context. FastText achieved an LCAF1 of 0.871 on a single-labeled version of the RCV1 dataset. The results analysis indicates that using word embeddings is a very promising approach for HTC. |
| publishDate |
2018 |
| dc.date.issued.fl_str_mv |
2018-03-28 |
| dc.date.accessioned.fl_str_mv |
2019-03-07T14:41:05Z |
| dc.date.available.fl_str_mv |
2019-03-07T14:41:05Z |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
| format |
masterThesis |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
http://www.repositorio.jesuita.org.br/handle/UNISINOS/7624 |
| url |
http://www.repositorio.jesuita.org.br/handle/UNISINOS/7624 |
| dc.language.iso.fl_str_mv |
por |
| language |
por |
| dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.publisher.none.fl_str_mv |
Universidade do Vale do Rio dos Sinos |
| dc.publisher.program.fl_str_mv |
Programa de Pós-Graduação em Computação Aplicada |
| dc.publisher.initials.fl_str_mv |
Unisinos |
| dc.publisher.country.fl_str_mv |
Brasil |
| dc.publisher.department.fl_str_mv |
Escola Politécnica |
| publisher.none.fl_str_mv |
Universidade do Vale do Rio dos Sinos |
| dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UNISINOS (RBDU Repositório Digital da Biblioteca da Unisinos) instname:Universidade do Vale do Rio dos Sinos (UNISINOS) instacron:UNISINOS |
| instname_str |
Universidade do Vale do Rio dos Sinos (UNISINOS) |
| instacron_str |
UNISINOS |
| institution |
UNISINOS |
| reponame_str |
Repositório Institucional da UNISINOS (RBDU Repositório Digital da Biblioteca da Unisinos) |
| collection |
Repositório Institucional da UNISINOS (RBDU Repositório Digital da Biblioteca da Unisinos) |
| bitstream.url.fl_str_mv |
http://repositorio.jesuita.org.br/bitstream/UNISINOS/7624/1/Roger+Alan+Stein_.pdf http://repositorio.jesuita.org.br/bitstream/UNISINOS/7624/2/license.txt |
| bitstream.checksum.fl_str_mv |
a87a32ffe84d0e5d7a882e0db7b03847 320e21f23402402ac4988605e1edd177 |
| bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 |
| repository.name.fl_str_mv |
Repositório Institucional da UNISINOS (RBDU Repositório Digital da Biblioteca da Unisinos) - Universidade do Vale do Rio dos Sinos (UNISINOS) |
| repository.mail.fl_str_mv |
maicons@unisinos.br ||dspace@unisinos.br |
| _version_ |
1853242072025268224 |