Interactive keyterm-based document clustering and visualization via neural language models

Detalhes bibliográficos
Ano de defesa: 2020
Autor(a) principal: Cabral, Eric Macedo
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://www.teses.usp.br/teses/disponiveis/55/55134/tde-20082020-093906/
Resumo: Interactive data clustering techniques put the user in the clustering algorithm loop, allowing not only better clustering quality, but also supporting the knowledge discovery task in large textual corpora. The keyterm guided approach is arguably intuitive, allowing the user to interact with representative words instead of interacting with a large volume of full-length documents or complex topic models. More than making the clustering algorithm adjustable with little user-effort, the visual interactive clustering approach allows the user to focus on exploring the corpus as an incremental task. After each interaction, the user can obtain new information about the corpus, and expresses it as feedback to the clustering algorithm. The visual analytics system Vis-Kt presents itself as an interactive keyterm-based document clustering system, embedded with techniques that overcome the state-of-the-art ones, such as Latent Dirichlet Allocation and the Non-negative Matrix Factorization. With a user-guided approach, Vis-Kt allows the user to draw her insights into the corpus by describing document clusters with a small set of significative terms. However, Vis-Kt and its underlying clustering algorithms depend on the Bag-of-Words model, which has several limitations concerning the information extractions scalability, the process incrementality, and the datas semantic representation. In order to overcome the limitations inherent to the Bag-of-Words model, we propose an update for the keyterm-based representation model to a machine learning approach based on neural language models. Such a model can extract semantic information and relationships from the words that are included in the corpus. This projects main contribution is a novel interactive document clustering algorithm guided by keyterms and based on neural language models. This approach shows a significant improvement compared to the baseline algorithms, considered state-of-the-art. The proposed clustering algorithm allows Vis-Kt to work incrementally, without the need to repeat the entire learning and clustering processes from the beginning. This makes the system suitable for analyzing text streams. In order to contribute to the task of knowledge discovery and to support its incremental aspect, a visual component based on the Sankey diagram was developed to depict the clustering membership changes throughout the clustering loop after each interaction with the corpus. A set of quantitative experiments on publicly available text datasets was performed to evaluate the obtained clustering results. The results reported in this work show that, in most of the experimented cases, the proposed algorithm presents a significant improvement in clustering quality measures in comparison with the baseline algorithms. In all cases, the proposed algorithm showed a gain in processing time, especially in the largest datasets. We also report two usage scenarios to qualitatively evaluate the proposed visual component.
id USP_0fd9510daa6fe668a72ccb476aeaf900
oai_identifier_str oai:teses.usp.br:tde-20082020-093906
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str
spelling Interactive keyterm-based document clustering and visualization via neural language modelsAgrupamento interativo e visualização de documentos baseado em termos-chave via modelos neurais de linguagemAgrupamento interativo de documentosInteractive document clusteringModelos neurais de linguagemNeural language modelsVisual analyticsVisualização analíticaInteractive data clustering techniques put the user in the clustering algorithm loop, allowing not only better clustering quality, but also supporting the knowledge discovery task in large textual corpora. The keyterm guided approach is arguably intuitive, allowing the user to interact with representative words instead of interacting with a large volume of full-length documents or complex topic models. More than making the clustering algorithm adjustable with little user-effort, the visual interactive clustering approach allows the user to focus on exploring the corpus as an incremental task. After each interaction, the user can obtain new information about the corpus, and expresses it as feedback to the clustering algorithm. The visual analytics system Vis-Kt presents itself as an interactive keyterm-based document clustering system, embedded with techniques that overcome the state-of-the-art ones, such as Latent Dirichlet Allocation and the Non-negative Matrix Factorization. With a user-guided approach, Vis-Kt allows the user to draw her insights into the corpus by describing document clusters with a small set of significative terms. However, Vis-Kt and its underlying clustering algorithms depend on the Bag-of-Words model, which has several limitations concerning the information extractions scalability, the process incrementality, and the datas semantic representation. In order to overcome the limitations inherent to the Bag-of-Words model, we propose an update for the keyterm-based representation model to a machine learning approach based on neural language models. Such a model can extract semantic information and relationships from the words that are included in the corpus. This projects main contribution is a novel interactive document clustering algorithm guided by keyterms and based on neural language models. This approach shows a significant improvement compared to the baseline algorithms, considered state-of-the-art. The proposed clustering algorithm allows Vis-Kt to work incrementally, without the need to repeat the entire learning and clustering processes from the beginning. This makes the system suitable for analyzing text streams. In order to contribute to the task of knowledge discovery and to support its incremental aspect, a visual component based on the Sankey diagram was developed to depict the clustering membership changes throughout the clustering loop after each interaction with the corpus. A set of quantitative experiments on publicly available text datasets was performed to evaluate the obtained clustering results. The results reported in this work show that, in most of the experimented cases, the proposed algorithm presents a significant improvement in clustering quality measures in comparison with the baseline algorithms. In all cases, the proposed algorithm showed a gain in processing time, especially in the largest datasets. We also report two usage scenarios to qualitatively evaluate the proposed visual component.Técnicas interativas de agrupamento de dados colocam o usuário no ciclo do algoritmo de agrupamento, permitindo não somente uma melhor qualidade de agrupamento, mas também apoiando a tarefa de descoberta de conhecimento em grandes volumes textuais. A abordagem guiada por termos-chave é sem dúvida intuitiva permitindo ao usuário a interação com palavras representativas ao invés de interagir com um grande volume de documentos ou com modelos de tópicos complexos. Mais do que tornar o algoritmo de agrupamento ajustável com pouco esforço do usuário, a abordagem de agrupamento visualmente interativo permite que o usuário foque na exploração do corpus como uma tarefa incremental. Após cada interação, o usuário pode obter novas informações sobre o corpus e expressar essas informações como feedback para o algoritmo de agrupamento. O sistema Vis-Kt apresenta-se como um sistema de visualização analítica para agrupamento de documentos basaedo em termos-chave, com técnicas que superam as técnicas considerada como estado da arte, como a Latent Dirichlet Allocation e a Non-negative Matrix Factorization. Com uma abordagem guiada pelo usuário, o sistema Vis-Kt permite ao usuário modelar seu discernimento sobre o corpus por meio de conjuntos de termos-chave que descrevem grupos de documentos. No entanto, o sistema Vis-Kt e seus algoritmos dependem do modelo Bag-of- Words, que possui várias limitações em relação à escalabilidade da extração de informação, à incrementalidade do processo e à representação semântica dos dados. Com o objetivo de superar as limitações inerentes ao Bag-of-Words, propomos uma atualização da representação por termos-chave para uma abordagem de aprendizado de máquina baseado em modelos neurais de linguagem. Tais modelos podem extrair informações semânticas e relações das palavras que compõem o corpus. A principal contribuição deste projeto é um novo algoritmo interativo de agrupamento de documentos guiado por termos-chave e baseado em modelos neurais de linguagem. Essa abordagem mostra uma melhoria significativa em comparação com os algoritmos considerados estado da arte. O algoritmo de agrupamento proposto permite que o sistema Vis-Kt funcione de forma incremental, sem a necessidade de repetir todo processo de aprendizado e agrupamento desde o início. Isso torna o sistema adequado para o uso em análises de fluxos de texto. Para contribuir com a tarefa de descoberta de conhecimento e apoiar seu aspecto incremental, foi desenvolvida uma visualização baseada no diagrama de Sankey que representa as mudanças nos agrupamentos após cada interação com o corpus. Um conjunto de experimentos quantitativos em conjuntos de dados de texto disponíveis publicamente foi realizado para avaliar os resultados dos agrupamentos obtidos. Os resultados reportados neste trabalho mostram que, na maioria dos casos experimentados, o algoritmo proposto apresenta uma melhoria significativa nas medidas de qualidade de agrupamentos em comparação com os algoritmos previamente adotados no sistema. Em todos os casos, o algoritmo proposto apresentou um ganho em tempo de processamento, principalmente nos maiores conjuntos de dados. Também relatamos dois cenários de uso para avaliar qualitativamente o componente visual proposto.Biblioteca Digitais de Teses e Dissertações da USPMinghim, RosaneCabral, Eric Macedo2020-06-09info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-20082020-093906/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2020-08-20T15:49:02Zoai:teses.usp.br:tde-20082020-093906Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212020-08-20T15:49:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Interactive keyterm-based document clustering and visualization via neural language models
Agrupamento interativo e visualização de documentos baseado em termos-chave via modelos neurais de linguagem
title Interactive keyterm-based document clustering and visualization via neural language models
spellingShingle Interactive keyterm-based document clustering and visualization via neural language models
Cabral, Eric Macedo
Agrupamento interativo de documentos
Interactive document clustering
Modelos neurais de linguagem
Neural language models
Visual analytics
Visualização analítica
title_short Interactive keyterm-based document clustering and visualization via neural language models
title_full Interactive keyterm-based document clustering and visualization via neural language models
title_fullStr Interactive keyterm-based document clustering and visualization via neural language models
title_full_unstemmed Interactive keyterm-based document clustering and visualization via neural language models
title_sort Interactive keyterm-based document clustering and visualization via neural language models
author Cabral, Eric Macedo
author_facet Cabral, Eric Macedo
author_role author
dc.contributor.none.fl_str_mv Minghim, Rosane
dc.contributor.author.fl_str_mv Cabral, Eric Macedo
dc.subject.por.fl_str_mv Agrupamento interativo de documentos
Interactive document clustering
Modelos neurais de linguagem
Neural language models
Visual analytics
Visualização analítica
topic Agrupamento interativo de documentos
Interactive document clustering
Modelos neurais de linguagem
Neural language models
Visual analytics
Visualização analítica
description Interactive data clustering techniques put the user in the clustering algorithm loop, allowing not only better clustering quality, but also supporting the knowledge discovery task in large textual corpora. The keyterm guided approach is arguably intuitive, allowing the user to interact with representative words instead of interacting with a large volume of full-length documents or complex topic models. More than making the clustering algorithm adjustable with little user-effort, the visual interactive clustering approach allows the user to focus on exploring the corpus as an incremental task. After each interaction, the user can obtain new information about the corpus, and expresses it as feedback to the clustering algorithm. The visual analytics system Vis-Kt presents itself as an interactive keyterm-based document clustering system, embedded with techniques that overcome the state-of-the-art ones, such as Latent Dirichlet Allocation and the Non-negative Matrix Factorization. With a user-guided approach, Vis-Kt allows the user to draw her insights into the corpus by describing document clusters with a small set of significative terms. However, Vis-Kt and its underlying clustering algorithms depend on the Bag-of-Words model, which has several limitations concerning the information extractions scalability, the process incrementality, and the datas semantic representation. In order to overcome the limitations inherent to the Bag-of-Words model, we propose an update for the keyterm-based representation model to a machine learning approach based on neural language models. Such a model can extract semantic information and relationships from the words that are included in the corpus. This projects main contribution is a novel interactive document clustering algorithm guided by keyterms and based on neural language models. This approach shows a significant improvement compared to the baseline algorithms, considered state-of-the-art. The proposed clustering algorithm allows Vis-Kt to work incrementally, without the need to repeat the entire learning and clustering processes from the beginning. This makes the system suitable for analyzing text streams. In order to contribute to the task of knowledge discovery and to support its incremental aspect, a visual component based on the Sankey diagram was developed to depict the clustering membership changes throughout the clustering loop after each interaction with the corpus. A set of quantitative experiments on publicly available text datasets was performed to evaluate the obtained clustering results. The results reported in this work show that, in most of the experimented cases, the proposed algorithm presents a significant improvement in clustering quality measures in comparison with the baseline algorithms. In all cases, the proposed algorithm showed a gain in processing time, especially in the largest datasets. We also report two usage scenarios to qualitatively evaluate the proposed visual component.
publishDate 2020
dc.date.none.fl_str_mv 2020-06-09
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/55/55134/tde-20082020-093906/
url https://www.teses.usp.br/teses/disponiveis/55/55134/tde-20082020-093906/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1865492508642902016