Interactive keyterm-based document clustering and visualization via neural language models

Cabral, Eric Macedo

Interactive keyterm-based document clustering and visualization via neural language models

Detalhes bibliográficos
Ano de defesa:	2020
Autor(a) principal:	Cabral, Eric Macedo
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Agrupamento interativo de documentos Interactive document clustering Modelos neurais de linguagem Neural language models Visual analytics Visualização analítica
Link de acesso:	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-20082020-093906/
Resumo:	Interactive data clustering techniques put the user in the clustering algorithm loop, allowing not only better clustering quality, but also supporting the knowledge discovery task in large textual corpora. The keyterm guided approach is arguably intuitive, allowing the user to interact with representative words instead of interacting with a large volume of full-length documents or complex topic models. More than making the clustering algorithm adjustable with little user-effort, the visual interactive clustering approach allows the user to focus on exploring the corpus as an incremental task. After each interaction, the user can obtain new information about the corpus, and expresses it as feedback to the clustering algorithm. The visual analytics system Vis-Kt presents itself as an interactive keyterm-based document clustering system, embedded with techniques that overcome the state-of-the-art ones, such as Latent Dirichlet Allocation and the Non-negative Matrix Factorization. With a user-guided approach, Vis-Kt allows the user to draw her insights into the corpus by describing document clusters with a small set of significative terms. However, Vis-Kt and its underlying clustering algorithms depend on the Bag-of-Words model, which has several limitations concerning the information extractions scalability, the process incrementality, and the datas semantic representation. In order to overcome the limitations inherent to the Bag-of-Words model, we propose an update for the keyterm-based representation model to a machine learning approach based on neural language models. Such a model can extract semantic information and relationships from the words that are included in the corpus. This projects main contribution is a novel interactive document clustering algorithm guided by keyterms and based on neural language models. This approach shows a significant improvement compared to the baseline algorithms, considered state-of-the-art. The proposed clustering algorithm allows Vis-Kt to work incrementally, without the need to repeat the entire learning and clustering processes from the beginning. This makes the system suitable for analyzing text streams. In order to contribute to the task of knowledge discovery and to support its incremental aspect, a visual component based on the Sankey diagram was developed to depict the clustering membership changes throughout the clustering loop after each interaction with the corpus. A set of quantitative experiments on publicly available text datasets was performed to evaluate the obtained clustering results. The results reported in this work show that, in most of the experimented cases, the proposed algorithm presents a significant improvement in clustering quality measures in comparison with the baseline algorithms. In all cases, the proposed algorithm showed a gain in processing time, especially in the largest datasets. We also report two usage scenarios to qualitatively evaluate the proposed visual component.

Metadados do item

id	USP_0fd9510daa6fe668a72ccb476aeaf900
oai_identifier_str	oai:teses.usp.br:tde-20082020-093906
network_acronym_str	USP
network_name_str	Biblioteca Digital de Teses e Dissertações da USP
repository_id_str
spelling	Interactive keyterm-based document clustering and visualization via neural language modelsAgrupamento interativo e visualização de documentos baseado em termos-chave via modelos neurais de linguagemAgrupamento interativo de documentosInteractive document clusteringModelos neurais de linguagemNeural language modelsVisual analyticsVisualização analíticaInteractive data clustering techniques put the user in the clustering algorithm loop, allowing not only better clustering quality, but also supporting the knowledge discovery task in large textual corpora. The keyterm guided approach is arguably intuitive, allowing the user to interact with representative words instead of interacting with a large volume of full-length documents or complex topic models. More than making the clustering algorithm adjustable with little user-effort, the visual interactive clustering approach allows the user to focus on exploring the corpus as an incremental task. After each interaction, the user can obtain new information about the corpus, and expresses it as feedback to the clustering algorithm. The visual analytics system Vis-Kt presents itself as an interactive keyterm-based document clustering system, embedded with techniques that overcome the state-of-the-art ones, such as Latent Dirichlet Allocation and the Non-negative Matrix Factorization. With a user-guided approach, Vis-Kt allows the user to draw her insights into the corpus by describing document clusters with a small set of significative terms. However, Vis-Kt and its underlying clustering algorithms depend on the Bag-of-Words model, which has several limitations concerning the information extractions scalability, the process incrementality, and the datas semantic representation. In order to overcome the limitations inherent to the Bag-of-Words model, we propose an update for the keyterm-based representation model to a machine learning approach based on neural language models. Such a model can extract semantic information and relationships from the words that are included in the corpus. This projects main contribution is a novel interactive document clustering algorithm guided by keyterms and based on neural language models. This approach shows a significant improvement compared to the baseline algorithms, considered state-of-the-art. The proposed clustering algorithm allows Vis-Kt to work incrementally, without the need to repeat the entire learning and clustering processes from the beginning. This makes the system suitable for analyzing text streams. In order to contribute to the task of knowledge discovery and to support its incremental aspect, a visual component based on the Sankey diagram was developed to depict the clustering membership changes throughout the clustering loop after each interaction with the corpus. A set of quantitative experiments on publicly available text datasets was performed to evaluate the obtained clustering results. The results reported in this work show that, in most of the experimented cases, the proposed algorithm presents a significant improvement in clustering quality measures in comparison with the baseline algorithms. In all cases, the proposed algorithm showed a gain in processing time, especially in the largest datasets. We also report two usage scenarios to qualitatively evaluate the proposed visual component.Técnicas interativas de agrupamento de dados colocam o usuário no ciclo do algoritmo de agrupamento, permitindo não somente uma melhor qualidade de agrupamento, mas também apoiando a tarefa de descoberta de conhecimento em grandes volumes textuais. A abordagem guiada por termos-chave é sem dúvida intuitiva permitindo ao usuário a interação com palavras representativas ao invés de interagir com um grande volume de documentos ou com modelos de tópicos complexos. Mais do que tornar o algoritmo de agrupamento ajustável com pouco esforço do usuário, a abordagem de agrupamento visualmente interativo permite que o usuário foque na exploração do corpus como uma tarefa incremental. Após cada interação, o usuário pode obter novas informações sobre o corpus e expressar essas informações como feedback para o algoritmo de agrupamento. O sistema Vis-Kt apresenta-se como um sistema de visualização analítica para agrupamento de documentos basaedo em termos-chave, com técnicas que superam as técnicas considerada como estado da arte, como a Latent Dirichlet Allocation e a Non-negative Matrix Factorization. Com uma abordagem guiada pelo usuário, o sistema Vis-Kt permite ao usuário modelar seu discernimento sobre o corpus por meio de conjuntos de termos-chave que descrevem grupos de documentos. No entanto, o sistema Vis-Kt e seus algoritmos dependem do modelo Bag-of- Words, que possui várias limitações em relação à escalabilidade da extração de informação, à incrementalidade do processo e à representação semântica dos dados. Com o objetivo de superar as limitações inerentes ao Bag-of-Words, propomos uma atualização da representação por termos-chave para uma abordagem de aprendizado de máquina baseado em modelos neurais de linguagem. Tais modelos podem extrair informações semânticas e relações das palavras que compõem o corpus. A principal contribuição deste projeto é um novo algoritmo interativo de agrupamento de documentos guiado por termos-chave e baseado em modelos neurais de linguagem. Essa abordagem mostra uma melhoria significativa em comparação com os algoritmos considerados estado da arte. O algoritmo de agrupamento proposto permite que o sistema Vis-Kt funcione de forma incremental, sem a necessidade de repetir todo processo de aprendizado e agrupamento desde o início. Isso torna o sistema adequado para o uso em análises de fluxos de texto. Para contribuir com a tarefa de descoberta de conhecimento e apoiar seu aspecto incremental, foi desenvolvida uma visualização baseada no diagrama de Sankey que representa as mudanças nos agrupamentos após cada interação com o corpus. Um conjunto de experimentos quantitativos em conjuntos de dados de texto disponíveis publicamente foi realizado para avaliar os resultados dos agrupamentos obtidos. Os resultados reportados neste trabalho mostram que, na maioria dos casos experimentados, o algoritmo proposto apresenta uma melhoria significativa nas medidas de qualidade de agrupamentos em comparação com os algoritmos previamente adotados no sistema. Em todos os casos, o algoritmo proposto apresentou um ganho em tempo de processamento, principalmente nos maiores conjuntos de dados. Também relatamos dois cenários de uso para avaliar qualitativamente o componente visual proposto.Biblioteca Digitais de Teses e Dissertações da USPMinghim, RosaneCabral, Eric Macedo2020-06-09info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-20082020-093906/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2020-08-20T15:49:02Zoai:teses.usp.br:tde-20082020-093906Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.bropendoar:27212020-08-20T15:49:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv	Interactive keyterm-based document clustering and visualization via neural language models Agrupamento interativo e visualização de documentos baseado em termos-chave via modelos neurais de linguagem
title	Interactive keyterm-based document clustering and visualization via neural language models
spellingShingle	Interactive keyterm-based document clustering and visualization via neural language models Cabral, Eric Macedo Agrupamento interativo de documentos Interactive document clustering Modelos neurais de linguagem Neural language models Visual analytics Visualização analítica
title_short	Interactive keyterm-based document clustering and visualization via neural language models
title_full	Interactive keyterm-based document clustering and visualization via neural language models
title_fullStr	Interactive keyterm-based document clustering and visualization via neural language models
title_full_unstemmed	Interactive keyterm-based document clustering and visualization via neural language models
title_sort	Interactive keyterm-based document clustering and visualization via neural language models
author	Cabral, Eric Macedo
author_facet	Cabral, Eric Macedo
author_role	author
dc.contributor.none.fl_str_mv	Minghim, Rosane
dc.contributor.author.fl_str_mv	Cabral, Eric Macedo
dc.subject.por.fl_str_mv	Agrupamento interativo de documentos Interactive document clustering Modelos neurais de linguagem Neural language models Visual analytics Visualização analítica
topic	Agrupamento interativo de documentos Interactive document clustering Modelos neurais de linguagem Neural language models Visual analytics Visualização analítica
description	Interactive data clustering techniques put the user in the clustering algorithm loop, allowing not only better clustering quality, but also supporting the knowledge discovery task in large textual corpora. The keyterm guided approach is arguably intuitive, allowing the user to interact with representative words instead of interacting with a large volume of full-length documents or complex topic models. More than making the clustering algorithm adjustable with little user-effort, the visual interactive clustering approach allows the user to focus on exploring the corpus as an incremental task. After each interaction, the user can obtain new information about the corpus, and expresses it as feedback to the clustering algorithm. The visual analytics system Vis-Kt presents itself as an interactive keyterm-based document clustering system, embedded with techniques that overcome the state-of-the-art ones, such as Latent Dirichlet Allocation and the Non-negative Matrix Factorization. With a user-guided approach, Vis-Kt allows the user to draw her insights into the corpus by describing document clusters with a small set of significative terms. However, Vis-Kt and its underlying clustering algorithms depend on the Bag-of-Words model, which has several limitations concerning the information extractions scalability, the process incrementality, and the datas semantic representation. In order to overcome the limitations inherent to the Bag-of-Words model, we propose an update for the keyterm-based representation model to a machine learning approach based on neural language models. Such a model can extract semantic information and relationships from the words that are included in the corpus. This projects main contribution is a novel interactive document clustering algorithm guided by keyterms and based on neural language models. This approach shows a significant improvement compared to the baseline algorithms, considered state-of-the-art. The proposed clustering algorithm allows Vis-Kt to work incrementally, without the need to repeat the entire learning and clustering processes from the beginning. This makes the system suitable for analyzing text streams. In order to contribute to the task of knowledge discovery and to support its incremental aspect, a visual component based on the Sankey diagram was developed to depict the clustering membership changes throughout the clustering loop after each interaction with the corpus. A set of quantitative experiments on publicly available text datasets was performed to evaluate the obtained clustering results. The results reported in this work show that, in most of the experimented cases, the proposed algorithm presents a significant improvement in clustering quality measures in comparison with the baseline algorithms. In all cases, the proposed algorithm showed a gain in processing time, especially in the largest datasets. We also report two usage scenarios to qualitatively evaluate the proposed visual component.
publishDate	2020
dc.date.none.fl_str_mv	2020-06-09
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-20082020-093906/
url	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-20082020-093906/
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv	Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Liberar o conteúdo para acesso público.
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP
instname_str	Universidade de São Paulo (USP)
instacron_str	USP
institution	USP
reponame_str	Biblioteca Digital de Teses e Dissertações da USP
collection	Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv	virginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.br
_version_	1865492508642902016

Interactive keyterm-based document clustering and visualization via neural language models

Registros relacionados