Modelos para análise de textos: um comparativo do número de tópicos

Coelho Filho, Edvaldo Capobiango

Modelos para análise de textos: um comparativo do número de tópicos

Detalhes bibliográficos
Ano de defesa:	2024
Autor(a) principal:	Coelho Filho, Edvaldo Capobiango
Orientador(a):	Zuanetti, Daiane Aparecida
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	por
Instituição de defesa:	Universidade Federal de São Carlos Câmpus São Carlos
Programa de Pós-Graduação:	Programa Interinstitucional de Pós-Graduação em Estatística - PIPGEs
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Inferência Bayesiana Latent Dirichlet Allocation Métricas de desempenho Modelagem de tópicos Modelo de mistura
Palavras-chave em Inglês:	Bayesian approach Performance metrics Topic modeling Mixture model
Área do conhecimento CNPq:	CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA::ANALISE DE DADOS CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA::INFERENCIA PARAMETRICA
Link de acesso:	https://repositorio.ufscar.br/handle/20.500.14289/20846
Resumo:	Text modeling has gained significant visibility and popularity in recent years due to the large and ever-increasing amount of information present in daily life, consumed in various ways. For the efficiency and applicability of these models, the prior step of data preprocessing is of utmost importance, as it helps in the organization and treatment of texts. One branch within text analysis is topic modeling, whose methodologies aim to understand the topic structure that forms a document, segmenting multiple documents by their dominant topics (subjects) and thus simplifying the exploration of large volumes of textual data with the resulting dimensionality reduction. One of the pioneering methods in this context is the Mixture Model (MM), which assumes that each document will be composed of words from a single topic. Given this limitation, the technique of Latent Dirichlet Allocation (LDA) has gained considerable visibility due to its greater flexibility, as it allows each document to exhibit multiple topics. In both methodologies, model inference is generally given via a Bayesian approach. However, one of the characteristics of MM and LDA is the requirement that the user define the number of topics in the model from the outset. Therefore, the use of performance metrics becomes necessary after the application of the method, aiming to help in the definition and estimation of the best number of topics to be chosen. In this work, therefore, in addition to contrasting text analysis methodologies, we compare the metrics that measure the quality of the models and are used for choosing the number of topics. To do this, we apply the models and selection metrics to two sets of real data.

Metadados do item

id	SCAR_4f72c6dfe367de04571485a481d0165f
oai_identifier_str	oai:repositorio.ufscar.br:20.500.14289/20846
network_acronym_str	SCAR
network_name_str	Repositório Institucional da UFSCAR
repository_id_str
spelling	Coelho Filho, Edvaldo CapobiangoZuanetti, Daiane Aparecidahttp://lattes.cnpq.br/8352484284929824http://lattes.cnpq.br/83959421736466722024-10-22T11:04:23Z2024-10-22T11:04:23Z2024-08-27COELHO FILHO, Edvaldo Capobiango. Modelos para análise de textos: um comparativo do número de tópicos. 2024. Dissertação (Mestrado em Estatística) – Universidade Federal de São Carlos, São Carlos, 2024. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/20846.https://repositorio.ufscar.br/handle/20.500.14289/20846Text modeling has gained significant visibility and popularity in recent years due to the large and ever-increasing amount of information present in daily life, consumed in various ways. For the efficiency and applicability of these models, the prior step of data preprocessing is of utmost importance, as it helps in the organization and treatment of texts. One branch within text analysis is topic modeling, whose methodologies aim to understand the topic structure that forms a document, segmenting multiple documents by their dominant topics (subjects) and thus simplifying the exploration of large volumes of textual data with the resulting dimensionality reduction. One of the pioneering methods in this context is the Mixture Model (MM), which assumes that each document will be composed of words from a single topic. Given this limitation, the technique of Latent Dirichlet Allocation (LDA) has gained considerable visibility due to its greater flexibility, as it allows each document to exhibit multiple topics. In both methodologies, model inference is generally given via a Bayesian approach. However, one of the characteristics of MM and LDA is the requirement that the user define the number of topics in the model from the outset. Therefore, the use of performance metrics becomes necessary after the application of the method, aiming to help in the definition and estimation of the best number of topics to be chosen. In this work, therefore, in addition to contrasting text analysis methodologies, we compare the metrics that measure the quality of the models and are used for choosing the number of topics. To do this, we apply the models and selection metrics to two sets of real data.A modelagem de textos tem ganhado bastante visibilidade e popularidade nos últimos anos devido a grande e, cada vez maior, quantidade de informações presentes no dia a dia, consumidas de diversas maneiras. Para a eficiência e aplicabilidade destes modelos, é de suma importância a etapa de pré-processamento dos dados, que ajuda na organização e tratamento dos textos. Um ramo dentro da análise de textos é o de modelagem de tópicos, cujas metodologias visam entender a estrutura de tópicos (assuntos) que formam um documento, segmentando vários documentos por seus tópicos dominantes e simplificando assim a exploração de grandes volumes de dados textuais com a redução de dimensionalidade ocasionada. Um dos métodos pioneiros neste contexto é o Modelo de Mistura (MM), este que parte-se do pressuposto de que cada documento será composto de palavras advindas de um único tópico. Diante dessa limitação, tem ganhado bastante visibilidade o modelo de \textit{Latent Dirichlet Allocation} (LDA), por conta de sua maior flexibilidade, visto que permite que cada documento possa exibir vários tópicos. Em ambas as metodologias, a inferência é realizada, em geral, via abordagem Bayesiana. No entanto, uma das características do MM e LDA consiste na exigência de que o usuário defina de partida a quantidade de tópicos do modelo. Sendo assim, o uso de métricas de desempenho se faz necessário após a aplicação do método, visando a ajuda na definição e estimação do melhor número de tópicos a ser escolhido. Nesse trabalho, portanto, além de contrapor as metodologias de análises textuais, fazemos o comparativo entre as métricas que mensuram a qualidade dos modelos e são utilizadas para a escolha do número de tópicos. Para isso, aplicamos os modelos e as métricas de seleção em dois conjuntos de dados reais.Não recebi financiamentoporUniversidade Federal de São CarlosCâmpus São CarlosPrograma Interinstitucional de Pós-Graduação em Estatística - PIPGEsUFSCarAttribution 3.0 Brazilhttp://creativecommons.org/licenses/by/3.0/br/info:eu-repo/semantics/openAccessInferência BayesianaLatent Dirichlet AllocationMétricas de desempenhoModelagem de tópicosModelo de misturaBayesian approachPerformance metricsTopic modelingMixture modelCIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA::ANALISE DE DADOSCIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA::INFERENCIA PARAMETRICAModelos para análise de textos: um comparativo do número de tópicosModels for text analysis: a comparison of the number of topicsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisreponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARTEXTDissertação - Edvaldo Capobiango Coelho Filho.pdf.txtDissertação - Edvaldo Capobiango Coelho Filho.pdf.txtExtracted texttext/plain98547https://repositorio.ufscar.br/bitstreams/a9daaa6b-65eb-4133-8984-19b67879430c/download711d59edc5f612d21d841245cbd43d8dMD53falseAnonymousREADTHUMBNAILDissertação - Edvaldo Capobiango Coelho Filho.pdf.jpgDissertação - Edvaldo Capobiango Coelho Filho.pdf.jpgGenerated Thumbnailimage/jpeg6375https://repositorio.ufscar.br/bitstreams/8e599358-7d12-49f7-a8a4-d139a10d7cb9/download1065b61bcd652ab48081b9d6aa6bd2f4MD54falseAnonymousREADORIGINALDissertação - Edvaldo Capobiango Coelho Filho.pdfDissertação - Edvaldo Capobiango Coelho Filho.pdfapplication/pdf512617https://repositorio.ufscar.br/bitstreams/b10df719-e3cb-4b64-bc6d-ba556b69138d/download8261c2f2c24f2425f85bc167bb19a63eMD51trueAnonymousREADCC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8913https://repositorio.ufscar.br/bitstreams/9eb567c3-7ce2-4001-825d-a84358a6ff97/download3185b4de2190c2d366d1d324db01f8b8MD52falseAnonymousREAD20.500.14289/208462025-02-06 03:38:32.995http://creativecommons.org/licenses/by/3.0/br/Attribution 3.0 Brazilopen.accessoai:repositorio.ufscar.br:20.500.14289/20846https://repositorio.ufscar.brRepositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestrepositorio.sibi@ufscar.bropendoar:43222025-02-06T06:38:32Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false
dc.title.por.fl_str_mv	Modelos para análise de textos: um comparativo do número de tópicos
dc.title.alternative.eng.fl_str_mv	Models for text analysis: a comparison of the number of topics
title	Modelos para análise de textos: um comparativo do número de tópicos
spellingShingle	Modelos para análise de textos: um comparativo do número de tópicos Coelho Filho, Edvaldo Capobiango Inferência Bayesiana Latent Dirichlet Allocation Métricas de desempenho Modelagem de tópicos Modelo de mistura Bayesian approach Performance metrics Topic modeling Mixture model CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA::ANALISE DE DADOS CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA::INFERENCIA PARAMETRICA
title_short	Modelos para análise de textos: um comparativo do número de tópicos
title_full	Modelos para análise de textos: um comparativo do número de tópicos
title_fullStr	Modelos para análise de textos: um comparativo do número de tópicos
title_full_unstemmed	Modelos para análise de textos: um comparativo do número de tópicos
title_sort	Modelos para análise de textos: um comparativo do número de tópicos
author	Coelho Filho, Edvaldo Capobiango
author_facet	Coelho Filho, Edvaldo Capobiango
author_role	author
dc.contributor.authorlattes.por.fl_str_mv	http://lattes.cnpq.br/8395942173646672
dc.contributor.author.fl_str_mv	Coelho Filho, Edvaldo Capobiango
dc.contributor.advisor1.fl_str_mv	Zuanetti, Daiane Aparecida
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/8352484284929824
contributor_str_mv	Zuanetti, Daiane Aparecida
dc.subject.por.fl_str_mv	Inferência Bayesiana Latent Dirichlet Allocation Métricas de desempenho Modelagem de tópicos Modelo de mistura
topic	Inferência Bayesiana Latent Dirichlet Allocation Métricas de desempenho Modelagem de tópicos Modelo de mistura Bayesian approach Performance metrics Topic modeling Mixture model CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA::ANALISE DE DADOS CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA::INFERENCIA PARAMETRICA
dc.subject.eng.fl_str_mv	Bayesian approach Performance metrics Topic modeling Mixture model
dc.subject.cnpq.fl_str_mv	CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA::ANALISE DE DADOS CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA::INFERENCIA PARAMETRICA
description	Text modeling has gained significant visibility and popularity in recent years due to the large and ever-increasing amount of information present in daily life, consumed in various ways. For the efficiency and applicability of these models, the prior step of data preprocessing is of utmost importance, as it helps in the organization and treatment of texts. One branch within text analysis is topic modeling, whose methodologies aim to understand the topic structure that forms a document, segmenting multiple documents by their dominant topics (subjects) and thus simplifying the exploration of large volumes of textual data with the resulting dimensionality reduction. One of the pioneering methods in this context is the Mixture Model (MM), which assumes that each document will be composed of words from a single topic. Given this limitation, the technique of Latent Dirichlet Allocation (LDA) has gained considerable visibility due to its greater flexibility, as it allows each document to exhibit multiple topics. In both methodologies, model inference is generally given via a Bayesian approach. However, one of the characteristics of MM and LDA is the requirement that the user define the number of topics in the model from the outset. Therefore, the use of performance metrics becomes necessary after the application of the method, aiming to help in the definition and estimation of the best number of topics to be chosen. In this work, therefore, in addition to contrasting text analysis methodologies, we compare the metrics that measure the quality of the models and are used for choosing the number of topics. To do this, we apply the models and selection metrics to two sets of real data.
publishDate	2024
dc.date.accessioned.fl_str_mv	2024-10-22T11:04:23Z
dc.date.available.fl_str_mv	2024-10-22T11:04:23Z
dc.date.issued.fl_str_mv	2024-08-27
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	COELHO FILHO, Edvaldo Capobiango. Modelos para análise de textos: um comparativo do número de tópicos. 2024. Dissertação (Mestrado em Estatística) – Universidade Federal de São Carlos, São Carlos, 2024. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/20846.
dc.identifier.uri.fl_str_mv	https://repositorio.ufscar.br/handle/20.500.14289/20846
identifier_str_mv	COELHO FILHO, Edvaldo Capobiango. Modelos para análise de textos: um comparativo do número de tópicos. 2024. Dissertação (Mestrado em Estatística) – Universidade Federal de São Carlos, São Carlos, 2024. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/20846.
url	https://repositorio.ufscar.br/handle/20.500.14289/20846
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	Attribution 3.0 Brazil http://creativecommons.org/licenses/by/3.0/br/ info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Attribution 3.0 Brazil http://creativecommons.org/licenses/by/3.0/br/
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de São Carlos Câmpus São Carlos
dc.publisher.program.fl_str_mv	Programa Interinstitucional de Pós-Graduação em Estatística - PIPGEs
dc.publisher.initials.fl_str_mv	UFSCar
publisher.none.fl_str_mv	Universidade Federal de São Carlos Câmpus São Carlos
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFSCAR instname:Universidade Federal de São Carlos (UFSCAR) instacron:UFSCAR
instname_str	Universidade Federal de São Carlos (UFSCAR)
instacron_str	UFSCAR
institution	UFSCAR
reponame_str	Repositório Institucional da UFSCAR
collection	Repositório Institucional da UFSCAR
bitstream.url.fl_str_mv	https://repositorio.ufscar.br/bitstreams/a9daaa6b-65eb-4133-8984-19b67879430c/download https://repositorio.ufscar.br/bitstreams/8e599358-7d12-49f7-a8a4-d139a10d7cb9/download https://repositorio.ufscar.br/bitstreams/b10df719-e3cb-4b64-bc6d-ba556b69138d/download https://repositorio.ufscar.br/bitstreams/9eb567c3-7ce2-4001-825d-a84358a6ff97/download
bitstream.checksum.fl_str_mv	711d59edc5f612d21d841245cbd43d8d 1065b61bcd652ab48081b9d6aa6bd2f4 8261c2f2c24f2425f85bc167bb19a63e 3185b4de2190c2d366d1d324db01f8b8
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)
repository.mail.fl_str_mv	repositorio.sibi@ufscar.br
_version_	1851688912998629376

Modelos para análise de textos: um comparativo do número de tópicos

Registros relacionados