Proposta de uma abordagem para sumarização extrativa de textos científicos longos

Cinthia Mikaela de Souza

Proposta de uma abordagem para sumarização extrativa de textos científicos longos

Detalhes bibliográficos
Ano de defesa:	2022
Autor(a) principal:	Cinthia Mikaela de Souza
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	por
Instituição de defesa:	Universidade Federal de Minas Gerais
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	. Computação – Teses Sumarização automática de textos – Teses Aprendizado de máquina multivisão– Teses Classificação – Teses
Link de acesso:	https://hdl.handle.net/1843/51324
Resumo:	Automatic text summarization is one of the solutions that allows users to identify the most relevant information in a textual document, consequently reducing the time to search for information. The objective of this technique is to condense the information of a text into a simple and descriptive summary, which gives the reader a general idea of the text without having to read all its content. Most of the literature in automatic text summarization focuses on proposing and improving Deep Learning methods in order to make these models applicable in the context of long text summarization. Unfortunately, these models still have limitations on the input sequence length. Such a limitation may lead to a loss of information that impairs the quality of the summaries generated. For this reason, we propose in this dissertation a new approach to extractive summarization of long texts. We have two hypotheses, the first is that subdividing the summarization problem into smaller problems and solving them separately, and later combining these solutions can be beneficial for the task of summarizing long texts. The second hypothesis is that there are other characteristics of the text that can be useful in the creation of the summary. With this in mind, we model the text summarization problem as a binary classification problem. We tested different algorithms and showed that multi-section summarization outperforms single-section summarization with a performance gain of approximately 14% and 5% of BertScore for the Plos One and ArXiv datasets, respectively. We also evaluated the performance of the proposed summarizer using different representations of the text and showed that the single-view representation of attributes is the one that gets the best results. This shows that, for the extractive text summarization task, the attributes selected to compose the attributes view allow to better identify the importance of the sentences. Finally, we compare the proposed method with different state-of-the-art models in extractive, abstractive and hybrid summarization and show that our approach outperforms these models.

Metadados do item

id	UFMG_70640e93d43fbc2fcec5ffaedad4d6f9
oai_identifier_str	oai:repositorio.ufmg.br:1843/51324
network_acronym_str	UFMG
network_name_str	Repositório Institucional da UFMG
repository_id_str
spelling	2023-03-29T14:51:16Z2025-09-08T23:59:05Z2023-03-29T14:51:16Z2022-12-05https://hdl.handle.net/1843/51324Automatic text summarization is one of the solutions that allows users to identify the most relevant information in a textual document, consequently reducing the time to search for information. The objective of this technique is to condense the information of a text into a simple and descriptive summary, which gives the reader a general idea of the text without having to read all its content. Most of the literature in automatic text summarization focuses on proposing and improving Deep Learning methods in order to make these models applicable in the context of long text summarization. Unfortunately, these models still have limitations on the input sequence length. Such a limitation may lead to a loss of information that impairs the quality of the summaries generated. For this reason, we propose in this dissertation a new approach to extractive summarization of long texts. We have two hypotheses, the first is that subdividing the summarization problem into smaller problems and solving them separately, and later combining these solutions can be beneficial for the task of summarizing long texts. The second hypothesis is that there are other characteristics of the text that can be useful in the creation of the summary. With this in mind, we model the text summarization problem as a binary classification problem. We tested different algorithms and showed that multi-section summarization outperforms single-section summarization with a performance gain of approximately 14% and 5% of BertScore for the Plos One and ArXiv datasets, respectively. We also evaluated the performance of the proposed summarizer using different representations of the text and showed that the single-view representation of attributes is the one that gets the best results. This shows that, for the extractive text summarization task, the attributes selected to compose the attributes view allow to better identify the importance of the sentences. Finally, we compare the proposed method with different state-of-the-art models in extractive, abstractive and hybrid summarization and show that our approach outperforms these models.porUniversidade Federal de Minas GeraisSumariza ̧c ̃ao extrativa de textosAprendizado Multi-visãoClassificação. Computação – TesesSumarização automática de textos – TesesAprendizado de máquina multivisão– TesesClassificação – TesesProposta de uma abordagem para sumarização extrativa de textos científicos longosinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisCinthia Mikaela de Souzainfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGhttp://lattes.cnpq.br/5399985415079833Renato Vimieirohttp://lattes.cnpq.br/5736183954752317Magali Rezende Gouvêa MeirelesRodrygo Luis Teodoro SantosAdriano Alonso VelosoA sumarização automática de textos é uma das soluções que permite aos usuários identificar as informações mais relevantes de um documento textual, consequentemente, reduzindo o tempo de busca pelas informações. O objetivo dessa técnica é condensar as informações de um texto em um resumo simples e descritivo, que dê ao leitor uma ideia geral do texto sem ter que ler todo o seu conteúdo. A maior parte da literatura em sumarização automática de texto se concentra em propor e aprimorar métodos de aprendizado profundo para tornar esses modelos aplicáveis no contexto de sumarização de textos longos. Infelizmente, esses modelos ainda possuem limitações no comprimento da sequência de entrada. Tal limitação pode levar a uma perda de informações que prejudica a qualidade dos resumos gerados. Por esta razão, propomos nessa dissertação uma nova abordagem de sumarização extrativa de textos longos. Temos duas hipóteses: (1) subdividir o problema de sumarização em problemas menores e resolvê-los, separadamente, e, posteriormente, combinar essas soluções pode trazer benefícios para a tarefa de sumarização de textos longos; (2) há outros atributos do texto que podem ser úteis na criação do resumo. Tendo isso em vista, nós modelamos o problema de sumarização de textos como um problema de classificação binária. Testamos diferentes algoritmos e mostramos que a sumarização multi-seção tem um desempenho superior à sumarização de seção única com um ganho de desempenho de, aproximadamente, 14% e 5% de BertScore para o conjunto de dados da Plos One e do ArXiv, respectivamente. Nós, também, avaliamos o desempenho do sumarizador proposto usando diferentes representações do texto e mostramos que a representação de visão única de atributos é a que obtém os melhores resultados. Isso mostra que, para a tarefa de sumarização extrativa de textos, os atributos selecionados para compor a visão de atributos permitem identificar melhor a importância das sentenças. Por fim, nós comparamos o método proposto com diferentes modelos do estado-da-arte em sumarização extrativa, abstrativa e híbrida e mostramos que a nossa abordagem supera esses modelos.BrasilICX - DEPARTAMENTO DE CIÊNCIA DA COMPUTAÇÃOPrograma de Pós-Graduação em Ciência da ComputaçãoUFMGORIGINALCinthia Mikaela de Souza_final (1).pdfapplication/pdf1037255https://repositorio.ufmg.br//bitstreams/30b03994-a4ff-469f-870d-bc9ffb27e4f0/downloadc4325fbc553f4584d96e9b947021bf71MD51trueAnonymousREADLICENSElicense.txttext/plain2118https://repositorio.ufmg.br//bitstreams/a49906a0-b124-4aa2-b52e-a0fe0e5b9e34/downloadcda590c95a0b51b4d15f60c9642ca272MD52falseAnonymousREAD1843/513242025-09-08 20:59:05.806open.accessoai:repositorio.ufmg.br:1843/51324https://repositorio.ufmg.br/Repositório InstitucionalPUBhttps://repositorio.ufmg.br/oairepositorio@ufmg.bropendoar:2025-09-08T23:59:05Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)falseTElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEgRE8gUkVQT1NJVMOTUklPIElOU1RJVFVDSU9OQUwgREEgVUZNRwoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSBhbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIChSSS1VRk1HKSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZSBpcnJldm9nw6F2ZWwgZGUgcmVwcm9kdXppciBlL291IGRpc3RyaWJ1aXIgYSBzdWEgcHVibGljYcOnw6NvIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyw7RuaWNvIGUgZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBkZWNsYXJhIHF1ZSBjb25oZWNlIGEgcG9sw610aWNhIGRlIGNvcHlyaWdodCBkYSBlZGl0b3JhIGRvIHNldSBkb2N1bWVudG8gZSBxdWUgY29uaGVjZSBlIGFjZWl0YSBhcyBEaXJldHJpemVzIGRvIFJJLVVGTUcuCgpWb2PDqiBjb25jb3JkYSBxdWUgbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGTUcgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGRlIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBmaW5zIGRlIHNlZ3VyYW7Dp2EsIGJhY2stdXAgZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogZGVjbGFyYSBxdWUgYSBzdWEgcHVibGljYcOnw6NvIMOpIG9yaWdpbmFsIGUgcXVlIHZvY8OqIHRlbSBvIHBvZGVyIGRlIGNvbmNlZGVyIG9zIGRpcmVpdG9zIGNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRlIHN1YSBwdWJsaWNhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3XDqW0uCgpDYXNvIGEgc3VhIHB1YmxpY2HDp8OjbyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgYW8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHB1YmxpY2HDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBQVUJMSUNBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCk8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lKHMpIG91IG8ocykgbm9tZXMocykgZG8ocykgZGV0ZW50b3IoZXMpIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkYSBwdWJsaWNhw6fDo28sIGUgbsOjbyBmYXLDoSBxdWFscXVlciBhbHRlcmHDp8OjbywgYWzDqW0gZGFxdWVsYXMgY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4K
dc.title.none.fl_str_mv	Proposta de uma abordagem para sumarização extrativa de textos científicos longos
title	Proposta de uma abordagem para sumarização extrativa de textos científicos longos
spellingShingle	Proposta de uma abordagem para sumarização extrativa de textos científicos longos Cinthia Mikaela de Souza . Computação – Teses Sumarização automática de textos – Teses Aprendizado de máquina multivisão– Teses Classificação – Teses Sumariza ̧c ̃ao extrativa de textos Aprendizado Multi-visão Classificação
title_short	Proposta de uma abordagem para sumarização extrativa de textos científicos longos
title_full	Proposta de uma abordagem para sumarização extrativa de textos científicos longos
title_fullStr	Proposta de uma abordagem para sumarização extrativa de textos científicos longos
title_full_unstemmed	Proposta de uma abordagem para sumarização extrativa de textos científicos longos
title_sort	Proposta de uma abordagem para sumarização extrativa de textos científicos longos
author	Cinthia Mikaela de Souza
author_facet	Cinthia Mikaela de Souza
author_role	author
dc.contributor.author.fl_str_mv	Cinthia Mikaela de Souza
dc.subject.por.fl_str_mv	. Computação – Teses Sumarização automática de textos – Teses Aprendizado de máquina multivisão– Teses Classificação – Teses
topic	. Computação – Teses Sumarização automática de textos – Teses Aprendizado de máquina multivisão– Teses Classificação – Teses Sumariza ̧c ̃ao extrativa de textos Aprendizado Multi-visão Classificação
dc.subject.other.none.fl_str_mv	Sumariza ̧c ̃ao extrativa de textos Aprendizado Multi-visão Classificação
description	Automatic text summarization is one of the solutions that allows users to identify the most relevant information in a textual document, consequently reducing the time to search for information. The objective of this technique is to condense the information of a text into a simple and descriptive summary, which gives the reader a general idea of the text without having to read all its content. Most of the literature in automatic text summarization focuses on proposing and improving Deep Learning methods in order to make these models applicable in the context of long text summarization. Unfortunately, these models still have limitations on the input sequence length. Such a limitation may lead to a loss of information that impairs the quality of the summaries generated. For this reason, we propose in this dissertation a new approach to extractive summarization of long texts. We have two hypotheses, the first is that subdividing the summarization problem into smaller problems and solving them separately, and later combining these solutions can be beneficial for the task of summarizing long texts. The second hypothesis is that there are other characteristics of the text that can be useful in the creation of the summary. With this in mind, we model the text summarization problem as a binary classification problem. We tested different algorithms and showed that multi-section summarization outperforms single-section summarization with a performance gain of approximately 14% and 5% of BertScore for the Plos One and ArXiv datasets, respectively. We also evaluated the performance of the proposed summarizer using different representations of the text and showed that the single-view representation of attributes is the one that gets the best results. This shows that, for the extractive text summarization task, the attributes selected to compose the attributes view allow to better identify the importance of the sentences. Finally, we compare the proposed method with different state-of-the-art models in extractive, abstractive and hybrid summarization and show that our approach outperforms these models.
publishDate	2022
dc.date.issued.fl_str_mv	2022-12-05
dc.date.accessioned.fl_str_mv	2023-03-29T14:51:16Z 2025-09-08T23:59:05Z
dc.date.available.fl_str_mv	2023-03-29T14:51:16Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://hdl.handle.net/1843/51324
url	https://hdl.handle.net/1843/51324
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de Minas Gerais
publisher.none.fl_str_mv	Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG
instname_str	Universidade Federal de Minas Gerais (UFMG)
instacron_str	UFMG
institution	UFMG
reponame_str	Repositório Institucional da UFMG
collection	Repositório Institucional da UFMG
bitstream.url.fl_str_mv	https://repositorio.ufmg.br//bitstreams/30b03994-a4ff-469f-870d-bc9ffb27e4f0/download https://repositorio.ufmg.br//bitstreams/a49906a0-b124-4aa2-b52e-a0fe0e5b9e34/download
bitstream.checksum.fl_str_mv	c4325fbc553f4584d96e9b947021bf71 cda590c95a0b51b4d15f60c9642ca272
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv	repositorio@ufmg.br
_version_	1862105903955705856

Proposta de uma abordagem para sumarização extrativa de textos científicos longos

Registros relacionados