Explorando a avaliação de sumários automáticos multidocumento multilíngues

Nascimento, Darlan Xavier

Explorando a avaliação de sumários automáticos multidocumento multilíngues

Detalhes bibliográficos
Ano de defesa:	2020
Autor(a) principal:	Nascimento, Darlan Xavier
Orientador(a):	Di Felippo, Ariani
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	por
Instituição de defesa:	Universidade Federal de São Carlos Câmpus São Carlos
Programa de Pós-Graduação:	Programa de Pós-Graduação em Linguística - PPGL
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Sumarização automática Linguística computacional Avaliação de sumários
Palavras-chave em Inglês:	Automatic summarization Computational linguistics Summary evaluation
Área do conhecimento CNPq:	LINGUISTICA, LETRAS E ARTES::LINGUISTICA
Link de acesso:	https://repositorio.ufscar.br/handle/20.500.14289/12642
Resumo:	Multilingual Multi-Document Automatic Summarization (MMDS) is a computational task through which a summary is produced in a target language from a collection of at least two news stories which address the same subject, one in the user’s language and the other(s) in foreign language(s). The scientific literature shows that not many researches approach methods which generate summaries in Portuguese. Based on the CF and CFUL summarization methods, the present thesis describes the development of a study whose goal was to refine the summary quality evaluation, by varying (i) the native language of the producers of the reference summaries, that is, summaries written by human subjects after reading the corresponding source texts and which are necessary for the automatic calculation of informativeness, and (ii) the compression rate (desired summary size). Furthermore, this thesis outlines the enlargement of the corpus used for the investigation of these methods through the addition of texts in German (the original corpus included content in Portuguese and English) and the production of four extracts for each of the twenty clusters. The results show that the reference summaries are slightly impacted by their writer’s native language, even though additional factors might be taken into account, such as the size of each source text and the content compatibility. Regarding the summarization methods, this study found that extracts with a lower compression rate performed better when it came to the automatic evaluation of informativeness and worse in the assessment of linguistic quality.

Metadados do item

id	SCAR_fbbf8ed3ed34401fbc832b559bdc1e62
oai_identifier_str	oai:repositorio.ufscar.br:20.500.14289/12642
network_acronym_str	SCAR
network_name_str	Repositório Institucional da UFSCAR
repository_id_str
spelling	Nascimento, Darlan XavierDi Felippo, Arianihttp://lattes.cnpq.br/8648412103197455http://lattes.cnpq.br/253230432812124857c826a7-839d-474b-aa3a-59080fba2fba2020-04-28T11:24:48Z2020-04-28T11:24:48Z2020-03-12NASCIMENTO, Darlan Xavier. Explorando a avaliação de sumários automáticos multidocumento multilíngues. 2020. Dissertação (Mestrado em Linguística) – Universidade Federal de São Carlos, São Carlos, 2020. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/12642.https://repositorio.ufscar.br/handle/20.500.14289/12642Multilingual Multi-Document Automatic Summarization (MMDS) is a computational task through which a summary is produced in a target language from a collection of at least two news stories which address the same subject, one in the user’s language and the other(s) in foreign language(s). The scientific literature shows that not many researches approach methods which generate summaries in Portuguese. Based on the CF and CFUL summarization methods, the present thesis describes the development of a study whose goal was to refine the summary quality evaluation, by varying (i) the native language of the producers of the reference summaries, that is, summaries written by human subjects after reading the corresponding source texts and which are necessary for the automatic calculation of informativeness, and (ii) the compression rate (desired summary size). Furthermore, this thesis outlines the enlargement of the corpus used for the investigation of these methods through the addition of texts in German (the original corpus included content in Portuguese and English) and the production of four extracts for each of the twenty clusters. The results show that the reference summaries are slightly impacted by their writer’s native language, even though additional factors might be taken into account, such as the size of each source text and the content compatibility. Regarding the summarization methods, this study found that extracts with a lower compression rate performed better when it came to the automatic evaluation of informativeness and worse in the assessment of linguistic quality.A Sumarização Automática Multidocumento Multilíngue (SAMM) é uma aplicação computacional por meio da qual se produz um sumário em uma língua de interesse a partir de uma coleção de pelo menos dois textos de conteúdo equivalente e redigidos em idiomas diferentes. Verificou-se, na literatura científica, que poucas pesquisas se concentraram em métodos que geram sumários em português. Tendo como base os métodos CF e CFUL, esta dissertação apresenta o desenvolvimento de um estudo no qual se pretendeu refinar a avaliação da qualidade dos sumários produzidos, variando (i) a língua materna dos produtores dos sumários de referência, isto é, sumários escritos por humanos a partir da leitura dos textos-fonte correspondentes e que são necessários ao cálculo automático da informatividade, e (ii) a taxa de compressão (tamanho desejado do sumário). Além disso, ampliou-se o corpus utilizado nos estudos originais desses métodos (que continha material em português e inglês) por meio da inclusão de textos em língua alemã e produziram-se quatro extratos para cada uma das vinte coleções do corpus. Os resultados mostram que os sumários de referência apresentam leve interferência da língua materna de quem os redigiu, embora outros fatores possam ser considerados, como a extensão de cada texto-fonte e a compatibilidade de conteúdo. Com relação aos métodos investigados, identificou-se que os extratos com menor taxa de compressão tiveram melhor desempenho na avaliação automática da informatividade, mas pior desempenho em termos de qualidade linguística.Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)CAPES: código de financiamento - 001porUniversidade Federal de São CarlosCâmpus São CarlosPrograma de Pós-Graduação em Linguística - PPGLUFSCarAttribution-NonCommercial-NoDerivs 3.0 Brazilhttp://creativecommons.org/licenses/by-nc-nd/3.0/br/info:eu-repo/semantics/openAccessSumarização automáticaLinguística computacionalAvaliação de sumáriosAutomatic summarizationComputational linguisticsSummary evaluationLINGUISTICA, LETRAS E ARTES::LINGUISTICAExplorando a avaliação de sumários automáticos multidocumento multilínguesExploring the evaluation of multilingual multi-document automatic summariesinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesis60060026c5db60-6612-41e6-a8f9-f94fb475ca58reponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINALDissertação-Darlan Xavier Nascimento.pdfDissertação-Darlan Xavier Nascimento.pdfDissertaçãoapplication/pdf2795676https://repositorio.ufscar.br/bitstreams/8990a847-bb13-40a9-a20c-a3a06c53a06b/download96022114e5ffdbc5ebd5f97106d8702bMD51trueAnonymousREADCarta do orientador assinada.pdfCarta do orientador assinada.pdfCarta do orientadorapplication/pdf453089https://repositorio.ufscar.br/bitstreams/e8eb4da3-3327-49c8-afe0-44b1b29384e1/downloada3e8aaa160ba37fac88dc9bab96c1426MD52falseAnonymousREADCC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8811https://repositorio.ufscar.br/bitstreams/464e1ebf-89a3-4e9a-9e6d-6b84c250547b/downloade39d27027a6cc9cb039ad269a5db8e34MD53falseAnonymousREADTEXTDissertação-Darlan Xavier Nascimento.pdf.txtDissertação-Darlan Xavier Nascimento.pdf.txtExtracted texttext/plain214021https://repositorio.ufscar.br/bitstreams/0574fee2-51b4-4c38-aa1d-45946f914e89/download007e8bed03a3c27597b5fa603a95e859MD58falseAnonymousREADCarta do orientador assinada.pdf.txtCarta do orientador assinada.pdf.txtExtracted texttext/plain2https://repositorio.ufscar.br/bitstreams/d7bdc001-f432-44fa-b3f2-72b9d8e523e4/downloadd784fa8b6d98d27699781bd9a7cf19f0MD510falseAnonymousREADTHUMBNAILDissertação-Darlan Xavier Nascimento.pdf.jpgDissertação-Darlan Xavier Nascimento.pdf.jpgIM Thumbnailimage/jpeg10781https://repositorio.ufscar.br/bitstreams/0ef7f8cc-00ae-4097-b97e-47c8275a71d3/download4f66037a2ab4a9e326a0382039ef781eMD59falseAnonymousREADCarta do orientador assinada.pdf.jpgCarta do orientador assinada.pdf.jpgIM Thumbnailimage/jpeg11553https://repositorio.ufscar.br/bitstreams/26c331c4-176e-4ec4-87f8-2f381b57bb23/download0e4062dd9db4c8731812f415d6319eb4MD511falseAnonymousREAD20.500.14289/126422025-02-05 18:25:12.126http://creativecommons.org/licenses/by-nc-nd/3.0/br/Attribution-NonCommercial-NoDerivs 3.0 Brazilopen.accessoai:repositorio.ufscar.br:20.500.14289/12642https://repositorio.ufscar.brRepositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestrepositorio.sibi@ufscar.bropendoar:43222025-02-05T21:25:12Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false
dc.title.por.fl_str_mv	Explorando a avaliação de sumários automáticos multidocumento multilíngues
dc.title.alternative.eng.fl_str_mv	Exploring the evaluation of multilingual multi-document automatic summaries
title	Explorando a avaliação de sumários automáticos multidocumento multilíngues
spellingShingle	Explorando a avaliação de sumários automáticos multidocumento multilíngues Nascimento, Darlan Xavier Sumarização automática Linguística computacional Avaliação de sumários Automatic summarization Computational linguistics Summary evaluation LINGUISTICA, LETRAS E ARTES::LINGUISTICA
title_short	Explorando a avaliação de sumários automáticos multidocumento multilíngues
title_full	Explorando a avaliação de sumários automáticos multidocumento multilíngues
title_fullStr	Explorando a avaliação de sumários automáticos multidocumento multilíngues
title_full_unstemmed	Explorando a avaliação de sumários automáticos multidocumento multilíngues
title_sort	Explorando a avaliação de sumários automáticos multidocumento multilíngues
author	Nascimento, Darlan Xavier
author_facet	Nascimento, Darlan Xavier
author_role	author
dc.contributor.authorlattes.por.fl_str_mv	http://lattes.cnpq.br/2532304328121248
dc.contributor.author.fl_str_mv	Nascimento, Darlan Xavier
dc.contributor.advisor1.fl_str_mv	Di Felippo, Ariani
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/8648412103197455
dc.contributor.authorID.fl_str_mv	57c826a7-839d-474b-aa3a-59080fba2fba
contributor_str_mv	Di Felippo, Ariani
dc.subject.por.fl_str_mv	Sumarização automática Linguística computacional Avaliação de sumários
topic	Sumarização automática Linguística computacional Avaliação de sumários Automatic summarization Computational linguistics Summary evaluation LINGUISTICA, LETRAS E ARTES::LINGUISTICA
dc.subject.eng.fl_str_mv	Automatic summarization Computational linguistics Summary evaluation
dc.subject.cnpq.fl_str_mv	LINGUISTICA, LETRAS E ARTES::LINGUISTICA
description	Multilingual Multi-Document Automatic Summarization (MMDS) is a computational task through which a summary is produced in a target language from a collection of at least two news stories which address the same subject, one in the user’s language and the other(s) in foreign language(s). The scientific literature shows that not many researches approach methods which generate summaries in Portuguese. Based on the CF and CFUL summarization methods, the present thesis describes the development of a study whose goal was to refine the summary quality evaluation, by varying (i) the native language of the producers of the reference summaries, that is, summaries written by human subjects after reading the corresponding source texts and which are necessary for the automatic calculation of informativeness, and (ii) the compression rate (desired summary size). Furthermore, this thesis outlines the enlargement of the corpus used for the investigation of these methods through the addition of texts in German (the original corpus included content in Portuguese and English) and the production of four extracts for each of the twenty clusters. The results show that the reference summaries are slightly impacted by their writer’s native language, even though additional factors might be taken into account, such as the size of each source text and the content compatibility. Regarding the summarization methods, this study found that extracts with a lower compression rate performed better when it came to the automatic evaluation of informativeness and worse in the assessment of linguistic quality.
publishDate	2020
dc.date.accessioned.fl_str_mv	2020-04-28T11:24:48Z
dc.date.available.fl_str_mv	2020-04-28T11:24:48Z
dc.date.issued.fl_str_mv	2020-03-12
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	NASCIMENTO, Darlan Xavier. Explorando a avaliação de sumários automáticos multidocumento multilíngues. 2020. Dissertação (Mestrado em Linguística) – Universidade Federal de São Carlos, São Carlos, 2020. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/12642.
dc.identifier.uri.fl_str_mv	https://repositorio.ufscar.br/handle/20.500.14289/12642
identifier_str_mv	NASCIMENTO, Darlan Xavier. Explorando a avaliação de sumários automáticos multidocumento multilíngues. 2020. Dissertação (Mestrado em Linguística) – Universidade Federal de São Carlos, São Carlos, 2020. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/12642.
url	https://repositorio.ufscar.br/handle/20.500.14289/12642
dc.language.iso.fl_str_mv	por
language	por
dc.relation.confidence.fl_str_mv	600 600
dc.relation.authority.fl_str_mv	26c5db60-6612-41e6-a8f9-f94fb475ca58
dc.rights.driver.fl_str_mv	Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de São Carlos Câmpus São Carlos
dc.publisher.program.fl_str_mv	Programa de Pós-Graduação em Linguística - PPGL
dc.publisher.initials.fl_str_mv	UFSCar
publisher.none.fl_str_mv	Universidade Federal de São Carlos Câmpus São Carlos
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFSCAR instname:Universidade Federal de São Carlos (UFSCAR) instacron:UFSCAR
instname_str	Universidade Federal de São Carlos (UFSCAR)
instacron_str	UFSCAR
institution	UFSCAR
reponame_str	Repositório Institucional da UFSCAR
collection	Repositório Institucional da UFSCAR
bitstream.url.fl_str_mv	https://repositorio.ufscar.br/bitstreams/8990a847-bb13-40a9-a20c-a3a06c53a06b/download https://repositorio.ufscar.br/bitstreams/e8eb4da3-3327-49c8-afe0-44b1b29384e1/download https://repositorio.ufscar.br/bitstreams/464e1ebf-89a3-4e9a-9e6d-6b84c250547b/download https://repositorio.ufscar.br/bitstreams/0574fee2-51b4-4c38-aa1d-45946f914e89/download https://repositorio.ufscar.br/bitstreams/d7bdc001-f432-44fa-b3f2-72b9d8e523e4/download https://repositorio.ufscar.br/bitstreams/0ef7f8cc-00ae-4097-b97e-47c8275a71d3/download https://repositorio.ufscar.br/bitstreams/26c331c4-176e-4ec4-87f8-2f381b57bb23/download
bitstream.checksum.fl_str_mv	96022114e5ffdbc5ebd5f97106d8702b a3e8aaa160ba37fac88dc9bab96c1426 e39d27027a6cc9cb039ad269a5db8e34 007e8bed03a3c27597b5fa603a95e859 d784fa8b6d98d27699781bd9a7cf19f0 4f66037a2ab4a9e326a0382039ef781e 0e4062dd9db4c8731812f415d6319eb4
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5 MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)
repository.mail.fl_str_mv	repositorio.sibi@ufscar.br
_version_	1851688859225554944

Explorando a avaliação de sumários automáticos multidocumento multilíngues

Registros relacionados