Machine Learning and Readability in Accounting: An Ensemble Learning Approach

Detalhes bibliográficos
Ano de defesa: 2025
Autor(a) principal: COSTA NETO, Arlindo Menezes da
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Universidade Federal de Pernambuco
UFPE
Brasil
Programa de Pos Graduacao em Ciencias Contabeis
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://repositorio.ufpe.br/handle/123456789/67273
Resumo: We expand on the value relevance of accounting information by exploring a new metric for valuing the financial text, to do so we employ a language model (FinBERT-PT-BR) trained in Brazilian Portuguese to develop an Informativeness Index, assigning scores to 26.804 quarterly financial statement notes from 1.152 companies in Brazil over the span of 12 years. As a verification of our model’s capability to understand textual data, we calculate the usual readability metrics (Flesch-Kincaid reading ease, Fog index, SMOG index, Loughran McDonald Index) for all the notes and employ machine learning models to evaluate which readability metric best represents an informativeness index built upon the dimensions of Boilerplateness, Completeness and Density, expecting our proposed metric to be poorly related to the readability metrics. The evaluation of which readability metric is closest to measuring the informativeness of financial text is based on the feature importance, which indicates the best proxy for financial text readability of Portuguese text is be the Loughran McDonald Index. The Loughran-McDonald Index is the only one with any relevance in the regressors, and as is based on file size, we assume our metric as capable of measuring textual information value better than common readability metrics, while pointing to the Loughran-McDonald to be a reasonable proxy to informational value of financial text. This research innovates by presenting a new method to quantify the informational value of financial information, contributing to value-relevance literature as well as literature of machine learning employment in accounting research, additionally we do so within a not-so-explored field (Portuguese financial information) with a reasonably large dataset. Further research may be needed to combine our proposed model with market-related metrics or human experiments to increase the validity of the metric concept.
id UFPE_7bd2937bcb637c834d4b9e8e0f85d15d
oai_identifier_str oai:repositorio.ufpe.br:123456789/67273
network_acronym_str UFPE
network_name_str Repositório Institucional da UFPE
repository_id_str
spelling Machine Learning and Readability in Accounting: An Ensemble Learning ApproachInformativenessMachine LearningAccounting informationWe expand on the value relevance of accounting information by exploring a new metric for valuing the financial text, to do so we employ a language model (FinBERT-PT-BR) trained in Brazilian Portuguese to develop an Informativeness Index, assigning scores to 26.804 quarterly financial statement notes from 1.152 companies in Brazil over the span of 12 years. As a verification of our model’s capability to understand textual data, we calculate the usual readability metrics (Flesch-Kincaid reading ease, Fog index, SMOG index, Loughran McDonald Index) for all the notes and employ machine learning models to evaluate which readability metric best represents an informativeness index built upon the dimensions of Boilerplateness, Completeness and Density, expecting our proposed metric to be poorly related to the readability metrics. The evaluation of which readability metric is closest to measuring the informativeness of financial text is based on the feature importance, which indicates the best proxy for financial text readability of Portuguese text is be the Loughran McDonald Index. The Loughran-McDonald Index is the only one with any relevance in the regressors, and as is based on file size, we assume our metric as capable of measuring textual information value better than common readability metrics, while pointing to the Loughran-McDonald to be a reasonable proxy to informational value of financial text. This research innovates by presenting a new method to quantify the informational value of financial information, contributing to value-relevance literature as well as literature of machine learning employment in accounting research, additionally we do so within a not-so-explored field (Portuguese financial information) with a reasonably large dataset. Further research may be needed to combine our proposed model with market-related metrics or human experiments to increase the validity of the metric concept.Este estudo emprega o FinBERT-PT-BR, um modelo de linguagem baseado em trans formadores treinado em textos financeiros em português do Brasil, para desenvolver um Índice de Informatividade, concebido para quantificar o valor informacional das divulgações financeiras. O conjunto de dados é composto por 26.804 notas explicativas anuais de 1.152 companhias abertas brasileiras, abrangendo um período de 12 anos (2011–2023). Além o índice, são calculadas as medidas tradicionais de legibilidade, Flesch-Kincaid Reading Ease, Índice de Fog, Índice SMOG e Índice de Loughran-McDonald, para cada nota. Em seguida, aplicam-se modelos de aprendizado de máquina (Random Forest e Gradient Boosting) para avaliar qual dessas métricas de legibilidade melhor representa o índice de informatividade derivado das três dimensões fundamentais: Padronização (Boilerplateness), Completude e Densidade. As análises de importância das variáveis nos diferentes modelos indicam que o Índice de Loughran-McDonald é o que mais se aproxima da variação do índice de informatividade, sugerindo que ele é a proxy mais eficaz para mensurar a legibilidade dos textos financeiros em português. Esse resultado com base em evidência empírica implica mudanças sobre a relação teórica entre complexidade textual e ofuscação informacional sob a ótica da teoria da agência. A pesquisa contribui para a literatura ao integrar modelos de linguagem e técnicas de aprendizado de máquina ao estudo da qualidade das divulgações financeiras em português, um contexto linguístico e regulatório ainda pouco explorado, utilizando um banco de dados extenso. Pesquisas futuras podem ampliar essa abordagem ao incorporar modelos multilíngues, avaliações humanas ou embeddings híbridos, de modo a aprimorar e validar o conceito de informatividadeUniversidade Federal de PernambucoUFPEBrasilPrograma de Pos Graduacao em Ciencias ContabeisANJOS, Luiz Carlos Marques doshttp://lattes.cnpq.br/2667949398304488http://lattes.cnpq.br/2136400491380618COSTA NETO, Arlindo Menezes da2025-12-18T16:37:37Z2025-12-18T16:37:37Z2025-11-26info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfCOSTA NETO, Arlindo Menezes da. Machine Learning and Readability in Accounting: An Ensemble Learning Approach. 2025. Tese (Doutorado em Ciências Contábeis) - Universidade Federal de Pernambuco, Recife, 2025.https://repositorio.ufpe.br/handle/123456789/67273enghttps://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFPEinstname:Universidade Federal de Pernambuco (UFPE)instacron:UFPE2025-12-18T16:37:38Zoai:repositorio.ufpe.br:123456789/67273Repositório InstitucionalPUBhttps://repositorio.ufpe.br/oai/requestattena@ufpe.bropendoar:22212025-12-18T16:37:38Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE)false
dc.title.none.fl_str_mv Machine Learning and Readability in Accounting: An Ensemble Learning Approach
title Machine Learning and Readability in Accounting: An Ensemble Learning Approach
spellingShingle Machine Learning and Readability in Accounting: An Ensemble Learning Approach
COSTA NETO, Arlindo Menezes da
Informativeness
Machine Learning
Accounting information
title_short Machine Learning and Readability in Accounting: An Ensemble Learning Approach
title_full Machine Learning and Readability in Accounting: An Ensemble Learning Approach
title_fullStr Machine Learning and Readability in Accounting: An Ensemble Learning Approach
title_full_unstemmed Machine Learning and Readability in Accounting: An Ensemble Learning Approach
title_sort Machine Learning and Readability in Accounting: An Ensemble Learning Approach
author COSTA NETO, Arlindo Menezes da
author_facet COSTA NETO, Arlindo Menezes da
author_role author
dc.contributor.none.fl_str_mv ANJOS, Luiz Carlos Marques dos
http://lattes.cnpq.br/2667949398304488
http://lattes.cnpq.br/2136400491380618
dc.contributor.author.fl_str_mv COSTA NETO, Arlindo Menezes da
dc.subject.por.fl_str_mv Informativeness
Machine Learning
Accounting information
topic Informativeness
Machine Learning
Accounting information
description We expand on the value relevance of accounting information by exploring a new metric for valuing the financial text, to do so we employ a language model (FinBERT-PT-BR) trained in Brazilian Portuguese to develop an Informativeness Index, assigning scores to 26.804 quarterly financial statement notes from 1.152 companies in Brazil over the span of 12 years. As a verification of our model’s capability to understand textual data, we calculate the usual readability metrics (Flesch-Kincaid reading ease, Fog index, SMOG index, Loughran McDonald Index) for all the notes and employ machine learning models to evaluate which readability metric best represents an informativeness index built upon the dimensions of Boilerplateness, Completeness and Density, expecting our proposed metric to be poorly related to the readability metrics. The evaluation of which readability metric is closest to measuring the informativeness of financial text is based on the feature importance, which indicates the best proxy for financial text readability of Portuguese text is be the Loughran McDonald Index. The Loughran-McDonald Index is the only one with any relevance in the regressors, and as is based on file size, we assume our metric as capable of measuring textual information value better than common readability metrics, while pointing to the Loughran-McDonald to be a reasonable proxy to informational value of financial text. This research innovates by presenting a new method to quantify the informational value of financial information, contributing to value-relevance literature as well as literature of machine learning employment in accounting research, additionally we do so within a not-so-explored field (Portuguese financial information) with a reasonably large dataset. Further research may be needed to combine our proposed model with market-related metrics or human experiments to increase the validity of the metric concept.
publishDate 2025
dc.date.none.fl_str_mv 2025-12-18T16:37:37Z
2025-12-18T16:37:37Z
2025-11-26
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv COSTA NETO, Arlindo Menezes da. Machine Learning and Readability in Accounting: An Ensemble Learning Approach. 2025. Tese (Doutorado em Ciências Contábeis) - Universidade Federal de Pernambuco, Recife, 2025.
https://repositorio.ufpe.br/handle/123456789/67273
identifier_str_mv COSTA NETO, Arlindo Menezes da. Machine Learning and Readability in Accounting: An Ensemble Learning Approach. 2025. Tese (Doutorado em Ciências Contábeis) - Universidade Federal de Pernambuco, Recife, 2025.
url https://repositorio.ufpe.br/handle/123456789/67273
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv https://creativecommons.org/licenses/by-nc-nd/4.0/
info:eu-repo/semantics/openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by-nc-nd/4.0/
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Federal de Pernambuco
UFPE
Brasil
Programa de Pos Graduacao em Ciencias Contabeis
publisher.none.fl_str_mv Universidade Federal de Pernambuco
UFPE
Brasil
Programa de Pos Graduacao em Ciencias Contabeis
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFPE
instname:Universidade Federal de Pernambuco (UFPE)
instacron:UFPE
instname_str Universidade Federal de Pernambuco (UFPE)
instacron_str UFPE
institution UFPE
reponame_str Repositório Institucional da UFPE
collection Repositório Institucional da UFPE
repository.name.fl_str_mv Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE)
repository.mail.fl_str_mv attena@ufpe.br
_version_ 1856042032783425536