Isolating variable effects in supervised machine learning illustrated in educational data mining

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: SILVA FILHO, Rogério Luiz Cardoso
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Universidade Federal de Pernambuco
UFPE
Brasil
Programa de Pos Graduacao em Ciencia da Computacao
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://repositorio.ufpe.br/handle/123456789/57408
Resumo: This thesis investigates the application of Explainable Artificial Intelligence (XAI) in Su- pervised Machine Learning (SML) models. The motivation for this study stems from the development of Educational Data Mining (EDM), an area that frequently uses such models to analyze and extract insights from large datasets. A central issue of this work is the challenge of generating global explanations for SML, particularly in cases where data independence is not guaranteed. This is a recurring but still underexplored problem in EDM. Neglecting data interdependencies can lead to biased explanations, overestimating irrelevant variables or disproportionately assigning importance to predictors with similar relevance. To address these challenges, this work builds on Accumulated Local Effects (ALE), a recent method for post-hoc global explanation that visualizes the impact of features. ALE’s pseudo-orthogonality property allows for isolating individual variable effects, distinguishing it from widely used methods in EDM such as partial dependence plots and Shapley-based explanations. In a preliminary stage, ALE techniques is compared to other existing ones by using a new methodology that evaluates how different these techniques approximate the true variable effects in various contexts of data dependency. In a preliminary stage, ALE techniques are compared to other existing ones using a new methodology that evaluates how well these techniques approximate the true variable ef- fects in various contexts of data dependency. Furthermore, based on the ALE promising results of this stage, this work proposes new ALE-based scores to measure the impact of variables in SML. The scores are model-agnostic and can report both the magnitude and direction of the individual impact of features. The scores prove to be efficient in various scenarios when compared to existing metrics on synthetic and real-world datasets. Moreover, an empirical study using data from Brazilian secondary schools not only confirms the usefulness of the new scores in a real-world scenario but also extends the contributions of this thesis by identifying and offering new perspectives on the determinants of Brazilian school success over more than a decade.
id UFPE_7d030b74aa35940ce47f172f9b56accd
oai_identifier_str oai:repositorio.ufpe.br:123456789/57408
network_acronym_str UFPE
network_name_str Repositório Institucional da UFPE
repository_id_str
spelling Isolating variable effects in supervised machine learning illustrated in educational data miningIA explicávelAprendizagem de máquina interpretávelExplicadores globaisMineração de dados educacionaisImportância de variáveisThis thesis investigates the application of Explainable Artificial Intelligence (XAI) in Su- pervised Machine Learning (SML) models. The motivation for this study stems from the development of Educational Data Mining (EDM), an area that frequently uses such models to analyze and extract insights from large datasets. A central issue of this work is the challenge of generating global explanations for SML, particularly in cases where data independence is not guaranteed. This is a recurring but still underexplored problem in EDM. Neglecting data interdependencies can lead to biased explanations, overestimating irrelevant variables or disproportionately assigning importance to predictors with similar relevance. To address these challenges, this work builds on Accumulated Local Effects (ALE), a recent method for post-hoc global explanation that visualizes the impact of features. ALE’s pseudo-orthogonality property allows for isolating individual variable effects, distinguishing it from widely used methods in EDM such as partial dependence plots and Shapley-based explanations. In a preliminary stage, ALE techniques is compared to other existing ones by using a new methodology that evaluates how different these techniques approximate the true variable effects in various contexts of data dependency. In a preliminary stage, ALE techniques are compared to other existing ones using a new methodology that evaluates how well these techniques approximate the true variable ef- fects in various contexts of data dependency. Furthermore, based on the ALE promising results of this stage, this work proposes new ALE-based scores to measure the impact of variables in SML. The scores are model-agnostic and can report both the magnitude and direction of the individual impact of features. The scores prove to be efficient in various scenarios when compared to existing metrics on synthetic and real-world datasets. Moreover, an empirical study using data from Brazilian secondary schools not only confirms the usefulness of the new scores in a real-world scenario but also extends the contributions of this thesis by identifying and offering new perspectives on the determinants of Brazilian school success over more than a decade.Esta tese investiga a aplicação de Inteligência Artificial Explicável (IAE) em modelos de Aprendizagem de Máquina Supervisionada (AMS). A motivação para esse estudo decorre do desenvolvimento da Mineração de Dados Educacionais (MDE), uma área de estudo que frequentemente emprega tais modelos para analisar e extrair conhecimentos de vastos con- juntos de dados. Um aspecto central dessa tese é o desafio de gerar explicações globais para AMS, particularmente em situações onde a independência entre os dados não é garantida. Esta é uma problemática recorrente, mas ainda pouco explorada na MDE. A negligência das interdependências entre os dados pode levar a explicações enviesadas, valorização excessiva de variáveis irrelevantes ou atribuição desproporcional de importância a preditores de similar relevância. Para resolver estes desafios, a tese baseia-se em um método recente para a vi- sualização do impacto das variáveis em modelos supervisionados, conhecido em inglês como Accumulated Local Effects (ALE), que se refere à distribuição acumulada de efeitos locais. A propriedade pseudo-ortogonal de ALE permite isolar os efeitos de variáveis individualmente, distinguindo-a de métodos amplamente usados em MDE, como os gráficos de dependência parcial e explicações baseadas em valores de Shapley. Em uma etapa inicial, as técnicas ALE são comparadas a outras existentes utilizando uma nova metodologia que avalia quão bem essas técnicas se aproximam do efeito real das variáveis nos modelos em vários contextos de dependência de dados. Além disso, com base nos resultados promissores dessa etapa, este tra- balho propõe novos escores baseados em ALE para medir o impacto das variáveis em modelos de AMS. Esses escores são agnósticos a modelos e podem capturar tanto a magnitude quanto a direção do impacto individual das variáveis. Os escores demonstram eficiência em vários cenários quando comparados com as métricas existentes em conjuntos de dados sintéticos e reais. Além disso, um estudo empírico utilizando os dados das escolas secundárias brasileiras não apenas ratifica a utilidade dos novos escores em um cenário do mundo real, mas tam- bém estende as contribuições desta tese ao identificar e oferecer novas perspectivas sobre os determinantes do sucesso escolar brasileiro ao longo de mais de uma década.Universidade Federal de PernambucoUFPEBrasilPrograma de Pos Graduacao em Ciencia da ComputacaoADEODATO, Paulo Jorge LeitãoBRITO, Kellyton dos Santoshttp://lattes.cnpq.br/9212443460705379http://lattes.cnpq.br/3524590211304012http://lattes.cnpq.br/8750956715158540SILVA FILHO, Rogério Luiz Cardoso2024-08-16T13:36:32Z2024-08-16T13:36:32Z2024-04-18info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfSILVA FILHO, Rogério Luiz Cardoso. Isolating variable effects in supervised machine learning illustrated in educational data mining. 2024. Tese (Doutorado em Ciência da Computação) – Universidade Federal de Pernambuco, Recife, 2024.https://repositorio.ufpe.br/handle/123456789/57408engAttribution-NonCommercial-NoDerivs 3.0 Brazilhttp://creativecommons.org/licenses/by-nc-nd/3.0/br/info:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFPEinstname:Universidade Federal de Pernambuco (UFPE)instacron:UFPE2024-08-17T05:24:14Zoai:repositorio.ufpe.br:123456789/57408Repositório InstitucionalPUBhttps://repositorio.ufpe.br/oai/requestattena@ufpe.bropendoar:22212024-08-17T05:24:14Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE)false
dc.title.none.fl_str_mv Isolating variable effects in supervised machine learning illustrated in educational data mining
title Isolating variable effects in supervised machine learning illustrated in educational data mining
spellingShingle Isolating variable effects in supervised machine learning illustrated in educational data mining
SILVA FILHO, Rogério Luiz Cardoso
IA explicável
Aprendizagem de máquina interpretável
Explicadores globais
Mineração de dados educacionais
Importância de variáveis
title_short Isolating variable effects in supervised machine learning illustrated in educational data mining
title_full Isolating variable effects in supervised machine learning illustrated in educational data mining
title_fullStr Isolating variable effects in supervised machine learning illustrated in educational data mining
title_full_unstemmed Isolating variable effects in supervised machine learning illustrated in educational data mining
title_sort Isolating variable effects in supervised machine learning illustrated in educational data mining
author SILVA FILHO, Rogério Luiz Cardoso
author_facet SILVA FILHO, Rogério Luiz Cardoso
author_role author
dc.contributor.none.fl_str_mv ADEODATO, Paulo Jorge Leitão
BRITO, Kellyton dos Santos
http://lattes.cnpq.br/9212443460705379
http://lattes.cnpq.br/3524590211304012
http://lattes.cnpq.br/8750956715158540
dc.contributor.author.fl_str_mv SILVA FILHO, Rogério Luiz Cardoso
dc.subject.por.fl_str_mv IA explicável
Aprendizagem de máquina interpretável
Explicadores globais
Mineração de dados educacionais
Importância de variáveis
topic IA explicável
Aprendizagem de máquina interpretável
Explicadores globais
Mineração de dados educacionais
Importância de variáveis
description This thesis investigates the application of Explainable Artificial Intelligence (XAI) in Su- pervised Machine Learning (SML) models. The motivation for this study stems from the development of Educational Data Mining (EDM), an area that frequently uses such models to analyze and extract insights from large datasets. A central issue of this work is the challenge of generating global explanations for SML, particularly in cases where data independence is not guaranteed. This is a recurring but still underexplored problem in EDM. Neglecting data interdependencies can lead to biased explanations, overestimating irrelevant variables or disproportionately assigning importance to predictors with similar relevance. To address these challenges, this work builds on Accumulated Local Effects (ALE), a recent method for post-hoc global explanation that visualizes the impact of features. ALE’s pseudo-orthogonality property allows for isolating individual variable effects, distinguishing it from widely used methods in EDM such as partial dependence plots and Shapley-based explanations. In a preliminary stage, ALE techniques is compared to other existing ones by using a new methodology that evaluates how different these techniques approximate the true variable effects in various contexts of data dependency. In a preliminary stage, ALE techniques are compared to other existing ones using a new methodology that evaluates how well these techniques approximate the true variable ef- fects in various contexts of data dependency. Furthermore, based on the ALE promising results of this stage, this work proposes new ALE-based scores to measure the impact of variables in SML. The scores are model-agnostic and can report both the magnitude and direction of the individual impact of features. The scores prove to be efficient in various scenarios when compared to existing metrics on synthetic and real-world datasets. Moreover, an empirical study using data from Brazilian secondary schools not only confirms the usefulness of the new scores in a real-world scenario but also extends the contributions of this thesis by identifying and offering new perspectives on the determinants of Brazilian school success over more than a decade.
publishDate 2024
dc.date.none.fl_str_mv 2024-08-16T13:36:32Z
2024-08-16T13:36:32Z
2024-04-18
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv SILVA FILHO, Rogério Luiz Cardoso. Isolating variable effects in supervised machine learning illustrated in educational data mining. 2024. Tese (Doutorado em Ciência da Computação) – Universidade Federal de Pernambuco, Recife, 2024.
https://repositorio.ufpe.br/handle/123456789/57408
identifier_str_mv SILVA FILHO, Rogério Luiz Cardoso. Isolating variable effects in supervised machine learning illustrated in educational data mining. 2024. Tese (Doutorado em Ciência da Computação) – Universidade Federal de Pernambuco, Recife, 2024.
url https://repositorio.ufpe.br/handle/123456789/57408
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv Attribution-NonCommercial-NoDerivs 3.0 Brazil
http://creativecommons.org/licenses/by-nc-nd/3.0/br/
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Attribution-NonCommercial-NoDerivs 3.0 Brazil
http://creativecommons.org/licenses/by-nc-nd/3.0/br/
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Federal de Pernambuco
UFPE
Brasil
Programa de Pos Graduacao em Ciencia da Computacao
publisher.none.fl_str_mv Universidade Federal de Pernambuco
UFPE
Brasil
Programa de Pos Graduacao em Ciencia da Computacao
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFPE
instname:Universidade Federal de Pernambuco (UFPE)
instacron:UFPE
instname_str Universidade Federal de Pernambuco (UFPE)
instacron_str UFPE
institution UFPE
reponame_str Repositório Institucional da UFPE
collection Repositório Institucional da UFPE
repository.name.fl_str_mv Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE)
repository.mail.fl_str_mv attena@ufpe.br
_version_ 1856041943965892608