Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery

Detalhes bibliográficos
Ano de defesa: 2025
Autor(a) principal: Gut, Christian Manfred Toni
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://www.teses.usp.br/teses/disponiveis/45/45134/tde-01092025-084702/
Resumo: As software becomes increasingly important, there is a growing focus on developing it in an effective and efficient manner. Thereby Open Source software is of particular interest for researchers due to the transparency of its development process. This thesis leverages this transparency by subjecting data mined from GitHub to explainable AI methods in order to understand what factors influence the productivity in Open Source software development. To address the complexity of software development, a multidimensional approach, based on the SPACE framework, was adopted, using five different productivity metrics: Developer Churn, Stars Added, LOC per Developer, Merge Times and Issue Solutions Times. Taking advantage of the wealth of data from the Git commit history and GitHub API, a variety of input variables were generated to address the team structure, the structure of the source code, and software engineering practices. These variables served to train machine learning models to predict each of the five productivity metrics. Using SHapley Additive exPlanations (SHAP), the importance of input variables could be assessed, providing insight into which factors may be most critical for each productivity metric. To provide additional understanding, the most important variables were also used as input for the FCI algorithm, a computational method that helps to uncover causal relationships. The analysis of the SHAP values and causal graphs from the FCI algorithm concluded that the traditional understanding of measuring productivity in software development may not fully apply to Open Source, because sporadic contributions may lead to misinterpretations of productivity metrics. The thesis also identified File Age as a relevant factor, exerting a positive influence on multiple productivity metrics if new features are being developed, compared to the maintenance of existing features. Lastly, Closeness Centrality, a network metric derived from team interactions and collaborations, was found to exert a significant positive correlation or even influence on various productivity measures.
id USP_6c8f5d63236e1833b5758a6425e6a992
oai_identifier_str oai:teses.usp.br:tde-01092025-084702
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str
spelling Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discoveryAnalisando a produtividade em desenvolvimento de software de open source: uma abordagem de IA explicável usando SHAP e descoberta causalAlgoritmo FCICausal discoveryCausal inferenceDescoberta causalEmpirical software engineeringEngenharia de software empíricaExplainable AIFCI algorithmFramework SPACEGitHubGitHubInferência causalInteligência artificial explicávelMétricas de produtividadeMineração de repositórios de softwareMining software repositoriesOpen source softwareProductivity metricsSHAPSHAPSoftware de código abertoSPACE frameworkAs software becomes increasingly important, there is a growing focus on developing it in an effective and efficient manner. Thereby Open Source software is of particular interest for researchers due to the transparency of its development process. This thesis leverages this transparency by subjecting data mined from GitHub to explainable AI methods in order to understand what factors influence the productivity in Open Source software development. To address the complexity of software development, a multidimensional approach, based on the SPACE framework, was adopted, using five different productivity metrics: Developer Churn, Stars Added, LOC per Developer, Merge Times and Issue Solutions Times. Taking advantage of the wealth of data from the Git commit history and GitHub API, a variety of input variables were generated to address the team structure, the structure of the source code, and software engineering practices. These variables served to train machine learning models to predict each of the five productivity metrics. Using SHapley Additive exPlanations (SHAP), the importance of input variables could be assessed, providing insight into which factors may be most critical for each productivity metric. To provide additional understanding, the most important variables were also used as input for the FCI algorithm, a computational method that helps to uncover causal relationships. The analysis of the SHAP values and causal graphs from the FCI algorithm concluded that the traditional understanding of measuring productivity in software development may not fully apply to Open Source, because sporadic contributions may lead to misinterpretations of productivity metrics. The thesis also identified File Age as a relevant factor, exerting a positive influence on multiple productivity metrics if new features are being developed, compared to the maintenance of existing features. Lastly, Closeness Centrality, a network metric derived from team interactions and collaborations, was found to exert a significant positive correlation or even influence on various productivity measures.À medida que o software se torna cada vez mais importante, cresce o foco em desenvolvê-lo de maneira eficaz e eficiente. Nesse contexto, o software de código aberto desperta um interesse particular entre os pesquisadores devido à transparência de seu processo de desenvolvimento. Esta tese aproveita essa transparência ao aplicar métodos de Inteligência Artificial Explicável (Explainable AI, XAI) a dados minerados do GitHub, com o objetivo de entender quais fatores influenciam a produtividade no desenvolvimento de software de código aberto. Para lidar com a complexidade do desenvolvimento de software, foi adotada uma abordagem multidimensional, baseada no framework SPACE, utilizando cinco métricas diferentes de produtividade: Developer Churn (rotatividade de desenvolvedores), Stars Added (número de estrelas adicionadas), LOC per Developer (linhas de código por desenvolvedor), Merge Times (tempo de merges) e Issue Solution Times (tempos de resolução de issues). Aproveitando a riqueza de dados vindo do histórico de commits do Git e da API do GitHub, foi gerado um conjunto de variáveis para abordar a estrutura da equipe, a estrutura do código-fonte e as práticas de engenharia de software. Essas variáveis foram utilizadas para treinar modelos de aprendizado de máquina com o objetivo de prever cada uma das cinco métricas de produtividade. Usando SHapley Additive exPlanations (SHAP), foi possível avaliar a importância dessas variáveis de entrada, fornecendo um entendimento sobre quais fatores podem ser mais críticos para cada métrica de produtividade. Para oferecer uma compreensão adicional, as variáveis mais importantes também foram utilizadas como entrada para o algoritmo FCI, um método computacional que ajuda a descobrir relações causais. A análise dos valores SHAP e dos grafos causais gerados pelo algoritmo FCI levou à conclusão de que a forma tradicional de medir produtividade em desenvolvimento de software pode não se aplicar na integra ao contexto de código aberto, pois contribuições esporádicas podem levar a interpretações equivocadas das métricas de produtividade. A tese também identificou a idade dos arquivos (File Age) como um fator relevante, exercendo uma influência positiva sobre várias métricas de produtividade quando novas funcionalidades estão sendo desenvolvidas, em comparação com a manutenção de funcionalidades existentes. Por fim, a Centralidade de Proximidade (Closeness Centrality), uma métrica de rede derivada das interações e colaborações da equipe, mostrou exercer uma correlação positiva significativa ou até mesmo influência sobre diversas medidas de produtividadeBiblioteca Digitais de Teses e Dissertações da USPLejbman, Alfredo Goldman VelGut, Christian Manfred Toni2025-08-26info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/45/45134/tde-01092025-084702/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2025-10-02T09:01:02Zoai:teses.usp.br:tde-01092025-084702Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212025-10-02T09:01:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery
Analisando a produtividade em desenvolvimento de software de open source: uma abordagem de IA explicável usando SHAP e descoberta causal
title Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery
spellingShingle Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery
Gut, Christian Manfred Toni
Algoritmo FCI
Causal discovery
Causal inference
Descoberta causal
Empirical software engineering
Engenharia de software empírica
Explainable AI
FCI algorithm
Framework SPACE
GitHub
GitHub
Inferência causal
Inteligência artificial explicável
Métricas de produtividade
Mineração de repositórios de software
Mining software repositories
Open source software
Productivity metrics
SHAP
SHAP
Software de código aberto
SPACE framework
title_short Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery
title_full Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery
title_fullStr Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery
title_full_unstemmed Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery
title_sort Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery
author Gut, Christian Manfred Toni
author_facet Gut, Christian Manfred Toni
author_role author
dc.contributor.none.fl_str_mv Lejbman, Alfredo Goldman Vel
dc.contributor.author.fl_str_mv Gut, Christian Manfred Toni
dc.subject.por.fl_str_mv Algoritmo FCI
Causal discovery
Causal inference
Descoberta causal
Empirical software engineering
Engenharia de software empírica
Explainable AI
FCI algorithm
Framework SPACE
GitHub
GitHub
Inferência causal
Inteligência artificial explicável
Métricas de produtividade
Mineração de repositórios de software
Mining software repositories
Open source software
Productivity metrics
SHAP
SHAP
Software de código aberto
SPACE framework
topic Algoritmo FCI
Causal discovery
Causal inference
Descoberta causal
Empirical software engineering
Engenharia de software empírica
Explainable AI
FCI algorithm
Framework SPACE
GitHub
GitHub
Inferência causal
Inteligência artificial explicável
Métricas de produtividade
Mineração de repositórios de software
Mining software repositories
Open source software
Productivity metrics
SHAP
SHAP
Software de código aberto
SPACE framework
description As software becomes increasingly important, there is a growing focus on developing it in an effective and efficient manner. Thereby Open Source software is of particular interest for researchers due to the transparency of its development process. This thesis leverages this transparency by subjecting data mined from GitHub to explainable AI methods in order to understand what factors influence the productivity in Open Source software development. To address the complexity of software development, a multidimensional approach, based on the SPACE framework, was adopted, using five different productivity metrics: Developer Churn, Stars Added, LOC per Developer, Merge Times and Issue Solutions Times. Taking advantage of the wealth of data from the Git commit history and GitHub API, a variety of input variables were generated to address the team structure, the structure of the source code, and software engineering practices. These variables served to train machine learning models to predict each of the five productivity metrics. Using SHapley Additive exPlanations (SHAP), the importance of input variables could be assessed, providing insight into which factors may be most critical for each productivity metric. To provide additional understanding, the most important variables were also used as input for the FCI algorithm, a computational method that helps to uncover causal relationships. The analysis of the SHAP values and causal graphs from the FCI algorithm concluded that the traditional understanding of measuring productivity in software development may not fully apply to Open Source, because sporadic contributions may lead to misinterpretations of productivity metrics. The thesis also identified File Age as a relevant factor, exerting a positive influence on multiple productivity metrics if new features are being developed, compared to the maintenance of existing features. Lastly, Closeness Centrality, a network metric derived from team interactions and collaborations, was found to exert a significant positive correlation or even influence on various productivity measures.
publishDate 2025
dc.date.none.fl_str_mv 2025-08-26
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/45/45134/tde-01092025-084702/
url https://www.teses.usp.br/teses/disponiveis/45/45134/tde-01092025-084702/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1848370469552521216