Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery
| Ano de defesa: | 2025 |
|---|---|
| Autor(a) principal: | |
| Orientador(a): | |
| Banca de defesa: | |
| Tipo de documento: | Tese |
| Tipo de acesso: | Acesso aberto |
| Idioma: | eng |
| Instituição de defesa: |
Biblioteca Digitais de Teses e Dissertações da USP
|
| Programa de Pós-Graduação: |
Não Informado pela instituição
|
| Departamento: |
Não Informado pela instituição
|
| País: |
Não Informado pela instituição
|
| Palavras-chave em Português: | |
| Link de acesso: | https://www.teses.usp.br/teses/disponiveis/45/45134/tde-01092025-084702/ |
Resumo: | As software becomes increasingly important, there is a growing focus on developing it in an effective and efficient manner. Thereby Open Source software is of particular interest for researchers due to the transparency of its development process. This thesis leverages this transparency by subjecting data mined from GitHub to explainable AI methods in order to understand what factors influence the productivity in Open Source software development. To address the complexity of software development, a multidimensional approach, based on the SPACE framework, was adopted, using five different productivity metrics: Developer Churn, Stars Added, LOC per Developer, Merge Times and Issue Solutions Times. Taking advantage of the wealth of data from the Git commit history and GitHub API, a variety of input variables were generated to address the team structure, the structure of the source code, and software engineering practices. These variables served to train machine learning models to predict each of the five productivity metrics. Using SHapley Additive exPlanations (SHAP), the importance of input variables could be assessed, providing insight into which factors may be most critical for each productivity metric. To provide additional understanding, the most important variables were also used as input for the FCI algorithm, a computational method that helps to uncover causal relationships. The analysis of the SHAP values and causal graphs from the FCI algorithm concluded that the traditional understanding of measuring productivity in software development may not fully apply to Open Source, because sporadic contributions may lead to misinterpretations of productivity metrics. The thesis also identified File Age as a relevant factor, exerting a positive influence on multiple productivity metrics if new features are being developed, compared to the maintenance of existing features. Lastly, Closeness Centrality, a network metric derived from team interactions and collaborations, was found to exert a significant positive correlation or even influence on various productivity measures. |
| id |
USP_6c8f5d63236e1833b5758a6425e6a992 |
|---|---|
| oai_identifier_str |
oai:teses.usp.br:tde-01092025-084702 |
| network_acronym_str |
USP |
| network_name_str |
Biblioteca Digital de Teses e Dissertações da USP |
| repository_id_str |
|
| spelling |
Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discoveryAnalisando a produtividade em desenvolvimento de software de open source: uma abordagem de IA explicável usando SHAP e descoberta causalAlgoritmo FCICausal discoveryCausal inferenceDescoberta causalEmpirical software engineeringEngenharia de software empíricaExplainable AIFCI algorithmFramework SPACEGitHubGitHubInferência causalInteligência artificial explicávelMétricas de produtividadeMineração de repositórios de softwareMining software repositoriesOpen source softwareProductivity metricsSHAPSHAPSoftware de código abertoSPACE frameworkAs software becomes increasingly important, there is a growing focus on developing it in an effective and efficient manner. Thereby Open Source software is of particular interest for researchers due to the transparency of its development process. This thesis leverages this transparency by subjecting data mined from GitHub to explainable AI methods in order to understand what factors influence the productivity in Open Source software development. To address the complexity of software development, a multidimensional approach, based on the SPACE framework, was adopted, using five different productivity metrics: Developer Churn, Stars Added, LOC per Developer, Merge Times and Issue Solutions Times. Taking advantage of the wealth of data from the Git commit history and GitHub API, a variety of input variables were generated to address the team structure, the structure of the source code, and software engineering practices. These variables served to train machine learning models to predict each of the five productivity metrics. Using SHapley Additive exPlanations (SHAP), the importance of input variables could be assessed, providing insight into which factors may be most critical for each productivity metric. To provide additional understanding, the most important variables were also used as input for the FCI algorithm, a computational method that helps to uncover causal relationships. The analysis of the SHAP values and causal graphs from the FCI algorithm concluded that the traditional understanding of measuring productivity in software development may not fully apply to Open Source, because sporadic contributions may lead to misinterpretations of productivity metrics. The thesis also identified File Age as a relevant factor, exerting a positive influence on multiple productivity metrics if new features are being developed, compared to the maintenance of existing features. Lastly, Closeness Centrality, a network metric derived from team interactions and collaborations, was found to exert a significant positive correlation or even influence on various productivity measures.À medida que o software se torna cada vez mais importante, cresce o foco em desenvolvê-lo de maneira eficaz e eficiente. Nesse contexto, o software de código aberto desperta um interesse particular entre os pesquisadores devido à transparência de seu processo de desenvolvimento. Esta tese aproveita essa transparência ao aplicar métodos de Inteligência Artificial Explicável (Explainable AI, XAI) a dados minerados do GitHub, com o objetivo de entender quais fatores influenciam a produtividade no desenvolvimento de software de código aberto. Para lidar com a complexidade do desenvolvimento de software, foi adotada uma abordagem multidimensional, baseada no framework SPACE, utilizando cinco métricas diferentes de produtividade: Developer Churn (rotatividade de desenvolvedores), Stars Added (número de estrelas adicionadas), LOC per Developer (linhas de código por desenvolvedor), Merge Times (tempo de merges) e Issue Solution Times (tempos de resolução de issues). Aproveitando a riqueza de dados vindo do histórico de commits do Git e da API do GitHub, foi gerado um conjunto de variáveis para abordar a estrutura da equipe, a estrutura do código-fonte e as práticas de engenharia de software. Essas variáveis foram utilizadas para treinar modelos de aprendizado de máquina com o objetivo de prever cada uma das cinco métricas de produtividade. Usando SHapley Additive exPlanations (SHAP), foi possível avaliar a importância dessas variáveis de entrada, fornecendo um entendimento sobre quais fatores podem ser mais críticos para cada métrica de produtividade. Para oferecer uma compreensão adicional, as variáveis mais importantes também foram utilizadas como entrada para o algoritmo FCI, um método computacional que ajuda a descobrir relações causais. A análise dos valores SHAP e dos grafos causais gerados pelo algoritmo FCI levou à conclusão de que a forma tradicional de medir produtividade em desenvolvimento de software pode não se aplicar na integra ao contexto de código aberto, pois contribuições esporádicas podem levar a interpretações equivocadas das métricas de produtividade. A tese também identificou a idade dos arquivos (File Age) como um fator relevante, exercendo uma influência positiva sobre várias métricas de produtividade quando novas funcionalidades estão sendo desenvolvidas, em comparação com a manutenção de funcionalidades existentes. Por fim, a Centralidade de Proximidade (Closeness Centrality), uma métrica de rede derivada das interações e colaborações da equipe, mostrou exercer uma correlação positiva significativa ou até mesmo influência sobre diversas medidas de produtividadeBiblioteca Digitais de Teses e Dissertações da USPLejbman, Alfredo Goldman VelGut, Christian Manfred Toni2025-08-26info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/45/45134/tde-01092025-084702/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2025-10-02T09:01:02Zoai:teses.usp.br:tde-01092025-084702Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212025-10-02T09:01:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false |
| dc.title.none.fl_str_mv |
Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery Analisando a produtividade em desenvolvimento de software de open source: uma abordagem de IA explicável usando SHAP e descoberta causal |
| title |
Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery |
| spellingShingle |
Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery Gut, Christian Manfred Toni Algoritmo FCI Causal discovery Causal inference Descoberta causal Empirical software engineering Engenharia de software empírica Explainable AI FCI algorithm Framework SPACE GitHub GitHub Inferência causal Inteligência artificial explicável Métricas de produtividade Mineração de repositórios de software Mining software repositories Open source software Productivity metrics SHAP SHAP Software de código aberto SPACE framework |
| title_short |
Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery |
| title_full |
Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery |
| title_fullStr |
Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery |
| title_full_unstemmed |
Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery |
| title_sort |
Analyzing productivity in open source software development: an explainable AI approach using SHAP and causal discovery |
| author |
Gut, Christian Manfred Toni |
| author_facet |
Gut, Christian Manfred Toni |
| author_role |
author |
| dc.contributor.none.fl_str_mv |
Lejbman, Alfredo Goldman Vel |
| dc.contributor.author.fl_str_mv |
Gut, Christian Manfred Toni |
| dc.subject.por.fl_str_mv |
Algoritmo FCI Causal discovery Causal inference Descoberta causal Empirical software engineering Engenharia de software empírica Explainable AI FCI algorithm Framework SPACE GitHub GitHub Inferência causal Inteligência artificial explicável Métricas de produtividade Mineração de repositórios de software Mining software repositories Open source software Productivity metrics SHAP SHAP Software de código aberto SPACE framework |
| topic |
Algoritmo FCI Causal discovery Causal inference Descoberta causal Empirical software engineering Engenharia de software empírica Explainable AI FCI algorithm Framework SPACE GitHub GitHub Inferência causal Inteligência artificial explicável Métricas de produtividade Mineração de repositórios de software Mining software repositories Open source software Productivity metrics SHAP SHAP Software de código aberto SPACE framework |
| description |
As software becomes increasingly important, there is a growing focus on developing it in an effective and efficient manner. Thereby Open Source software is of particular interest for researchers due to the transparency of its development process. This thesis leverages this transparency by subjecting data mined from GitHub to explainable AI methods in order to understand what factors influence the productivity in Open Source software development. To address the complexity of software development, a multidimensional approach, based on the SPACE framework, was adopted, using five different productivity metrics: Developer Churn, Stars Added, LOC per Developer, Merge Times and Issue Solutions Times. Taking advantage of the wealth of data from the Git commit history and GitHub API, a variety of input variables were generated to address the team structure, the structure of the source code, and software engineering practices. These variables served to train machine learning models to predict each of the five productivity metrics. Using SHapley Additive exPlanations (SHAP), the importance of input variables could be assessed, providing insight into which factors may be most critical for each productivity metric. To provide additional understanding, the most important variables were also used as input for the FCI algorithm, a computational method that helps to uncover causal relationships. The analysis of the SHAP values and causal graphs from the FCI algorithm concluded that the traditional understanding of measuring productivity in software development may not fully apply to Open Source, because sporadic contributions may lead to misinterpretations of productivity metrics. The thesis also identified File Age as a relevant factor, exerting a positive influence on multiple productivity metrics if new features are being developed, compared to the maintenance of existing features. Lastly, Closeness Centrality, a network metric derived from team interactions and collaborations, was found to exert a significant positive correlation or even influence on various productivity measures. |
| publishDate |
2025 |
| dc.date.none.fl_str_mv |
2025-08-26 |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
| format |
doctoralThesis |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
https://www.teses.usp.br/teses/disponiveis/45/45134/tde-01092025-084702/ |
| url |
https://www.teses.usp.br/teses/disponiveis/45/45134/tde-01092025-084702/ |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
|
| dc.rights.driver.fl_str_mv |
Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess |
| rights_invalid_str_mv |
Liberar o conteúdo para acesso público. |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.coverage.none.fl_str_mv |
|
| dc.publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
| publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
| dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP |
| instname_str |
Universidade de São Paulo (USP) |
| instacron_str |
USP |
| institution |
USP |
| reponame_str |
Biblioteca Digital de Teses e Dissertações da USP |
| collection |
Biblioteca Digital de Teses e Dissertações da USP |
| repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP) |
| repository.mail.fl_str_mv |
virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br |
| _version_ |
1848370469552521216 |