Towards automatic labeling of exception handling bugs: a case study of 10 years bug-fixing in apache hadoop

Silva, Antônio José Amâncio da

Towards automatic labeling of exception handling bugs: a case study of 10 years bug-fixing in apache hadoop

Detalhes bibliográficos
Ano de defesa:	2022
Autor(a) principal:	Silva, Antônio José Amâncio da
Orientador(a):	Rocha, Lincoln Souza
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Não Informado pela instituição
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Área do conhecimento CNPq:	CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Link de acesso:	http://repositorio.ufc.br/handle/riufc/78339
Resumo:	Context: Exception randling (EH) bugs stem from incorrect usage of exception handling mechanisms (EHM) and often incur severe consequences (e.g., system downtime, data loss, and security risk). Tracking EH bugs is particularly relevant for contemporary systems (e.g., cloud- and artificial intelligence based systems), in which the software’s sophisticated logic is an additional threat to the correct use of the EHM. On top of that, bug reporters seldom can tag EH bugs — since it may require an encompassing knowledge of the software’s EH strategy. Surprisingly, to the best of our knowledge, there is no automated procedure to identify EH bugs from report descriptions. Objective: First, we aim at evaluating the extent to which Natural Language Processing (NLP) and Machine Learning (ML) can be used to reliably label EH bugs using the text fields from bug reports (e.g., summary, description, and comments). Second, we aim at providing a reliably labeled dataset that the community can use in future endeavors. Overall, we expect our work to raise the community’s awareness regarding the importance of EH bugs. Method: We manually analyzed 4,516 bug reports from the four main components of Apache’s Hadoop project, out of which we labeled ≈ 20% (943) as EH bugs. Then, we used word embedding techniques (Bag-of-Words and Term Frequency-Inverse Document Frequency (TF-IDF)) to summarize the textual fields of bug reports. Subsequently, we used these embeddings to fit four classes of ML methods and record their performance on unseen data. We have also evaluated whether considering only EH keywords is enough to achieve high predictive performance. Results: Our results show that the combination of NLP and ML techniques can label EH bugs reasonably well, achieving Receiver Operating Characteristics-Area Under The Curve (ROC-AUC) scores of up to 0.70 and recall ranging from 0.50 up to 0.62. As a sanity check, we also evaluate methods using embeddings extracted solely from keywords. While keyword-based embeddings yield similar AUC, we observe a steep decrease in recall (0.53). This suggests that keywords alone are not sufficient to characterize reports of EH bugs — and there is an avenue for more complex text analyses. Conclusions: To the best of our knowledge, this is the first study addressing the problem of automatic labeling of EH bugs. Based on our results, we can conclude that the combination of NLP and ML techniques sounds promising to automate the task of labeling EH bugs. Overall, we hope (i) that our work will contribute towards raising awareness around EH bugs; and (ii) that our (publicly available) dataset will serve as a benchmarking dataset, paving the way for follow-up works. Additionally, our findings can be used to build tools that help maintainers flesh out EH bugs during the triage process.

Metadados do item

id	UFC-7_61e7ffdf1f39324c96dfc3f535453388
oai_identifier_str	oai:repositorio.ufc.br:riufc/78339
network_acronym_str	UFC-7
network_name_str	Repositório Institucional da Universidade Federal do Ceará (UFC)
repository_id_str
spelling	Silva, Antônio José Amâncio daRocha, Lincoln Souza2024-10-01T13:45:24Z2024-10-01T13:45:24Z2022SILVA, Antônio José Amâncio da. Towards automatic labeling of exception handling bugs: a case study of 10 years bug-fixing in apache hadoop. 2024. 53 f. Dissertação (Mestrado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2022.http://repositorio.ufc.br/handle/riufc/78339Context: Exception randling (EH) bugs stem from incorrect usage of exception handling mechanisms (EHM) and often incur severe consequences (e.g., system downtime, data loss, and security risk). Tracking EH bugs is particularly relevant for contemporary systems (e.g., cloud- and artificial intelligence based systems), in which the software’s sophisticated logic is an additional threat to the correct use of the EHM. On top of that, bug reporters seldom can tag EH bugs — since it may require an encompassing knowledge of the software’s EH strategy. Surprisingly, to the best of our knowledge, there is no automated procedure to identify EH bugs from report descriptions. Objective: First, we aim at evaluating the extent to which Natural Language Processing (NLP) and Machine Learning (ML) can be used to reliably label EH bugs using the text fields from bug reports (e.g., summary, description, and comments). Second, we aim at providing a reliably labeled dataset that the community can use in future endeavors. Overall, we expect our work to raise the community’s awareness regarding the importance of EH bugs. Method: We manually analyzed 4,516 bug reports from the four main components of Apache’s Hadoop project, out of which we labeled ≈ 20% (943) as EH bugs. Then, we used word embedding techniques (Bag-of-Words and Term Frequency-Inverse Document Frequency (TF-IDF)) to summarize the textual fields of bug reports. Subsequently, we used these embeddings to fit four classes of ML methods and record their performance on unseen data. We have also evaluated whether considering only EH keywords is enough to achieve high predictive performance. Results: Our results show that the combination of NLP and ML techniques can label EH bugs reasonably well, achieving Receiver Operating Characteristics-Area Under The Curve (ROC-AUC) scores of up to 0.70 and recall ranging from 0.50 up to 0.62. As a sanity check, we also evaluate methods using embeddings extracted solely from keywords. While keyword-based embeddings yield similar AUC, we observe a steep decrease in recall (0.53). This suggests that keywords alone are not sufficient to characterize reports of EH bugs — and there is an avenue for more complex text analyses. Conclusions: To the best of our knowledge, this is the first study addressing the problem of automatic labeling of EH bugs. Based on our results, we can conclude that the combination of NLP and ML techniques sounds promising to automate the task of labeling EH bugs. Overall, we hope (i) that our work will contribute towards raising awareness around EH bugs; and (ii) that our (publicly available) dataset will serve as a benchmarking dataset, paving the way for follow-up works. Additionally, our findings can be used to build tools that help maintainers flesh out EH bugs during the triage process.Contexto: Os bugs de tratamento de exceções (EH) decorrem do uso incorreto do mecanismo de tratamento de exceção (EHM) e frequentemente acarretam consequências severas (e.g., tempo de inatividade do sistema, perda de dados e risco de segurança). O rastreamento de bugs EH é particularmente relevante para sistemas contemporâneos (como sistemas baseados em nuvem e inteligência artificial), nos quais a lógica sofisticada do software representa uma ameaça adicional ao uso correto do EHM. Além disso, as pessoas que reportam bugs raramente conseguem rotular bugs como bugs EH, pois isso pode exigir um conhecimento abrangente da estratégia de EH do software. Surpreendentemente, até onde sabemos, não existe um procedimento automatizado para identificar bugs EH a partir das descrições dos relatórios. Objetivo: Primeiramente, buscamos avaliar até que ponto o Processamento de Linguagem Natural (NLP) e o Aprendizado de Máquina (ML) podem ser usados para rotular de forma confiável os bugs EH utilizando os campos de texto dos relatórios de bugs (e.g., resumo, descrição e comentários). Em segundo lugar, pretendemos fornecer um conjunto de dados rotulados de maneira confiável que a comunidade possa usar em esforços futuros. De modo geral, esperamos que nosso trabalho aumente a conscientização da comunidade sobre a importância dos bugs EH. Método: Analisamos manualmente 4.516 relatórios de bugs dos quatro principais componentes do projeto Hadoop da Apache, dos quais rotulamos cerca de ≈ 20% (943) como bugs EH. Em seguida, utilizamos técnicas de incorporação (embedding) de palavras (Bag-of-Words e Frequência de Termos - Frequência Inversa de Documentos – TF-IDF) para resumir os campos textuais dos relatórios de bugs. Posteriormente, usamos essas incorporações para ajustar quatro classes de métodos de ML e registrar seu desempenho em dados não vistos. Também avaliamos se a consideração exclusiva de palavras-chave de EH é suficiente para alcançar um alto desempenho preditivo. Resultados: Nossos resultados mostram que a combinação de técnicas de NLP e ML pode rotular bugs EH de forma razoavelmente eficaz, alcançando pontuações de Características de Operação do Receptor - Área Sob a Curva (ROC-AUC) de até 0,70 e recall variando de 0,50 a 0,62. Como verificação de sanidade, também avaliamos métodos que utilizam incorporações extraídas apenas de palavras-chave. Embora as incorporações baseadas em palavras-chave gerem AUCs semelhantes, observamos uma queda acentuada no recall (0,53). Isso sugere que palavras-chave sozinhas não são suficientes para caracterizar relatórios de bugs EH, indicando a necessidade de análises textuais mais complexas. Conclusões: Até onde sabemos, este é o primeiro estudo a abordar o problema da rotulagem automática de bugs EH. Com base em nossos resultados, podemos concluir que a combinação de técnicas de NLP e ML é promissora para automatizar a tarefa de rotulagem de bugs EH. Esperamos, em geral, que (i) nosso trabalho contribua para aumentar a conscientização sobre os bugs EH e (ii) que nosso conjunto de dados (disponível publicamente) sirva como um conjunto de dados de referência, abrindo caminho para trabalhos futuros. Além disso, nossas descobertas podem ser utilizadas para construir ferramentas que ajudem os mantenedores a identificar bugs EH durante o processo de triagem.Towards automatic labeling of exception handling bugs: a case study of 10 years bug-fixing in apache hadoopinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisBug de tratamento de exceçãoRotulagem automática de bugsAprendizado de máquinaProcessamento de linguagem naturalException handling bugAutomatic bug labelingMachine learningNatural language processingCNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOinfo:eu-repo/semantics/openAccessengreponame:Repositório Institucional da Universidade Federal do Ceará (UFC)instname:Universidade Federal do Ceará (UFC)instacron:UFChttp://lattes.cnpq.br/3081401322589475http://lattes.cnpq.br/06569777425905152024-10-01ORIGINAL2022_dis_ajasilva.pdf2022_dis_ajasilva.pdfapplication/pdf1210287http://repositorio.ufc.br/bitstream/riufc/78339/3/2022_dis_ajasilva.pdf300a723edf4734bec688c0864a274383MD53LICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://repositorio.ufc.br/bitstream/riufc/78339/4/license.txt8a4605be74aa9ea9d79846c1fba20a33MD54riufc/783392024-10-01 10:45:26.339oai:repositorio.ufc.br:riufc/78339Tk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=Repositório InstitucionalPUBhttp://www.repositorio.ufc.br/ri-oai/requestbu@ufc.br \|\| repositorio@ufc.bropendoar:2024-10-01T13:45:26Repositório Institucional da Universidade Federal do Ceará (UFC) - Universidade Federal do Ceará (UFC)false
dc.title.pt_BR.fl_str_mv	Towards automatic labeling of exception handling bugs: a case study of 10 years bug-fixing in apache hadoop
title	Towards automatic labeling of exception handling bugs: a case study of 10 years bug-fixing in apache hadoop
spellingShingle	Towards automatic labeling of exception handling bugs: a case study of 10 years bug-fixing in apache hadoop Silva, Antônio José Amâncio da CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO Bug de tratamento de exceção Rotulagem automática de bugs Aprendizado de máquina Processamento de linguagem natural Exception handling bug Automatic bug labeling Machine learning Natural language processing
title_short	Towards automatic labeling of exception handling bugs: a case study of 10 years bug-fixing in apache hadoop
title_full	Towards automatic labeling of exception handling bugs: a case study of 10 years bug-fixing in apache hadoop
title_fullStr	Towards automatic labeling of exception handling bugs: a case study of 10 years bug-fixing in apache hadoop
title_full_unstemmed	Towards automatic labeling of exception handling bugs: a case study of 10 years bug-fixing in apache hadoop
title_sort	Towards automatic labeling of exception handling bugs: a case study of 10 years bug-fixing in apache hadoop
author	Silva, Antônio José Amâncio da
author_facet	Silva, Antônio José Amâncio da
author_role	author
dc.contributor.author.fl_str_mv	Silva, Antônio José Amâncio da
dc.contributor.advisor1.fl_str_mv	Rocha, Lincoln Souza
contributor_str_mv	Rocha, Lincoln Souza
dc.subject.cnpq.fl_str_mv	CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
topic	CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO Bug de tratamento de exceção Rotulagem automática de bugs Aprendizado de máquina Processamento de linguagem natural Exception handling bug Automatic bug labeling Machine learning Natural language processing
dc.subject.ptbr.pt_BR.fl_str_mv	Bug de tratamento de exceção Rotulagem automática de bugs Aprendizado de máquina Processamento de linguagem natural
dc.subject.en.pt_BR.fl_str_mv	Exception handling bug Automatic bug labeling Machine learning Natural language processing
description	Context: Exception randling (EH) bugs stem from incorrect usage of exception handling mechanisms (EHM) and often incur severe consequences (e.g., system downtime, data loss, and security risk). Tracking EH bugs is particularly relevant for contemporary systems (e.g., cloud- and artificial intelligence based systems), in which the software’s sophisticated logic is an additional threat to the correct use of the EHM. On top of that, bug reporters seldom can tag EH bugs — since it may require an encompassing knowledge of the software’s EH strategy. Surprisingly, to the best of our knowledge, there is no automated procedure to identify EH bugs from report descriptions. Objective: First, we aim at evaluating the extent to which Natural Language Processing (NLP) and Machine Learning (ML) can be used to reliably label EH bugs using the text fields from bug reports (e.g., summary, description, and comments). Second, we aim at providing a reliably labeled dataset that the community can use in future endeavors. Overall, we expect our work to raise the community’s awareness regarding the importance of EH bugs. Method: We manually analyzed 4,516 bug reports from the four main components of Apache’s Hadoop project, out of which we labeled ≈ 20% (943) as EH bugs. Then, we used word embedding techniques (Bag-of-Words and Term Frequency-Inverse Document Frequency (TF-IDF)) to summarize the textual fields of bug reports. Subsequently, we used these embeddings to fit four classes of ML methods and record their performance on unseen data. We have also evaluated whether considering only EH keywords is enough to achieve high predictive performance. Results: Our results show that the combination of NLP and ML techniques can label EH bugs reasonably well, achieving Receiver Operating Characteristics-Area Under The Curve (ROC-AUC) scores of up to 0.70 and recall ranging from 0.50 up to 0.62. As a sanity check, we also evaluate methods using embeddings extracted solely from keywords. While keyword-based embeddings yield similar AUC, we observe a steep decrease in recall (0.53). This suggests that keywords alone are not sufficient to characterize reports of EH bugs — and there is an avenue for more complex text analyses. Conclusions: To the best of our knowledge, this is the first study addressing the problem of automatic labeling of EH bugs. Based on our results, we can conclude that the combination of NLP and ML techniques sounds promising to automate the task of labeling EH bugs. Overall, we hope (i) that our work will contribute towards raising awareness around EH bugs; and (ii) that our (publicly available) dataset will serve as a benchmarking dataset, paving the way for follow-up works. Additionally, our findings can be used to build tools that help maintainers flesh out EH bugs during the triage process.
publishDate	2022
dc.date.issued.fl_str_mv	2022
dc.date.accessioned.fl_str_mv	2024-10-01T13:45:24Z
dc.date.available.fl_str_mv	2024-10-01T13:45:24Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	SILVA, Antônio José Amâncio da. Towards automatic labeling of exception handling bugs: a case study of 10 years bug-fixing in apache hadoop. 2024. 53 f. Dissertação (Mestrado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2022.
dc.identifier.uri.fl_str_mv	http://repositorio.ufc.br/handle/riufc/78339
identifier_str_mv	SILVA, Antônio José Amâncio da. Towards automatic labeling of exception handling bugs: a case study of 10 years bug-fixing in apache hadoop. 2024. 53 f. Dissertação (Mestrado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2022.
url	http://repositorio.ufc.br/handle/riufc/78339
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.source.none.fl_str_mv	reponame:Repositório Institucional da Universidade Federal do Ceará (UFC) instname:Universidade Federal do Ceará (UFC) instacron:UFC
instname_str	Universidade Federal do Ceará (UFC)
instacron_str	UFC
institution	UFC
reponame_str	Repositório Institucional da Universidade Federal do Ceará (UFC)
collection	Repositório Institucional da Universidade Federal do Ceará (UFC)
bitstream.url.fl_str_mv	http://repositorio.ufc.br/bitstream/riufc/78339/3/2022_dis_ajasilva.pdf http://repositorio.ufc.br/bitstream/riufc/78339/4/license.txt
bitstream.checksum.fl_str_mv	300a723edf4734bec688c0864a274383 8a4605be74aa9ea9d79846c1fba20a33
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da Universidade Federal do Ceará (UFC) - Universidade Federal do Ceará (UFC)
repository.mail.fl_str_mv	bu@ufc.br \|\| repositorio@ufc.br
_version_	1847793119846727680

Towards automatic labeling of exception handling bugs: a case study of 10 years bug-fixing in apache hadoop

Registros relacionados