Anomaly Detection and Root Cause Analysis in Cloud-Native Environments Using Large Language Models and Bayesian Networks
| Ano de defesa: | 2025 |
|---|---|
| Autor(a) principal: | |
| Orientador(a): | |
| Banca de defesa: | |
| Tipo de documento: | Tese |
| Tipo de acesso: | Acesso aberto |
| Idioma: | eng |
| Instituição de defesa: |
Biblioteca Digitais de Teses e Dissertações da USP
|
| Programa de Pós-Graduação: |
Não Informado pela instituição
|
| Departamento: |
Não Informado pela instituição
|
| País: |
Não Informado pela instituição
|
| Palavras-chave em Português: | |
| Link de acesso: | https://www.teses.usp.br/teses/disponiveis/55/55134/tde-27082025-141523/ |
Resumo: | Cloud computing technologies offer significant advantages in scalability and performance, enabling rapid deployment of applications. The adoption of microservices-oriented architectures has introduced an ecosystem characterized by an increased number of applications, frameworks, abstraction layers, orchestrators, and hypervisors, all operating within distributed systems. This complexity results in the generation of vast quantities of logs from diverse sources, making the analysis of these events an inherently challenging task, particularly in the absence of automation. To address this issue, Machine Learning techniques leveraging Large Language Models (LLMs) offer a promising approach for dynamically identifying patterns within these events. In this study, we propose a novel anomaly detection framework utilizing a microservices architecture deployed on Kubernetes and Istio, enhanced by an LLM model. The model was trained on various error scenarios, with Chaos Mesh employed as an error injection tool to simulate faults of different natures, and Locust used as a load generator to create workload stress conditions. After an anomaly is detected by the LLM model, we employ a dynamic Bayesian network to provide probabilistic inferences about the incident, proving the relationships between components and assessing the degree of impact among them. Additionally, a ChatBot powered by the same LLM model allows users to interact with the AI, ask questions about the detected incident, and gain deeper insights. The experimental results demonstrated the models effectiveness, reliably identifying all error events across various test scenarios. While it successfully avoided missing any anomalies, it did produce some false positives, which remain within acceptable limits. |
| id |
USP_04c3bbbf206f44b1872332a16f5314e3 |
|---|---|
| oai_identifier_str |
oai:teses.usp.br:tde-27082025-141523 |
| network_acronym_str |
USP |
| network_name_str |
Biblioteca Digital de Teses e Dissertações da USP |
| repository_id_str |
|
| spelling |
Anomaly Detection and Root Cause Analysis in Cloud-Native Environments Using Large Language Models and Bayesian NetworksDetecção de anomalias e análise de causa raiz em ambientes nativos da nuvem usando LLM\'s e redes bayesianasAutomated root cause analysisBayesian networksCloud computingComputação em nuvemDetecção de anomaliasLLMLLMRedes bayesianasCloud computing technologies offer significant advantages in scalability and performance, enabling rapid deployment of applications. The adoption of microservices-oriented architectures has introduced an ecosystem characterized by an increased number of applications, frameworks, abstraction layers, orchestrators, and hypervisors, all operating within distributed systems. This complexity results in the generation of vast quantities of logs from diverse sources, making the analysis of these events an inherently challenging task, particularly in the absence of automation. To address this issue, Machine Learning techniques leveraging Large Language Models (LLMs) offer a promising approach for dynamically identifying patterns within these events. In this study, we propose a novel anomaly detection framework utilizing a microservices architecture deployed on Kubernetes and Istio, enhanced by an LLM model. The model was trained on various error scenarios, with Chaos Mesh employed as an error injection tool to simulate faults of different natures, and Locust used as a load generator to create workload stress conditions. After an anomaly is detected by the LLM model, we employ a dynamic Bayesian network to provide probabilistic inferences about the incident, proving the relationships between components and assessing the degree of impact among them. Additionally, a ChatBot powered by the same LLM model allows users to interact with the AI, ask questions about the detected incident, and gain deeper insights. The experimental results demonstrated the models effectiveness, reliably identifying all error events across various test scenarios. While it successfully avoided missing any anomalies, it did produce some false positives, which remain within acceptable limits.As tecnologias de computação em nuvem oferecem vantagens significativas em escalabilidade e desempenho, permitindo a rápida implantação de aplicativos. No entanto, a crescente complexidade dos sistemas nativos da nuvem introduz riscos de confiabilidade. Lidar com esses riscos é uma responsabilidade essencial dos provedores de serviços de TI, pois eles desempenham um papel crítico na manutenção da estabilidade do sistema e na garantia da entrega confiável de serviços. Essa complexidade resulta na geração de grandes quantidades de logs de diversas fontes, tornando a análise desses eventos uma tarefa inerentemente desafiadora, principalmente na ausência de automação. Para resolver esse problema, as técnicas de Machine Learning que utilizam Large Language Models (LLMs) oferecem uma abordagem promissora para identificar dinamicamente padrões dentro desses eventos. Neste estudo, propomos uma nova estrutura de detecção de anomalias utilizando uma arquitetura de microsserviços implantada no Kubernetes e Istio, aprimorada por um modelo LLM. O modelo foi treinado em vários cenários de erro, com o Chaos Mesh empregado como uma ferramenta de injeção de erro para simular falhas de diferentes naturezas, e o Locust usado como um gerador de carga para criar condições de estresse de carga de trabalho. Depois que uma anomalia é detectada pelo modelo LLM, empregamos uma rede bayesiana dinâmica para fornecer inferências probabilísticas sobre o incidente, provando as relações entre os componentes e avaliando o grau de impacto entre eles. Além disso, um ChatBot alimentado pelo mesmo modelo LLM permite que os usuários interajam com a IA, façam perguntas sobre o incidente detectado e obtenham insights mais profundos. Os resultados experimentais demonstraram a eficácia do modelo, identificando de forma confiável todos os eventos de erro em vários cenários de teste. Embora tenha evitado com sucesso a falta de anomalias, ele produziu alguns falsos positivos, que permanecem dentro dos limites aceitáveis.Biblioteca Digitais de Teses e Dissertações da USPBruschi, Sarita MazziniPedroso, Diego Frazatto2025-04-30info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-27082025-141523/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2025-08-27T17:21:02Zoai:teses.usp.br:tde-27082025-141523Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212025-08-27T17:21:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false |
| dc.title.none.fl_str_mv |
Anomaly Detection and Root Cause Analysis in Cloud-Native Environments Using Large Language Models and Bayesian Networks Detecção de anomalias e análise de causa raiz em ambientes nativos da nuvem usando LLM\'s e redes bayesianas |
| title |
Anomaly Detection and Root Cause Analysis in Cloud-Native Environments Using Large Language Models and Bayesian Networks |
| spellingShingle |
Anomaly Detection and Root Cause Analysis in Cloud-Native Environments Using Large Language Models and Bayesian Networks Pedroso, Diego Frazatto Automated root cause analysis Bayesian networks Cloud computing Computação em nuvem Detecção de anomalias LLM LLM Redes bayesianas |
| title_short |
Anomaly Detection and Root Cause Analysis in Cloud-Native Environments Using Large Language Models and Bayesian Networks |
| title_full |
Anomaly Detection and Root Cause Analysis in Cloud-Native Environments Using Large Language Models and Bayesian Networks |
| title_fullStr |
Anomaly Detection and Root Cause Analysis in Cloud-Native Environments Using Large Language Models and Bayesian Networks |
| title_full_unstemmed |
Anomaly Detection and Root Cause Analysis in Cloud-Native Environments Using Large Language Models and Bayesian Networks |
| title_sort |
Anomaly Detection and Root Cause Analysis in Cloud-Native Environments Using Large Language Models and Bayesian Networks |
| author |
Pedroso, Diego Frazatto |
| author_facet |
Pedroso, Diego Frazatto |
| author_role |
author |
| dc.contributor.none.fl_str_mv |
Bruschi, Sarita Mazzini |
| dc.contributor.author.fl_str_mv |
Pedroso, Diego Frazatto |
| dc.subject.por.fl_str_mv |
Automated root cause analysis Bayesian networks Cloud computing Computação em nuvem Detecção de anomalias LLM LLM Redes bayesianas |
| topic |
Automated root cause analysis Bayesian networks Cloud computing Computação em nuvem Detecção de anomalias LLM LLM Redes bayesianas |
| description |
Cloud computing technologies offer significant advantages in scalability and performance, enabling rapid deployment of applications. The adoption of microservices-oriented architectures has introduced an ecosystem characterized by an increased number of applications, frameworks, abstraction layers, orchestrators, and hypervisors, all operating within distributed systems. This complexity results in the generation of vast quantities of logs from diverse sources, making the analysis of these events an inherently challenging task, particularly in the absence of automation. To address this issue, Machine Learning techniques leveraging Large Language Models (LLMs) offer a promising approach for dynamically identifying patterns within these events. In this study, we propose a novel anomaly detection framework utilizing a microservices architecture deployed on Kubernetes and Istio, enhanced by an LLM model. The model was trained on various error scenarios, with Chaos Mesh employed as an error injection tool to simulate faults of different natures, and Locust used as a load generator to create workload stress conditions. After an anomaly is detected by the LLM model, we employ a dynamic Bayesian network to provide probabilistic inferences about the incident, proving the relationships between components and assessing the degree of impact among them. Additionally, a ChatBot powered by the same LLM model allows users to interact with the AI, ask questions about the detected incident, and gain deeper insights. The experimental results demonstrated the models effectiveness, reliably identifying all error events across various test scenarios. While it successfully avoided missing any anomalies, it did produce some false positives, which remain within acceptable limits. |
| publishDate |
2025 |
| dc.date.none.fl_str_mv |
2025-04-30 |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
| format |
doctoralThesis |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-27082025-141523/ |
| url |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-27082025-141523/ |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
|
| dc.rights.driver.fl_str_mv |
Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess |
| rights_invalid_str_mv |
Liberar o conteúdo para acesso público. |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.coverage.none.fl_str_mv |
|
| dc.publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
| publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
| dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP |
| instname_str |
Universidade de São Paulo (USP) |
| instacron_str |
USP |
| institution |
USP |
| reponame_str |
Biblioteca Digital de Teses e Dissertações da USP |
| collection |
Biblioteca Digital de Teses e Dissertações da USP |
| repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP) |
| repository.mail.fl_str_mv |
virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br |
| _version_ |
1848370491251752960 |