A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: Santana, Carla dos Santos
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Universidade Federal do Rio Grande do Norte
BR
UFRN
PROGRAMA DE PÓS-GRADUAÇÃO EM ENGENHARIA ELÉTRICA E DE COMPUTAÇÃO
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://repositorio.ufrn.br/handle/123456789/63748
Resumo: High-performance computing, a dynamic field within computer science, provides the processing power necessary for algorithms across diverse domains. Large-scale supercomputers are indispensable for tackling complex problems; however, their size and complexity make them susceptible to failure. This underscores the criticality of employing fault tolerance techniques to mitigate the impact of interruptions or failures. These methods are instrumental in addressing hardware and software malfunctions and preemptive scenarios. Given the imperative for fault tolerance, we present new methodologies for improving fault tolerance in bulk synchronous programs. These new methodologies are presented as the Dependability Library for Iterative Applications. This library offers a versatile solution that combines data conservation at the application level, fault detection, and failover capabilities. The proposed library simplifies the integration of fault tolerance abilities into the applications, offering high configurability options. This thesis presents data conservation methodologies, including application-level checkpointing and process data replication, to ensure reliability by allowing a backup unit to take over in case of failure. This work also presents fault detection methods such as termination signal detection and heartbeat monitoring with inexpensive communication to trigger the data conservation only if there is a possibility of failure; this approach permits low overhead. The proposed library is compatible with user-level failure mitigation, which allows failover capabilities; in other words, the programs can continue operating after crashes, minimizing downtime and ensuring continuous operation. Our proposal was successfully applied to the geophysical problem of full-waveform inversion, a standard algorithm for oil and gas exploration geophysics processing. This application serves as a high-performance practical scenario for analysis, demonstrating the real-world applicability of the library. All methods were rigorously validated, and the overhead in this problem was thoroughly analyzed using more realistic examples. In our experiments, the application did not lose all data processed until the failure moment, and it could continue execution even in the presence of node failure, with minimal overhead. This work also shows other case studies in the initial stage of applying the library and discusses some fault tolerance concepts and related works.
id UFRN_187480d0f9d641d096e18f8962129eba
oai_identifier_str oai:repositorio.ufrn.br:123456789/63748
network_acronym_str UFRN
network_name_str Repositório Institucional da UFRN
repository_id_str
spelling A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilitiesFault toleranceInterruption detectionData conservationFailoverHigh performance computingENGENHARIAS::ENGENHARIA ELETRICAHigh-performance computing, a dynamic field within computer science, provides the processing power necessary for algorithms across diverse domains. Large-scale supercomputers are indispensable for tackling complex problems; however, their size and complexity make them susceptible to failure. This underscores the criticality of employing fault tolerance techniques to mitigate the impact of interruptions or failures. These methods are instrumental in addressing hardware and software malfunctions and preemptive scenarios. Given the imperative for fault tolerance, we present new methodologies for improving fault tolerance in bulk synchronous programs. These new methodologies are presented as the Dependability Library for Iterative Applications. This library offers a versatile solution that combines data conservation at the application level, fault detection, and failover capabilities. The proposed library simplifies the integration of fault tolerance abilities into the applications, offering high configurability options. This thesis presents data conservation methodologies, including application-level checkpointing and process data replication, to ensure reliability by allowing a backup unit to take over in case of failure. This work also presents fault detection methods such as termination signal detection and heartbeat monitoring with inexpensive communication to trigger the data conservation only if there is a possibility of failure; this approach permits low overhead. The proposed library is compatible with user-level failure mitigation, which allows failover capabilities; in other words, the programs can continue operating after crashes, minimizing downtime and ensuring continuous operation. Our proposal was successfully applied to the geophysical problem of full-waveform inversion, a standard algorithm for oil and gas exploration geophysics processing. This application serves as a high-performance practical scenario for analysis, demonstrating the real-world applicability of the library. All methods were rigorously validated, and the overhead in this problem was thoroughly analyzed using more realistic examples. In our experiments, the application did not lose all data processed until the failure moment, and it could continue execution even in the presence of node failure, with minimal overhead. This work also shows other case studies in the initial stage of applying the library and discusses some fault tolerance concepts and related works.A computação de alto desempenho é um campo dinâmico da ciência da computação que permite o processamento necessário para problemas de diversos domínios. Supercomputadores são indispensáveis para resolução de problemas complexos; no entanto, seu tamanho e complexidade os tornam suscetíveis a falhas. Isso destaca a importância crítica de empregar técnicas de tolerância a falhas para mitigar o impacto de interrupções. Esses métodos são essenciais para lidar com falhas de hardware e software, bem como cenários preemptivos. Dada a necessidade de tolerância a falhas, apresentamos novas metodologias para melhorar a tolerância a falhas em programas síncronos em massa. Essas novas metodologias são apresentadas dentro da Biblioteca de Confiabilidade para Aplicações Iterativas. Esta biblioteca oferece uma solução versátil que combina conservação de dados no nível da aplicação, detecção de falhas e capacidades de failover. A biblioteca proposta simplifica a integração de habilidades de tolerância a falhas nas aplicações, oferecendo opções de alta configurabilidade. Esta tese apresenta técnicas de conservação de dados, incluindo checkpointing no nível da aplicação e replicação de dados de processo, para garantir confiabilidade, permitindo que uma unidade de backup assuma em caso de falha. Este trabalho também apresenta métodos de detecção de falhas, como detecção de sinal de término e monitoramento de batita de coração com comunicação de baixo custo, para acionar a conservação de dados apenas se houver uma possibilidade de falha; essa abordagem permite baixo overhead. A biblioteca proposta é compatível com a mitigação de falhas no nível do usuário, o que permite capacidades de failover; em outras palavras, os programas podem continuar operando após falhas, minimizando o tempo de inatividade e garantindo operação contínua. Nossa proposta foi aplicada com sucesso ao problema geofísico de inversão de forma de onda completa, um algoritmo padrão para o processamento geofísico de exploração de petróleo e gás. Esta aplicação serve como um cenário prático de alto desempenho para análise, demonstrando a aplicabilidade real da biblioteca. Todos os métodos foram rigorosamente validados, e o overhead neste problema foi analisado usando exemplos mais realistas. Em nossos experimentos, a aplicação não perdeu todos os dados processados até o momento da falha e pôde continuar a execução, mesmo na presença de falha de nó, com overhead mínimo. Este trabalho também apresenta outros estudos de caso na fase inicial de aplicação da biblioteca e discute alguns conceitos de tolerância a falhas e trabalhos relacionados.Universidade Federal do Rio Grande do NorteBRUFRNPROGRAMA DE PÓS-GRADUAÇÃO EM ENGENHARIA ELÉTRICA E DE COMPUTAÇÃOSouza, Samuel Xavier dehttps://orcid.org/0000-0003-3328-0056http://lattes.cnpq.br/4697610292983660https://orcid.org/0000-0001-8747-4580http://lattes.cnpq.br/9892239670106361Bianchini, Calebe de PaulaTadonki, ClaudeChauris, HerveTaufer, MichelaNavaux, Philippe Olivier AlexandreBarros, Tiago Tavares LeiteSantana, Carla dos Santos2025-05-29T22:29:53Z2025-05-29T22:29:53Z2024-10-04info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfSANTANA, Carla dos Santos. A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities. Orientador: Dr. Samuel Xavier de Souza. 2024. 90f. Tese (Doutorado em Engenharia Elétrica e de Computação) - Centro de Tecnologia, Universidade Federal do Rio Grande do Norte, Natal, 2024.https://repositorio.ufrn.br/handle/123456789/63748info:eu-repo/semantics/openAccessengreponame:Repositório Institucional da UFRNinstname:Universidade Federal do Rio Grande do Norte (UFRN)instacron:UFRN2025-05-29T22:29:54Zoai:repositorio.ufrn.br:123456789/63748Repositório InstitucionalPUBhttp://repositorio.ufrn.br/oai/repositorio@bczm.ufrn.bropendoar:2025-05-29T22:29:54Repositório Institucional da UFRN - Universidade Federal do Rio Grande do Norte (UFRN)false
dc.title.none.fl_str_mv A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities
title A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities
spellingShingle A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities
Santana, Carla dos Santos
Fault tolerance
Interruption detection
Data conservation
Failover
High performance computing
ENGENHARIAS::ENGENHARIA ELETRICA
title_short A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities
title_full A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities
title_fullStr A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities
title_full_unstemmed A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities
title_sort A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities
author Santana, Carla dos Santos
author_facet Santana, Carla dos Santos
author_role author
dc.contributor.none.fl_str_mv Souza, Samuel Xavier de
https://orcid.org/0000-0003-3328-0056
http://lattes.cnpq.br/4697610292983660
https://orcid.org/0000-0001-8747-4580
http://lattes.cnpq.br/9892239670106361
Bianchini, Calebe de Paula
Tadonki, Claude
Chauris, Herve
Taufer, Michela
Navaux, Philippe Olivier Alexandre
Barros, Tiago Tavares Leite
dc.contributor.author.fl_str_mv Santana, Carla dos Santos
dc.subject.por.fl_str_mv Fault tolerance
Interruption detection
Data conservation
Failover
High performance computing
ENGENHARIAS::ENGENHARIA ELETRICA
topic Fault tolerance
Interruption detection
Data conservation
Failover
High performance computing
ENGENHARIAS::ENGENHARIA ELETRICA
description High-performance computing, a dynamic field within computer science, provides the processing power necessary for algorithms across diverse domains. Large-scale supercomputers are indispensable for tackling complex problems; however, their size and complexity make them susceptible to failure. This underscores the criticality of employing fault tolerance techniques to mitigate the impact of interruptions or failures. These methods are instrumental in addressing hardware and software malfunctions and preemptive scenarios. Given the imperative for fault tolerance, we present new methodologies for improving fault tolerance in bulk synchronous programs. These new methodologies are presented as the Dependability Library for Iterative Applications. This library offers a versatile solution that combines data conservation at the application level, fault detection, and failover capabilities. The proposed library simplifies the integration of fault tolerance abilities into the applications, offering high configurability options. This thesis presents data conservation methodologies, including application-level checkpointing and process data replication, to ensure reliability by allowing a backup unit to take over in case of failure. This work also presents fault detection methods such as termination signal detection and heartbeat monitoring with inexpensive communication to trigger the data conservation only if there is a possibility of failure; this approach permits low overhead. The proposed library is compatible with user-level failure mitigation, which allows failover capabilities; in other words, the programs can continue operating after crashes, minimizing downtime and ensuring continuous operation. Our proposal was successfully applied to the geophysical problem of full-waveform inversion, a standard algorithm for oil and gas exploration geophysics processing. This application serves as a high-performance practical scenario for analysis, demonstrating the real-world applicability of the library. All methods were rigorously validated, and the overhead in this problem was thoroughly analyzed using more realistic examples. In our experiments, the application did not lose all data processed until the failure moment, and it could continue execution even in the presence of node failure, with minimal overhead. This work also shows other case studies in the initial stage of applying the library and discusses some fault tolerance concepts and related works.
publishDate 2024
dc.date.none.fl_str_mv 2024-10-04
2025-05-29T22:29:53Z
2025-05-29T22:29:53Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv SANTANA, Carla dos Santos. A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities. Orientador: Dr. Samuel Xavier de Souza. 2024. 90f. Tese (Doutorado em Engenharia Elétrica e de Computação) - Centro de Tecnologia, Universidade Federal do Rio Grande do Norte, Natal, 2024.
https://repositorio.ufrn.br/handle/123456789/63748
identifier_str_mv SANTANA, Carla dos Santos. A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities. Orientador: Dr. Samuel Xavier de Souza. 2024. 90f. Tese (Doutorado em Engenharia Elétrica e de Computação) - Centro de Tecnologia, Universidade Federal do Rio Grande do Norte, Natal, 2024.
url https://repositorio.ufrn.br/handle/123456789/63748
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Federal do Rio Grande do Norte
BR
UFRN
PROGRAMA DE PÓS-GRADUAÇÃO EM ENGENHARIA ELÉTRICA E DE COMPUTAÇÃO
publisher.none.fl_str_mv Universidade Federal do Rio Grande do Norte
BR
UFRN
PROGRAMA DE PÓS-GRADUAÇÃO EM ENGENHARIA ELÉTRICA E DE COMPUTAÇÃO
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFRN
instname:Universidade Federal do Rio Grande do Norte (UFRN)
instacron:UFRN
instname_str Universidade Federal do Rio Grande do Norte (UFRN)
instacron_str UFRN
institution UFRN
reponame_str Repositório Institucional da UFRN
collection Repositório Institucional da UFRN
repository.name.fl_str_mv Repositório Institucional da UFRN - Universidade Federal do Rio Grande do Norte (UFRN)
repository.mail.fl_str_mv repositorio@bczm.ufrn.br
_version_ 1855758841940017152