A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities
| Ano de defesa: | 2024 |
|---|---|
| Autor(a) principal: | |
| Orientador(a): | |
| Banca de defesa: | |
| Tipo de documento: | Tese |
| Tipo de acesso: | Acesso aberto |
| Idioma: | eng |
| Instituição de defesa: |
Universidade Federal do Rio Grande do Norte
BR UFRN PROGRAMA DE PÓS-GRADUAÇÃO EM ENGENHARIA ELÉTRICA E DE COMPUTAÇÃO |
| Programa de Pós-Graduação: |
Não Informado pela instituição
|
| Departamento: |
Não Informado pela instituição
|
| País: |
Não Informado pela instituição
|
| Palavras-chave em Português: | |
| Link de acesso: | https://repositorio.ufrn.br/handle/123456789/63748 |
Resumo: | High-performance computing, a dynamic field within computer science, provides the processing power necessary for algorithms across diverse domains. Large-scale supercomputers are indispensable for tackling complex problems; however, their size and complexity make them susceptible to failure. This underscores the criticality of employing fault tolerance techniques to mitigate the impact of interruptions or failures. These methods are instrumental in addressing hardware and software malfunctions and preemptive scenarios. Given the imperative for fault tolerance, we present new methodologies for improving fault tolerance in bulk synchronous programs. These new methodologies are presented as the Dependability Library for Iterative Applications. This library offers a versatile solution that combines data conservation at the application level, fault detection, and failover capabilities. The proposed library simplifies the integration of fault tolerance abilities into the applications, offering high configurability options. This thesis presents data conservation methodologies, including application-level checkpointing and process data replication, to ensure reliability by allowing a backup unit to take over in case of failure. This work also presents fault detection methods such as termination signal detection and heartbeat monitoring with inexpensive communication to trigger the data conservation only if there is a possibility of failure; this approach permits low overhead. The proposed library is compatible with user-level failure mitigation, which allows failover capabilities; in other words, the programs can continue operating after crashes, minimizing downtime and ensuring continuous operation. Our proposal was successfully applied to the geophysical problem of full-waveform inversion, a standard algorithm for oil and gas exploration geophysics processing. This application serves as a high-performance practical scenario for analysis, demonstrating the real-world applicability of the library. All methods were rigorously validated, and the overhead in this problem was thoroughly analyzed using more realistic examples. In our experiments, the application did not lose all data processed until the failure moment, and it could continue execution even in the presence of node failure, with minimal overhead. This work also shows other case studies in the initial stage of applying the library and discusses some fault tolerance concepts and related works. |
| id |
UFRN_187480d0f9d641d096e18f8962129eba |
|---|---|
| oai_identifier_str |
oai:repositorio.ufrn.br:123456789/63748 |
| network_acronym_str |
UFRN |
| network_name_str |
Repositório Institucional da UFRN |
| repository_id_str |
|
| spelling |
A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilitiesFault toleranceInterruption detectionData conservationFailoverHigh performance computingENGENHARIAS::ENGENHARIA ELETRICAHigh-performance computing, a dynamic field within computer science, provides the processing power necessary for algorithms across diverse domains. Large-scale supercomputers are indispensable for tackling complex problems; however, their size and complexity make them susceptible to failure. This underscores the criticality of employing fault tolerance techniques to mitigate the impact of interruptions or failures. These methods are instrumental in addressing hardware and software malfunctions and preemptive scenarios. Given the imperative for fault tolerance, we present new methodologies for improving fault tolerance in bulk synchronous programs. These new methodologies are presented as the Dependability Library for Iterative Applications. This library offers a versatile solution that combines data conservation at the application level, fault detection, and failover capabilities. The proposed library simplifies the integration of fault tolerance abilities into the applications, offering high configurability options. This thesis presents data conservation methodologies, including application-level checkpointing and process data replication, to ensure reliability by allowing a backup unit to take over in case of failure. This work also presents fault detection methods such as termination signal detection and heartbeat monitoring with inexpensive communication to trigger the data conservation only if there is a possibility of failure; this approach permits low overhead. The proposed library is compatible with user-level failure mitigation, which allows failover capabilities; in other words, the programs can continue operating after crashes, minimizing downtime and ensuring continuous operation. Our proposal was successfully applied to the geophysical problem of full-waveform inversion, a standard algorithm for oil and gas exploration geophysics processing. This application serves as a high-performance practical scenario for analysis, demonstrating the real-world applicability of the library. All methods were rigorously validated, and the overhead in this problem was thoroughly analyzed using more realistic examples. In our experiments, the application did not lose all data processed until the failure moment, and it could continue execution even in the presence of node failure, with minimal overhead. This work also shows other case studies in the initial stage of applying the library and discusses some fault tolerance concepts and related works.A computação de alto desempenho é um campo dinâmico da ciência da computação que permite o processamento necessário para problemas de diversos domínios. Supercomputadores são indispensáveis para resolução de problemas complexos; no entanto, seu tamanho e complexidade os tornam suscetíveis a falhas. Isso destaca a importância crítica de empregar técnicas de tolerância a falhas para mitigar o impacto de interrupções. Esses métodos são essenciais para lidar com falhas de hardware e software, bem como cenários preemptivos. Dada a necessidade de tolerância a falhas, apresentamos novas metodologias para melhorar a tolerância a falhas em programas síncronos em massa. Essas novas metodologias são apresentadas dentro da Biblioteca de Confiabilidade para Aplicações Iterativas. Esta biblioteca oferece uma solução versátil que combina conservação de dados no nível da aplicação, detecção de falhas e capacidades de failover. A biblioteca proposta simplifica a integração de habilidades de tolerância a falhas nas aplicações, oferecendo opções de alta configurabilidade. Esta tese apresenta técnicas de conservação de dados, incluindo checkpointing no nível da aplicação e replicação de dados de processo, para garantir confiabilidade, permitindo que uma unidade de backup assuma em caso de falha. Este trabalho também apresenta métodos de detecção de falhas, como detecção de sinal de término e monitoramento de batita de coração com comunicação de baixo custo, para acionar a conservação de dados apenas se houver uma possibilidade de falha; essa abordagem permite baixo overhead. A biblioteca proposta é compatível com a mitigação de falhas no nível do usuário, o que permite capacidades de failover; em outras palavras, os programas podem continuar operando após falhas, minimizando o tempo de inatividade e garantindo operação contínua. Nossa proposta foi aplicada com sucesso ao problema geofísico de inversão de forma de onda completa, um algoritmo padrão para o processamento geofísico de exploração de petróleo e gás. Esta aplicação serve como um cenário prático de alto desempenho para análise, demonstrando a aplicabilidade real da biblioteca. Todos os métodos foram rigorosamente validados, e o overhead neste problema foi analisado usando exemplos mais realistas. Em nossos experimentos, a aplicação não perdeu todos os dados processados até o momento da falha e pôde continuar a execução, mesmo na presença de falha de nó, com overhead mínimo. Este trabalho também apresenta outros estudos de caso na fase inicial de aplicação da biblioteca e discute alguns conceitos de tolerância a falhas e trabalhos relacionados.Universidade Federal do Rio Grande do NorteBRUFRNPROGRAMA DE PÓS-GRADUAÇÃO EM ENGENHARIA ELÉTRICA E DE COMPUTAÇÃOSouza, Samuel Xavier dehttps://orcid.org/0000-0003-3328-0056http://lattes.cnpq.br/4697610292983660https://orcid.org/0000-0001-8747-4580http://lattes.cnpq.br/9892239670106361Bianchini, Calebe de PaulaTadonki, ClaudeChauris, HerveTaufer, MichelaNavaux, Philippe Olivier AlexandreBarros, Tiago Tavares LeiteSantana, Carla dos Santos2025-05-29T22:29:53Z2025-05-29T22:29:53Z2024-10-04info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfSANTANA, Carla dos Santos. A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities. Orientador: Dr. Samuel Xavier de Souza. 2024. 90f. Tese (Doutorado em Engenharia Elétrica e de Computação) - Centro de Tecnologia, Universidade Federal do Rio Grande do Norte, Natal, 2024.https://repositorio.ufrn.br/handle/123456789/63748info:eu-repo/semantics/openAccessengreponame:Repositório Institucional da UFRNinstname:Universidade Federal do Rio Grande do Norte (UFRN)instacron:UFRN2025-05-29T22:29:54Zoai:repositorio.ufrn.br:123456789/63748Repositório InstitucionalPUBhttp://repositorio.ufrn.br/oai/repositorio@bczm.ufrn.bropendoar:2025-05-29T22:29:54Repositório Institucional da UFRN - Universidade Federal do Rio Grande do Norte (UFRN)false |
| dc.title.none.fl_str_mv |
A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities |
| title |
A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities |
| spellingShingle |
A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities Santana, Carla dos Santos Fault tolerance Interruption detection Data conservation Failover High performance computing ENGENHARIAS::ENGENHARIA ELETRICA |
| title_short |
A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities |
| title_full |
A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities |
| title_fullStr |
A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities |
| title_full_unstemmed |
A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities |
| title_sort |
A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities |
| author |
Santana, Carla dos Santos |
| author_facet |
Santana, Carla dos Santos |
| author_role |
author |
| dc.contributor.none.fl_str_mv |
Souza, Samuel Xavier de https://orcid.org/0000-0003-3328-0056 http://lattes.cnpq.br/4697610292983660 https://orcid.org/0000-0001-8747-4580 http://lattes.cnpq.br/9892239670106361 Bianchini, Calebe de Paula Tadonki, Claude Chauris, Herve Taufer, Michela Navaux, Philippe Olivier Alexandre Barros, Tiago Tavares Leite |
| dc.contributor.author.fl_str_mv |
Santana, Carla dos Santos |
| dc.subject.por.fl_str_mv |
Fault tolerance Interruption detection Data conservation Failover High performance computing ENGENHARIAS::ENGENHARIA ELETRICA |
| topic |
Fault tolerance Interruption detection Data conservation Failover High performance computing ENGENHARIAS::ENGENHARIA ELETRICA |
| description |
High-performance computing, a dynamic field within computer science, provides the processing power necessary for algorithms across diverse domains. Large-scale supercomputers are indispensable for tackling complex problems; however, their size and complexity make them susceptible to failure. This underscores the criticality of employing fault tolerance techniques to mitigate the impact of interruptions or failures. These methods are instrumental in addressing hardware and software malfunctions and preemptive scenarios. Given the imperative for fault tolerance, we present new methodologies for improving fault tolerance in bulk synchronous programs. These new methodologies are presented as the Dependability Library for Iterative Applications. This library offers a versatile solution that combines data conservation at the application level, fault detection, and failover capabilities. The proposed library simplifies the integration of fault tolerance abilities into the applications, offering high configurability options. This thesis presents data conservation methodologies, including application-level checkpointing and process data replication, to ensure reliability by allowing a backup unit to take over in case of failure. This work also presents fault detection methods such as termination signal detection and heartbeat monitoring with inexpensive communication to trigger the data conservation only if there is a possibility of failure; this approach permits low overhead. The proposed library is compatible with user-level failure mitigation, which allows failover capabilities; in other words, the programs can continue operating after crashes, minimizing downtime and ensuring continuous operation. Our proposal was successfully applied to the geophysical problem of full-waveform inversion, a standard algorithm for oil and gas exploration geophysics processing. This application serves as a high-performance practical scenario for analysis, demonstrating the real-world applicability of the library. All methods were rigorously validated, and the overhead in this problem was thoroughly analyzed using more realistic examples. In our experiments, the application did not lose all data processed until the failure moment, and it could continue execution even in the presence of node failure, with minimal overhead. This work also shows other case studies in the initial stage of applying the library and discusses some fault tolerance concepts and related works. |
| publishDate |
2024 |
| dc.date.none.fl_str_mv |
2024-10-04 2025-05-29T22:29:53Z 2025-05-29T22:29:53Z |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
| format |
doctoralThesis |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
SANTANA, Carla dos Santos. A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities. Orientador: Dr. Samuel Xavier de Souza. 2024. 90f. Tese (Doutorado em Engenharia Elétrica e de Computação) - Centro de Tecnologia, Universidade Federal do Rio Grande do Norte, Natal, 2024. https://repositorio.ufrn.br/handle/123456789/63748 |
| identifier_str_mv |
SANTANA, Carla dos Santos. A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities. Orientador: Dr. Samuel Xavier de Souza. 2024. 90f. Tese (Doutorado em Engenharia Elétrica e de Computação) - Centro de Tecnologia, Universidade Federal do Rio Grande do Norte, Natal, 2024. |
| url |
https://repositorio.ufrn.br/handle/123456789/63748 |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.publisher.none.fl_str_mv |
Universidade Federal do Rio Grande do Norte BR UFRN PROGRAMA DE PÓS-GRADUAÇÃO EM ENGENHARIA ELÉTRICA E DE COMPUTAÇÃO |
| publisher.none.fl_str_mv |
Universidade Federal do Rio Grande do Norte BR UFRN PROGRAMA DE PÓS-GRADUAÇÃO EM ENGENHARIA ELÉTRICA E DE COMPUTAÇÃO |
| dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFRN instname:Universidade Federal do Rio Grande do Norte (UFRN) instacron:UFRN |
| instname_str |
Universidade Federal do Rio Grande do Norte (UFRN) |
| instacron_str |
UFRN |
| institution |
UFRN |
| reponame_str |
Repositório Institucional da UFRN |
| collection |
Repositório Institucional da UFRN |
| repository.name.fl_str_mv |
Repositório Institucional da UFRN - Universidade Federal do Rio Grande do Norte (UFRN) |
| repository.mail.fl_str_mv |
repositorio@bczm.ufrn.br |
| _version_ |
1855758841940017152 |