Layered approach for runtime fault recovery in NOC-Based MPSOCS

Detalhes bibliográficos
Ano de defesa: 2015
Autor(a) principal: W?chter, Eduardo Weber lattes
Orientador(a): Moraes, Fernando Gehm lattes
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Pontif?cia Universidade Cat?lica do Rio Grande do Sul
Programa de Pós-Graduação: Programa de P?s-Gradua??o em Ci?ncia da Computa??o
Departamento: Faculdade de Inform?tica
País: Brasil
Palavras-chave em Português:
Área do conhecimento CNPq:
Link de acesso: http://tede2.pucrs.br/tede2/handle/tede/6279
Resumo: Mechanisms for fault-tolerance in MPSoCs are mandatory to cope with defects during fabrication or faults during product lifetime. For instance, permanent faults on the interconnect network can stall or crash applications, even though the MPSoCs? network has alternative faultfree paths to a given destination. Runtime Fault Tolerance provide self-organization mechanisms to continue delivering their processing services despite defective cores due to the presence of permanent and/or transient faults throughout their lifetime. This Thesis presents a runtime layered approach to a fault-tolerant MPSoC, where each layer is responsible for solving one part of the problem. The approach is built on top of a novel small specialized network used to search fault-free paths. The first layer, named physical layer, is responsible for the fault detection and fault isolation of defective routers. The second layer, named the network layer, is responsible for replacing the original faulty path by an alternative fault-free path. A fault-tolerant routing method executes a path search mechanism and reconfigures the network to use the faulty-free path. The third layer, named transport layer, implements a fault-tolerant communication protocol that triggers the path search in the network layer when a packet does not reach its destination. The last layer, application layer, is responsible for moving tasks from the defective processing element (PE) to a healthy PE, saving the task?s internal state, and restoring it in case of fault while executing a task. Results at the network layer, show a fast path finding method. The entire process of finding alternative paths takes typically less than 2000 clock cycles or 20 microseconds. In the transport layer, different approaches were evaluated being capable of detecting a lost message and start the retransmission. The results show that the overhead to retransmit the message is 2.46X compared to the time to transmit a message without fault, being all other messages transmitted with no overhead. For the DTW, MPEG, and synthetic applications the average-case application execution overhead was 0.17%, 0.09%, and 0.42%, respectively. This represents less than 5% of the application execution overhead worst case. At the application layer, the entire fault recovery protocol executes fast, with a low execution time overhead with no faults (5.67%) and with faults (17.33% - 28.34%).
id P_RS_297b3d20a1901c8438d9c496d6904488
oai_identifier_str oai:tede2.pucrs.br:tede/6279
network_acronym_str P_RS
network_name_str Biblioteca Digital de Teses e Dissertações da PUC_RS
repository_id_str
spelling Moraes, Fernando Gehm477.763.820-00http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4782943Z2Amory, Alexandre de Moraishttp://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4765980T0011.823.080-82http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4236336A0W?chter, Eduardo Weber2015-08-31T11:15:38Z2015-06-10http://tede2.pucrs.br/tede2/handle/tede/6279Mechanisms for fault-tolerance in MPSoCs are mandatory to cope with defects during fabrication or faults during product lifetime. For instance, permanent faults on the interconnect network can stall or crash applications, even though the MPSoCs? network has alternative faultfree paths to a given destination. Runtime Fault Tolerance provide self-organization mechanisms to continue delivering their processing services despite defective cores due to the presence of permanent and/or transient faults throughout their lifetime. This Thesis presents a runtime layered approach to a fault-tolerant MPSoC, where each layer is responsible for solving one part of the problem. The approach is built on top of a novel small specialized network used to search fault-free paths. The first layer, named physical layer, is responsible for the fault detection and fault isolation of defective routers. The second layer, named the network layer, is responsible for replacing the original faulty path by an alternative fault-free path. A fault-tolerant routing method executes a path search mechanism and reconfigures the network to use the faulty-free path. The third layer, named transport layer, implements a fault-tolerant communication protocol that triggers the path search in the network layer when a packet does not reach its destination. The last layer, application layer, is responsible for moving tasks from the defective processing element (PE) to a healthy PE, saving the task?s internal state, and restoring it in case of fault while executing a task. Results at the network layer, show a fast path finding method. The entire process of finding alternative paths takes typically less than 2000 clock cycles or 20 microseconds. In the transport layer, different approaches were evaluated being capable of detecting a lost message and start the retransmission. The results show that the overhead to retransmit the message is 2.46X compared to the time to transmit a message without fault, being all other messages transmitted with no overhead. For the DTW, MPEG, and synthetic applications the average-case application execution overhead was 0.17%, 0.09%, and 0.42%, respectively. This represents less than 5% of the application execution overhead worst case. At the application layer, the entire fault recovery protocol executes fast, with a low execution time overhead with no faults (5.67%) and with faults (17.33% - 28.34%).Mecanismos de toler?ncia a falhas em MPSoCs s?o obrigat?rios para enfrentar defeitos ocorridos durante a fabrica??o ou falhas durante a vida ?til do circuito integrado. Por exemplo, falhas permanentes na rede de interconex?o do MPSoC podem interromper aplica??es mesmo que a rede tenha caminhos sem falha para um determinado destino. A toler?ncia a falhas em tempo de execu??o fornece mecanismos de auto-organiza??o para continuar a oferecer servi?os de processamento apesar de n?cleos defeituosos devido ? presen?a de falhas permanentes e/ou transit?rias durante toda a vida dos chips. Esta Tese apresenta uma abordagem em camadas para um MPSoC tolerante a falhas, onde cada camada ? respons?vel por resolver uma parte do problema. O m?todo ? constru?do sobre uma nova proposta de rede especializada utilizada para procurar caminhos livre de falha. A primeira camada, denominada camada f?sica, ? respons?vel pela detec??o de falhas e isolamento das partes defeituosas da rede. A segunda camada, denominada camada de rede, ? respons?vel por substituir um caminho defeituoso por um caminho alternativo livre de falhas. Um m?todo de roteamento tolerante a falhas executa o mecanismo de busca de caminhos e reconfigura a rede para usar este caminho livre de falhas. A terceira camada, denominada camada de transporte, implementa um protocolo de comunica??o tolerante a falhas que detecta quando pacotes n?o s?o entregues ao destino, acionando o m?todo proposto na camada de rede. A ?ltima camada, camada de aplica??o, ? respons?vel por mover as tarefas do elemento de processamento (PE) defeituoso para um PE saud?vel, salvar o estado interno da tarefa, e restaur?-la em caso de falha durante a execu??o. Os resultados na camada de rede mostram um m?todo r?pido para encontrar caminhos livres de falhas. O processo de procura de caminhos alternativos leva tipicamente menos de 2000 ciclos de rel?gio (ou 20 microssegundos). Na camada de transporte, diferentes abordagens foram avaliadas para detectar uma mensagem n?o entregue e acionar a retransmiss?o. Os resultados mostram que a sobrecarga para retransmitir a mensagem ? 2,46 vezes maior quando comparado com o tempo para transmitir uma mensagem sem falha, sendo que todas outras mensagens subsequentes s?o transmitidas sem sobrecarga. Para as aplica??es DTW, MPEG e sint?tica, o caso m?dio de sobrecarga no tempo de execu??o da aplica??o ? de 0,17%, 0,09% e 0,42%, respectivamente. Isto representa menos do que 5% do tempo de execu??o de uma dada aplica??o no pior caso. Na camada de aplica??o, todo o protocolo de recupera??o de falhas executa rapidamente, com uma baixa sobrecarga no tempo de execu??o sem falhas (5,67%) e com falhas (17,33% - 28,34%).Submitted by Setor de Tratamento da Informa??o - BC/PUCRS (tede2@pucrs.br) on 2015-08-31T11:15:37Z No. of bitstreams: 1 474345 - Texto Completo.pdf: 3978955 bytes, checksum: aa0f35953c5bc355cef3bfc0576e2a44 (MD5)Made available in DSpace on 2015-08-31T11:15:38Z (GMT). No. of bitstreams: 1 474345 - Texto Completo.pdf: 3978955 bytes, checksum: aa0f35953c5bc355cef3bfc0576e2a44 (MD5) Previous issue date: 2015-06-10Coordena??o de Aperfei?oamento de Pessoal de N?vel Superior - CAPESapplication/pdfhttp://tede2.pucrs.br:80/tede2/retrieve/163276/474345%20-%20Texto%20Completo.pdf.jpgengPontif?cia Universidade Cat?lica do Rio Grande do SulPrograma de P?s-Gradua??o em Ci?ncia da Computa??oPUCRSBrasilFaculdade de Inform?ticaINFORM?TICAARQUITETURA DE COMPUTADORMICROPROCESSADORESCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOLayered approach for runtime fault recovery in NOC-Based MPSOCSinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesis1974996533081274470600600600600-300854251040114914436717112058112045092075167498588264571info:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da PUC_RSinstname:Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)instacron:PUC_RSTHUMBNAIL474345 - Texto Completo.pdf.jpg474345 - Texto Completo.pdf.jpgimage/jpeg4214http://tede2.pucrs.br/tede2/bitstream/tede/6279/4/474345+-+Texto+Completo.pdf.jpgd4b232619e31353322247acd0e7a0a37MD54TEXT474345 - Texto Completo.pdf.txt474345 - Texto Completo.pdf.txttext/plain225173http://tede2.pucrs.br/tede2/bitstream/tede/6279/3/474345+-+Texto+Completo.pdf.txt92331325a155fd92593a91e97c142d63MD53ORIGINAL474345 - Texto Completo.pdf474345 - Texto Completo.pdfapplication/pdf3978955http://tede2.pucrs.br/tede2/bitstream/tede/6279/2/474345+-+Texto+Completo.pdfaa0f35953c5bc355cef3bfc0576e2a44MD52LICENSElicense.txtlicense.txttext/plain; charset=utf-8610http://tede2.pucrs.br/tede2/bitstream/tede/6279/1/license.txt5a9d6006225b368ef605ba16b4f6d1beMD51tede/62792015-09-29 08:31:55.1oai:tede2.pucrs.br:tede/6279QXV0b3JpemHDp8OjbyBwYXJhIFB1YmxpY2HDp8OjbyBFbGV0csO0bmljYTogQ29tIGJhc2Ugbm8gZGlzcG9zdG8gbmEgTGVpIEZlZGVyYWwgbsK6OS42MTAsIGRlIDE5IGRlIGZldmVyZWlybyBkZSAxOTk4LCBvIGF1dG9yIEFVVE9SSVpBIGEgcHVibGljYcOnw6NvIGVsZXRyw7RuaWNhIGRhIHByZXNlbnRlIG9icmEgbm8gYWNlcnZvIGRhIEJpYmxpb3RlY2EgRGlnaXRhbCBkYSBQb250aWbDrWNpYSBVbml2ZXJzaWRhZGUgQ2F0w7NsaWNhIGRvIFJpbyBHcmFuZGUgZG8gU3VsLCBzZWRpYWRhIGEgQXYuIElwaXJhbmdhIDY2ODEsIFBvcnRvIEFsZWdyZSwgUmlvIEdyYW5kZSBkbyBTdWwsIGNvbSByZWdpc3RybyBkZSBDTlBKIDg4NjMwNDEzMDAwMi04MSBiZW0gY29tbyBlbSBvdXRyYXMgYmlibGlvdGVjYXMgZGlnaXRhaXMsIG5hY2lvbmFpcyBlIGludGVybmFjaW9uYWlzLCBjb25zw7NyY2lvcyBlIHJlZGVzIMOgcyBxdWFpcyBhIGJpYmxpb3RlY2EgZGEgUFVDUlMgcG9zc2EgYSB2aXIgcGFydGljaXBhciwgc2VtIMO0bnVzIGFsdXNpdm8gYW9zIGRpcmVpdG9zIGF1dG9yYWlzLCBhIHTDrXR1bG8gZGUgZGl2dWxnYcOnw6NvIGRhIHByb2R1w6fDo28gY2llbnTDrWZpY2EuCg==Biblioteca Digital de Teses e Dissertaçõeshttp://tede2.pucrs.br/tede2/PRIhttps://tede2.pucrs.br/oai/requestbiblioteca.central@pucrs.br||opendoar:2015-09-29T11:31:55Biblioteca Digital de Teses e Dissertações da PUC_RS - Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)false
dc.title.por.fl_str_mv Layered approach for runtime fault recovery in NOC-Based MPSOCS
title Layered approach for runtime fault recovery in NOC-Based MPSOCS
spellingShingle Layered approach for runtime fault recovery in NOC-Based MPSOCS
W?chter, Eduardo Weber
INFORM?TICA
ARQUITETURA DE COMPUTADOR
MICROPROCESSADORES
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
title_short Layered approach for runtime fault recovery in NOC-Based MPSOCS
title_full Layered approach for runtime fault recovery in NOC-Based MPSOCS
title_fullStr Layered approach for runtime fault recovery in NOC-Based MPSOCS
title_full_unstemmed Layered approach for runtime fault recovery in NOC-Based MPSOCS
title_sort Layered approach for runtime fault recovery in NOC-Based MPSOCS
author W?chter, Eduardo Weber
author_facet W?chter, Eduardo Weber
author_role author
dc.contributor.advisor1.fl_str_mv Moraes, Fernando Gehm
dc.contributor.advisor1ID.fl_str_mv 477.763.820-00
dc.contributor.advisor1Lattes.fl_str_mv http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4782943Z2
dc.contributor.advisor-co1.fl_str_mv Amory, Alexandre de Morais
dc.contributor.advisor-co1Lattes.fl_str_mv http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4765980T0
dc.contributor.authorID.fl_str_mv 011.823.080-82
dc.contributor.authorLattes.fl_str_mv http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4236336A0
dc.contributor.author.fl_str_mv W?chter, Eduardo Weber
contributor_str_mv Moraes, Fernando Gehm
Amory, Alexandre de Morais
dc.subject.por.fl_str_mv INFORM?TICA
ARQUITETURA DE COMPUTADOR
MICROPROCESSADORES
topic INFORM?TICA
ARQUITETURA DE COMPUTADOR
MICROPROCESSADORES
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
dc.subject.cnpq.fl_str_mv CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
description Mechanisms for fault-tolerance in MPSoCs are mandatory to cope with defects during fabrication or faults during product lifetime. For instance, permanent faults on the interconnect network can stall or crash applications, even though the MPSoCs? network has alternative faultfree paths to a given destination. Runtime Fault Tolerance provide self-organization mechanisms to continue delivering their processing services despite defective cores due to the presence of permanent and/or transient faults throughout their lifetime. This Thesis presents a runtime layered approach to a fault-tolerant MPSoC, where each layer is responsible for solving one part of the problem. The approach is built on top of a novel small specialized network used to search fault-free paths. The first layer, named physical layer, is responsible for the fault detection and fault isolation of defective routers. The second layer, named the network layer, is responsible for replacing the original faulty path by an alternative fault-free path. A fault-tolerant routing method executes a path search mechanism and reconfigures the network to use the faulty-free path. The third layer, named transport layer, implements a fault-tolerant communication protocol that triggers the path search in the network layer when a packet does not reach its destination. The last layer, application layer, is responsible for moving tasks from the defective processing element (PE) to a healthy PE, saving the task?s internal state, and restoring it in case of fault while executing a task. Results at the network layer, show a fast path finding method. The entire process of finding alternative paths takes typically less than 2000 clock cycles or 20 microseconds. In the transport layer, different approaches were evaluated being capable of detecting a lost message and start the retransmission. The results show that the overhead to retransmit the message is 2.46X compared to the time to transmit a message without fault, being all other messages transmitted with no overhead. For the DTW, MPEG, and synthetic applications the average-case application execution overhead was 0.17%, 0.09%, and 0.42%, respectively. This represents less than 5% of the application execution overhead worst case. At the application layer, the entire fault recovery protocol executes fast, with a low execution time overhead with no faults (5.67%) and with faults (17.33% - 28.34%).
publishDate 2015
dc.date.accessioned.fl_str_mv 2015-08-31T11:15:38Z
dc.date.issued.fl_str_mv 2015-06-10
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://tede2.pucrs.br/tede2/handle/tede/6279
url http://tede2.pucrs.br/tede2/handle/tede/6279
dc.language.iso.fl_str_mv eng
language eng
dc.relation.program.fl_str_mv 1974996533081274470
dc.relation.confidence.fl_str_mv 600
600
600
600
dc.relation.department.fl_str_mv -3008542510401149144
dc.relation.cnpq.fl_str_mv 3671711205811204509
dc.relation.sponsorship.fl_str_mv 2075167498588264571
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Pontif?cia Universidade Cat?lica do Rio Grande do Sul
dc.publisher.program.fl_str_mv Programa de P?s-Gradua??o em Ci?ncia da Computa??o
dc.publisher.initials.fl_str_mv PUCRS
dc.publisher.country.fl_str_mv Brasil
dc.publisher.department.fl_str_mv Faculdade de Inform?tica
publisher.none.fl_str_mv Pontif?cia Universidade Cat?lica do Rio Grande do Sul
dc.source.none.fl_str_mv reponame:Biblioteca Digital de Teses e Dissertações da PUC_RS
instname:Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)
instacron:PUC_RS
instname_str Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)
instacron_str PUC_RS
institution PUC_RS
reponame_str Biblioteca Digital de Teses e Dissertações da PUC_RS
collection Biblioteca Digital de Teses e Dissertações da PUC_RS
bitstream.url.fl_str_mv http://tede2.pucrs.br/tede2/bitstream/tede/6279/4/474345+-+Texto+Completo.pdf.jpg
http://tede2.pucrs.br/tede2/bitstream/tede/6279/3/474345+-+Texto+Completo.pdf.txt
http://tede2.pucrs.br/tede2/bitstream/tede/6279/2/474345+-+Texto+Completo.pdf
http://tede2.pucrs.br/tede2/bitstream/tede/6279/1/license.txt
bitstream.checksum.fl_str_mv d4b232619e31353322247acd0e7a0a37
92331325a155fd92593a91e97c142d63
aa0f35953c5bc355cef3bfc0576e2a44
5a9d6006225b368ef605ba16b4f6d1be
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da PUC_RS - Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS)
repository.mail.fl_str_mv biblioteca.central@pucrs.br||
_version_ 1796793216120389632