Contributions to the study of the protein folding problem using deep learning and molecular dynamics

Detalhes bibliográficos
Ano de defesa: 2020
Autor(a) principal: Hattori, Leandro Takeshi
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Universidade Tecnológica Federal do Paraná
Curitiba
Brasil
Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial
UTFPR
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://repositorio.utfpr.edu.br/jspui/handle/1/24963
Resumo: The Protein Folding Problem (PFP) is one of the main challenges in the Computational Biology area. Globular proteins are believed to evolve from random initial conformations through folding pathways achieving, in almost all cases, to a functional native structure. Studies of the folding process are related to several abnormal events, such as misfolding and protein aggregation. Therefore, several computational approaches have been proposed in the literature for this problem. Deep Learning (DL) methods have been highlighted in studies in the Proteomics area, given their ability to extract features vectors and their efficiency after the training process. Recurrent Neural Networks (RNN) are cyclic DL methods that have achieved state-of-the-art performance for sequential and temporal problems. Therefore, this thesis presents contributions to studying the spatial-temporal pathways of the protein folding using RNN methods. To achieve these contributions, experiments of this thesis were organized in three steps: develop a framework to generate a massive amount of protein folding data using pure sequential and parallel Molecular Dynamics (MD) methods in the canonical ensemble; propose a Neighbourhood List (NL) approach to the parallel MD method; apply RNNs networks to the PFP. In the first step, we presented a package called PathMolD-AB to simulate and analyze folding data trajectories using the 3D-AB off-lattice model to represent the protein structure. The datasets generated from PathMolD-AB correspond to the MD evolution of 3,500 folding pathways, encompassing 35×106 states. The speedup analysis showed that the parallel approach obtained faster simulations when used protein sequences with more than 99 amino acids were used. In the second step, the NL approach with parallel MD showed higher improvement in the speedup performance than the purely parallel MD version with protein sequences between 99 to 1,000 amino acids, which covers 80% of the entire Protein Data Bank (PDB). In the last step of this thesis, a comparative analysis between RNNs architectures were carried out using the many-to-one model with datasets generated by the PathMold-AB. Results indicate that the Long Short-Term Memory ( obtained the best performance than other RNNs architectures in terms of prediction error. The biological analysis indicated that the LSTM predicted structures with similar features to the target (MD), in terms of hydrophobic and polar compactness, and also torsion and bond energies, suggesting that this approach is auspicious for the PFP study.
id UTFPR-12_aacdb480caf97c2676f97c404da79c13
oai_identifier_str oai:repositorio.utfpr.edu.br:1/24963
network_acronym_str UTFPR-12
network_name_str Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))
repository_id_str
spelling Contributions to the study of the protein folding problem using deep learning and molecular dynamicsContribuições para o estudo do problema de dobramento de proteínas usando métodos de aprendizado profundo e dinâmica molecularProteínasDinâmica molecularBiologia computacionalComputação de alto desempenhoBiologia Molecular ComputacionalProteômica - Processamento de dadosSimulação (Computadores)ProteinsMolecular dynamicsComputational biologyHigh performance computingComputational molecular biologyProteomics - Data processingComputer simulationCNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOEngenharia ElétricaThe Protein Folding Problem (PFP) is one of the main challenges in the Computational Biology area. Globular proteins are believed to evolve from random initial conformations through folding pathways achieving, in almost all cases, to a functional native structure. Studies of the folding process are related to several abnormal events, such as misfolding and protein aggregation. Therefore, several computational approaches have been proposed in the literature for this problem. Deep Learning (DL) methods have been highlighted in studies in the Proteomics area, given their ability to extract features vectors and their efficiency after the training process. Recurrent Neural Networks (RNN) are cyclic DL methods that have achieved state-of-the-art performance for sequential and temporal problems. Therefore, this thesis presents contributions to studying the spatial-temporal pathways of the protein folding using RNN methods. To achieve these contributions, experiments of this thesis were organized in three steps: develop a framework to generate a massive amount of protein folding data using pure sequential and parallel Molecular Dynamics (MD) methods in the canonical ensemble; propose a Neighbourhood List (NL) approach to the parallel MD method; apply RNNs networks to the PFP. In the first step, we presented a package called PathMolD-AB to simulate and analyze folding data trajectories using the 3D-AB off-lattice model to represent the protein structure. The datasets generated from PathMolD-AB correspond to the MD evolution of 3,500 folding pathways, encompassing 35×106 states. The speedup analysis showed that the parallel approach obtained faster simulations when used protein sequences with more than 99 amino acids were used. In the second step, the NL approach with parallel MD showed higher improvement in the speedup performance than the purely parallel MD version with protein sequences between 99 to 1,000 amino acids, which covers 80% of the entire Protein Data Bank (PDB). In the last step of this thesis, a comparative analysis between RNNs architectures were carried out using the many-to-one model with datasets generated by the PathMold-AB. Results indicate that the Long Short-Term Memory ( obtained the best performance than other RNNs architectures in terms of prediction error. The biological analysis indicated that the LSTM predicted structures with similar features to the target (MD), in terms of hydrophobic and polar compactness, and also torsion and bond energies, suggesting that this approach is auspicious for the PFP study.O Protein Folding Problem (PFP) é um dos principais desafios da área de Biologia Computacional. Acredita-se que as proteínas globulares evoluem de conformações iniciais aleatórias através de trajetórias de dobramento, alcançando, em quase todos os casos, uma estrutura nativa funcional. Estudos relacionados ao dobramento proteico estão relacionados a vários eventos anormais, como dobramento incorreto e agregação de proteínas. Portanto, várias abordagens computacionais têm sido propostas na literatura para este problema. Métodos de Deep Learning (DL) têm se destacado em estudos na área de Proteômica, dada a sua capacidade de extrair vetores de características e também pela sua eficiência após o processo de treinamento. Recurrent Neural Network (RNN) são métodos DL cíclicos que alcançaram desempenho do estado-da-arte para problemas sequenciais e temporais. Esta tese apresenta contribuições para o estudo das trajetórias espaço-temporais do enovelamento de proteínas utilizando métodos RNN. Para alcançar essas contribuições, os experimentos desta tese foram organizados em três etapas: desenvolver um framework para gerar grande quantidades de dados de dobramento de proteínas usando métodos sequenciais e paralelos de Molecular Dynamics (MD) no ensemble canônico; propor uma abordagem de Neighbourhood List (NL) para o método MD paralelo; aplicar redes RNNs ao PFP. Na primeira etapa, apresentamos um pacote chamado PathMolD-AB para simular e analisar trajetórias de dados de dobramento usando o modelo 3D-AB off-lattice para representar a estrutura da proteína. Os conjuntos de dados gerados a partir do PathMolD-AB correspondem à 3.500 trajetórias de dobras, abrangendo 35 × 106 estados de dobramento. A análise de speedup mostrou que a abordagem paralela obteve simulações mais rápidas quando se utilizaram sequências de proteínas com mais de 99 aminoácidos. Na segunda etapa, a abordagem NL com MD paralelo mostrou melhoria no desempenho de aceleração do que a versão MD puramente paralela com sequências de proteínas entre 99 a 1.000 aminoácidos, que abrange 80 % de todo o Protein Data Bank (PDB). Na última etapa desta tese, foi realizada uma análise comparativa entre as arquiteturas de RNNs utilizando o modelo many-to-one com conjuntos de dados gerados pelo PathMold-AB. Os resultados indicam que a Long Short-Term Memory (LSTM) obteve o melhor desempenho que as outras arquiteturas de RNNs em termos de erro de predição. A análise biológica indicou que a rede LSTM previu estruturas com características semelhantes ao alvo (MD), em termos de compactação hidrofóbica e polar, e também energias de torção e ligação, sugerindo que esta abordagem é auspiciosa para o estudo PFP.Universidade Tecnológica Federal do ParanáCuritibaBrasilPrograma de Pós-Graduação em Engenharia Elétrica e Informática IndustrialUTFPRLopes, Heitor Silveriohttps://orcid.org/0000-0003-3984-1432http://lattes.cnpq.br/4045818083957064Benitez, Cesar Manuel Vargashttps://orcid.org/0000-0002-5691-5432http://lattes.cnpq.br/3930929146154435Britto Junior, Alceu de Souzahttps://orcid.org/0000-0002-3064-3563http://lattes.cnpq.br/4251936710939364Lopes, Fabricio Martinshttp://orcid.org/0000-0002-8786-3313http://lattes.cnpq.br/1660070580824436Lopes, Heitor Silveriohttps://orcid.org/0000-0003-3984-1432http://lattes.cnpq.br/4045818083957064Frigori, Rafael Bertolinihttps://orcid.org/0000-0002-4861-7240http://lattes.cnpq.br/5836878566801544Parpinelli, Rafael Stubshttps://orcid.org/0000-0001-7326-5032http://lattes.cnpq.br/4456007001373501Hattori, Leandro Takeshi2021-05-16T20:44:31Z2021-05-16T20:44:31Z2020-11-30info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfHATTORI, Leandro Takeshi. Contributions to the study of the protein folding problem using deep learning and molecular dynamics. 2020. Tese (Doutorado em Engenharia Elétrica e Informática Industrial) - Universidade Tecnológica Federal do Paraná, Curitiba, 2020.http://repositorio.utfpr.edu.br/jspui/handle/1/24963enghttp://creativecommons.org/licenses/by/4.0/info:eu-repo/semantics/openAccessreponame:Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))instname:Universidade Tecnológica Federal do Paraná (UTFPR)instacron:UTFPR2021-05-17T06:11:27Zoai:repositorio.utfpr.edu.br:1/24963Repositório InstitucionalPUBhttp://repositorio.utfpr.edu.br:8080/oai/requestriut@utfpr.edu.br || sibi@utfpr.edu.bropendoar:2021-05-17T06:11:27Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT)) - Universidade Tecnológica Federal do Paraná (UTFPR)false
dc.title.none.fl_str_mv Contributions to the study of the protein folding problem using deep learning and molecular dynamics
Contribuições para o estudo do problema de dobramento de proteínas usando métodos de aprendizado profundo e dinâmica molecular
title Contributions to the study of the protein folding problem using deep learning and molecular dynamics
spellingShingle Contributions to the study of the protein folding problem using deep learning and molecular dynamics
Hattori, Leandro Takeshi
Proteínas
Dinâmica molecular
Biologia computacional
Computação de alto desempenho
Biologia Molecular Computacional
Proteômica - Processamento de dados
Simulação (Computadores)
Proteins
Molecular dynamics
Computational biology
High performance computing
Computational molecular biology
Proteomics - Data processing
Computer simulation
CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Engenharia Elétrica
title_short Contributions to the study of the protein folding problem using deep learning and molecular dynamics
title_full Contributions to the study of the protein folding problem using deep learning and molecular dynamics
title_fullStr Contributions to the study of the protein folding problem using deep learning and molecular dynamics
title_full_unstemmed Contributions to the study of the protein folding problem using deep learning and molecular dynamics
title_sort Contributions to the study of the protein folding problem using deep learning and molecular dynamics
author Hattori, Leandro Takeshi
author_facet Hattori, Leandro Takeshi
author_role author
dc.contributor.none.fl_str_mv Lopes, Heitor Silverio
https://orcid.org/0000-0003-3984-1432
http://lattes.cnpq.br/4045818083957064
Benitez, Cesar Manuel Vargas
https://orcid.org/0000-0002-5691-5432
http://lattes.cnpq.br/3930929146154435
Britto Junior, Alceu de Souza
https://orcid.org/0000-0002-3064-3563
http://lattes.cnpq.br/4251936710939364
Lopes, Fabricio Martins
http://orcid.org/0000-0002-8786-3313
http://lattes.cnpq.br/1660070580824436
Lopes, Heitor Silverio
https://orcid.org/0000-0003-3984-1432
http://lattes.cnpq.br/4045818083957064
Frigori, Rafael Bertolini
https://orcid.org/0000-0002-4861-7240
http://lattes.cnpq.br/5836878566801544
Parpinelli, Rafael Stubs
https://orcid.org/0000-0001-7326-5032
http://lattes.cnpq.br/4456007001373501
dc.contributor.author.fl_str_mv Hattori, Leandro Takeshi
dc.subject.por.fl_str_mv Proteínas
Dinâmica molecular
Biologia computacional
Computação de alto desempenho
Biologia Molecular Computacional
Proteômica - Processamento de dados
Simulação (Computadores)
Proteins
Molecular dynamics
Computational biology
High performance computing
Computational molecular biology
Proteomics - Data processing
Computer simulation
CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Engenharia Elétrica
topic Proteínas
Dinâmica molecular
Biologia computacional
Computação de alto desempenho
Biologia Molecular Computacional
Proteômica - Processamento de dados
Simulação (Computadores)
Proteins
Molecular dynamics
Computational biology
High performance computing
Computational molecular biology
Proteomics - Data processing
Computer simulation
CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Engenharia Elétrica
description The Protein Folding Problem (PFP) is one of the main challenges in the Computational Biology area. Globular proteins are believed to evolve from random initial conformations through folding pathways achieving, in almost all cases, to a functional native structure. Studies of the folding process are related to several abnormal events, such as misfolding and protein aggregation. Therefore, several computational approaches have been proposed in the literature for this problem. Deep Learning (DL) methods have been highlighted in studies in the Proteomics area, given their ability to extract features vectors and their efficiency after the training process. Recurrent Neural Networks (RNN) are cyclic DL methods that have achieved state-of-the-art performance for sequential and temporal problems. Therefore, this thesis presents contributions to studying the spatial-temporal pathways of the protein folding using RNN methods. To achieve these contributions, experiments of this thesis were organized in three steps: develop a framework to generate a massive amount of protein folding data using pure sequential and parallel Molecular Dynamics (MD) methods in the canonical ensemble; propose a Neighbourhood List (NL) approach to the parallel MD method; apply RNNs networks to the PFP. In the first step, we presented a package called PathMolD-AB to simulate and analyze folding data trajectories using the 3D-AB off-lattice model to represent the protein structure. The datasets generated from PathMolD-AB correspond to the MD evolution of 3,500 folding pathways, encompassing 35×106 states. The speedup analysis showed that the parallel approach obtained faster simulations when used protein sequences with more than 99 amino acids were used. In the second step, the NL approach with parallel MD showed higher improvement in the speedup performance than the purely parallel MD version with protein sequences between 99 to 1,000 amino acids, which covers 80% of the entire Protein Data Bank (PDB). In the last step of this thesis, a comparative analysis between RNNs architectures were carried out using the many-to-one model with datasets generated by the PathMold-AB. Results indicate that the Long Short-Term Memory ( obtained the best performance than other RNNs architectures in terms of prediction error. The biological analysis indicated that the LSTM predicted structures with similar features to the target (MD), in terms of hydrophobic and polar compactness, and also torsion and bond energies, suggesting that this approach is auspicious for the PFP study.
publishDate 2020
dc.date.none.fl_str_mv 2020-11-30
2021-05-16T20:44:31Z
2021-05-16T20:44:31Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv HATTORI, Leandro Takeshi. Contributions to the study of the protein folding problem using deep learning and molecular dynamics. 2020. Tese (Doutorado em Engenharia Elétrica e Informática Industrial) - Universidade Tecnológica Federal do Paraná, Curitiba, 2020.
http://repositorio.utfpr.edu.br/jspui/handle/1/24963
identifier_str_mv HATTORI, Leandro Takeshi. Contributions to the study of the protein folding problem using deep learning and molecular dynamics. 2020. Tese (Doutorado em Engenharia Elétrica e Informática Industrial) - Universidade Tecnológica Federal do Paraná, Curitiba, 2020.
url http://repositorio.utfpr.edu.br/jspui/handle/1/24963
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv http://creativecommons.org/licenses/by/4.0/
info:eu-repo/semantics/openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by/4.0/
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Tecnológica Federal do Paraná
Curitiba
Brasil
Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial
UTFPR
publisher.none.fl_str_mv Universidade Tecnológica Federal do Paraná
Curitiba
Brasil
Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial
UTFPR
dc.source.none.fl_str_mv reponame:Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))
instname:Universidade Tecnológica Federal do Paraná (UTFPR)
instacron:UTFPR
instname_str Universidade Tecnológica Federal do Paraná (UTFPR)
instacron_str UTFPR
institution UTFPR
reponame_str Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))
collection Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))
repository.name.fl_str_mv Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT)) - Universidade Tecnológica Federal do Paraná (UTFPR)
repository.mail.fl_str_mv riut@utfpr.edu.br || sibi@utfpr.edu.br
_version_ 1850498257406394368