SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável

Castro, Marcelo Rodrigo de

SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável

Detalhes bibliográficos
Ano de defesa:	2017
Autor(a) principal:	Castro, Marcelo Rodrigo de
Orientador(a):	Senger, Hermes
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	por
Instituição de defesa:	Universidade Federal de São Carlos Câmpus São Carlos
Programa de Pós-Graduação:	Programa de Pós-Graduação em Ciência da Computação - PPGCC
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	BLAST Apache Spark Nuvens computacionais Sequenciamento genético
Palavras-chave em Inglês:	Cloud computing Genetic sequencing Hadoop
Área do conhecimento CNPq:	CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Link de acesso:	https://repositorio.ufscar.br/handle/20.500.14289/9114
Resumo:	With the evolution of next generation sequencing devices, the cost for obtaining genomic data has significantly reduced. With reduced costs for sequencing, the amount of genomic data to be processed has increased exponentially. Such data growth supersedes the rate at which computing power can be increased year after year by the hardware and software evolution. Thus, the higher rate of data growth in bioinformatics raises the need for exploiting more efficient and scalable techniques based on parallel and distributed processing, including platforms like Clusters, and Cloud Computing. BLAST is a widely used tool for genomic sequences alignment, which has native support for multicore-based parallel processing. However, its scalability is limited to a single machine. On the other hand, Cloud computing has emerged as an important technology for supporting rapid and elastic provisioning of large amounts of resources. Current frameworks like Apache Hadoop and Apache Spark provide support for the execution of distributed applications. Such environments provide mechanisms for embedding external applications in order to compose large distributed jobs which can be executed on clusters and cloud platforms. In this work, we used Spark to support the high scalable and efficient parallelization of BLAST (Basic Local Alingment Search Tool) to execute on dozens to hundreds of processing cores on a cloud platform. As result, our prototype has demonstrated better performance and scalability then CloudBLAST, a Hadoop based parallelization of BLAST.

Metadados do item

id	SCAR_60ea872e8bb68d3ffd2acc9d1fa37ed5
oai_identifier_str	oai:repositorio.ufscar.br:20.500.14289/9114
network_acronym_str	SCAR
network_name_str	Repositório Institucional da UFSCAR
repository_id_str
spelling	Castro, Marcelo Rodrigo deSenger, Hermeshttp://lattes.cnpq.br/3691742159298316http://lattes.cnpq.br/868871203394353446b787fd-c8fd-4eb1-a39f-013b75d223652017-09-25T17:05:03Z2017-09-25T17:05:03Z2017-02-13CASTRO, Marcelo Rodrigo de. SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável. 2017. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, São Carlos, 2017. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/9114.https://repositorio.ufscar.br/handle/20.500.14289/9114With the evolution of next generation sequencing devices, the cost for obtaining genomic data has significantly reduced. With reduced costs for sequencing, the amount of genomic data to be processed has increased exponentially. Such data growth supersedes the rate at which computing power can be increased year after year by the hardware and software evolution. Thus, the higher rate of data growth in bioinformatics raises the need for exploiting more efficient and scalable techniques based on parallel and distributed processing, including platforms like Clusters, and Cloud Computing. BLAST is a widely used tool for genomic sequences alignment, which has native support for multicore-based parallel processing. However, its scalability is limited to a single machine. On the other hand, Cloud computing has emerged as an important technology for supporting rapid and elastic provisioning of large amounts of resources. Current frameworks like Apache Hadoop and Apache Spark provide support for the execution of distributed applications. Such environments provide mechanisms for embedding external applications in order to compose large distributed jobs which can be executed on clusters and cloud platforms. In this work, we used Spark to support the high scalable and efficient parallelization of BLAST (Basic Local Alingment Search Tool) to execute on dozens to hundreds of processing cores on a cloud platform. As result, our prototype has demonstrated better performance and scalability then CloudBLAST, a Hadoop based parallelization of BLAST.Com a redução dos custos e evolução dos mecanismos que efetuam o sequenciamento genômico, tem havido um grande aumento na quantidade de dados referentes aos estudos da genomica. O crescimento desses dados tem ocorrido a taxas mais elevadas do que a industria tem conseguido aumentar o poder dos computadores a cada ano. Para melhor atender a necessidade de processamento e analise de dados em bioinformatica faz-se o uso de sistemas paralelos e distribuídos, como por exemplo: Clusters, Grids e Nuvens Computacionais. Contudo, muitas ferramentas, como o BLAST, que fazem o alinhamento entre sequencias e banco de dados, nao foram desenvolvidas para serem processadas de forma distribuída e escalavel. Os atuais frameworks Apache Hadoop e Apache Spark permitem a execucao de aplicacoes de forma distribuída e paralela, desde que as aplicacoes possam ser devidamente adaptadas e paralelizadas. Estudos que permitam melhorar desempenho de aplicacoes em bioinformatica tem se tornado um esforço contínuo. O Spark tem se mostrado uma ferramenta robusta para processamento massivo de dados. Nesta pesquisa de mestrado a ferramenta Apache Spark foi utilizada para dar suporte ao paralelismo da ferramenta BLAST (Basic Local Alingment Search Tool). Experimentos realizados na nuvem Google Cloud e Microsoft Azure demonstram desempenho (speedup) obtido foi similar ou melhor que trabalhos semelhantes ja desenvolvidos em Hadoop.OutraConselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro (FAPERJ)porUniversidade Federal de São CarlosCâmpus São CarlosPrograma de Pós-Graduação em Ciência da Computação - PPGCCUFSCarBLASTApache SparkNuvens computacionaisSequenciamento genéticoCloud computingGenetic sequencingHadoopCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOSparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalávelinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisOnline6006002947c428-30b1-4d14-8369-e5871a4d7accinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINALDissMRC.pdfDissMRC.pdfapplication/pdf1562148https://repositorio.ufscar.br/bitstreams/eaa1b500-26a4-437f-bc40-9577b0191981/download9921840ad67ef82d956e399ab96dd78cMD51trueAnonymousREADLICENSElicense.txtlicense.txttext/plain; charset=utf-81957https://repositorio.ufscar.br/bitstreams/2289cf90-cfab-475c-8cde-4d57f831b8e2/downloadae0398b6f8b235e40ad82cba6c50031dMD52falseAnonymousREADTEXTDissMRC.pdf.txtDissMRC.pdf.txtExtracted texttext/plain161874https://repositorio.ufscar.br/bitstreams/bada32a2-c7e6-48ae-9e11-3212244da5a5/download60c9f768375bf8ec7d483feeac370974MD55falseAnonymousREADTHUMBNAILDissMRC.pdf.jpgDissMRC.pdf.jpgIM Thumbnailimage/jpeg8424https://repositorio.ufscar.br/bitstreams/07aeb01d-9c84-4a6a-92d2-8e3f18029a6f/download482833920f3fa632038ec4bf214d7b64MD56falseAnonymousREAD20.500.14289/91142025-02-05 17:39:49.141Acesso abertoopen.accessoai:repositorio.ufscar.br:20.500.14289/9114https://repositorio.ufscar.brRepositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestrepositorio.sibi@ufscar.bropendoar:43222025-02-05T20:39:49Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)falseTElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEKCkNvbSBhIGFwcmVzZW50YcOnw6NvIGRlc3RhIGxpY2Vuw6dhLCB2b2PDqiAobyBhdXRvciAoZXMpIG91IG8gdGl0dWxhciBkb3MgZGlyZWl0b3MgZGUgYXV0b3IpIGNvbmNlZGUgw6AgVW5pdmVyc2lkYWRlCkZlZGVyYWwgZGUgU8OjbyBDYXJsb3MgbyBkaXJlaXRvIG7Do28tZXhjbHVzaXZvIGRlIHJlcHJvZHV6aXIsICB0cmFkdXppciAoY29uZm9ybWUgZGVmaW5pZG8gYWJhaXhvKSwgZS9vdQpkaXN0cmlidWlyIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyAoaW5jbHVpbmRvIG8gcmVzdW1vKSBwb3IgdG9kbyBvIG11bmRvIG5vIGZvcm1hdG8gaW1wcmVzc28gZSBlbGV0csO0bmljbyBlCmVtIHF1YWxxdWVyIG1laW8sIGluY2x1aW5kbyBvcyBmb3JtYXRvcyDDoXVkaW8gb3UgdsOtZGVvLgoKVm9jw6ogY29uY29yZGEgcXVlIGEgVUZTQ2FyIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28KcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBhIFVGU0NhciBwb2RlIG1hbnRlciBtYWlzIGRlIHVtYSBjw7NwaWEgYSBzdWEgdGVzZSBvdQpkaXNzZXJ0YcOnw6NvIHBhcmEgZmlucyBkZSBzZWd1cmFuw6dhLCBiYWNrLXVwIGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIGRlY2xhcmEgcXVlIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyDDqSBvcmlnaW5hbCBlIHF1ZSB2b2PDqiB0ZW0gbyBwb2RlciBkZSBjb25jZWRlciBvcyBkaXJlaXRvcyBjb250aWRvcwpuZXN0YSBsaWNlbsOnYS4gVm9jw6ogdGFtYsOpbSBkZWNsYXJhIHF1ZSBvIGRlcMOzc2l0byBkYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvIG7Do28sIHF1ZSBzZWphIGRlIHNldQpjb25oZWNpbWVudG8sIGluZnJpbmdlIGRpcmVpdG9zIGF1dG9yYWlzIGRlIG5pbmd1w6ltLgoKQ2FzbyBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gY29udGVuaGEgbWF0ZXJpYWwgcXVlIHZvY8OqIG7Do28gcG9zc3VpIGEgdGl0dWxhcmlkYWRlIGRvcyBkaXJlaXRvcyBhdXRvcmFpcywgdm9jw6oKZGVjbGFyYSBxdWUgb2J0ZXZlIGEgcGVybWlzc8OjbyBpcnJlc3RyaXRhIGRvIGRldGVudG9yIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBwYXJhIGNvbmNlZGVyIMOgIFVGU0NhcgpvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUKaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBURVNFIE9VIERJU1NFUlRBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UKQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PIFFVRSBOw4NPIFNFSkEgQSBVRlNDYXIsClZPQ8OKIERFQ0xBUkEgUVVFIFJFU1BFSVRPVSBUT0RPUyBFIFFVQUlTUVVFUiBESVJFSVRPUyBERSBSRVZJU8ODTyBDT01PClRBTULDiU0gQVMgREVNQUlTIE9CUklHQcOHw5VFUyBFWElHSURBUyBQT1IgQ09OVFJBVE8gT1UgQUNPUkRPLgoKQSBVRlNDYXIgc2UgY29tcHJvbWV0ZSBhIGlkZW50aWZpY2FyIGNsYXJhbWVudGUgbyBzZXUgbm9tZSAocykgb3UgbyhzKSBub21lKHMpIGRvKHMpCmRldGVudG9yKGVzKSBkb3MgZGlyZWl0b3MgYXV0b3JhaXMgZGEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvLCBlIG7Do28gZmFyw6EgcXVhbHF1ZXIgYWx0ZXJhw6fDo28sIGFsw6ltIGRhcXVlbGFzCmNvbmNlZGlkYXMgcG9yIGVzdGEgbGljZW7Dp2EuCg==
dc.title.por.fl_str_mv	SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável
title	SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável
spellingShingle	SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável Castro, Marcelo Rodrigo de BLAST Apache Spark Nuvens computacionais Sequenciamento genético Cloud computing Genetic sequencing Hadoop CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
title_short	SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável
title_full	SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável
title_fullStr	SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável
title_full_unstemmed	SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável
title_sort	SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável
author	Castro, Marcelo Rodrigo de
author_facet	Castro, Marcelo Rodrigo de
author_role	author
dc.contributor.authorlattes.por.fl_str_mv	http://lattes.cnpq.br/8688712033943534
dc.contributor.author.fl_str_mv	Castro, Marcelo Rodrigo de
dc.contributor.advisor1.fl_str_mv	Senger, Hermes
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/3691742159298316
dc.contributor.authorID.fl_str_mv	46b787fd-c8fd-4eb1-a39f-013b75d22365
contributor_str_mv	Senger, Hermes
dc.subject.por.fl_str_mv	BLAST Apache Spark Nuvens computacionais Sequenciamento genético
topic	BLAST Apache Spark Nuvens computacionais Sequenciamento genético Cloud computing Genetic sequencing Hadoop CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
dc.subject.eng.fl_str_mv	Cloud computing Genetic sequencing Hadoop
dc.subject.cnpq.fl_str_mv	CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
description	With the evolution of next generation sequencing devices, the cost for obtaining genomic data has significantly reduced. With reduced costs for sequencing, the amount of genomic data to be processed has increased exponentially. Such data growth supersedes the rate at which computing power can be increased year after year by the hardware and software evolution. Thus, the higher rate of data growth in bioinformatics raises the need for exploiting more efficient and scalable techniques based on parallel and distributed processing, including platforms like Clusters, and Cloud Computing. BLAST is a widely used tool for genomic sequences alignment, which has native support for multicore-based parallel processing. However, its scalability is limited to a single machine. On the other hand, Cloud computing has emerged as an important technology for supporting rapid and elastic provisioning of large amounts of resources. Current frameworks like Apache Hadoop and Apache Spark provide support for the execution of distributed applications. Such environments provide mechanisms for embedding external applications in order to compose large distributed jobs which can be executed on clusters and cloud platforms. In this work, we used Spark to support the high scalable and efficient parallelization of BLAST (Basic Local Alingment Search Tool) to execute on dozens to hundreds of processing cores on a cloud platform. As result, our prototype has demonstrated better performance and scalability then CloudBLAST, a Hadoop based parallelization of BLAST.
publishDate	2017
dc.date.accessioned.fl_str_mv	2017-09-25T17:05:03Z
dc.date.available.fl_str_mv	2017-09-25T17:05:03Z
dc.date.issued.fl_str_mv	2017-02-13
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	CASTRO, Marcelo Rodrigo de. SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável. 2017. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, São Carlos, 2017. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/9114.
dc.identifier.uri.fl_str_mv	https://repositorio.ufscar.br/handle/20.500.14289/9114
identifier_str_mv	CASTRO, Marcelo Rodrigo de. SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável. 2017. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, São Carlos, 2017. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/9114.
url	https://repositorio.ufscar.br/handle/20.500.14289/9114
dc.language.iso.fl_str_mv	por
language	por
dc.relation.confidence.fl_str_mv	600 600
dc.relation.authority.fl_str_mv	2947c428-30b1-4d14-8369-e5871a4d7acc
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de São Carlos Câmpus São Carlos
dc.publisher.program.fl_str_mv	Programa de Pós-Graduação em Ciência da Computação - PPGCC
dc.publisher.initials.fl_str_mv	UFSCar
publisher.none.fl_str_mv	Universidade Federal de São Carlos Câmpus São Carlos
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFSCAR instname:Universidade Federal de São Carlos (UFSCAR) instacron:UFSCAR
instname_str	Universidade Federal de São Carlos (UFSCAR)
instacron_str	UFSCAR
institution	UFSCAR
reponame_str	Repositório Institucional da UFSCAR
collection	Repositório Institucional da UFSCAR
bitstream.url.fl_str_mv	https://repositorio.ufscar.br/bitstreams/eaa1b500-26a4-437f-bc40-9577b0191981/download https://repositorio.ufscar.br/bitstreams/2289cf90-cfab-475c-8cde-4d57f831b8e2/download https://repositorio.ufscar.br/bitstreams/bada32a2-c7e6-48ae-9e11-3212244da5a5/download https://repositorio.ufscar.br/bitstreams/07aeb01d-9c84-4a6a-92d2-8e3f18029a6f/download
bitstream.checksum.fl_str_mv	9921840ad67ef82d956e399ab96dd78c ae0398b6f8b235e40ad82cba6c50031d 60c9f768375bf8ec7d483feeac370974 482833920f3fa632038ec4bf214d7b64
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)
repository.mail.fl_str_mv	repositorio.sibi@ufscar.br
_version_	1851688752226762752

SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável

Registros relacionados