Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank

Detalhes bibliográficos
Ano de defesa: 2015
Autor(a) principal: Gon?alves, Jos? Irahe Kasprzykowski lattes
Orientador(a): Queiroz, Artur Trancoso Lopo de lattes
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Estadual de Feira de Santana
Programa de Pós-Graduação: Mestrado em Computa??o Aplicada
Departamento: DEPARTAMENTO DE CI?NCIAS EXATAS
País: Brasil
Palavras-chave em Português:
HIV
Palavras-chave em Inglês:
HIV
Área do conhecimento CNPq:
Link de acesso: http://localhost:8080/tede/handle/tede/327
Resumo: HIV infects over 40 million people worldwide and is considered by the World Health Organization a large scale pandemic. Which the associated disease has no cure. New data and analysis can help new treatment and vaccine development. However, the dataset is vast, with over 500,000 sequences available on GenBank. This data still lacks essential information such as subtyping and genome location. To help minimize these problems we developed a system for automated analysis from GenBank data. The tool performs sequence map according to HXB2 and subtyping by comparison with subtype reference sequences. This process uses Needleman-Wusch and Smith-Waterman respectively. All 582,678 sequences were mapped in 5 days and 14 hours and subtyped in 1 day and 7 hours with our algorithm, while the original approach was estimated to finish in 36 and 97 years respectively. Our tool was able to analyse the massive data in a reliable time. No current subtyping tool can analyse this high-throughput data. Our results showed that pol and gag genes were the most prevalent genes on the dataset, and could be explained because treatment and subtyping are based on these genes. Moreover, the structural genes were most prevalent, with 66.41%. This highlighted the low representation of regulatory genes on available data. The subtyping results showed that the subtype B was most frequent, with 45.96%. The recombinants together represent 43.37%. Furthermore, subtype C presented only 4.12% and the other pure subtypes less than 4%. Also, the geographical data was recovered from database and USA presented higher frequency, with 24.50%, showing a significant country bias. Our results present a new HIV subtype distribution with the most complete and recent dataset.Herein, we presented a new user friendly software for massive data analysis of viruses. This software is able to analyse highly mutational virus data, such as HCV and HIV in reliable time. Further, severe country bias raises questions regarding world subtype distribution. The analysis of all sequences from HIV provides new epidemy insights about subtypes and country distribution.
id UEFS_48fce772d82549fd4dc69cd8dcddc480
oai_identifier_str oai:tede2.uefs.br:8080:tede/327
network_acronym_str UEFS
network_name_str Biblioteca Digital de Teses e Dissertações da UEFS
repository_id_str
spelling Queiroz, Artur Trancoso Lopo de83630643515http://lattes.cnpq.br/522218242717149701845510569http://lattes.cnpq.br/6650527222516832Gon?alves, Jos? Irahe Kasprzykowski2016-03-31T01:16:15Z2015-12-15GON?ALVES, Jos? Irahe Kasprzykowski. Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank. 2015. 60 f. Disserta??o (Mestrado em Computa??o Aplicada) - Universidade Estadual de Feira de Santana, Feira de Santana, BA.http://localhost:8080/tede/handle/tede/327HIV infects over 40 million people worldwide and is considered by the World Health Organization a large scale pandemic. Which the associated disease has no cure. New data and analysis can help new treatment and vaccine development. However, the dataset is vast, with over 500,000 sequences available on GenBank. This data still lacks essential information such as subtyping and genome location. To help minimize these problems we developed a system for automated analysis from GenBank data. The tool performs sequence map according to HXB2 and subtyping by comparison with subtype reference sequences. This process uses Needleman-Wusch and Smith-Waterman respectively. All 582,678 sequences were mapped in 5 days and 14 hours and subtyped in 1 day and 7 hours with our algorithm, while the original approach was estimated to finish in 36 and 97 years respectively. Our tool was able to analyse the massive data in a reliable time. No current subtyping tool can analyse this high-throughput data. Our results showed that pol and gag genes were the most prevalent genes on the dataset, and could be explained because treatment and subtyping are based on these genes. Moreover, the structural genes were most prevalent, with 66.41%. This highlighted the low representation of regulatory genes on available data. The subtyping results showed that the subtype B was most frequent, with 45.96%. The recombinants together represent 43.37%. Furthermore, subtype C presented only 4.12% and the other pure subtypes less than 4%. Also, the geographical data was recovered from database and USA presented higher frequency, with 24.50%, showing a significant country bias. Our results present a new HIV subtype distribution with the most complete and recent dataset.Herein, we presented a new user friendly software for massive data analysis of viruses. This software is able to analyse highly mutational virus data, such as HCV and HIV in reliable time. Further, severe country bias raises questions regarding world subtype distribution. The analysis of all sequences from HIV provides new epidemy insights about subtypes and country distribution.O HIV infecta mais de 40 milh?es de pessoas no mundo e ? considerado pela Organiza??o Mundial de Sa?de como uma pandemia. A doen?a associada n?o possui cura cl?nica. Novas an?lises e informa??es podem ajudar no desenvolvimento de novos tratamentos e vacinas. No entanto, o conjunto de dados sobre o agente etiol?gico dispon?vel ? vasto, contando com mais de 500 mil sequ?ncias no GenBank. Este conjunto de dados ainda carece de informa??es essenciais, como subtipo viral e localiza??o no genoma de refer?ncia. Para auxiliar na minimiza??o destes problemas, desenvolvemos um sistema para an?lise dos dados dispon?veis no GenBank. A ferramenta realiza o mapeamento de acordo com o genoma refer?ncia HXB2 e a subtipagem comparando as sequ?ncias de refer?ncia dos subtipos. Estes processos utilizam os algoritmos de Needleman-Wusch e Smith-Waterman respectivamente. Todas as 582.678 sequ?ncias foram mapeadas em 5 dias e 14 horas, e subtipadas em 1 dia e 7 horas com nosso algoritmo. Enquanto a abordagem original estima terminar em 36 e 97 anos respectivamente. Nenhuma ferramenta de subtipagem dispon?vel atualmente ? capaz de analisar esta quantidade de dados. Nossos resultados mostraram que os genes gag e pol s?o mais prevalentes no conjunto de dados. O que pode ser explicado pelo fato de t?cnicas de avalia??o de resist?ncia aos antirretrovirais e subtipagem serem baseadas nesses genes. Al?m disso, os genes estruturais exibiram uma preval?ncia absoluta de 66.41%. Isto evidencia a pouca representatividade de genes regulat?rios no conjunto de dados. Os resultados da subtipagem mostram que o subtipo B ? o mais frequente com 45,96% de preval?ncia. Os recombinantes, combinados, representam 43.37%. Ademais, o subtipo C apresentou apenas 4,12% de preval?ncia absoluta e outros subtipos puros menos de 4%. Al?m disso, dados geogr?ficos foram recuperados do banco de dados. Os Estados Unidos representam a maior frequ?ncia de sequ?ncias submetidas, com 24,5% de todos os dados dispon?veis. Nossos resultados apresentam uma nova distribui??o genot?pica do HIV, com o conjunto de dados mais recente e completo. Neste trabalho apresentamos um novo software para an?lise das sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank. Este software ? capaz de analisar dados de v?rus com elevado comportamento mutacional como HIV e HCV em um curto espa?o de tempo. A an?lise de todas as sequ?ncias do HIV dispon?veis no GenBank oferece um novo ponto de vista sobre a epidemia, distribui??o de subtipos e geogr?fica.Submitted by Luis Ricardo Andrade da Silva (lrasilva@uefs.br) on 2016-03-31T01:16:15Z No. of bitstreams: 1 Disserta??o Final.pdf: 2489318 bytes, checksum: 74b79aac96fa73b31d6e0dbb4272efe3 (MD5)Made available in DSpace on 2016-03-31T01:16:15Z (GMT). No. of bitstreams: 1 Disserta??o Final.pdf: 2489318 bytes, checksum: 74b79aac96fa73b31d6e0dbb4272efe3 (MD5) Previous issue date: 2015-12-15application/pdfporUniversidade Estadual de Feira de SantanaMestrado em Computa??o AplicadaUEFSBrasilDEPARTAMENTO DE CI?NCIAS EXATASHIVSequencias nucleot?dicasSubtipoGen?tipoGen?ticaHIVNucleotide sequenqcesSubtypesGenotypesGeneticsCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOCIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAOSistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBankinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesis303317282311144204600600600600-548683281661150621136717112058112045098930092515683771531info:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da UEFSinstname:Universidade Estadual de Feira de Santana (UEFS)instacron:UEFSORIGINALDisserta??o Final.pdfDisserta??o Final.pdfapplication/pdf2489318http://tede2.uefs.br:8080/bitstream/tede/327/2/Disserta%C3%A7%C3%A3o+Final.pdf74b79aac96fa73b31d6e0dbb4272efe3MD52LICENSElicense.txtlicense.txttext/plain; charset=utf-82089http://tede2.uefs.br:8080/bitstream/tede/327/1/license.txt7b5ba3d2445355f386edab96125d42b7MD51tede/3272016-03-30 22:16:15.381oai:tede2.uefs.br:8080:tede/327Tk9UQTogQ09MT1FVRSBBUVVJIEEgU1VBIFBSP1BSSUEgTElDRU4/QQpFc3RhIGxpY2VuP2EgZGUgZXhlbXBsbyA/IGZvcm5lY2lkYSBhcGVuYXMgcGFyYSBmaW5zIGluZm9ybWF0aXZvcy4KCkxJQ0VOP0EgREUgRElTVFJJQlVJPz9PIE4/Ty1FWENMVVNJVkEKCkNvbSBhIGFwcmVzZW50YT8/byBkZXN0YSBsaWNlbj9hLCB2b2M/IChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSA/IFVuaXZlcnNpZGFkZSAKWFhYIChTaWdsYSBkYSBVbml2ZXJzaWRhZGUpIG8gZGlyZWl0byBuP28tZXhjbHVzaXZvIGRlIHJlcHJvZHV6aXIsICB0cmFkdXppciAoY29uZm9ybWUgZGVmaW5pZG8gYWJhaXhvKSwgZS9vdSAKZGlzdHJpYnVpciBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhPz9vIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyP25pY28gZSAKZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zID91ZGlvIG91IHY/ZGVvLgoKVm9jPyBjb25jb3JkYSBxdWUgYSBTaWdsYSBkZSBVbml2ZXJzaWRhZGUgcG9kZSwgc2VtIGFsdGVyYXIgbyBjb250ZT9kbywgdHJhbnNwb3IgYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YT8/byAKcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhPz9vLgoKVm9jPyB0YW1iP20gY29uY29yZGEgcXVlIGEgU2lnbGEgZGUgVW5pdmVyc2lkYWRlIHBvZGUgbWFudGVyIG1haXMgZGUgdW1hIGM/cGlhIGEgc3VhIHRlc2Ugb3UgCmRpc3NlcnRhPz9vIHBhcmEgZmlucyBkZSBzZWd1cmFuP2EsIGJhY2stdXAgZSBwcmVzZXJ2YT8/by4KClZvYz8gZGVjbGFyYSBxdWUgYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YT8/byA/IG9yaWdpbmFsIGUgcXVlIHZvYz8gdGVtIG8gcG9kZXIgZGUgY29uY2VkZXIgb3MgZGlyZWl0b3MgY29udGlkb3MgCm5lc3RhIGxpY2VuP2EuIFZvYz8gdGFtYj9tIGRlY2xhcmEgcXVlIG8gZGVwP3NpdG8gZGEgc3VhIHRlc2Ugb3UgZGlzc2VydGE/P28gbj9vLCBxdWUgc2VqYSBkZSBzZXUgCmNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3U/bS4KCkNhc28gYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YT8/byBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jPyBuP28gcG9zc3VpIGEgdGl0dWxhcmlkYWRlIGRvcyBkaXJlaXRvcyBhdXRvcmFpcywgdm9jPyAKZGVjbGFyYSBxdWUgb2J0ZXZlIGEgcGVybWlzcz9vIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgPyBTaWdsYSBkZSBVbml2ZXJzaWRhZGUgCm9zIGRpcmVpdG9zIGFwcmVzZW50YWRvcyBuZXN0YSBsaWNlbj9hLCBlIHF1ZSBlc3NlIG1hdGVyaWFsIGRlIHByb3ByaWVkYWRlIGRlIHRlcmNlaXJvcyBlc3Q/IGNsYXJhbWVudGUgCmlkZW50aWZpY2FkbyBlIHJlY29uaGVjaWRvIG5vIHRleHRvIG91IG5vIGNvbnRlP2RvIGRhIHRlc2Ugb3UgZGlzc2VydGE/P28gb3JhIGRlcG9zaXRhZGEuCgpDQVNPIEEgVEVTRSBPVSBESVNTRVJUQT8/TyBPUkEgREVQT1NJVEFEQSBURU5IQSBTSURPIFJFU1VMVEFETyBERSBVTSBQQVRST0M/TklPIE9VIApBUE9JTyBERSBVTUEgQUc/TkNJQSBERSBGT01FTlRPIE9VIE9VVFJPIE9SR0FOSVNNTyBRVUUgTj9PIFNFSkEgQSBTSUdMQSBERSAKVU5JVkVSU0lEQURFLCBWT0M/IERFQ0xBUkEgUVVFIFJFU1BFSVRPVSBUT0RPUyBFIFFVQUlTUVVFUiBESVJFSVRPUyBERSBSRVZJUz9PIENPTU8gClRBTUI/TSBBUyBERU1BSVMgT0JSSUdBPz9FUyBFWElHSURBUyBQT1IgQ09OVFJBVE8gT1UgQUNPUkRPLgoKQSBTaWdsYSBkZSBVbml2ZXJzaWRhZGUgc2UgY29tcHJvbWV0ZSBhIGlkZW50aWZpY2FyIGNsYXJhbWVudGUgbyBzZXUgbm9tZSAocykgb3UgbyhzKSBub21lKHMpIGRvKHMpIApkZXRlbnRvcihlcykgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIGRhIHRlc2Ugb3UgZGlzc2VydGE/P28sIGUgbj9vIGZhcj8gcXVhbHF1ZXIgYWx0ZXJhPz9vLCBhbD9tIGRhcXVlbGFzIApjb25jZWRpZGFzIHBvciBlc3RhIGxpY2VuP2EuCg==Biblioteca Digital de Teses e Dissertaçõeshttp://tede2.uefs.br:8080/PUBhttp://tede2.uefs.br:8080/oai/requestbcuefs@uefs.br|| bcref@uefs.br||bcuefs@uefs.bropendoar:2016-03-31T01:16:15Biblioteca Digital de Teses e Dissertações da UEFS - Universidade Estadual de Feira de Santana (UEFS)false
dc.title.por.fl_str_mv Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank
title Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank
spellingShingle Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank
Gon?alves, Jos? Irahe Kasprzykowski
HIV
Sequencias nucleot?dicas
Subtipo
Gen?tipo
Gen?tica
HIV
Nucleotide sequenqces
Subtypes
Genotypes
Genetics
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO
title_short Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank
title_full Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank
title_fullStr Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank
title_full_unstemmed Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank
title_sort Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank
author Gon?alves, Jos? Irahe Kasprzykowski
author_facet Gon?alves, Jos? Irahe Kasprzykowski
author_role author
dc.contributor.advisor1.fl_str_mv Queiroz, Artur Trancoso Lopo de
dc.contributor.advisor1ID.fl_str_mv 83630643515
dc.contributor.advisor1Lattes.fl_str_mv http://lattes.cnpq.br/5222182427171497
dc.contributor.authorID.fl_str_mv 01845510569
dc.contributor.authorLattes.fl_str_mv http://lattes.cnpq.br/6650527222516832
dc.contributor.author.fl_str_mv Gon?alves, Jos? Irahe Kasprzykowski
contributor_str_mv Queiroz, Artur Trancoso Lopo de
dc.subject.por.fl_str_mv HIV
Sequencias nucleot?dicas
Subtipo
Gen?tipo
Gen?tica
topic HIV
Sequencias nucleot?dicas
Subtipo
Gen?tipo
Gen?tica
HIV
Nucleotide sequenqces
Subtypes
Genotypes
Genetics
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO
dc.subject.eng.fl_str_mv HIV
Nucleotide sequenqces
Subtypes
Genotypes
Genetics
dc.subject.cnpq.fl_str_mv CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO
description HIV infects over 40 million people worldwide and is considered by the World Health Organization a large scale pandemic. Which the associated disease has no cure. New data and analysis can help new treatment and vaccine development. However, the dataset is vast, with over 500,000 sequences available on GenBank. This data still lacks essential information such as subtyping and genome location. To help minimize these problems we developed a system for automated analysis from GenBank data. The tool performs sequence map according to HXB2 and subtyping by comparison with subtype reference sequences. This process uses Needleman-Wusch and Smith-Waterman respectively. All 582,678 sequences were mapped in 5 days and 14 hours and subtyped in 1 day and 7 hours with our algorithm, while the original approach was estimated to finish in 36 and 97 years respectively. Our tool was able to analyse the massive data in a reliable time. No current subtyping tool can analyse this high-throughput data. Our results showed that pol and gag genes were the most prevalent genes on the dataset, and could be explained because treatment and subtyping are based on these genes. Moreover, the structural genes were most prevalent, with 66.41%. This highlighted the low representation of regulatory genes on available data. The subtyping results showed that the subtype B was most frequent, with 45.96%. The recombinants together represent 43.37%. Furthermore, subtype C presented only 4.12% and the other pure subtypes less than 4%. Also, the geographical data was recovered from database and USA presented higher frequency, with 24.50%, showing a significant country bias. Our results present a new HIV subtype distribution with the most complete and recent dataset.Herein, we presented a new user friendly software for massive data analysis of viruses. This software is able to analyse highly mutational virus data, such as HCV and HIV in reliable time. Further, severe country bias raises questions regarding world subtype distribution. The analysis of all sequences from HIV provides new epidemy insights about subtypes and country distribution.
publishDate 2015
dc.date.issued.fl_str_mv 2015-12-15
dc.date.accessioned.fl_str_mv 2016-03-31T01:16:15Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.citation.fl_str_mv GON?ALVES, Jos? Irahe Kasprzykowski. Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank. 2015. 60 f. Disserta??o (Mestrado em Computa??o Aplicada) - Universidade Estadual de Feira de Santana, Feira de Santana, BA.
dc.identifier.uri.fl_str_mv http://localhost:8080/tede/handle/tede/327
identifier_str_mv GON?ALVES, Jos? Irahe Kasprzykowski. Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank. 2015. 60 f. Disserta??o (Mestrado em Computa??o Aplicada) - Universidade Estadual de Feira de Santana, Feira de Santana, BA.
url http://localhost:8080/tede/handle/tede/327
dc.language.iso.fl_str_mv por
language por
dc.relation.program.fl_str_mv 303317282311144204
dc.relation.confidence.fl_str_mv 600
600
600
600
dc.relation.department.fl_str_mv -5486832816611506211
dc.relation.cnpq.fl_str_mv 3671711205811204509
8930092515683771531
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Estadual de Feira de Santana
dc.publisher.program.fl_str_mv Mestrado em Computa??o Aplicada
dc.publisher.initials.fl_str_mv UEFS
dc.publisher.country.fl_str_mv Brasil
dc.publisher.department.fl_str_mv DEPARTAMENTO DE CI?NCIAS EXATAS
publisher.none.fl_str_mv Universidade Estadual de Feira de Santana
dc.source.none.fl_str_mv reponame:Biblioteca Digital de Teses e Dissertações da UEFS
instname:Universidade Estadual de Feira de Santana (UEFS)
instacron:UEFS
instname_str Universidade Estadual de Feira de Santana (UEFS)
instacron_str UEFS
institution UEFS
reponame_str Biblioteca Digital de Teses e Dissertações da UEFS
collection Biblioteca Digital de Teses e Dissertações da UEFS
bitstream.url.fl_str_mv http://tede2.uefs.br:8080/bitstream/tede/327/2/Disserta%C3%A7%C3%A3o+Final.pdf
http://tede2.uefs.br:8080/bitstream/tede/327/1/license.txt
bitstream.checksum.fl_str_mv 74b79aac96fa73b31d6e0dbb4272efe3
7b5ba3d2445355f386edab96125d42b7
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da UEFS - Universidade Estadual de Feira de Santana (UEFS)
repository.mail.fl_str_mv bcuefs@uefs.br|| bcref@uefs.br||bcuefs@uefs.br
_version_ 1796793632206880768