Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank
Ano de defesa: | 2015 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Estadual de Feira de Santana
|
Programa de Pós-Graduação: |
Mestrado em Computa??o Aplicada
|
Departamento: |
DEPARTAMENTO DE CI?NCIAS EXATAS
|
País: |
Brasil
|
Palavras-chave em Português: | |
Palavras-chave em Inglês: | |
Área do conhecimento CNPq: | |
Link de acesso: | http://localhost:8080/tede/handle/tede/327 |
Resumo: | HIV infects over 40 million people worldwide and is considered by the World Health Organization a large scale pandemic. Which the associated disease has no cure. New data and analysis can help new treatment and vaccine development. However, the dataset is vast, with over 500,000 sequences available on GenBank. This data still lacks essential information such as subtyping and genome location. To help minimize these problems we developed a system for automated analysis from GenBank data. The tool performs sequence map according to HXB2 and subtyping by comparison with subtype reference sequences. This process uses Needleman-Wusch and Smith-Waterman respectively. All 582,678 sequences were mapped in 5 days and 14 hours and subtyped in 1 day and 7 hours with our algorithm, while the original approach was estimated to finish in 36 and 97 years respectively. Our tool was able to analyse the massive data in a reliable time. No current subtyping tool can analyse this high-throughput data. Our results showed that pol and gag genes were the most prevalent genes on the dataset, and could be explained because treatment and subtyping are based on these genes. Moreover, the structural genes were most prevalent, with 66.41%. This highlighted the low representation of regulatory genes on available data. The subtyping results showed that the subtype B was most frequent, with 45.96%. The recombinants together represent 43.37%. Furthermore, subtype C presented only 4.12% and the other pure subtypes less than 4%. Also, the geographical data was recovered from database and USA presented higher frequency, with 24.50%, showing a significant country bias. Our results present a new HIV subtype distribution with the most complete and recent dataset.Herein, we presented a new user friendly software for massive data analysis of viruses. This software is able to analyse highly mutational virus data, such as HCV and HIV in reliable time. Further, severe country bias raises questions regarding world subtype distribution. The analysis of all sequences from HIV provides new epidemy insights about subtypes and country distribution. |
id |
UEFS_48fce772d82549fd4dc69cd8dcddc480 |
---|---|
oai_identifier_str |
oai:tede2.uefs.br:8080:tede/327 |
network_acronym_str |
UEFS |
network_name_str |
Biblioteca Digital de Teses e Dissertações da UEFS |
repository_id_str |
|
spelling |
Queiroz, Artur Trancoso Lopo de83630643515http://lattes.cnpq.br/522218242717149701845510569http://lattes.cnpq.br/6650527222516832Gon?alves, Jos? Irahe Kasprzykowski2016-03-31T01:16:15Z2015-12-15GON?ALVES, Jos? Irahe Kasprzykowski. Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank. 2015. 60 f. Disserta??o (Mestrado em Computa??o Aplicada) - Universidade Estadual de Feira de Santana, Feira de Santana, BA.http://localhost:8080/tede/handle/tede/327HIV infects over 40 million people worldwide and is considered by the World Health Organization a large scale pandemic. Which the associated disease has no cure. New data and analysis can help new treatment and vaccine development. However, the dataset is vast, with over 500,000 sequences available on GenBank. This data still lacks essential information such as subtyping and genome location. To help minimize these problems we developed a system for automated analysis from GenBank data. The tool performs sequence map according to HXB2 and subtyping by comparison with subtype reference sequences. This process uses Needleman-Wusch and Smith-Waterman respectively. All 582,678 sequences were mapped in 5 days and 14 hours and subtyped in 1 day and 7 hours with our algorithm, while the original approach was estimated to finish in 36 and 97 years respectively. Our tool was able to analyse the massive data in a reliable time. No current subtyping tool can analyse this high-throughput data. Our results showed that pol and gag genes were the most prevalent genes on the dataset, and could be explained because treatment and subtyping are based on these genes. Moreover, the structural genes were most prevalent, with 66.41%. This highlighted the low representation of regulatory genes on available data. The subtyping results showed that the subtype B was most frequent, with 45.96%. The recombinants together represent 43.37%. Furthermore, subtype C presented only 4.12% and the other pure subtypes less than 4%. Also, the geographical data was recovered from database and USA presented higher frequency, with 24.50%, showing a significant country bias. Our results present a new HIV subtype distribution with the most complete and recent dataset.Herein, we presented a new user friendly software for massive data analysis of viruses. This software is able to analyse highly mutational virus data, such as HCV and HIV in reliable time. Further, severe country bias raises questions regarding world subtype distribution. The analysis of all sequences from HIV provides new epidemy insights about subtypes and country distribution.O HIV infecta mais de 40 milh?es de pessoas no mundo e ? considerado pela Organiza??o Mundial de Sa?de como uma pandemia. A doen?a associada n?o possui cura cl?nica. Novas an?lises e informa??es podem ajudar no desenvolvimento de novos tratamentos e vacinas. No entanto, o conjunto de dados sobre o agente etiol?gico dispon?vel ? vasto, contando com mais de 500 mil sequ?ncias no GenBank. Este conjunto de dados ainda carece de informa??es essenciais, como subtipo viral e localiza??o no genoma de refer?ncia. Para auxiliar na minimiza??o destes problemas, desenvolvemos um sistema para an?lise dos dados dispon?veis no GenBank. A ferramenta realiza o mapeamento de acordo com o genoma refer?ncia HXB2 e a subtipagem comparando as sequ?ncias de refer?ncia dos subtipos. Estes processos utilizam os algoritmos de Needleman-Wusch e Smith-Waterman respectivamente. Todas as 582.678 sequ?ncias foram mapeadas em 5 dias e 14 horas, e subtipadas em 1 dia e 7 horas com nosso algoritmo. Enquanto a abordagem original estima terminar em 36 e 97 anos respectivamente. Nenhuma ferramenta de subtipagem dispon?vel atualmente ? capaz de analisar esta quantidade de dados. Nossos resultados mostraram que os genes gag e pol s?o mais prevalentes no conjunto de dados. O que pode ser explicado pelo fato de t?cnicas de avalia??o de resist?ncia aos antirretrovirais e subtipagem serem baseadas nesses genes. Al?m disso, os genes estruturais exibiram uma preval?ncia absoluta de 66.41%. Isto evidencia a pouca representatividade de genes regulat?rios no conjunto de dados. Os resultados da subtipagem mostram que o subtipo B ? o mais frequente com 45,96% de preval?ncia. Os recombinantes, combinados, representam 43.37%. Ademais, o subtipo C apresentou apenas 4,12% de preval?ncia absoluta e outros subtipos puros menos de 4%. Al?m disso, dados geogr?ficos foram recuperados do banco de dados. Os Estados Unidos representam a maior frequ?ncia de sequ?ncias submetidas, com 24,5% de todos os dados dispon?veis. Nossos resultados apresentam uma nova distribui??o genot?pica do HIV, com o conjunto de dados mais recente e completo. Neste trabalho apresentamos um novo software para an?lise das sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank. Este software ? capaz de analisar dados de v?rus com elevado comportamento mutacional como HIV e HCV em um curto espa?o de tempo. A an?lise de todas as sequ?ncias do HIV dispon?veis no GenBank oferece um novo ponto de vista sobre a epidemia, distribui??o de subtipos e geogr?fica.Submitted by Luis Ricardo Andrade da Silva (lrasilva@uefs.br) on 2016-03-31T01:16:15Z No. of bitstreams: 1 Disserta??o Final.pdf: 2489318 bytes, checksum: 74b79aac96fa73b31d6e0dbb4272efe3 (MD5)Made available in DSpace on 2016-03-31T01:16:15Z (GMT). No. of bitstreams: 1 Disserta??o Final.pdf: 2489318 bytes, checksum: 74b79aac96fa73b31d6e0dbb4272efe3 (MD5) Previous issue date: 2015-12-15application/pdfporUniversidade Estadual de Feira de SantanaMestrado em Computa??o AplicadaUEFSBrasilDEPARTAMENTO DE CI?NCIAS EXATASHIVSequencias nucleot?dicasSubtipoGen?tipoGen?ticaHIVNucleotide sequenqcesSubtypesGenotypesGeneticsCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOCIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAOSistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBankinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesis303317282311144204600600600600-548683281661150621136717112058112045098930092515683771531info:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da UEFSinstname:Universidade Estadual de Feira de Santana (UEFS)instacron:UEFSORIGINALDisserta??o Final.pdfDisserta??o Final.pdfapplication/pdf2489318http://tede2.uefs.br:8080/bitstream/tede/327/2/Disserta%C3%A7%C3%A3o+Final.pdf74b79aac96fa73b31d6e0dbb4272efe3MD52LICENSElicense.txtlicense.txttext/plain; charset=utf-82089http://tede2.uefs.br:8080/bitstream/tede/327/1/license.txt7b5ba3d2445355f386edab96125d42b7MD51tede/3272016-03-30 22:16:15.381oai:tede2.uefs.br:8080:tede/327Tk9UQTogQ09MT1FVRSBBUVVJIEEgU1VBIFBSP1BSSUEgTElDRU4/QQpFc3RhIGxpY2VuP2EgZGUgZXhlbXBsbyA/IGZvcm5lY2lkYSBhcGVuYXMgcGFyYSBmaW5zIGluZm9ybWF0aXZvcy4KCkxJQ0VOP0EgREUgRElTVFJJQlVJPz9PIE4/Ty1FWENMVVNJVkEKCkNvbSBhIGFwcmVzZW50YT8/byBkZXN0YSBsaWNlbj9hLCB2b2M/IChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSA/IFVuaXZlcnNpZGFkZSAKWFhYIChTaWdsYSBkYSBVbml2ZXJzaWRhZGUpIG8gZGlyZWl0byBuP28tZXhjbHVzaXZvIGRlIHJlcHJvZHV6aXIsICB0cmFkdXppciAoY29uZm9ybWUgZGVmaW5pZG8gYWJhaXhvKSwgZS9vdSAKZGlzdHJpYnVpciBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhPz9vIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyP25pY28gZSAKZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zID91ZGlvIG91IHY/ZGVvLgoKVm9jPyBjb25jb3JkYSBxdWUgYSBTaWdsYSBkZSBVbml2ZXJzaWRhZGUgcG9kZSwgc2VtIGFsdGVyYXIgbyBjb250ZT9kbywgdHJhbnNwb3IgYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YT8/byAKcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhPz9vLgoKVm9jPyB0YW1iP20gY29uY29yZGEgcXVlIGEgU2lnbGEgZGUgVW5pdmVyc2lkYWRlIHBvZGUgbWFudGVyIG1haXMgZGUgdW1hIGM/cGlhIGEgc3VhIHRlc2Ugb3UgCmRpc3NlcnRhPz9vIHBhcmEgZmlucyBkZSBzZWd1cmFuP2EsIGJhY2stdXAgZSBwcmVzZXJ2YT8/by4KClZvYz8gZGVjbGFyYSBxdWUgYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YT8/byA/IG9yaWdpbmFsIGUgcXVlIHZvYz8gdGVtIG8gcG9kZXIgZGUgY29uY2VkZXIgb3MgZGlyZWl0b3MgY29udGlkb3MgCm5lc3RhIGxpY2VuP2EuIFZvYz8gdGFtYj9tIGRlY2xhcmEgcXVlIG8gZGVwP3NpdG8gZGEgc3VhIHRlc2Ugb3UgZGlzc2VydGE/P28gbj9vLCBxdWUgc2VqYSBkZSBzZXUgCmNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3U/bS4KCkNhc28gYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YT8/byBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jPyBuP28gcG9zc3VpIGEgdGl0dWxhcmlkYWRlIGRvcyBkaXJlaXRvcyBhdXRvcmFpcywgdm9jPyAKZGVjbGFyYSBxdWUgb2J0ZXZlIGEgcGVybWlzcz9vIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgPyBTaWdsYSBkZSBVbml2ZXJzaWRhZGUgCm9zIGRpcmVpdG9zIGFwcmVzZW50YWRvcyBuZXN0YSBsaWNlbj9hLCBlIHF1ZSBlc3NlIG1hdGVyaWFsIGRlIHByb3ByaWVkYWRlIGRlIHRlcmNlaXJvcyBlc3Q/IGNsYXJhbWVudGUgCmlkZW50aWZpY2FkbyBlIHJlY29uaGVjaWRvIG5vIHRleHRvIG91IG5vIGNvbnRlP2RvIGRhIHRlc2Ugb3UgZGlzc2VydGE/P28gb3JhIGRlcG9zaXRhZGEuCgpDQVNPIEEgVEVTRSBPVSBESVNTRVJUQT8/TyBPUkEgREVQT1NJVEFEQSBURU5IQSBTSURPIFJFU1VMVEFETyBERSBVTSBQQVRST0M/TklPIE9VIApBUE9JTyBERSBVTUEgQUc/TkNJQSBERSBGT01FTlRPIE9VIE9VVFJPIE9SR0FOSVNNTyBRVUUgTj9PIFNFSkEgQSBTSUdMQSBERSAKVU5JVkVSU0lEQURFLCBWT0M/IERFQ0xBUkEgUVVFIFJFU1BFSVRPVSBUT0RPUyBFIFFVQUlTUVVFUiBESVJFSVRPUyBERSBSRVZJUz9PIENPTU8gClRBTUI/TSBBUyBERU1BSVMgT0JSSUdBPz9FUyBFWElHSURBUyBQT1IgQ09OVFJBVE8gT1UgQUNPUkRPLgoKQSBTaWdsYSBkZSBVbml2ZXJzaWRhZGUgc2UgY29tcHJvbWV0ZSBhIGlkZW50aWZpY2FyIGNsYXJhbWVudGUgbyBzZXUgbm9tZSAocykgb3UgbyhzKSBub21lKHMpIGRvKHMpIApkZXRlbnRvcihlcykgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIGRhIHRlc2Ugb3UgZGlzc2VydGE/P28sIGUgbj9vIGZhcj8gcXVhbHF1ZXIgYWx0ZXJhPz9vLCBhbD9tIGRhcXVlbGFzIApjb25jZWRpZGFzIHBvciBlc3RhIGxpY2VuP2EuCg==Biblioteca Digital de Teses e Dissertaçõeshttp://tede2.uefs.br:8080/PUBhttp://tede2.uefs.br:8080/oai/requestbcuefs@uefs.br|| bcref@uefs.br||bcuefs@uefs.bropendoar:2016-03-31T01:16:15Biblioteca Digital de Teses e Dissertações da UEFS - Universidade Estadual de Feira de Santana (UEFS)false |
dc.title.por.fl_str_mv |
Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank |
title |
Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank |
spellingShingle |
Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank Gon?alves, Jos? Irahe Kasprzykowski HIV Sequencias nucleot?dicas Subtipo Gen?tipo Gen?tica HIV Nucleotide sequenqces Subtypes Genotypes Genetics CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO |
title_short |
Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank |
title_full |
Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank |
title_fullStr |
Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank |
title_full_unstemmed |
Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank |
title_sort |
Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank |
author |
Gon?alves, Jos? Irahe Kasprzykowski |
author_facet |
Gon?alves, Jos? Irahe Kasprzykowski |
author_role |
author |
dc.contributor.advisor1.fl_str_mv |
Queiroz, Artur Trancoso Lopo de |
dc.contributor.advisor1ID.fl_str_mv |
83630643515 |
dc.contributor.advisor1Lattes.fl_str_mv |
http://lattes.cnpq.br/5222182427171497 |
dc.contributor.authorID.fl_str_mv |
01845510569 |
dc.contributor.authorLattes.fl_str_mv |
http://lattes.cnpq.br/6650527222516832 |
dc.contributor.author.fl_str_mv |
Gon?alves, Jos? Irahe Kasprzykowski |
contributor_str_mv |
Queiroz, Artur Trancoso Lopo de |
dc.subject.por.fl_str_mv |
HIV Sequencias nucleot?dicas Subtipo Gen?tipo Gen?tica |
topic |
HIV Sequencias nucleot?dicas Subtipo Gen?tipo Gen?tica HIV Nucleotide sequenqces Subtypes Genotypes Genetics CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO |
dc.subject.eng.fl_str_mv |
HIV Nucleotide sequenqces Subtypes Genotypes Genetics |
dc.subject.cnpq.fl_str_mv |
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO |
description |
HIV infects over 40 million people worldwide and is considered by the World Health Organization a large scale pandemic. Which the associated disease has no cure. New data and analysis can help new treatment and vaccine development. However, the dataset is vast, with over 500,000 sequences available on GenBank. This data still lacks essential information such as subtyping and genome location. To help minimize these problems we developed a system for automated analysis from GenBank data. The tool performs sequence map according to HXB2 and subtyping by comparison with subtype reference sequences. This process uses Needleman-Wusch and Smith-Waterman respectively. All 582,678 sequences were mapped in 5 days and 14 hours and subtyped in 1 day and 7 hours with our algorithm, while the original approach was estimated to finish in 36 and 97 years respectively. Our tool was able to analyse the massive data in a reliable time. No current subtyping tool can analyse this high-throughput data. Our results showed that pol and gag genes were the most prevalent genes on the dataset, and could be explained because treatment and subtyping are based on these genes. Moreover, the structural genes were most prevalent, with 66.41%. This highlighted the low representation of regulatory genes on available data. The subtyping results showed that the subtype B was most frequent, with 45.96%. The recombinants together represent 43.37%. Furthermore, subtype C presented only 4.12% and the other pure subtypes less than 4%. Also, the geographical data was recovered from database and USA presented higher frequency, with 24.50%, showing a significant country bias. Our results present a new HIV subtype distribution with the most complete and recent dataset.Herein, we presented a new user friendly software for massive data analysis of viruses. This software is able to analyse highly mutational virus data, such as HCV and HIV in reliable time. Further, severe country bias raises questions regarding world subtype distribution. The analysis of all sequences from HIV provides new epidemy insights about subtypes and country distribution. |
publishDate |
2015 |
dc.date.issued.fl_str_mv |
2015-12-15 |
dc.date.accessioned.fl_str_mv |
2016-03-31T01:16:15Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.citation.fl_str_mv |
GON?ALVES, Jos? Irahe Kasprzykowski. Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank. 2015. 60 f. Disserta??o (Mestrado em Computa??o Aplicada) - Universidade Estadual de Feira de Santana, Feira de Santana, BA. |
dc.identifier.uri.fl_str_mv |
http://localhost:8080/tede/handle/tede/327 |
identifier_str_mv |
GON?ALVES, Jos? Irahe Kasprzykowski. Sistema para an?lise de sequ?ncias nucleot?dicas do HIV dispon?veis no GenBank. 2015. 60 f. Disserta??o (Mestrado em Computa??o Aplicada) - Universidade Estadual de Feira de Santana, Feira de Santana, BA. |
url |
http://localhost:8080/tede/handle/tede/327 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.relation.program.fl_str_mv |
303317282311144204 |
dc.relation.confidence.fl_str_mv |
600 600 600 600 |
dc.relation.department.fl_str_mv |
-5486832816611506211 |
dc.relation.cnpq.fl_str_mv |
3671711205811204509 8930092515683771531 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Universidade Estadual de Feira de Santana |
dc.publisher.program.fl_str_mv |
Mestrado em Computa??o Aplicada |
dc.publisher.initials.fl_str_mv |
UEFS |
dc.publisher.country.fl_str_mv |
Brasil |
dc.publisher.department.fl_str_mv |
DEPARTAMENTO DE CI?NCIAS EXATAS |
publisher.none.fl_str_mv |
Universidade Estadual de Feira de Santana |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da UEFS instname:Universidade Estadual de Feira de Santana (UEFS) instacron:UEFS |
instname_str |
Universidade Estadual de Feira de Santana (UEFS) |
instacron_str |
UEFS |
institution |
UEFS |
reponame_str |
Biblioteca Digital de Teses e Dissertações da UEFS |
collection |
Biblioteca Digital de Teses e Dissertações da UEFS |
bitstream.url.fl_str_mv |
http://tede2.uefs.br:8080/bitstream/tede/327/2/Disserta%C3%A7%C3%A3o+Final.pdf http://tede2.uefs.br:8080/bitstream/tede/327/1/license.txt |
bitstream.checksum.fl_str_mv |
74b79aac96fa73b31d6e0dbb4272efe3 7b5ba3d2445355f386edab96125d42b7 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da UEFS - Universidade Estadual de Feira de Santana (UEFS) |
repository.mail.fl_str_mv |
bcuefs@uefs.br|| bcref@uefs.br||bcuefs@uefs.br |
_version_ |
1796793632206880768 |