Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning
| Ano de defesa: | 2025 |
|---|---|
| Autor(a) principal: | |
| Orientador(a): | |
| Banca de defesa: | |
| Tipo de documento: | Tese |
| Tipo de acesso: | Acesso aberto |
| Idioma: | eng |
| Instituição de defesa: |
Biblioteca Digitais de Teses e Dissertações da USP
|
| Programa de Pós-Graduação: |
Não Informado pela instituição
|
| Departamento: |
Não Informado pela instituição
|
| País: |
Não Informado pela instituição
|
| Palavras-chave em Português: | |
| Link de acesso: | https://www.teses.usp.br/teses/disponiveis/55/55134/tde-13082025-104726/ |
Resumo: | Our understanding of human-associated microorganisms has evolved substantially since Theodor Escherich first documented Escherichia coli in the gut flora of healthy children. Today, we recognise these microorganisms as crucial contributors to human homeostasis. Among the techniques that emerged to study these microorganisms, metagenomics enabled the sequencing of all DNA within environmental samples without the need for culturing. In particular, the recovery of metagenome-assembled genomes (MAGs) allows direct genome investigation of previously uncultured organisms in their environmental context. Genome-resolved metagenomics allows the connection of functional potential to specific microorganisms, capturing subtle genomic adaptations to specific ecological contexts and providing a powerful approach to identify precise bioindicators with implications for epidemiology, drug discovery, and personalised medicine, for instance. However, despite the years of progress, the transformation of metagenomic data into biological insights is complex, causing the samples to be underused and relevant patterns to be overlooked. Metadata inconsistencies hinder systematic sample selection, genome recovery requires complex computational pipelines, and the high-dimensional nature of metagenomic data constraints comparative analysis. This dissertation addresses these challenges through integrated computational frameworks that bridge the critical gap between exponential data generation and actionable scientific insights while advancing our understanding of the composition and function of the human microbiome. First, we developed the HumanMetagenomeDB, a curated database of standardised metadata from 69,822 human metagenome samples collected from public repositories. This resource enabled systematic sample selection based on host characteristics, medical conditions, and technical parameters, focusing on facilitating large-scale meta-studies. The database also revealed significant geographical biases in microbiome sampling and gaps in disease coverage, highlighting areas that require additional research effort. Building on this foundation, we created MuDoGeR (Multi-Domain Genome Recovery), a streamlined, reproducible pipeline for genome recovery at scale, reducing the genome recovery from metagenomes technical barriers for non-bioinformaticians. Afterwards, to ensure reliable results for genome-centric studies, we conducted a systematic evaluation using simulated communities that revealed factors influencing genome recovery success. The evaluation established an optimal sequencing depth of 60 million reads and demonstrated that current methods can achieve up to 92% recovery precision while highlighting specific limitations in recovering closely related species. After recovering the genomes, we next focused on streamlining comparative genomic analysis by developing gSpreadComp. In particular, gSpreadComp focuses on analysing antimicrobial resistance genes (ARGs) and virulence factors (VFs). We used human gut microbiome samples from subjects with different diets as a use case. gSpreadComp revealed that while overall resistance patterns were similar across diets, specific differences emerged - such as increased tetracycline resistance in omnivores and elevated bacitracin resistance in vegans. Notably, vegan and vegetarian diets showed significantly higher potential for plasmid-mediated horizontal gene transfer. gSpreadComp focuses on guiding researchers in generating testable hypotheses. The culmination of this work is an approach for biologically interpretable feature selection that combines autoencoder-based dimensionality reduction with density-based clustering to encode the functional potential of MAGs. Throughout the process, and as a consequence of it, we also assembled a comprehensive collection of 514,932 prokaryotic MAGs of the human microbiome, representing 6,763 bacterial and 31 archaeal species. The resulting interactive AutoML-driven platform allows researchers to explore microbiome patterns, generate hypotheses, and select relevant MAGs for further investigation. Naturally, the frameworks and databases developed here contribute to transforming metagenomic data into biological insights. However, the evolution toward more integrative and translational applications requires continued interdisciplinary collaboration between microbiologists, computer scientists, and clinical researchers. |
| id |
USP_12544ca96dc6561feb0762065d01e3e5 |
|---|---|
| oai_identifier_str |
oai:teses.usp.br:tde-13082025-104726 |
| network_acronym_str |
USP |
| network_name_str |
Biblioteca Digital de Teses e Dissertações da USP |
| repository_id_str |
|
| spelling |
Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learningAnálise e classificação de microbiomas humanos: detecção de bioindicadores e otimização por meio de aprendizado de máquinaBanco de dados metagenômicoDietDietaGenome recoveryHuman microbiomeMAGsMAGsMetagenomic databaseMetagenômicaMetagenomicsMicrobioma humanoRecuperação do genomaOur understanding of human-associated microorganisms has evolved substantially since Theodor Escherich first documented Escherichia coli in the gut flora of healthy children. Today, we recognise these microorganisms as crucial contributors to human homeostasis. Among the techniques that emerged to study these microorganisms, metagenomics enabled the sequencing of all DNA within environmental samples without the need for culturing. In particular, the recovery of metagenome-assembled genomes (MAGs) allows direct genome investigation of previously uncultured organisms in their environmental context. Genome-resolved metagenomics allows the connection of functional potential to specific microorganisms, capturing subtle genomic adaptations to specific ecological contexts and providing a powerful approach to identify precise bioindicators with implications for epidemiology, drug discovery, and personalised medicine, for instance. However, despite the years of progress, the transformation of metagenomic data into biological insights is complex, causing the samples to be underused and relevant patterns to be overlooked. Metadata inconsistencies hinder systematic sample selection, genome recovery requires complex computational pipelines, and the high-dimensional nature of metagenomic data constraints comparative analysis. This dissertation addresses these challenges through integrated computational frameworks that bridge the critical gap between exponential data generation and actionable scientific insights while advancing our understanding of the composition and function of the human microbiome. First, we developed the HumanMetagenomeDB, a curated database of standardised metadata from 69,822 human metagenome samples collected from public repositories. This resource enabled systematic sample selection based on host characteristics, medical conditions, and technical parameters, focusing on facilitating large-scale meta-studies. The database also revealed significant geographical biases in microbiome sampling and gaps in disease coverage, highlighting areas that require additional research effort. Building on this foundation, we created MuDoGeR (Multi-Domain Genome Recovery), a streamlined, reproducible pipeline for genome recovery at scale, reducing the genome recovery from metagenomes technical barriers for non-bioinformaticians. Afterwards, to ensure reliable results for genome-centric studies, we conducted a systematic evaluation using simulated communities that revealed factors influencing genome recovery success. The evaluation established an optimal sequencing depth of 60 million reads and demonstrated that current methods can achieve up to 92% recovery precision while highlighting specific limitations in recovering closely related species. After recovering the genomes, we next focused on streamlining comparative genomic analysis by developing gSpreadComp. In particular, gSpreadComp focuses on analysing antimicrobial resistance genes (ARGs) and virulence factors (VFs). We used human gut microbiome samples from subjects with different diets as a use case. gSpreadComp revealed that while overall resistance patterns were similar across diets, specific differences emerged - such as increased tetracycline resistance in omnivores and elevated bacitracin resistance in vegans. Notably, vegan and vegetarian diets showed significantly higher potential for plasmid-mediated horizontal gene transfer. gSpreadComp focuses on guiding researchers in generating testable hypotheses. The culmination of this work is an approach for biologically interpretable feature selection that combines autoencoder-based dimensionality reduction with density-based clustering to encode the functional potential of MAGs. Throughout the process, and as a consequence of it, we also assembled a comprehensive collection of 514,932 prokaryotic MAGs of the human microbiome, representing 6,763 bacterial and 31 archaeal species. The resulting interactive AutoML-driven platform allows researchers to explore microbiome patterns, generate hypotheses, and select relevant MAGs for further investigation. Naturally, the frameworks and databases developed here contribute to transforming metagenomic data into biological insights. However, the evolution toward more integrative and translational applications requires continued interdisciplinary collaboration between microbiologists, computer scientists, and clinical researchers.Nossa compreensão dos microrganismos associados aos seres humanos evoluiu drasticamente desde que Theodor Escherich documentou pela primeira vez a Escherichia coli na flora intestinal de crianças saudáveis. Hoje, reconhecemos esses microrganismos como contribuintes essenciais para a homeostase humana. Entre as técnicas que surgiram para estudar esses microrganismos, a metagenômica permitiu o sequenciamento de todo o DNA em amostras ambientais sem a necessidade de cultivo. Particularmente, a recuperação de metagenome-assembled genomes (MAGs) permite a investigação direta de organismos não cultivados anteriormente, preservando seu contexto ambiental. A metagenômica resolvida por genoma permite a conexão do potencial funcional a microrganismos específicos, capturando adaptações genômicas sutis a contextos ecológicos específicos e fornecendo uma abordagem poderosa para identificar bioindicadores precisos com implicações para a epidemiologia, descoberta de medicamentos e medicina personalizada, por exemplo. No entanto, apesar desses avanços, a transformação de dados metagenômicos em percepções biológicas continua sendo um desafio. As inconsistências de metadados dificultam a seleção sistemática de amostras, a recuperação do genoma exige pipelines computacionais complexos e a natureza altamente dimensional dos dados metagenômicos complica a análise comparativa. Esta tese aborda esses desafios por meio de estruturas computacionais integradas que preenchem a lacuna crítica entre a geração exponencial de dados e as descobertas científicas, ao mesmo tempo em que avançam nossa compreensão da composição e da função do microbioma humano. Primeiro, desenvolvemos o HumanMetagenomeDB, um banco de dados com curadoria de metadados padronizados de 69.822 amostras de metagenoma humano em repositórios públicos. Esse recurso permitiu a seleção sistemática de amostras com base em características do hospedeiro, condições médicas e parâmetros técnicos, abordando uma barreira crítica para estudos comparativos em larga escala. Ele também revelou vieses geográficos significativos na amostragem do microbioma e lacunas na cobertura de doenças, destacando áreas que exigem foco adicional de pesquisa. Com base nesse fundamento, criamos o MuDoGeR (Multi-Domain Genome Recovery), um pipeline simplificado e reproduzível para a recuperação de genomas em escala, reduzindo as barreiras técnicas para não bioinformatas. Para garantir resultados confiáveis para estudos centrados no genoma, realizamos uma avaliação sistemática usando comunidades simuladas que revelaram fatores críticos que influenciam o sucesso da recuperação do genoma. Essa análise estabeleceu uma profundidade de sequenciamento ideal de 60 milhões de leituras e demonstrou que os métodos atuais podem atingir uma precisão de até 92%, destacando limitações específicas na recuperação de espécies estreitamente relacionadas. Expandindo os recursos de recuperação do genoma e para simplificar a análise genômica comparativa, desenvolvemos o gSpreadComp. Em particular, o gSpreadComp se concentra na análise de genes de resistência antimicrobiana (ARGs do inglês antimicrobial resistance genes) e fatores de virulência (VFs do inglês virulence factors). O gSpreadComp revelou que, embora os padrões gerais de resistência fossem semelhantes entre as dietas, surgiram diferenças específicas, como o aumento da resistência à tetraciclina em onívoros e a resistência elevada à bacitracina em veganos. Notavelmente, as dietas veganas e vegetarianas mostraram um potencial significativamente maior para a transferência horizontal de genes mediada por plasmídeos. O foco do gSpreadComp é orientar os pesquisadores na geração de hipóteses testáveis. O ponto culminante desse trabalho é uma abordagem para a seleção de recursos biologicamente interpretáveis que combina a redução de dimensionalidade baseada em autoencoder com o agrupamento de densidade para codificar o potencial funcional dos MAGs. Ao desenvolver essa abordagem, também reunimos uma coleção abrangente de 514.932 MAGs procarióticos do microbioma humano, representando 6.763 espécies bacterianas e 31 espécies de arqueas. A plataforma interativa resultante, orientada por AutoML, permite que os pesquisadores explorem os padrões do microbioma, gerem hipóteses e selecionem MAGs relevantes para uma investigação mais aprofundada. Naturalmente, as estruturas e os bancos de dados desenvolvidos aqui contribuem para transformar dados metagenômicos em percepções biológicas. No entanto, a evolução em direção a uma aplicação mais integrativa e translacional de dados metagenômicos é um desafio.Biblioteca Digitais de Teses e Dissertações da USPCarvalho, André Carlos Ponce de Leon Ferreira deRocha, Ulisses Nunes daKasmanas, Jonas Coelho2025-07-08info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-13082025-104726/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2025-08-13T13:57:02Zoai:teses.usp.br:tde-13082025-104726Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212025-08-13T13:57:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false |
| dc.title.none.fl_str_mv |
Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning Análise e classificação de microbiomas humanos: detecção de bioindicadores e otimização por meio de aprendizado de máquina |
| title |
Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning |
| spellingShingle |
Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning Kasmanas, Jonas Coelho Banco de dados metagenômico Diet Dieta Genome recovery Human microbiome MAGs MAGs Metagenomic database Metagenômica Metagenomics Microbioma humano Recuperação do genoma |
| title_short |
Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning |
| title_full |
Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning |
| title_fullStr |
Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning |
| title_full_unstemmed |
Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning |
| title_sort |
Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning |
| author |
Kasmanas, Jonas Coelho |
| author_facet |
Kasmanas, Jonas Coelho |
| author_role |
author |
| dc.contributor.none.fl_str_mv |
Carvalho, André Carlos Ponce de Leon Ferreira de Rocha, Ulisses Nunes da |
| dc.contributor.author.fl_str_mv |
Kasmanas, Jonas Coelho |
| dc.subject.por.fl_str_mv |
Banco de dados metagenômico Diet Dieta Genome recovery Human microbiome MAGs MAGs Metagenomic database Metagenômica Metagenomics Microbioma humano Recuperação do genoma |
| topic |
Banco de dados metagenômico Diet Dieta Genome recovery Human microbiome MAGs MAGs Metagenomic database Metagenômica Metagenomics Microbioma humano Recuperação do genoma |
| description |
Our understanding of human-associated microorganisms has evolved substantially since Theodor Escherich first documented Escherichia coli in the gut flora of healthy children. Today, we recognise these microorganisms as crucial contributors to human homeostasis. Among the techniques that emerged to study these microorganisms, metagenomics enabled the sequencing of all DNA within environmental samples without the need for culturing. In particular, the recovery of metagenome-assembled genomes (MAGs) allows direct genome investigation of previously uncultured organisms in their environmental context. Genome-resolved metagenomics allows the connection of functional potential to specific microorganisms, capturing subtle genomic adaptations to specific ecological contexts and providing a powerful approach to identify precise bioindicators with implications for epidemiology, drug discovery, and personalised medicine, for instance. However, despite the years of progress, the transformation of metagenomic data into biological insights is complex, causing the samples to be underused and relevant patterns to be overlooked. Metadata inconsistencies hinder systematic sample selection, genome recovery requires complex computational pipelines, and the high-dimensional nature of metagenomic data constraints comparative analysis. This dissertation addresses these challenges through integrated computational frameworks that bridge the critical gap between exponential data generation and actionable scientific insights while advancing our understanding of the composition and function of the human microbiome. First, we developed the HumanMetagenomeDB, a curated database of standardised metadata from 69,822 human metagenome samples collected from public repositories. This resource enabled systematic sample selection based on host characteristics, medical conditions, and technical parameters, focusing on facilitating large-scale meta-studies. The database also revealed significant geographical biases in microbiome sampling and gaps in disease coverage, highlighting areas that require additional research effort. Building on this foundation, we created MuDoGeR (Multi-Domain Genome Recovery), a streamlined, reproducible pipeline for genome recovery at scale, reducing the genome recovery from metagenomes technical barriers for non-bioinformaticians. Afterwards, to ensure reliable results for genome-centric studies, we conducted a systematic evaluation using simulated communities that revealed factors influencing genome recovery success. The evaluation established an optimal sequencing depth of 60 million reads and demonstrated that current methods can achieve up to 92% recovery precision while highlighting specific limitations in recovering closely related species. After recovering the genomes, we next focused on streamlining comparative genomic analysis by developing gSpreadComp. In particular, gSpreadComp focuses on analysing antimicrobial resistance genes (ARGs) and virulence factors (VFs). We used human gut microbiome samples from subjects with different diets as a use case. gSpreadComp revealed that while overall resistance patterns were similar across diets, specific differences emerged - such as increased tetracycline resistance in omnivores and elevated bacitracin resistance in vegans. Notably, vegan and vegetarian diets showed significantly higher potential for plasmid-mediated horizontal gene transfer. gSpreadComp focuses on guiding researchers in generating testable hypotheses. The culmination of this work is an approach for biologically interpretable feature selection that combines autoencoder-based dimensionality reduction with density-based clustering to encode the functional potential of MAGs. Throughout the process, and as a consequence of it, we also assembled a comprehensive collection of 514,932 prokaryotic MAGs of the human microbiome, representing 6,763 bacterial and 31 archaeal species. The resulting interactive AutoML-driven platform allows researchers to explore microbiome patterns, generate hypotheses, and select relevant MAGs for further investigation. Naturally, the frameworks and databases developed here contribute to transforming metagenomic data into biological insights. However, the evolution toward more integrative and translational applications requires continued interdisciplinary collaboration between microbiologists, computer scientists, and clinical researchers. |
| publishDate |
2025 |
| dc.date.none.fl_str_mv |
2025-07-08 |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
| format |
doctoralThesis |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-13082025-104726/ |
| url |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-13082025-104726/ |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
|
| dc.rights.driver.fl_str_mv |
Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess |
| rights_invalid_str_mv |
Liberar o conteúdo para acesso público. |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.coverage.none.fl_str_mv |
|
| dc.publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
| publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
| dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP |
| instname_str |
Universidade de São Paulo (USP) |
| instacron_str |
USP |
| institution |
USP |
| reponame_str |
Biblioteca Digital de Teses e Dissertações da USP |
| collection |
Biblioteca Digital de Teses e Dissertações da USP |
| repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP) |
| repository.mail.fl_str_mv |
virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br |
| _version_ |
1848370484069007360 |