Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning

Detalhes bibliográficos
Ano de defesa: 2025
Autor(a) principal: Kasmanas, Jonas Coelho
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://www.teses.usp.br/teses/disponiveis/55/55134/tde-13082025-104726/
Resumo: Our understanding of human-associated microorganisms has evolved substantially since Theodor Escherich first documented Escherichia coli in the gut flora of healthy children. Today, we recognise these microorganisms as crucial contributors to human homeostasis. Among the techniques that emerged to study these microorganisms, metagenomics enabled the sequencing of all DNA within environmental samples without the need for culturing. In particular, the recovery of metagenome-assembled genomes (MAGs) allows direct genome investigation of previously uncultured organisms in their environmental context. Genome-resolved metagenomics allows the connection of functional potential to specific microorganisms, capturing subtle genomic adaptations to specific ecological contexts and providing a powerful approach to identify precise bioindicators with implications for epidemiology, drug discovery, and personalised medicine, for instance. However, despite the years of progress, the transformation of metagenomic data into biological insights is complex, causing the samples to be underused and relevant patterns to be overlooked. Metadata inconsistencies hinder systematic sample selection, genome recovery requires complex computational pipelines, and the high-dimensional nature of metagenomic data constraints comparative analysis. This dissertation addresses these challenges through integrated computational frameworks that bridge the critical gap between exponential data generation and actionable scientific insights while advancing our understanding of the composition and function of the human microbiome. First, we developed the HumanMetagenomeDB, a curated database of standardised metadata from 69,822 human metagenome samples collected from public repositories. This resource enabled systematic sample selection based on host characteristics, medical conditions, and technical parameters, focusing on facilitating large-scale meta-studies. The database also revealed significant geographical biases in microbiome sampling and gaps in disease coverage, highlighting areas that require additional research effort. Building on this foundation, we created MuDoGeR (Multi-Domain Genome Recovery), a streamlined, reproducible pipeline for genome recovery at scale, reducing the genome recovery from metagenomes technical barriers for non-bioinformaticians. Afterwards, to ensure reliable results for genome-centric studies, we conducted a systematic evaluation using simulated communities that revealed factors influencing genome recovery success. The evaluation established an optimal sequencing depth of 60 million reads and demonstrated that current methods can achieve up to 92% recovery precision while highlighting specific limitations in recovering closely related species. After recovering the genomes, we next focused on streamlining comparative genomic analysis by developing gSpreadComp. In particular, gSpreadComp focuses on analysing antimicrobial resistance genes (ARGs) and virulence factors (VFs). We used human gut microbiome samples from subjects with different diets as a use case. gSpreadComp revealed that while overall resistance patterns were similar across diets, specific differences emerged - such as increased tetracycline resistance in omnivores and elevated bacitracin resistance in vegans. Notably, vegan and vegetarian diets showed significantly higher potential for plasmid-mediated horizontal gene transfer. gSpreadComp focuses on guiding researchers in generating testable hypotheses. The culmination of this work is an approach for biologically interpretable feature selection that combines autoencoder-based dimensionality reduction with density-based clustering to encode the functional potential of MAGs. Throughout the process, and as a consequence of it, we also assembled a comprehensive collection of 514,932 prokaryotic MAGs of the human microbiome, representing 6,763 bacterial and 31 archaeal species. The resulting interactive AutoML-driven platform allows researchers to explore microbiome patterns, generate hypotheses, and select relevant MAGs for further investigation. Naturally, the frameworks and databases developed here contribute to transforming metagenomic data into biological insights. However, the evolution toward more integrative and translational applications requires continued interdisciplinary collaboration between microbiologists, computer scientists, and clinical researchers.
id USP_12544ca96dc6561feb0762065d01e3e5
oai_identifier_str oai:teses.usp.br:tde-13082025-104726
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str
spelling Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learningAnálise e classificação de microbiomas humanos: detecção de bioindicadores e otimização por meio de aprendizado de máquinaBanco de dados metagenômicoDietDietaGenome recoveryHuman microbiomeMAGsMAGsMetagenomic databaseMetagenômicaMetagenomicsMicrobioma humanoRecuperação do genomaOur understanding of human-associated microorganisms has evolved substantially since Theodor Escherich first documented Escherichia coli in the gut flora of healthy children. Today, we recognise these microorganisms as crucial contributors to human homeostasis. Among the techniques that emerged to study these microorganisms, metagenomics enabled the sequencing of all DNA within environmental samples without the need for culturing. In particular, the recovery of metagenome-assembled genomes (MAGs) allows direct genome investigation of previously uncultured organisms in their environmental context. Genome-resolved metagenomics allows the connection of functional potential to specific microorganisms, capturing subtle genomic adaptations to specific ecological contexts and providing a powerful approach to identify precise bioindicators with implications for epidemiology, drug discovery, and personalised medicine, for instance. However, despite the years of progress, the transformation of metagenomic data into biological insights is complex, causing the samples to be underused and relevant patterns to be overlooked. Metadata inconsistencies hinder systematic sample selection, genome recovery requires complex computational pipelines, and the high-dimensional nature of metagenomic data constraints comparative analysis. This dissertation addresses these challenges through integrated computational frameworks that bridge the critical gap between exponential data generation and actionable scientific insights while advancing our understanding of the composition and function of the human microbiome. First, we developed the HumanMetagenomeDB, a curated database of standardised metadata from 69,822 human metagenome samples collected from public repositories. This resource enabled systematic sample selection based on host characteristics, medical conditions, and technical parameters, focusing on facilitating large-scale meta-studies. The database also revealed significant geographical biases in microbiome sampling and gaps in disease coverage, highlighting areas that require additional research effort. Building on this foundation, we created MuDoGeR (Multi-Domain Genome Recovery), a streamlined, reproducible pipeline for genome recovery at scale, reducing the genome recovery from metagenomes technical barriers for non-bioinformaticians. Afterwards, to ensure reliable results for genome-centric studies, we conducted a systematic evaluation using simulated communities that revealed factors influencing genome recovery success. The evaluation established an optimal sequencing depth of 60 million reads and demonstrated that current methods can achieve up to 92% recovery precision while highlighting specific limitations in recovering closely related species. After recovering the genomes, we next focused on streamlining comparative genomic analysis by developing gSpreadComp. In particular, gSpreadComp focuses on analysing antimicrobial resistance genes (ARGs) and virulence factors (VFs). We used human gut microbiome samples from subjects with different diets as a use case. gSpreadComp revealed that while overall resistance patterns were similar across diets, specific differences emerged - such as increased tetracycline resistance in omnivores and elevated bacitracin resistance in vegans. Notably, vegan and vegetarian diets showed significantly higher potential for plasmid-mediated horizontal gene transfer. gSpreadComp focuses on guiding researchers in generating testable hypotheses. The culmination of this work is an approach for biologically interpretable feature selection that combines autoencoder-based dimensionality reduction with density-based clustering to encode the functional potential of MAGs. Throughout the process, and as a consequence of it, we also assembled a comprehensive collection of 514,932 prokaryotic MAGs of the human microbiome, representing 6,763 bacterial and 31 archaeal species. The resulting interactive AutoML-driven platform allows researchers to explore microbiome patterns, generate hypotheses, and select relevant MAGs for further investigation. Naturally, the frameworks and databases developed here contribute to transforming metagenomic data into biological insights. However, the evolution toward more integrative and translational applications requires continued interdisciplinary collaboration between microbiologists, computer scientists, and clinical researchers.Nossa compreensão dos microrganismos associados aos seres humanos evoluiu drasticamente desde que Theodor Escherich documentou pela primeira vez a Escherichia coli na flora intestinal de crianças saudáveis. Hoje, reconhecemos esses microrganismos como contribuintes essenciais para a homeostase humana. Entre as técnicas que surgiram para estudar esses microrganismos, a metagenômica permitiu o sequenciamento de todo o DNA em amostras ambientais sem a necessidade de cultivo. Particularmente, a recuperação de metagenome-assembled genomes (MAGs) permite a investigação direta de organismos não cultivados anteriormente, preservando seu contexto ambiental. A metagenômica resolvida por genoma permite a conexão do potencial funcional a microrganismos específicos, capturando adaptações genômicas sutis a contextos ecológicos específicos e fornecendo uma abordagem poderosa para identificar bioindicadores precisos com implicações para a epidemiologia, descoberta de medicamentos e medicina personalizada, por exemplo. No entanto, apesar desses avanços, a transformação de dados metagenômicos em percepções biológicas continua sendo um desafio. As inconsistências de metadados dificultam a seleção sistemática de amostras, a recuperação do genoma exige pipelines computacionais complexos e a natureza altamente dimensional dos dados metagenômicos complica a análise comparativa. Esta tese aborda esses desafios por meio de estruturas computacionais integradas que preenchem a lacuna crítica entre a geração exponencial de dados e as descobertas científicas, ao mesmo tempo em que avançam nossa compreensão da composição e da função do microbioma humano. Primeiro, desenvolvemos o HumanMetagenomeDB, um banco de dados com curadoria de metadados padronizados de 69.822 amostras de metagenoma humano em repositórios públicos. Esse recurso permitiu a seleção sistemática de amostras com base em características do hospedeiro, condições médicas e parâmetros técnicos, abordando uma barreira crítica para estudos comparativos em larga escala. Ele também revelou vieses geográficos significativos na amostragem do microbioma e lacunas na cobertura de doenças, destacando áreas que exigem foco adicional de pesquisa. Com base nesse fundamento, criamos o MuDoGeR (Multi-Domain Genome Recovery), um pipeline simplificado e reproduzível para a recuperação de genomas em escala, reduzindo as barreiras técnicas para não bioinformatas. Para garantir resultados confiáveis para estudos centrados no genoma, realizamos uma avaliação sistemática usando comunidades simuladas que revelaram fatores críticos que influenciam o sucesso da recuperação do genoma. Essa análise estabeleceu uma profundidade de sequenciamento ideal de 60 milhões de leituras e demonstrou que os métodos atuais podem atingir uma precisão de até 92%, destacando limitações específicas na recuperação de espécies estreitamente relacionadas. Expandindo os recursos de recuperação do genoma e para simplificar a análise genômica comparativa, desenvolvemos o gSpreadComp. Em particular, o gSpreadComp se concentra na análise de genes de resistência antimicrobiana (ARGs do inglês antimicrobial resistance genes) e fatores de virulência (VFs do inglês virulence factors). O gSpreadComp revelou que, embora os padrões gerais de resistência fossem semelhantes entre as dietas, surgiram diferenças específicas, como o aumento da resistência à tetraciclina em onívoros e a resistência elevada à bacitracina em veganos. Notavelmente, as dietas veganas e vegetarianas mostraram um potencial significativamente maior para a transferência horizontal de genes mediada por plasmídeos. O foco do gSpreadComp é orientar os pesquisadores na geração de hipóteses testáveis. O ponto culminante desse trabalho é uma abordagem para a seleção de recursos biologicamente interpretáveis que combina a redução de dimensionalidade baseada em autoencoder com o agrupamento de densidade para codificar o potencial funcional dos MAGs. Ao desenvolver essa abordagem, também reunimos uma coleção abrangente de 514.932 MAGs procarióticos do microbioma humano, representando 6.763 espécies bacterianas e 31 espécies de arqueas. A plataforma interativa resultante, orientada por AutoML, permite que os pesquisadores explorem os padrões do microbioma, gerem hipóteses e selecionem MAGs relevantes para uma investigação mais aprofundada. Naturalmente, as estruturas e os bancos de dados desenvolvidos aqui contribuem para transformar dados metagenômicos em percepções biológicas. No entanto, a evolução em direção a uma aplicação mais integrativa e translacional de dados metagenômicos é um desafio.Biblioteca Digitais de Teses e Dissertações da USPCarvalho, André Carlos Ponce de Leon Ferreira deRocha, Ulisses Nunes daKasmanas, Jonas Coelho2025-07-08info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-13082025-104726/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2025-08-13T13:57:02Zoai:teses.usp.br:tde-13082025-104726Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212025-08-13T13:57:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning
Análise e classificação de microbiomas humanos: detecção de bioindicadores e otimização por meio de aprendizado de máquina
title Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning
spellingShingle Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning
Kasmanas, Jonas Coelho
Banco de dados metagenômico
Diet
Dieta
Genome recovery
Human microbiome
MAGs
MAGs
Metagenomic database
Metagenômica
Metagenomics
Microbioma humano
Recuperação do genoma
title_short Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning
title_full Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning
title_fullStr Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning
title_full_unstemmed Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning
title_sort Analysis and classification of human microbiomes: detection of bioindicators and optimization through machine learning
author Kasmanas, Jonas Coelho
author_facet Kasmanas, Jonas Coelho
author_role author
dc.contributor.none.fl_str_mv Carvalho, André Carlos Ponce de Leon Ferreira de
Rocha, Ulisses Nunes da
dc.contributor.author.fl_str_mv Kasmanas, Jonas Coelho
dc.subject.por.fl_str_mv Banco de dados metagenômico
Diet
Dieta
Genome recovery
Human microbiome
MAGs
MAGs
Metagenomic database
Metagenômica
Metagenomics
Microbioma humano
Recuperação do genoma
topic Banco de dados metagenômico
Diet
Dieta
Genome recovery
Human microbiome
MAGs
MAGs
Metagenomic database
Metagenômica
Metagenomics
Microbioma humano
Recuperação do genoma
description Our understanding of human-associated microorganisms has evolved substantially since Theodor Escherich first documented Escherichia coli in the gut flora of healthy children. Today, we recognise these microorganisms as crucial contributors to human homeostasis. Among the techniques that emerged to study these microorganisms, metagenomics enabled the sequencing of all DNA within environmental samples without the need for culturing. In particular, the recovery of metagenome-assembled genomes (MAGs) allows direct genome investigation of previously uncultured organisms in their environmental context. Genome-resolved metagenomics allows the connection of functional potential to specific microorganisms, capturing subtle genomic adaptations to specific ecological contexts and providing a powerful approach to identify precise bioindicators with implications for epidemiology, drug discovery, and personalised medicine, for instance. However, despite the years of progress, the transformation of metagenomic data into biological insights is complex, causing the samples to be underused and relevant patterns to be overlooked. Metadata inconsistencies hinder systematic sample selection, genome recovery requires complex computational pipelines, and the high-dimensional nature of metagenomic data constraints comparative analysis. This dissertation addresses these challenges through integrated computational frameworks that bridge the critical gap between exponential data generation and actionable scientific insights while advancing our understanding of the composition and function of the human microbiome. First, we developed the HumanMetagenomeDB, a curated database of standardised metadata from 69,822 human metagenome samples collected from public repositories. This resource enabled systematic sample selection based on host characteristics, medical conditions, and technical parameters, focusing on facilitating large-scale meta-studies. The database also revealed significant geographical biases in microbiome sampling and gaps in disease coverage, highlighting areas that require additional research effort. Building on this foundation, we created MuDoGeR (Multi-Domain Genome Recovery), a streamlined, reproducible pipeline for genome recovery at scale, reducing the genome recovery from metagenomes technical barriers for non-bioinformaticians. Afterwards, to ensure reliable results for genome-centric studies, we conducted a systematic evaluation using simulated communities that revealed factors influencing genome recovery success. The evaluation established an optimal sequencing depth of 60 million reads and demonstrated that current methods can achieve up to 92% recovery precision while highlighting specific limitations in recovering closely related species. After recovering the genomes, we next focused on streamlining comparative genomic analysis by developing gSpreadComp. In particular, gSpreadComp focuses on analysing antimicrobial resistance genes (ARGs) and virulence factors (VFs). We used human gut microbiome samples from subjects with different diets as a use case. gSpreadComp revealed that while overall resistance patterns were similar across diets, specific differences emerged - such as increased tetracycline resistance in omnivores and elevated bacitracin resistance in vegans. Notably, vegan and vegetarian diets showed significantly higher potential for plasmid-mediated horizontal gene transfer. gSpreadComp focuses on guiding researchers in generating testable hypotheses. The culmination of this work is an approach for biologically interpretable feature selection that combines autoencoder-based dimensionality reduction with density-based clustering to encode the functional potential of MAGs. Throughout the process, and as a consequence of it, we also assembled a comprehensive collection of 514,932 prokaryotic MAGs of the human microbiome, representing 6,763 bacterial and 31 archaeal species. The resulting interactive AutoML-driven platform allows researchers to explore microbiome patterns, generate hypotheses, and select relevant MAGs for further investigation. Naturally, the frameworks and databases developed here contribute to transforming metagenomic data into biological insights. However, the evolution toward more integrative and translational applications requires continued interdisciplinary collaboration between microbiologists, computer scientists, and clinical researchers.
publishDate 2025
dc.date.none.fl_str_mv 2025-07-08
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/55/55134/tde-13082025-104726/
url https://www.teses.usp.br/teses/disponiveis/55/55134/tde-13082025-104726/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1848370484069007360