Data science for epidemiology: a case study of dengue in Brazil.

Detalhes bibliográficos
Ano de defesa: 2022
Autor(a) principal: Kirstin Ingrid Oliveira Roster
Orientador(a): Francisco Aparecido Rodrigues
Banca de defesa: Colm Peter Connaughton, Jose Fernando Fontanari, Thomas Kauê Dal'Maso Peron
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Universidade de São Paulo
Programa de Pós-Graduação: Ciências da Computação e Matemática Computacional
Departamento: Não Informado pela instituição
País: BR
Link de acesso: https://doi.org/10.11606/T.55.2022.tde-27022023-142607
Resumo: This thesis is a collection of studies on the application of data science to problems in dengue epidemiology. We leverage machine learning models together with methods from causal inference for two important public health objectives: (i) forecasting disease prevalence to anticipate outbreaks and allocate resources, and (ii) understanding disease drivers to develop effective interventions. Using diverse data on disease prevalence, climate, and human behavior, we demonstrate how machine learning can be applied in three different contexts: first, to develop accurate predictions of infections across Brazilian cities; second, to generalize predictions to new diseases; and finally, as an intermediate step for causal inference. In Chapter 2, we compare machine learning algorithms for dengue prediction and assess the value of causal feature selection. We find variation in the optimal predictors in national (domain-invariant) and single-city (domain-specific) settings. Decision tree ensemble models perform best at national scale. Causal feature selection performs best according to one of four error metrics, though it is not the optimal method across all cities in single-city forecasts. This result helps us better understand the potential within-domain cost in predictive performance of causally-informed models. In Chapter 3, we assess the generalizability of the dengue models developed in the prior chapter. Based on the hypothesis that diseases may share common time series characteristics, we test the effectiveness of knowledge transfer from endemic to novel diseases to improve predictions in low-data settings. We compare instance- and parameter-based transfer learning algorithms and evaluate performance on both synthetic and empirical data. Results suggest that transfer learning offers potential for early pandemic response and that the most predictive algorithm and transfer method depends on the similarity of the disease pairs. In Chapter 4, we consider the contribution of machine learning to causal inference, by examining the impact of the COVID-19 pandemic on dengue in Brazil. We estimate the gap between expected and observed dengue cases using an interrupted time series design. We also decompose the gap into the impacts of climate conditions, pandemic-induced changes in reporting, human susceptibility, and human mobility. We find that there is considerable variation across the country in both overall pandemic impact on dengue and the relative importance of individual drivers. This analysis helps shed light on the data gaps caused by the COVID-19 pandemic and more generally, on possible intervention targets to help control dengue in the future.
id USP_cde7091540997a1ae302d918464dfcc3
oai_identifier_str oai:teses.usp.br:tde-27022023-142607
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str
spelling info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesis Data science for epidemiology: a case study of dengue in Brazil. Ciência de dados para epidemiologia: um estudo de caso da dengue no Brasil 2022-12-19Francisco Aparecido RodriguesColm Peter ConnaughtonJose Fernando FontanariThomas Kauê Dal'Maso PeronKirstin Ingrid Oliveira RosterUniversidade de São PauloCiências da Computação e Matemática ComputacionalUSPBR Aprendizado de máquina Causal inference Dengue Dengue Disease forecasting Inferência causal Machine learning Previsão de doenças This thesis is a collection of studies on the application of data science to problems in dengue epidemiology. We leverage machine learning models together with methods from causal inference for two important public health objectives: (i) forecasting disease prevalence to anticipate outbreaks and allocate resources, and (ii) understanding disease drivers to develop effective interventions. Using diverse data on disease prevalence, climate, and human behavior, we demonstrate how machine learning can be applied in three different contexts: first, to develop accurate predictions of infections across Brazilian cities; second, to generalize predictions to new diseases; and finally, as an intermediate step for causal inference. In Chapter 2, we compare machine learning algorithms for dengue prediction and assess the value of causal feature selection. We find variation in the optimal predictors in national (domain-invariant) and single-city (domain-specific) settings. Decision tree ensemble models perform best at national scale. Causal feature selection performs best according to one of four error metrics, though it is not the optimal method across all cities in single-city forecasts. This result helps us better understand the potential within-domain cost in predictive performance of causally-informed models. In Chapter 3, we assess the generalizability of the dengue models developed in the prior chapter. Based on the hypothesis that diseases may share common time series characteristics, we test the effectiveness of knowledge transfer from endemic to novel diseases to improve predictions in low-data settings. We compare instance- and parameter-based transfer learning algorithms and evaluate performance on both synthetic and empirical data. Results suggest that transfer learning offers potential for early pandemic response and that the most predictive algorithm and transfer method depends on the similarity of the disease pairs. In Chapter 4, we consider the contribution of machine learning to causal inference, by examining the impact of the COVID-19 pandemic on dengue in Brazil. We estimate the gap between expected and observed dengue cases using an interrupted time series design. We also decompose the gap into the impacts of climate conditions, pandemic-induced changes in reporting, human susceptibility, and human mobility. We find that there is considerable variation across the country in both overall pandemic impact on dengue and the relative importance of individual drivers. This analysis helps shed light on the data gaps caused by the COVID-19 pandemic and more generally, on possible intervention targets to help control dengue in the future. Esta tese é um conjunto de estudos sobre a aplicação da ciência de dados a problemas da epidemiologia da dengue. Alavancamos modelos de aprendizado de máquina juntamente com métodos de inferência causal para dois importantes objetivos de saúde pública: (i) prever a prevalência de doenças para antecipar surtos e alocar recursos e (ii) entender os causadores de doenças para desenvolver intervenções preventivas eficazes. Usando dados sobre a prevalência de doenças, condições climáticas e comportamento humano, demonstramos como o aprendizado de máquina pode ser aplicado em três contextos diferentes: (i) para desenvolver previsões precisas de infecções nas cidades brasileiras; (ii) para generalizar as previsões para novas doenças; e (iii) como um passo intermediário para a inferência causal. No Capítulo 2, comparamos algoritmos de aprendizado de máquina para previsão de dengue e avaliamos o valor da seleção de variáveis causais. O algoritmo ideal varia entre o contexto nacional (independente de domínio) e de cidade única (domínio específico). Os modelos de conjuntos de árvores de decisão têm melhor desempenho em escala nacional. A seleção de variáveis causais tem melhor desempenho de acordo com uma das quatro medidas de erro, embora não seja o método ideal em todas as cidades. Esse resultado nos ajuda a entender melhor o custo de modelos informados pelo relacionamento causal entre as variáveis. No Capítulo 3, avaliamos a generalização dos modelos desenvolvidos no capítulo anterior. Com base na hipótese de que doenças podem ter características de séries temporais em comum, testamos a eficácia da transferência de conhecimento de doenças endêmicas para doenças novas, para melhorar as previsões quando existem poucos dados para treinamento. Comparamos algoritmos de transferência de aprendizado baseados em instâncias e em parâmetros, e avaliamos o desempenho em dados empíricos e teóricos. Os resultados sugerem que a transferência de aprendizado oferece o potencial para responder a pandemias, e que o melhor algoritmo depende da semelhança dos pares de doenças. No Capítulo 4, consideramos a contribuição do aprendizado de máquina para a inferência causal, examinando o impacto da pandemia de COVID-19 na dengue no Brasil. Estimamos a diferença entre os casos de dengue esperados e os observados, usando um desenho de estudo de série temporal interrompida. Também separamos os impactos do clima, das mudanças na vigilância devido à pandemia, da suscetibilidade humana e da mobilidade. Descobrimos que há uma variação considerável em todo o país, tanto no impacto geral da pandemia quanto na importância relativa das causas principais. Essa análise ajuda a esclarecer as lacunas de dados causadas pela pandemia de COVID-19 e achar possíveis alvos de intervenção para controlar a dengue no futuro. https://doi.org/10.11606/T.55.2022.tde-27022023-142607info:eu-repo/semantics/openAccessengreponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USP2023-12-21T20:15:54Zoai:teses.usp.br:tde-27022023-142607Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212023-02-27T17:37:49Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.en.fl_str_mv Data science for epidemiology: a case study of dengue in Brazil.
dc.title.alternative.pt.fl_str_mv Ciência de dados para epidemiologia: um estudo de caso da dengue no Brasil
title Data science for epidemiology: a case study of dengue in Brazil.
spellingShingle Data science for epidemiology: a case study of dengue in Brazil.
Kirstin Ingrid Oliveira Roster
title_short Data science for epidemiology: a case study of dengue in Brazil.
title_full Data science for epidemiology: a case study of dengue in Brazil.
title_fullStr Data science for epidemiology: a case study of dengue in Brazil.
title_full_unstemmed Data science for epidemiology: a case study of dengue in Brazil.
title_sort Data science for epidemiology: a case study of dengue in Brazil.
author Kirstin Ingrid Oliveira Roster
author_facet Kirstin Ingrid Oliveira Roster
author_role author
dc.contributor.advisor1.fl_str_mv Francisco Aparecido Rodrigues
dc.contributor.referee1.fl_str_mv Colm Peter Connaughton
dc.contributor.referee2.fl_str_mv Jose Fernando Fontanari
dc.contributor.referee3.fl_str_mv Thomas Kauê Dal'Maso Peron
dc.contributor.author.fl_str_mv Kirstin Ingrid Oliveira Roster
contributor_str_mv Francisco Aparecido Rodrigues
Colm Peter Connaughton
Jose Fernando Fontanari
Thomas Kauê Dal'Maso Peron
description This thesis is a collection of studies on the application of data science to problems in dengue epidemiology. We leverage machine learning models together with methods from causal inference for two important public health objectives: (i) forecasting disease prevalence to anticipate outbreaks and allocate resources, and (ii) understanding disease drivers to develop effective interventions. Using diverse data on disease prevalence, climate, and human behavior, we demonstrate how machine learning can be applied in three different contexts: first, to develop accurate predictions of infections across Brazilian cities; second, to generalize predictions to new diseases; and finally, as an intermediate step for causal inference. In Chapter 2, we compare machine learning algorithms for dengue prediction and assess the value of causal feature selection. We find variation in the optimal predictors in national (domain-invariant) and single-city (domain-specific) settings. Decision tree ensemble models perform best at national scale. Causal feature selection performs best according to one of four error metrics, though it is not the optimal method across all cities in single-city forecasts. This result helps us better understand the potential within-domain cost in predictive performance of causally-informed models. In Chapter 3, we assess the generalizability of the dengue models developed in the prior chapter. Based on the hypothesis that diseases may share common time series characteristics, we test the effectiveness of knowledge transfer from endemic to novel diseases to improve predictions in low-data settings. We compare instance- and parameter-based transfer learning algorithms and evaluate performance on both synthetic and empirical data. Results suggest that transfer learning offers potential for early pandemic response and that the most predictive algorithm and transfer method depends on the similarity of the disease pairs. In Chapter 4, we consider the contribution of machine learning to causal inference, by examining the impact of the COVID-19 pandemic on dengue in Brazil. We estimate the gap between expected and observed dengue cases using an interrupted time series design. We also decompose the gap into the impacts of climate conditions, pandemic-induced changes in reporting, human susceptibility, and human mobility. We find that there is considerable variation across the country in both overall pandemic impact on dengue and the relative importance of individual drivers. This analysis helps shed light on the data gaps caused by the COVID-19 pandemic and more generally, on possible intervention targets to help control dengue in the future.
publishDate 2022
dc.date.issued.fl_str_mv 2022-12-19
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://doi.org/10.11606/T.55.2022.tde-27022023-142607
url https://doi.org/10.11606/T.55.2022.tde-27022023-142607
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade de São Paulo
dc.publisher.program.fl_str_mv Ciências da Computação e Matemática Computacional
dc.publisher.initials.fl_str_mv USP
dc.publisher.country.fl_str_mv BR
publisher.none.fl_str_mv Universidade de São Paulo
dc.source.none.fl_str_mv reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1786377176652709888