Data science for epidemiology: a case study of dengue in Brazil.
Ano de defesa: | 2022 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | , , |
Tipo de documento: | Tese |
Tipo de acesso: | Acesso aberto |
Idioma: | eng |
Instituição de defesa: |
Universidade de São Paulo
|
Programa de Pós-Graduação: |
Ciências da Computação e Matemática Computacional
|
Departamento: |
Não Informado pela instituição
|
País: |
BR
|
Link de acesso: | https://doi.org/10.11606/T.55.2022.tde-27022023-142607 |
Resumo: | This thesis is a collection of studies on the application of data science to problems in dengue epidemiology. We leverage machine learning models together with methods from causal inference for two important public health objectives: (i) forecasting disease prevalence to anticipate outbreaks and allocate resources, and (ii) understanding disease drivers to develop effective interventions. Using diverse data on disease prevalence, climate, and human behavior, we demonstrate how machine learning can be applied in three different contexts: first, to develop accurate predictions of infections across Brazilian cities; second, to generalize predictions to new diseases; and finally, as an intermediate step for causal inference. In Chapter 2, we compare machine learning algorithms for dengue prediction and assess the value of causal feature selection. We find variation in the optimal predictors in national (domain-invariant) and single-city (domain-specific) settings. Decision tree ensemble models perform best at national scale. Causal feature selection performs best according to one of four error metrics, though it is not the optimal method across all cities in single-city forecasts. This result helps us better understand the potential within-domain cost in predictive performance of causally-informed models. In Chapter 3, we assess the generalizability of the dengue models developed in the prior chapter. Based on the hypothesis that diseases may share common time series characteristics, we test the effectiveness of knowledge transfer from endemic to novel diseases to improve predictions in low-data settings. We compare instance- and parameter-based transfer learning algorithms and evaluate performance on both synthetic and empirical data. Results suggest that transfer learning offers potential for early pandemic response and that the most predictive algorithm and transfer method depends on the similarity of the disease pairs. In Chapter 4, we consider the contribution of machine learning to causal inference, by examining the impact of the COVID-19 pandemic on dengue in Brazil. We estimate the gap between expected and observed dengue cases using an interrupted time series design. We also decompose the gap into the impacts of climate conditions, pandemic-induced changes in reporting, human susceptibility, and human mobility. We find that there is considerable variation across the country in both overall pandemic impact on dengue and the relative importance of individual drivers. This analysis helps shed light on the data gaps caused by the COVID-19 pandemic and more generally, on possible intervention targets to help control dengue in the future. |
id |
USP_cde7091540997a1ae302d918464dfcc3 |
---|---|
oai_identifier_str |
oai:teses.usp.br:tde-27022023-142607 |
network_acronym_str |
USP |
network_name_str |
Biblioteca Digital de Teses e Dissertações da USP |
repository_id_str |
|
spelling |
info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesis Data science for epidemiology: a case study of dengue in Brazil. Ciência de dados para epidemiologia: um estudo de caso da dengue no Brasil 2022-12-19Francisco Aparecido RodriguesColm Peter ConnaughtonJose Fernando FontanariThomas Kauê Dal'Maso PeronKirstin Ingrid Oliveira RosterUniversidade de São PauloCiências da Computação e Matemática ComputacionalUSPBR Aprendizado de máquina Causal inference Dengue Dengue Disease forecasting Inferência causal Machine learning Previsão de doenças This thesis is a collection of studies on the application of data science to problems in dengue epidemiology. We leverage machine learning models together with methods from causal inference for two important public health objectives: (i) forecasting disease prevalence to anticipate outbreaks and allocate resources, and (ii) understanding disease drivers to develop effective interventions. Using diverse data on disease prevalence, climate, and human behavior, we demonstrate how machine learning can be applied in three different contexts: first, to develop accurate predictions of infections across Brazilian cities; second, to generalize predictions to new diseases; and finally, as an intermediate step for causal inference. In Chapter 2, we compare machine learning algorithms for dengue prediction and assess the value of causal feature selection. We find variation in the optimal predictors in national (domain-invariant) and single-city (domain-specific) settings. Decision tree ensemble models perform best at national scale. Causal feature selection performs best according to one of four error metrics, though it is not the optimal method across all cities in single-city forecasts. This result helps us better understand the potential within-domain cost in predictive performance of causally-informed models. In Chapter 3, we assess the generalizability of the dengue models developed in the prior chapter. Based on the hypothesis that diseases may share common time series characteristics, we test the effectiveness of knowledge transfer from endemic to novel diseases to improve predictions in low-data settings. We compare instance- and parameter-based transfer learning algorithms and evaluate performance on both synthetic and empirical data. Results suggest that transfer learning offers potential for early pandemic response and that the most predictive algorithm and transfer method depends on the similarity of the disease pairs. In Chapter 4, we consider the contribution of machine learning to causal inference, by examining the impact of the COVID-19 pandemic on dengue in Brazil. We estimate the gap between expected and observed dengue cases using an interrupted time series design. We also decompose the gap into the impacts of climate conditions, pandemic-induced changes in reporting, human susceptibility, and human mobility. We find that there is considerable variation across the country in both overall pandemic impact on dengue and the relative importance of individual drivers. This analysis helps shed light on the data gaps caused by the COVID-19 pandemic and more generally, on possible intervention targets to help control dengue in the future. Esta tese é um conjunto de estudos sobre a aplicação da ciência de dados a problemas da epidemiologia da dengue. Alavancamos modelos de aprendizado de máquina juntamente com métodos de inferência causal para dois importantes objetivos de saúde pública: (i) prever a prevalência de doenças para antecipar surtos e alocar recursos e (ii) entender os causadores de doenças para desenvolver intervenções preventivas eficazes. Usando dados sobre a prevalência de doenças, condições climáticas e comportamento humano, demonstramos como o aprendizado de máquina pode ser aplicado em três contextos diferentes: (i) para desenvolver previsões precisas de infecções nas cidades brasileiras; (ii) para generalizar as previsões para novas doenças; e (iii) como um passo intermediário para a inferência causal. No Capítulo 2, comparamos algoritmos de aprendizado de máquina para previsão de dengue e avaliamos o valor da seleção de variáveis causais. O algoritmo ideal varia entre o contexto nacional (independente de domínio) e de cidade única (domínio específico). Os modelos de conjuntos de árvores de decisão têm melhor desempenho em escala nacional. A seleção de variáveis causais tem melhor desempenho de acordo com uma das quatro medidas de erro, embora não seja o método ideal em todas as cidades. Esse resultado nos ajuda a entender melhor o custo de modelos informados pelo relacionamento causal entre as variáveis. No Capítulo 3, avaliamos a generalização dos modelos desenvolvidos no capítulo anterior. Com base na hipótese de que doenças podem ter características de séries temporais em comum, testamos a eficácia da transferência de conhecimento de doenças endêmicas para doenças novas, para melhorar as previsões quando existem poucos dados para treinamento. Comparamos algoritmos de transferência de aprendizado baseados em instâncias e em parâmetros, e avaliamos o desempenho em dados empíricos e teóricos. Os resultados sugerem que a transferência de aprendizado oferece o potencial para responder a pandemias, e que o melhor algoritmo depende da semelhança dos pares de doenças. No Capítulo 4, consideramos a contribuição do aprendizado de máquina para a inferência causal, examinando o impacto da pandemia de COVID-19 na dengue no Brasil. Estimamos a diferença entre os casos de dengue esperados e os observados, usando um desenho de estudo de série temporal interrompida. Também separamos os impactos do clima, das mudanças na vigilância devido à pandemia, da suscetibilidade humana e da mobilidade. Descobrimos que há uma variação considerável em todo o país, tanto no impacto geral da pandemia quanto na importância relativa das causas principais. Essa análise ajuda a esclarecer as lacunas de dados causadas pela pandemia de COVID-19 e achar possíveis alvos de intervenção para controlar a dengue no futuro. https://doi.org/10.11606/T.55.2022.tde-27022023-142607info:eu-repo/semantics/openAccessengreponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USP2023-12-21T20:15:54Zoai:teses.usp.br:tde-27022023-142607Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212023-02-27T17:37:49Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false |
dc.title.en.fl_str_mv |
Data science for epidemiology: a case study of dengue in Brazil. |
dc.title.alternative.pt.fl_str_mv |
Ciência de dados para epidemiologia: um estudo de caso da dengue no Brasil |
title |
Data science for epidemiology: a case study of dengue in Brazil. |
spellingShingle |
Data science for epidemiology: a case study of dengue in Brazil. Kirstin Ingrid Oliveira Roster |
title_short |
Data science for epidemiology: a case study of dengue in Brazil. |
title_full |
Data science for epidemiology: a case study of dengue in Brazil. |
title_fullStr |
Data science for epidemiology: a case study of dengue in Brazil. |
title_full_unstemmed |
Data science for epidemiology: a case study of dengue in Brazil. |
title_sort |
Data science for epidemiology: a case study of dengue in Brazil. |
author |
Kirstin Ingrid Oliveira Roster |
author_facet |
Kirstin Ingrid Oliveira Roster |
author_role |
author |
dc.contributor.advisor1.fl_str_mv |
Francisco Aparecido Rodrigues |
dc.contributor.referee1.fl_str_mv |
Colm Peter Connaughton |
dc.contributor.referee2.fl_str_mv |
Jose Fernando Fontanari |
dc.contributor.referee3.fl_str_mv |
Thomas Kauê Dal'Maso Peron |
dc.contributor.author.fl_str_mv |
Kirstin Ingrid Oliveira Roster |
contributor_str_mv |
Francisco Aparecido Rodrigues Colm Peter Connaughton Jose Fernando Fontanari Thomas Kauê Dal'Maso Peron |
description |
This thesis is a collection of studies on the application of data science to problems in dengue epidemiology. We leverage machine learning models together with methods from causal inference for two important public health objectives: (i) forecasting disease prevalence to anticipate outbreaks and allocate resources, and (ii) understanding disease drivers to develop effective interventions. Using diverse data on disease prevalence, climate, and human behavior, we demonstrate how machine learning can be applied in three different contexts: first, to develop accurate predictions of infections across Brazilian cities; second, to generalize predictions to new diseases; and finally, as an intermediate step for causal inference. In Chapter 2, we compare machine learning algorithms for dengue prediction and assess the value of causal feature selection. We find variation in the optimal predictors in national (domain-invariant) and single-city (domain-specific) settings. Decision tree ensemble models perform best at national scale. Causal feature selection performs best according to one of four error metrics, though it is not the optimal method across all cities in single-city forecasts. This result helps us better understand the potential within-domain cost in predictive performance of causally-informed models. In Chapter 3, we assess the generalizability of the dengue models developed in the prior chapter. Based on the hypothesis that diseases may share common time series characteristics, we test the effectiveness of knowledge transfer from endemic to novel diseases to improve predictions in low-data settings. We compare instance- and parameter-based transfer learning algorithms and evaluate performance on both synthetic and empirical data. Results suggest that transfer learning offers potential for early pandemic response and that the most predictive algorithm and transfer method depends on the similarity of the disease pairs. In Chapter 4, we consider the contribution of machine learning to causal inference, by examining the impact of the COVID-19 pandemic on dengue in Brazil. We estimate the gap between expected and observed dengue cases using an interrupted time series design. We also decompose the gap into the impacts of climate conditions, pandemic-induced changes in reporting, human susceptibility, and human mobility. We find that there is considerable variation across the country in both overall pandemic impact on dengue and the relative importance of individual drivers. This analysis helps shed light on the data gaps caused by the COVID-19 pandemic and more generally, on possible intervention targets to help control dengue in the future. |
publishDate |
2022 |
dc.date.issued.fl_str_mv |
2022-12-19 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://doi.org/10.11606/T.55.2022.tde-27022023-142607 |
url |
https://doi.org/10.11606/T.55.2022.tde-27022023-142607 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Universidade de São Paulo |
dc.publisher.program.fl_str_mv |
Ciências da Computação e Matemática Computacional |
dc.publisher.initials.fl_str_mv |
USP |
dc.publisher.country.fl_str_mv |
BR |
publisher.none.fl_str_mv |
Universidade de São Paulo |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP |
instname_str |
Universidade de São Paulo (USP) |
instacron_str |
USP |
institution |
USP |
reponame_str |
Biblioteca Digital de Teses e Dissertações da USP |
collection |
Biblioteca Digital de Teses e Dissertações da USP |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP) |
repository.mail.fl_str_mv |
virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br |
_version_ |
1786377176652709888 |