Hierarchical classification on batch and streaming data with applications to entomology

Detalhes bibliográficos
Ano de defesa: 2022
Autor(a) principal: Parmezan, Antonio Rafael Sabino
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://www.teses.usp.br/teses/disponiveis/55/55134/tde-03102022-171351/
Resumo: Traditional supervised machine learning algorithms conduct data classification in a flat way, i.e., they seek to associate each example with a class belonging to a finite, devoid of structural dependencies and usually small, set of classes. However, there are more challenging problems in which classes can be divided or grouped into subclasses or superclasses, respectively. This structural dependency between classes demands the application of methods prepared to deal with hierarchical classification. An algorithm for hierarchical classification considers the structural information embedded in the class hierarchy and uses it to decompose the original problems feature space into subproblems with fewer classes. Such decomposition reduces the complexity of the classification function as well as the prediction error. This thesis advances the state-ofthe-art by proposing novel algorithms for hierarchical classification considering two learning paradigms: (i) batch, where learning takes place offline employing a sample of fixed-size examples (ideally) coming from a stationary probability distribution. Each observation within the sample is independently and identically distributed; and (ii) streaming, in which learning is performed online from a usually uninterrupted and ordered sequence of examples available, at various update rates and without human intervention, by systems or devices. The features that describe the streaming examples may drift over time due to the non-stationary nature of the environment in which they are. In this context, the main contributions of this thesis include: (i) the most extensive and comprehensive study ever done to understand the impact of climatic-environmental conditions on the bee and wasp wing-beat frequencies. From the practical standpoint, the work builds base components for (online) (hierarchical) classification of flying insects; (ii) a method that combines local approaches to quickly and efficiently obtain a hierarchical decision model that faithfully represents the music genre identification scenario. We also validated the approach on hymenopteran data; (iii) a reference process that uses optical sensors and hierarchical classifiers to identify pollinating flying insects under natural field conditions. The results obtained provided answers to ten research questions; (iv) the first algorithm for hierarchical classification of data streams. It is based on nearest neighbors and works incrementally; (v) a framework and (vi) a collection of methods for hierarchical labeling of streaming data.
id USP_c925a43f14b1f29af475ad757f53fe66
oai_identifier_str oai:teses.usp.br:tde-03102022-171351
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str
spelling Hierarchical classification on batch and streaming data with applications to entomologyClassificação hierárquica de dados em lote e em fluxo contínuo com aplicações para entomologiaAprendizado de máquinaAprendizado em loteBatch learningClassificação hierárquicaConcept driftData streamFluxo de dadosHierarchical classificationMachine learningMudança de conceitoTraditional supervised machine learning algorithms conduct data classification in a flat way, i.e., they seek to associate each example with a class belonging to a finite, devoid of structural dependencies and usually small, set of classes. However, there are more challenging problems in which classes can be divided or grouped into subclasses or superclasses, respectively. This structural dependency between classes demands the application of methods prepared to deal with hierarchical classification. An algorithm for hierarchical classification considers the structural information embedded in the class hierarchy and uses it to decompose the original problems feature space into subproblems with fewer classes. Such decomposition reduces the complexity of the classification function as well as the prediction error. This thesis advances the state-ofthe-art by proposing novel algorithms for hierarchical classification considering two learning paradigms: (i) batch, where learning takes place offline employing a sample of fixed-size examples (ideally) coming from a stationary probability distribution. Each observation within the sample is independently and identically distributed; and (ii) streaming, in which learning is performed online from a usually uninterrupted and ordered sequence of examples available, at various update rates and without human intervention, by systems or devices. The features that describe the streaming examples may drift over time due to the non-stationary nature of the environment in which they are. In this context, the main contributions of this thesis include: (i) the most extensive and comprehensive study ever done to understand the impact of climatic-environmental conditions on the bee and wasp wing-beat frequencies. From the practical standpoint, the work builds base components for (online) (hierarchical) classification of flying insects; (ii) a method that combines local approaches to quickly and efficiently obtain a hierarchical decision model that faithfully represents the music genre identification scenario. We also validated the approach on hymenopteran data; (iii) a reference process that uses optical sensors and hierarchical classifiers to identify pollinating flying insects under natural field conditions. The results obtained provided answers to ten research questions; (iv) the first algorithm for hierarchical classification of data streams. It is based on nearest neighbors and works incrementally; (v) a framework and (vi) a collection of methods for hierarchical labeling of streaming data.Os algoritmos de aprendizado de máquina supervisionado tradicionais conduzem a classificação de dados de maneira plana, ou seja, buscam associar cada exemplo a uma classe pertencente a um conjunto finito, desprovido de dependências estruturais e normalmente pequeno, de classes. No entanto, existem problemas mais desafiadores nos quais as classes podem ser divididas ou agrupadas em subclasses ou superclasses, respectivamente. Essa dependência estrutural entre classes demanda a aplicação de métodos preparados para lidar com a classificação hierárquica. Um algoritmo para classificação hierárquica considera as informações estruturais embutidas na hierarquia de classes e as usa para decompor o espaço de atributos do problema original em subproblemas com menos classes. Tal decomposição reduz a complexidade da função de classificação enquanto aprimora o desempenho preditivo. Esta tese avança o estado da arte ao propor novos algoritmos para classificação hierárquica considerando dois paradigmas de aprendizado: (i) lote, onde o aprendizado ocorre offline a partir de uma amostra de exemplos de tamanho fixo (idealmente) proveniente de uma distribuição de probabilidade estacionária. Cada observação dentro da amostra é independente e identicamente distribuída; e (ii) fluxo contínuo, em que o aprendizado é realizado online a partir de uma sequência ordenada de exemplos usualmente ilimitada que é disponibilizada, em várias taxas de atualização e sem intervenção humana, por sistemas ou dispositivos. Devido à natureza não-estacionária do ambiente no qual estão inseridas, as características que compõem os exemplos de um fluxo contínuo podem variar no decorrer do tempo. Nesse contexto, as principais contribuições desta tese incluem: (i) o estudo mais extenso e abrangente já feito para entender o impacto das condições climáticas-ambientais nas frequências de batimento de asas de abelhas e vespas. Do ponto de vista prático, o trabalho constrói componentes-base para a classificação (hierárquica) (online) de insetos voadores; (ii) um método que combina abordagens locais para obter de forma rápida e eficiente um modelo de decisão hierárquica que representa fielmente o cenário de identificação de gêneros musicais. A abordagem também é validada em dados de himenópteros; (iii) um processo de referência que utiliza sensores ópticos e classificadores hierárquicos para identificar insetos voadores polinizadores em condições naturais de campo. Os resultados obtidos forneceram respostas à dez questões de pesquisa; (iv) o primeiro algoritmo para classificação hierárquica de fluxos de dados. Ele baseia-se em vizinhos mais próximos e funciona de maneira incremental; (v) um framework e (vi) uma coleção de métodos para rotulagem hierárquica de dados em fluxo contínuo.Biblioteca Digitais de Teses e Dissertações da USPBatista, Gustavo Enrique de Almeida Prado AlvesParmezan, Antonio Rafael Sabino2022-02-25info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-03102022-171351/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2022-10-03T20:20:45Zoai:teses.usp.br:tde-03102022-171351Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212022-10-03T20:20:45Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Hierarchical classification on batch and streaming data with applications to entomology
Classificação hierárquica de dados em lote e em fluxo contínuo com aplicações para entomologia
title Hierarchical classification on batch and streaming data with applications to entomology
spellingShingle Hierarchical classification on batch and streaming data with applications to entomology
Parmezan, Antonio Rafael Sabino
Aprendizado de máquina
Aprendizado em lote
Batch learning
Classificação hierárquica
Concept drift
Data stream
Fluxo de dados
Hierarchical classification
Machine learning
Mudança de conceito
title_short Hierarchical classification on batch and streaming data with applications to entomology
title_full Hierarchical classification on batch and streaming data with applications to entomology
title_fullStr Hierarchical classification on batch and streaming data with applications to entomology
title_full_unstemmed Hierarchical classification on batch and streaming data with applications to entomology
title_sort Hierarchical classification on batch and streaming data with applications to entomology
author Parmezan, Antonio Rafael Sabino
author_facet Parmezan, Antonio Rafael Sabino
author_role author
dc.contributor.none.fl_str_mv Batista, Gustavo Enrique de Almeida Prado Alves
dc.contributor.author.fl_str_mv Parmezan, Antonio Rafael Sabino
dc.subject.por.fl_str_mv Aprendizado de máquina
Aprendizado em lote
Batch learning
Classificação hierárquica
Concept drift
Data stream
Fluxo de dados
Hierarchical classification
Machine learning
Mudança de conceito
topic Aprendizado de máquina
Aprendizado em lote
Batch learning
Classificação hierárquica
Concept drift
Data stream
Fluxo de dados
Hierarchical classification
Machine learning
Mudança de conceito
description Traditional supervised machine learning algorithms conduct data classification in a flat way, i.e., they seek to associate each example with a class belonging to a finite, devoid of structural dependencies and usually small, set of classes. However, there are more challenging problems in which classes can be divided or grouped into subclasses or superclasses, respectively. This structural dependency between classes demands the application of methods prepared to deal with hierarchical classification. An algorithm for hierarchical classification considers the structural information embedded in the class hierarchy and uses it to decompose the original problems feature space into subproblems with fewer classes. Such decomposition reduces the complexity of the classification function as well as the prediction error. This thesis advances the state-ofthe-art by proposing novel algorithms for hierarchical classification considering two learning paradigms: (i) batch, where learning takes place offline employing a sample of fixed-size examples (ideally) coming from a stationary probability distribution. Each observation within the sample is independently and identically distributed; and (ii) streaming, in which learning is performed online from a usually uninterrupted and ordered sequence of examples available, at various update rates and without human intervention, by systems or devices. The features that describe the streaming examples may drift over time due to the non-stationary nature of the environment in which they are. In this context, the main contributions of this thesis include: (i) the most extensive and comprehensive study ever done to understand the impact of climatic-environmental conditions on the bee and wasp wing-beat frequencies. From the practical standpoint, the work builds base components for (online) (hierarchical) classification of flying insects; (ii) a method that combines local approaches to quickly and efficiently obtain a hierarchical decision model that faithfully represents the music genre identification scenario. We also validated the approach on hymenopteran data; (iii) a reference process that uses optical sensors and hierarchical classifiers to identify pollinating flying insects under natural field conditions. The results obtained provided answers to ten research questions; (iv) the first algorithm for hierarchical classification of data streams. It is based on nearest neighbors and works incrementally; (v) a framework and (vi) a collection of methods for hierarchical labeling of streaming data.
publishDate 2022
dc.date.none.fl_str_mv 2022-02-25
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/55/55134/tde-03102022-171351/
url https://www.teses.usp.br/teses/disponiveis/55/55134/tde-03102022-171351/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1815257888904445952