Matérn Kernel for incomplete data

Detalhes bibliográficos
Ano de defesa: 2023
Autor(a) principal: Silva, Danilo Avilar
Orientador(a): Gomes, João Paulo Pordeus
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Não Informado pela instituição
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Área do conhecimento CNPq:
Link de acesso: http://repositorio.ufc.br/handle/riufc/74505
Resumo: Machine learning problems with incomplete data are constantly addressed in various real-world domains. Statistical methods dealing with missing attributes are characterized by assumptions about data distribution through a density function. In this context, approaches that utilize similarity-based methods, become very promising research objects since these methods generally assume that data is fully observed and are not naturally equipped to handle incomplete. In this work, methods will be proposed to estimate the expected value of the Matérn Kernel in the presence of incomplete data vectors without any preprocessing steps. The EMK-MC and EMK-UT methods demonstrate the capability to address the kernel estimation problem directly, meaning they estimate the transformation of interest instead of embedding it within a preprocessing framework. To obtain such estimates, incomplete vectors are treated as continuous random variables, and based on the assumption that the Euclidean distance between points of interest follows a Nakagami distribution, sampling methods are used to generate points that depend only on the distribution of interest. Through a Gaussian mixture model, the data distribution is approximated by maximum likelihood estimation via the Expectation-Maximization algorithm, while simultaneously iteratively estimating the missing values. This allows the model to be fitted to the observed data, considering the uncertainty of the missing values and the relationships between variables. The performances of the proposed methods are compared to three methods on real and synthetic datasets. Considering the root mean square error obtained by computing the difference between the estimated kernel value and the true value, the consistency of the achieved performance remains evident in the majority of the scenarios evaluated for real-world datasets. The proposed methods, EMK-MC and EMK-UT, are superior in approximately 43% and 38% of the evaluated scenarios, respectively. As for the scenarios evaluated in synthetic datasets, the proposed approaches outperform all evaluated scenarios.
id UFC-7_6d8e11d4068f99e39a557a3ab5a9048a
oai_identifier_str oai:repositorio.ufc.br:riufc/74505
network_acronym_str UFC-7
network_name_str Repositório Institucional da Universidade Federal do Ceará (UFC)
repository_id_str
spelling Silva, Danilo AvilarMattos, César Lincoln CavalcanteGomes, João Paulo Pordeus2023-09-27T19:33:35Z2023-09-27T19:33:35Z2023SILVA, Danilo Avilar. Matérn Kernel for incomplete data. 2023. 104 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2023.http://repositorio.ufc.br/handle/riufc/74505Machine learning problems with incomplete data are constantly addressed in various real-world domains. Statistical methods dealing with missing attributes are characterized by assumptions about data distribution through a density function. In this context, approaches that utilize similarity-based methods, become very promising research objects since these methods generally assume that data is fully observed and are not naturally equipped to handle incomplete. In this work, methods will be proposed to estimate the expected value of the Matérn Kernel in the presence of incomplete data vectors without any preprocessing steps. The EMK-MC and EMK-UT methods demonstrate the capability to address the kernel estimation problem directly, meaning they estimate the transformation of interest instead of embedding it within a preprocessing framework. To obtain such estimates, incomplete vectors are treated as continuous random variables, and based on the assumption that the Euclidean distance between points of interest follows a Nakagami distribution, sampling methods are used to generate points that depend only on the distribution of interest. Through a Gaussian mixture model, the data distribution is approximated by maximum likelihood estimation via the Expectation-Maximization algorithm, while simultaneously iteratively estimating the missing values. This allows the model to be fitted to the observed data, considering the uncertainty of the missing values and the relationships between variables. The performances of the proposed methods are compared to three methods on real and synthetic datasets. Considering the root mean square error obtained by computing the difference between the estimated kernel value and the true value, the consistency of the achieved performance remains evident in the majority of the scenarios evaluated for real-world datasets. The proposed methods, EMK-MC and EMK-UT, are superior in approximately 43% and 38% of the evaluated scenarios, respectively. As for the scenarios evaluated in synthetic datasets, the proposed approaches outperform all evaluated scenarios.Problemas de aprendizado de máquina com dados incompletos são constantemente abordados em diversos domínios do mundo real. Métodos estatísticos que lidam com atributos ausentes caracterizam-se por suposições sobre a distribuição de dados através de uma função de densidade. Diante desse contexto, abordagens para utilização de métodos baseados em medidas de similaridade, tornam-se objetos de pesquisa bastante promissores, uma vez que esses métodos geralmente assumem que os dados são totalmente observados e não são equipados naturalmente para lidar com dados incompletos. Neste trabalho, serão propostos métodos para estimar o valor esperado do Kernel Matérn na presença de vetores de dados incompletos sem nenhuma etapa de pré-processamento. Os métodos Expected Matérn Kernel via Monte Carlo Method (EMK-MC) e Expected Matérn Kernel via Unscented Transform (EMK-UT) apresentam a capacidade de abordar o problema de estimativa do kernel estimando a transformação de interesse, ao invés de lançá-la em uma estrutura de pré-processamento. Para obter tais estimativas, os vetores incompletos são tratados como variáveis aleatórias contínuas e, a partir da suposição que a distância Euclidiana entre pontos de interesse seguem uma distribuição Nakagami, métodos de amostragem são utilizados para gerar pontos que dependem apenas da distribuição de interesse. Por meio de um modelo de mistura de Gaussianas, a distribuição dos dados é aproximada a partir da estimativa de máxima verossimilhança via algoritmo Expectation-Maximization, e ao mesmo tempo, estima os valores ausentes de forma iterativa. Isso permite que o modelo seja ajustado aos dados observados, levando em consideração a incerteza dos valores ausentes e as relações entre as variáveis. Os desempenhos dos métodos propostos são comparados à três métodos em conjuntos de dados reais e sintéticos. Em função da raiz do erro médio quadrático obtido ao computar a diferença entre o valor estimado do kernel e o valor real, a consistência do desempenho alcançado se mantém evidente na maioria dos cenários avaliados para bases do mundo real, sendo os métodos propostos EMK-MC e EMK-UT, melhores em cerca de 43% e 38% dos cenários avaliados, respectivamente. No que se refere aos cenários avaliados em conjuntos de dados sintéticos, as abordagens propostas são melhores em todos os cenários avaliados.Matérn Kernel for incomplete dataMatérn Kernel for incomplete datainfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisDados ausentesModelo de mistura de gaussianasMétodos de aproximação de funçõesKernel MatérnMissing dataGaussian mixture modelApproximation methods for functionsMatérn KernelCNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOinfo:eu-repo/semantics/openAccessengreponame:Repositório Institucional da Universidade Federal do Ceará (UFC)instname:Universidade Federal do Ceará (UFC)instacron:UFChttp://lattes.cnpq.br/9055072082719616http://lattes.cnpq.br/9553770402705512http://lattes.cnpq.br/24455711610293372023-09-27ORIGINAL2023_tese_dasilva.pdf2023_tese_dasilva.pdfapplication/pdf1700268http://repositorio.ufc.br/bitstream/riufc/74505/3/2023_tese_dasilva.pdf947af8db39a2ec8e7e0c50ea3e15dec1MD53LICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://repositorio.ufc.br/bitstream/riufc/74505/4/license.txt8a4605be74aa9ea9d79846c1fba20a33MD54riufc/745052023-10-04 13:42:42.564oai:repositorio.ufc.br:riufc/74505Tk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=Repositório InstitucionalPUBhttp://www.repositorio.ufc.br/ri-oai/requestbu@ufc.br || repositorio@ufc.bropendoar:2023-10-04T16:42:42Repositório Institucional da Universidade Federal do Ceará (UFC) - Universidade Federal do Ceará (UFC)false
dc.title.pt_BR.fl_str_mv Matérn Kernel for incomplete data
dc.title.en.pt_BR.fl_str_mv Matérn Kernel for incomplete data
title Matérn Kernel for incomplete data
spellingShingle Matérn Kernel for incomplete data
Silva, Danilo Avilar
CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Dados ausentes
Modelo de mistura de gaussianas
Métodos de aproximação de funções
Kernel Matérn
Missing data
Gaussian mixture model
Approximation methods for functions
Matérn Kernel
title_short Matérn Kernel for incomplete data
title_full Matérn Kernel for incomplete data
title_fullStr Matérn Kernel for incomplete data
title_full_unstemmed Matérn Kernel for incomplete data
title_sort Matérn Kernel for incomplete data
author Silva, Danilo Avilar
author_facet Silva, Danilo Avilar
author_role author
dc.contributor.co-advisor.none.fl_str_mv Mattos, César Lincoln Cavalcante
dc.contributor.author.fl_str_mv Silva, Danilo Avilar
dc.contributor.advisor1.fl_str_mv Gomes, João Paulo Pordeus
contributor_str_mv Gomes, João Paulo Pordeus
dc.subject.cnpq.fl_str_mv CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
topic CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Dados ausentes
Modelo de mistura de gaussianas
Métodos de aproximação de funções
Kernel Matérn
Missing data
Gaussian mixture model
Approximation methods for functions
Matérn Kernel
dc.subject.ptbr.pt_BR.fl_str_mv Dados ausentes
Modelo de mistura de gaussianas
Métodos de aproximação de funções
Kernel Matérn
dc.subject.en.pt_BR.fl_str_mv Missing data
Gaussian mixture model
Approximation methods for functions
Matérn Kernel
description Machine learning problems with incomplete data are constantly addressed in various real-world domains. Statistical methods dealing with missing attributes are characterized by assumptions about data distribution through a density function. In this context, approaches that utilize similarity-based methods, become very promising research objects since these methods generally assume that data is fully observed and are not naturally equipped to handle incomplete. In this work, methods will be proposed to estimate the expected value of the Matérn Kernel in the presence of incomplete data vectors without any preprocessing steps. The EMK-MC and EMK-UT methods demonstrate the capability to address the kernel estimation problem directly, meaning they estimate the transformation of interest instead of embedding it within a preprocessing framework. To obtain such estimates, incomplete vectors are treated as continuous random variables, and based on the assumption that the Euclidean distance between points of interest follows a Nakagami distribution, sampling methods are used to generate points that depend only on the distribution of interest. Through a Gaussian mixture model, the data distribution is approximated by maximum likelihood estimation via the Expectation-Maximization algorithm, while simultaneously iteratively estimating the missing values. This allows the model to be fitted to the observed data, considering the uncertainty of the missing values and the relationships between variables. The performances of the proposed methods are compared to three methods on real and synthetic datasets. Considering the root mean square error obtained by computing the difference between the estimated kernel value and the true value, the consistency of the achieved performance remains evident in the majority of the scenarios evaluated for real-world datasets. The proposed methods, EMK-MC and EMK-UT, are superior in approximately 43% and 38% of the evaluated scenarios, respectively. As for the scenarios evaluated in synthetic datasets, the proposed approaches outperform all evaluated scenarios.
publishDate 2023
dc.date.accessioned.fl_str_mv 2023-09-27T19:33:35Z
dc.date.available.fl_str_mv 2023-09-27T19:33:35Z
dc.date.issued.fl_str_mv 2023
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.citation.fl_str_mv SILVA, Danilo Avilar. Matérn Kernel for incomplete data. 2023. 104 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2023.
dc.identifier.uri.fl_str_mv http://repositorio.ufc.br/handle/riufc/74505
identifier_str_mv SILVA, Danilo Avilar. Matérn Kernel for incomplete data. 2023. 104 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2023.
url http://repositorio.ufc.br/handle/riufc/74505
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.source.none.fl_str_mv reponame:Repositório Institucional da Universidade Federal do Ceará (UFC)
instname:Universidade Federal do Ceará (UFC)
instacron:UFC
instname_str Universidade Federal do Ceará (UFC)
instacron_str UFC
institution UFC
reponame_str Repositório Institucional da Universidade Federal do Ceará (UFC)
collection Repositório Institucional da Universidade Federal do Ceará (UFC)
bitstream.url.fl_str_mv http://repositorio.ufc.br/bitstream/riufc/74505/3/2023_tese_dasilva.pdf
http://repositorio.ufc.br/bitstream/riufc/74505/4/license.txt
bitstream.checksum.fl_str_mv 947af8db39a2ec8e7e0c50ea3e15dec1
8a4605be74aa9ea9d79846c1fba20a33
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Repositório Institucional da Universidade Federal do Ceará (UFC) - Universidade Federal do Ceará (UFC)
repository.mail.fl_str_mv bu@ufc.br || repositorio@ufc.br
_version_ 1847793338459095040