Matérn Kernel for incomplete data

Silva, Danilo Avilar

Matérn Kernel for incomplete data

Detalhes bibliográficos
Ano de defesa:	2023
Autor(a) principal:	Silva, Danilo Avilar
Orientador(a):	Gomes, João Paulo Pordeus
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Tese
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Não Informado pela instituição
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Área do conhecimento CNPq:	CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Link de acesso:	http://repositorio.ufc.br/handle/riufc/74505
Resumo:	Machine learning problems with incomplete data are constantly addressed in various real-world domains. Statistical methods dealing with missing attributes are characterized by assumptions about data distribution through a density function. In this context, approaches that utilize similarity-based methods, become very promising research objects since these methods generally assume that data is fully observed and are not naturally equipped to handle incomplete. In this work, methods will be proposed to estimate the expected value of the Matérn Kernel in the presence of incomplete data vectors without any preprocessing steps. The EMK-MC and EMK-UT methods demonstrate the capability to address the kernel estimation problem directly, meaning they estimate the transformation of interest instead of embedding it within a preprocessing framework. To obtain such estimates, incomplete vectors are treated as continuous random variables, and based on the assumption that the Euclidean distance between points of interest follows a Nakagami distribution, sampling methods are used to generate points that depend only on the distribution of interest. Through a Gaussian mixture model, the data distribution is approximated by maximum likelihood estimation via the Expectation-Maximization algorithm, while simultaneously iteratively estimating the missing values. This allows the model to be fitted to the observed data, considering the uncertainty of the missing values and the relationships between variables. The performances of the proposed methods are compared to three methods on real and synthetic datasets. Considering the root mean square error obtained by computing the difference between the estimated kernel value and the true value, the consistency of the achieved performance remains evident in the majority of the scenarios evaluated for real-world datasets. The proposed methods, EMK-MC and EMK-UT, are superior in approximately 43% and 38% of the evaluated scenarios, respectively. As for the scenarios evaluated in synthetic datasets, the proposed approaches outperform all evaluated scenarios.

Metadados do item

id	UFC-7_6d8e11d4068f99e39a557a3ab5a9048a
oai_identifier_str	oai:repositorio.ufc.br:riufc/74505
network_acronym_str	UFC-7
network_name_str	Repositório Institucional da Universidade Federal do Ceará (UFC)
repository_id_str
spelling	Silva, Danilo AvilarMattos, César Lincoln CavalcanteGomes, João Paulo Pordeus2023-09-27T19:33:35Z2023-09-27T19:33:35Z2023SILVA, Danilo Avilar. Matérn Kernel for incomplete data. 2023. 104 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2023.http://repositorio.ufc.br/handle/riufc/74505Machine learning problems with incomplete data are constantly addressed in various real-world domains. Statistical methods dealing with missing attributes are characterized by assumptions about data distribution through a density function. In this context, approaches that utilize similarity-based methods, become very promising research objects since these methods generally assume that data is fully observed and are not naturally equipped to handle incomplete. In this work, methods will be proposed to estimate the expected value of the Matérn Kernel in the presence of incomplete data vectors without any preprocessing steps. The EMK-MC and EMK-UT methods demonstrate the capability to address the kernel estimation problem directly, meaning they estimate the transformation of interest instead of embedding it within a preprocessing framework. To obtain such estimates, incomplete vectors are treated as continuous random variables, and based on the assumption that the Euclidean distance between points of interest follows a Nakagami distribution, sampling methods are used to generate points that depend only on the distribution of interest. Through a Gaussian mixture model, the data distribution is approximated by maximum likelihood estimation via the Expectation-Maximization algorithm, while simultaneously iteratively estimating the missing values. This allows the model to be fitted to the observed data, considering the uncertainty of the missing values and the relationships between variables. The performances of the proposed methods are compared to three methods on real and synthetic datasets. Considering the root mean square error obtained by computing the difference between the estimated kernel value and the true value, the consistency of the achieved performance remains evident in the majority of the scenarios evaluated for real-world datasets. The proposed methods, EMK-MC and EMK-UT, are superior in approximately 43% and 38% of the evaluated scenarios, respectively. As for the scenarios evaluated in synthetic datasets, the proposed approaches outperform all evaluated scenarios.Problemas de aprendizado de máquina com dados incompletos são constantemente abordados em diversos domínios do mundo real. Métodos estatísticos que lidam com atributos ausentes caracterizam-se por suposições sobre a distribuição de dados através de uma função de densidade. Diante desse contexto, abordagens para utilização de métodos baseados em medidas de similaridade, tornam-se objetos de pesquisa bastante promissores, uma vez que esses métodos geralmente assumem que os dados são totalmente observados e não são equipados naturalmente para lidar com dados incompletos. Neste trabalho, serão propostos métodos para estimar o valor esperado do Kernel Matérn na presença de vetores de dados incompletos sem nenhuma etapa de pré-processamento. Os métodos Expected Matérn Kernel via Monte Carlo Method (EMK-MC) e Expected Matérn Kernel via Unscented Transform (EMK-UT) apresentam a capacidade de abordar o problema de estimativa do kernel estimando a transformação de interesse, ao invés de lançá-la em uma estrutura de pré-processamento. Para obter tais estimativas, os vetores incompletos são tratados como variáveis aleatórias contínuas e, a partir da suposição que a distância Euclidiana entre pontos de interesse seguem uma distribuição Nakagami, métodos de amostragem são utilizados para gerar pontos que dependem apenas da distribuição de interesse. Por meio de um modelo de mistura de Gaussianas, a distribuição dos dados é aproximada a partir da estimativa de máxima verossimilhança via algoritmo Expectation-Maximization, e ao mesmo tempo, estima os valores ausentes de forma iterativa. Isso permite que o modelo seja ajustado aos dados observados, levando em consideração a incerteza dos valores ausentes e as relações entre as variáveis. Os desempenhos dos métodos propostos são comparados à três métodos em conjuntos de dados reais e sintéticos. Em função da raiz do erro médio quadrático obtido ao computar a diferença entre o valor estimado do kernel e o valor real, a consistência do desempenho alcançado se mantém evidente na maioria dos cenários avaliados para bases do mundo real, sendo os métodos propostos EMK-MC e EMK-UT, melhores em cerca de 43% e 38% dos cenários avaliados, respectivamente. No que se refere aos cenários avaliados em conjuntos de dados sintéticos, as abordagens propostas são melhores em todos os cenários avaliados.Matérn Kernel for incomplete dataMatérn Kernel for incomplete datainfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisDados ausentesModelo de mistura de gaussianasMétodos de aproximação de funçõesKernel MatérnMissing dataGaussian mixture modelApproximation methods for functionsMatérn KernelCNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOinfo:eu-repo/semantics/openAccessengreponame:Repositório Institucional da Universidade Federal do Ceará (UFC)instname:Universidade Federal do Ceará (UFC)instacron:UFChttp://lattes.cnpq.br/9055072082719616http://lattes.cnpq.br/9553770402705512http://lattes.cnpq.br/24455711610293372023-09-27ORIGINAL2023_tese_dasilva.pdf2023_tese_dasilva.pdfapplication/pdf1700268http://repositorio.ufc.br/bitstream/riufc/74505/3/2023_tese_dasilva.pdf947af8db39a2ec8e7e0c50ea3e15dec1MD53LICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://repositorio.ufc.br/bitstream/riufc/74505/4/license.txt8a4605be74aa9ea9d79846c1fba20a33MD54riufc/745052023-10-04 13:42:42.564oai:repositorio.ufc.br:riufc/74505Tk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=Repositório InstitucionalPUBhttp://www.repositorio.ufc.br/ri-oai/requestbu@ufc.br \|\| repositorio@ufc.bropendoar:2023-10-04T16:42:42Repositório Institucional da Universidade Federal do Ceará (UFC) - Universidade Federal do Ceará (UFC)false
dc.title.pt_BR.fl_str_mv	Matérn Kernel for incomplete data
dc.title.en.pt_BR.fl_str_mv	Matérn Kernel for incomplete data
title	Matérn Kernel for incomplete data
spellingShingle	Matérn Kernel for incomplete data Silva, Danilo Avilar CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO Dados ausentes Modelo de mistura de gaussianas Métodos de aproximação de funções Kernel Matérn Missing data Gaussian mixture model Approximation methods for functions Matérn Kernel
title_short	Matérn Kernel for incomplete data
title_full	Matérn Kernel for incomplete data
title_fullStr	Matérn Kernel for incomplete data
title_full_unstemmed	Matérn Kernel for incomplete data
title_sort	Matérn Kernel for incomplete data
author	Silva, Danilo Avilar
author_facet	Silva, Danilo Avilar
author_role	author
dc.contributor.co-advisor.none.fl_str_mv	Mattos, César Lincoln Cavalcante
dc.contributor.author.fl_str_mv	Silva, Danilo Avilar
dc.contributor.advisor1.fl_str_mv	Gomes, João Paulo Pordeus
contributor_str_mv	Gomes, João Paulo Pordeus
dc.subject.cnpq.fl_str_mv	CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
topic	CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO Dados ausentes Modelo de mistura de gaussianas Métodos de aproximação de funções Kernel Matérn Missing data Gaussian mixture model Approximation methods for functions Matérn Kernel
dc.subject.ptbr.pt_BR.fl_str_mv	Dados ausentes Modelo de mistura de gaussianas Métodos de aproximação de funções Kernel Matérn
dc.subject.en.pt_BR.fl_str_mv	Missing data Gaussian mixture model Approximation methods for functions Matérn Kernel
description	Machine learning problems with incomplete data are constantly addressed in various real-world domains. Statistical methods dealing with missing attributes are characterized by assumptions about data distribution through a density function. In this context, approaches that utilize similarity-based methods, become very promising research objects since these methods generally assume that data is fully observed and are not naturally equipped to handle incomplete. In this work, methods will be proposed to estimate the expected value of the Matérn Kernel in the presence of incomplete data vectors without any preprocessing steps. The EMK-MC and EMK-UT methods demonstrate the capability to address the kernel estimation problem directly, meaning they estimate the transformation of interest instead of embedding it within a preprocessing framework. To obtain such estimates, incomplete vectors are treated as continuous random variables, and based on the assumption that the Euclidean distance between points of interest follows a Nakagami distribution, sampling methods are used to generate points that depend only on the distribution of interest. Through a Gaussian mixture model, the data distribution is approximated by maximum likelihood estimation via the Expectation-Maximization algorithm, while simultaneously iteratively estimating the missing values. This allows the model to be fitted to the observed data, considering the uncertainty of the missing values and the relationships between variables. The performances of the proposed methods are compared to three methods on real and synthetic datasets. Considering the root mean square error obtained by computing the difference between the estimated kernel value and the true value, the consistency of the achieved performance remains evident in the majority of the scenarios evaluated for real-world datasets. The proposed methods, EMK-MC and EMK-UT, are superior in approximately 43% and 38% of the evaluated scenarios, respectively. As for the scenarios evaluated in synthetic datasets, the proposed approaches outperform all evaluated scenarios.
publishDate	2023
dc.date.accessioned.fl_str_mv	2023-09-27T19:33:35Z
dc.date.available.fl_str_mv	2023-09-27T19:33:35Z
dc.date.issued.fl_str_mv	2023
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	SILVA, Danilo Avilar. Matérn Kernel for incomplete data. 2023. 104 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2023.
dc.identifier.uri.fl_str_mv	http://repositorio.ufc.br/handle/riufc/74505
identifier_str_mv	SILVA, Danilo Avilar. Matérn Kernel for incomplete data. 2023. 104 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2023.
url	http://repositorio.ufc.br/handle/riufc/74505
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.source.none.fl_str_mv	reponame:Repositório Institucional da Universidade Federal do Ceará (UFC) instname:Universidade Federal do Ceará (UFC) instacron:UFC
instname_str	Universidade Federal do Ceará (UFC)
instacron_str	UFC
institution	UFC
reponame_str	Repositório Institucional da Universidade Federal do Ceará (UFC)
collection	Repositório Institucional da Universidade Federal do Ceará (UFC)
bitstream.url.fl_str_mv	http://repositorio.ufc.br/bitstream/riufc/74505/3/2023_tese_dasilva.pdf http://repositorio.ufc.br/bitstream/riufc/74505/4/license.txt
bitstream.checksum.fl_str_mv	947af8db39a2ec8e7e0c50ea3e15dec1 8a4605be74aa9ea9d79846c1fba20a33
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da Universidade Federal do Ceará (UFC) - Universidade Federal do Ceará (UFC)
repository.mail.fl_str_mv	bu@ufc.br \|\| repositorio@ufc.br
_version_	1847793338459095040

Matérn Kernel for incomplete data

Registros relacionados