Matérn Kernel for incomplete data
| Ano de defesa: | 2023 |
|---|---|
| Autor(a) principal: | |
| Orientador(a): | |
| Banca de defesa: | |
| Tipo de documento: | Tese |
| Tipo de acesso: | Acesso aberto |
| Idioma: | eng |
| Instituição de defesa: |
Não Informado pela instituição
|
| Programa de Pós-Graduação: |
Não Informado pela instituição
|
| Departamento: |
Não Informado pela instituição
|
| País: |
Não Informado pela instituição
|
| Área do conhecimento CNPq: | |
| Link de acesso: | http://repositorio.ufc.br/handle/riufc/74505 |
Resumo: | Machine learning problems with incomplete data are constantly addressed in various real-world domains. Statistical methods dealing with missing attributes are characterized by assumptions about data distribution through a density function. In this context, approaches that utilize similarity-based methods, become very promising research objects since these methods generally assume that data is fully observed and are not naturally equipped to handle incomplete. In this work, methods will be proposed to estimate the expected value of the Matérn Kernel in the presence of incomplete data vectors without any preprocessing steps. The EMK-MC and EMK-UT methods demonstrate the capability to address the kernel estimation problem directly, meaning they estimate the transformation of interest instead of embedding it within a preprocessing framework. To obtain such estimates, incomplete vectors are treated as continuous random variables, and based on the assumption that the Euclidean distance between points of interest follows a Nakagami distribution, sampling methods are used to generate points that depend only on the distribution of interest. Through a Gaussian mixture model, the data distribution is approximated by maximum likelihood estimation via the Expectation-Maximization algorithm, while simultaneously iteratively estimating the missing values. This allows the model to be fitted to the observed data, considering the uncertainty of the missing values and the relationships between variables. The performances of the proposed methods are compared to three methods on real and synthetic datasets. Considering the root mean square error obtained by computing the difference between the estimated kernel value and the true value, the consistency of the achieved performance remains evident in the majority of the scenarios evaluated for real-world datasets. The proposed methods, EMK-MC and EMK-UT, are superior in approximately 43% and 38% of the evaluated scenarios, respectively. As for the scenarios evaluated in synthetic datasets, the proposed approaches outperform all evaluated scenarios. |
| id |
UFC-7_6d8e11d4068f99e39a557a3ab5a9048a |
|---|---|
| oai_identifier_str |
oai:repositorio.ufc.br:riufc/74505 |
| network_acronym_str |
UFC-7 |
| network_name_str |
Repositório Institucional da Universidade Federal do Ceará (UFC) |
| repository_id_str |
|
| spelling |
Silva, Danilo AvilarMattos, César Lincoln CavalcanteGomes, João Paulo Pordeus2023-09-27T19:33:35Z2023-09-27T19:33:35Z2023SILVA, Danilo Avilar. Matérn Kernel for incomplete data. 2023. 104 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2023.http://repositorio.ufc.br/handle/riufc/74505Machine learning problems with incomplete data are constantly addressed in various real-world domains. Statistical methods dealing with missing attributes are characterized by assumptions about data distribution through a density function. In this context, approaches that utilize similarity-based methods, become very promising research objects since these methods generally assume that data is fully observed and are not naturally equipped to handle incomplete. In this work, methods will be proposed to estimate the expected value of the Matérn Kernel in the presence of incomplete data vectors without any preprocessing steps. The EMK-MC and EMK-UT methods demonstrate the capability to address the kernel estimation problem directly, meaning they estimate the transformation of interest instead of embedding it within a preprocessing framework. To obtain such estimates, incomplete vectors are treated as continuous random variables, and based on the assumption that the Euclidean distance between points of interest follows a Nakagami distribution, sampling methods are used to generate points that depend only on the distribution of interest. Through a Gaussian mixture model, the data distribution is approximated by maximum likelihood estimation via the Expectation-Maximization algorithm, while simultaneously iteratively estimating the missing values. This allows the model to be fitted to the observed data, considering the uncertainty of the missing values and the relationships between variables. The performances of the proposed methods are compared to three methods on real and synthetic datasets. Considering the root mean square error obtained by computing the difference between the estimated kernel value and the true value, the consistency of the achieved performance remains evident in the majority of the scenarios evaluated for real-world datasets. The proposed methods, EMK-MC and EMK-UT, are superior in approximately 43% and 38% of the evaluated scenarios, respectively. As for the scenarios evaluated in synthetic datasets, the proposed approaches outperform all evaluated scenarios.Problemas de aprendizado de máquina com dados incompletos são constantemente abordados em diversos domínios do mundo real. Métodos estatísticos que lidam com atributos ausentes caracterizam-se por suposições sobre a distribuição de dados através de uma função de densidade. Diante desse contexto, abordagens para utilização de métodos baseados em medidas de similaridade, tornam-se objetos de pesquisa bastante promissores, uma vez que esses métodos geralmente assumem que os dados são totalmente observados e não são equipados naturalmente para lidar com dados incompletos. Neste trabalho, serão propostos métodos para estimar o valor esperado do Kernel Matérn na presença de vetores de dados incompletos sem nenhuma etapa de pré-processamento. Os métodos Expected Matérn Kernel via Monte Carlo Method (EMK-MC) e Expected Matérn Kernel via Unscented Transform (EMK-UT) apresentam a capacidade de abordar o problema de estimativa do kernel estimando a transformação de interesse, ao invés de lançá-la em uma estrutura de pré-processamento. Para obter tais estimativas, os vetores incompletos são tratados como variáveis aleatórias contínuas e, a partir da suposição que a distância Euclidiana entre pontos de interesse seguem uma distribuição Nakagami, métodos de amostragem são utilizados para gerar pontos que dependem apenas da distribuição de interesse. Por meio de um modelo de mistura de Gaussianas, a distribuição dos dados é aproximada a partir da estimativa de máxima verossimilhança via algoritmo Expectation-Maximization, e ao mesmo tempo, estima os valores ausentes de forma iterativa. Isso permite que o modelo seja ajustado aos dados observados, levando em consideração a incerteza dos valores ausentes e as relações entre as variáveis. Os desempenhos dos métodos propostos são comparados à três métodos em conjuntos de dados reais e sintéticos. Em função da raiz do erro médio quadrático obtido ao computar a diferença entre o valor estimado do kernel e o valor real, a consistência do desempenho alcançado se mantém evidente na maioria dos cenários avaliados para bases do mundo real, sendo os métodos propostos EMK-MC e EMK-UT, melhores em cerca de 43% e 38% dos cenários avaliados, respectivamente. No que se refere aos cenários avaliados em conjuntos de dados sintéticos, as abordagens propostas são melhores em todos os cenários avaliados.Matérn Kernel for incomplete dataMatérn Kernel for incomplete datainfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisDados ausentesModelo de mistura de gaussianasMétodos de aproximação de funçõesKernel MatérnMissing dataGaussian mixture modelApproximation methods for functionsMatérn KernelCNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOinfo:eu-repo/semantics/openAccessengreponame:Repositório Institucional da Universidade Federal do Ceará (UFC)instname:Universidade Federal do Ceará (UFC)instacron:UFChttp://lattes.cnpq.br/9055072082719616http://lattes.cnpq.br/9553770402705512http://lattes.cnpq.br/24455711610293372023-09-27ORIGINAL2023_tese_dasilva.pdf2023_tese_dasilva.pdfapplication/pdf1700268http://repositorio.ufc.br/bitstream/riufc/74505/3/2023_tese_dasilva.pdf947af8db39a2ec8e7e0c50ea3e15dec1MD53LICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://repositorio.ufc.br/bitstream/riufc/74505/4/license.txt8a4605be74aa9ea9d79846c1fba20a33MD54riufc/745052023-10-04 13:42:42.564oai:repositorio.ufc.br:riufc/74505Tk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=Repositório InstitucionalPUBhttp://www.repositorio.ufc.br/ri-oai/requestbu@ufc.br || repositorio@ufc.bropendoar:2023-10-04T16:42:42Repositório Institucional da Universidade Federal do Ceará (UFC) - Universidade Federal do Ceará (UFC)false |
| dc.title.pt_BR.fl_str_mv |
Matérn Kernel for incomplete data |
| dc.title.en.pt_BR.fl_str_mv |
Matérn Kernel for incomplete data |
| title |
Matérn Kernel for incomplete data |
| spellingShingle |
Matérn Kernel for incomplete data Silva, Danilo Avilar CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO Dados ausentes Modelo de mistura de gaussianas Métodos de aproximação de funções Kernel Matérn Missing data Gaussian mixture model Approximation methods for functions Matérn Kernel |
| title_short |
Matérn Kernel for incomplete data |
| title_full |
Matérn Kernel for incomplete data |
| title_fullStr |
Matérn Kernel for incomplete data |
| title_full_unstemmed |
Matérn Kernel for incomplete data |
| title_sort |
Matérn Kernel for incomplete data |
| author |
Silva, Danilo Avilar |
| author_facet |
Silva, Danilo Avilar |
| author_role |
author |
| dc.contributor.co-advisor.none.fl_str_mv |
Mattos, César Lincoln Cavalcante |
| dc.contributor.author.fl_str_mv |
Silva, Danilo Avilar |
| dc.contributor.advisor1.fl_str_mv |
Gomes, João Paulo Pordeus |
| contributor_str_mv |
Gomes, João Paulo Pordeus |
| dc.subject.cnpq.fl_str_mv |
CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
| topic |
CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO Dados ausentes Modelo de mistura de gaussianas Métodos de aproximação de funções Kernel Matérn Missing data Gaussian mixture model Approximation methods for functions Matérn Kernel |
| dc.subject.ptbr.pt_BR.fl_str_mv |
Dados ausentes Modelo de mistura de gaussianas Métodos de aproximação de funções Kernel Matérn |
| dc.subject.en.pt_BR.fl_str_mv |
Missing data Gaussian mixture model Approximation methods for functions Matérn Kernel |
| description |
Machine learning problems with incomplete data are constantly addressed in various real-world domains. Statistical methods dealing with missing attributes are characterized by assumptions about data distribution through a density function. In this context, approaches that utilize similarity-based methods, become very promising research objects since these methods generally assume that data is fully observed and are not naturally equipped to handle incomplete. In this work, methods will be proposed to estimate the expected value of the Matérn Kernel in the presence of incomplete data vectors without any preprocessing steps. The EMK-MC and EMK-UT methods demonstrate the capability to address the kernel estimation problem directly, meaning they estimate the transformation of interest instead of embedding it within a preprocessing framework. To obtain such estimates, incomplete vectors are treated as continuous random variables, and based on the assumption that the Euclidean distance between points of interest follows a Nakagami distribution, sampling methods are used to generate points that depend only on the distribution of interest. Through a Gaussian mixture model, the data distribution is approximated by maximum likelihood estimation via the Expectation-Maximization algorithm, while simultaneously iteratively estimating the missing values. This allows the model to be fitted to the observed data, considering the uncertainty of the missing values and the relationships between variables. The performances of the proposed methods are compared to three methods on real and synthetic datasets. Considering the root mean square error obtained by computing the difference between the estimated kernel value and the true value, the consistency of the achieved performance remains evident in the majority of the scenarios evaluated for real-world datasets. The proposed methods, EMK-MC and EMK-UT, are superior in approximately 43% and 38% of the evaluated scenarios, respectively. As for the scenarios evaluated in synthetic datasets, the proposed approaches outperform all evaluated scenarios. |
| publishDate |
2023 |
| dc.date.accessioned.fl_str_mv |
2023-09-27T19:33:35Z |
| dc.date.available.fl_str_mv |
2023-09-27T19:33:35Z |
| dc.date.issued.fl_str_mv |
2023 |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
| format |
doctoralThesis |
| status_str |
publishedVersion |
| dc.identifier.citation.fl_str_mv |
SILVA, Danilo Avilar. Matérn Kernel for incomplete data. 2023. 104 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2023. |
| dc.identifier.uri.fl_str_mv |
http://repositorio.ufc.br/handle/riufc/74505 |
| identifier_str_mv |
SILVA, Danilo Avilar. Matérn Kernel for incomplete data. 2023. 104 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2023. |
| url |
http://repositorio.ufc.br/handle/riufc/74505 |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.source.none.fl_str_mv |
reponame:Repositório Institucional da Universidade Federal do Ceará (UFC) instname:Universidade Federal do Ceará (UFC) instacron:UFC |
| instname_str |
Universidade Federal do Ceará (UFC) |
| instacron_str |
UFC |
| institution |
UFC |
| reponame_str |
Repositório Institucional da Universidade Federal do Ceará (UFC) |
| collection |
Repositório Institucional da Universidade Federal do Ceará (UFC) |
| bitstream.url.fl_str_mv |
http://repositorio.ufc.br/bitstream/riufc/74505/3/2023_tese_dasilva.pdf http://repositorio.ufc.br/bitstream/riufc/74505/4/license.txt |
| bitstream.checksum.fl_str_mv |
947af8db39a2ec8e7e0c50ea3e15dec1 8a4605be74aa9ea9d79846c1fba20a33 |
| bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 |
| repository.name.fl_str_mv |
Repositório Institucional da Universidade Federal do Ceará (UFC) - Universidade Federal do Ceará (UFC) |
| repository.mail.fl_str_mv |
bu@ufc.br || repositorio@ufc.br |
| _version_ |
1847793338459095040 |