Evaluating similarity in DBMSs: Towards query optimization

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: Eleutério, Igor Alberte Rodrigues
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://www.teses.usp.br/teses/disponiveis/55/55134/tde-23072024-143549/
Resumo: RDBMSs are omnipresent systems that store and retrieve data in diverse scenarios. They are good at dealing with scalar data, such as numbers, small strings, and dates, for which the Identity (=,&ne;) and Order relations (&le;, &ge;,<,>) are helpful. However, they struggle with complex data like images, videos, and audio tracks. For this kind of data, Identity and Order relations are not meaningful. In this context, the Similarity Queries are noteworthy because they are an approach to comparing and evaluating complex objects. Two noteworthy similarity queries are Range and k-NN. Many works in the literature implement systems to perform similarity queries. However, they have limitations, such as not using RDBMS structures to allow traditional queries, not implementing indexes, or requiring changes in SQL commands to operate similarity queries. In this masters research, we implemented two systems: MIGUE-Sim and CoSIM-Gres, each one with its own contributions to literature. MIGUE-Sim is focused on implementing similarity queries using only native resources of Postgres. With this system, we evaluated different ways to represent a k-NN query in plain SQL, and our proposed query is up to 10% faster than our main competitor. Also, we used the native Gist R-tree index to perform k-NN query, and it achieved a performance speed-up of up to 96% than our competitor. The CoSIM-Gres is focused on implementing three different access methods to perform similarity queries in RDBMS: Sequential Access, MAM Slim-tree, and Gist R-tree. To the best of our knowledge, this is the first in- depth discussion of the performance of similarity queries involving different access methods in RDBMS. We evaluated different cardinalities, dimensionalities, and distance functions, and our results point that i) distance functions of the Minkowski family do not impact the access methods performance significantly; ii) When the expected number of elements retrieved is low compared with the total number of elements in the table (around 5%), the MAM is much better than Sequential Access; iii) When the expected number of elements retrieved by the query is up to 50% of the dataset, the MAM is better than Sequential Access; otherwise, it is better to perform a Sequential Access; iv) When the Gist R-tree is available, it is better than MAM Slim-tree and Sequential Access to retrieve up to 20% of the dataset. Our results are relevant to future work on optimizing similarity queries in RDBMS.
id USP_b646ffb6b372ad1d7aaba391a00d448f
oai_identifier_str oai:teses.usp.br:tde-23072024-143549
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str
spelling Evaluating similarity in DBMSs: Towards query optimizationAvaliando similaridade em SGBDs: Rumo à otimização de consultasConsultas por similaridadeGist R-treeGist R-treeMétodos de Acesso MétricosMetric access methodsOptimizationOtimizaçãoRelational database management systemsSimilarity queriesSistemas Gerenciadores de Bases de Dados RelacionaisRDBMSs are omnipresent systems that store and retrieve data in diverse scenarios. They are good at dealing with scalar data, such as numbers, small strings, and dates, for which the Identity (=,&ne;) and Order relations (&le;, &ge;,<,>) are helpful. However, they struggle with complex data like images, videos, and audio tracks. For this kind of data, Identity and Order relations are not meaningful. In this context, the Similarity Queries are noteworthy because they are an approach to comparing and evaluating complex objects. Two noteworthy similarity queries are Range and k-NN. Many works in the literature implement systems to perform similarity queries. However, they have limitations, such as not using RDBMS structures to allow traditional queries, not implementing indexes, or requiring changes in SQL commands to operate similarity queries. In this masters research, we implemented two systems: MIGUE-Sim and CoSIM-Gres, each one with its own contributions to literature. MIGUE-Sim is focused on implementing similarity queries using only native resources of Postgres. With this system, we evaluated different ways to represent a k-NN query in plain SQL, and our proposed query is up to 10% faster than our main competitor. Also, we used the native Gist R-tree index to perform k-NN query, and it achieved a performance speed-up of up to 96% than our competitor. The CoSIM-Gres is focused on implementing three different access methods to perform similarity queries in RDBMS: Sequential Access, MAM Slim-tree, and Gist R-tree. To the best of our knowledge, this is the first in- depth discussion of the performance of similarity queries involving different access methods in RDBMS. We evaluated different cardinalities, dimensionalities, and distance functions, and our results point that i) distance functions of the Minkowski family do not impact the access methods performance significantly; ii) When the expected number of elements retrieved is low compared with the total number of elements in the table (around 5%), the MAM is much better than Sequential Access; iii) When the expected number of elements retrieved by the query is up to 50% of the dataset, the MAM is better than Sequential Access; otherwise, it is better to perform a Sequential Access; iv) When the Gist R-tree is available, it is better than MAM Slim-tree and Sequential Access to retrieve up to 20% of the dataset. Our results are relevant to future work on optimizing similarity queries in RDBMS.Sistemas Gerenciadores de Banco de Dados Relacionais (SGBDRs) são sistemas onipresentes que armazenam e recuperam dados em diversos cenários. Eles são adequados para lidar com dados escalares, como números, strings curtas e datas, para os quais as relações de Identidade (=, &ne;) e Ordem (&le;, &ge;, < , >) são úteis. No entanto, eles enfrentam dificuldades com dados complexos como imagens, vídeos e trilhas de áudio. Para este tipo de dado, as relações de Identidade e Ordem não são significativas. Nesse contexto, as Consultas por Similaridade são notáveis por serem uma abordagem para comparar e avaliar objetos complexos. Duas consultas de similaridade dignas de nota são Range e k-NN. Muitos trabalhos na literatura implementam sistemas para realizar consultas de similaridade. No entanto, eles possuem limitações, como não utilizar estruturas de SGBDR para permitir consultas tradicionais, não implementar índices ou exigir alterações nos comandos SQL para operar consultas de similaridade. Nesta dissertação de mestrado, implementamos dois sistemas: MIGUE-Sim e CoSIM-Gres, cada um com suas próprias contribuições para a literatura. O MIGUE-Sim está focado na implementação de consultas de similaridade usando apenas recursos nativos do Postgres. Com esse sistema, avaliamos diferentes maneiras de representar uma consulta k-NN em SQL puro, e nossa consulta proposta é até 10% mais rápida do que nosso principal concorrente. Além disso, usamos o índice Gist R-tree nativo para realizar consultas k-NN, e ele atingiu uma aceleração de desempenho de até 96% em relação ao nosso concorrente. O CoSIM-Gres está focado na implementação de três métodos de acesso diferentes para realizar consultas de similaridade em SGBDR: Acesso Sequencial, MAM Slim-tree e Gist R-tree. Até onde sabemos, esta é a primeira discussão aprofundada sobre o desempenho de consultas de similaridade envolvendo diferentes métodos de acesso em SGBDR. Avaliamos diferentes cardinalidades, dimensionalidades e funções de distância, e nossos resultados apontam que: i) as funções de distância da família Minkowski não impactam significativamente o desempenho dos métodos de acesso; ii) Quando o número esperado de elementos recuperados é baixo em comparação com o número total de elementos na tabela (cerca de 5%), o MAM é muito melhor do que o Acesso Sequencial; iii) Quando o número esperado de elementos recuperados pela consulta é de até 50% do conjunto de dados, o MAM é melhor do que o Acesso Sequencial; caso contrário, é melhor realizar um Acesso Sequencial; iv) Quando a Gist R-tree está disponível, é melhor do que MAM Slim-tree e Acesso Sequencial para recuperar até 20% do conjunto de dados. Nossos resultados são relevantes para trabalhos futuros sobre otimização de consultas de similaridade em SGBDR.Biblioteca Digitais de Teses e Dissertações da USPTraina Junior, CaetanoEleutério, Igor Alberte Rodrigues2024-05-22info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-23072024-143549/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2024-07-23T18:14:02Zoai:teses.usp.br:tde-23072024-143549Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212024-07-23T18:14:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Evaluating similarity in DBMSs: Towards query optimization
Avaliando similaridade em SGBDs: Rumo à otimização de consultas
title Evaluating similarity in DBMSs: Towards query optimization
spellingShingle Evaluating similarity in DBMSs: Towards query optimization
Eleutério, Igor Alberte Rodrigues
Consultas por similaridade
Gist R-tree
Gist R-tree
Métodos de Acesso Métricos
Metric access methods
Optimization
Otimização
Relational database management systems
Similarity queries
Sistemas Gerenciadores de Bases de Dados Relacionais
title_short Evaluating similarity in DBMSs: Towards query optimization
title_full Evaluating similarity in DBMSs: Towards query optimization
title_fullStr Evaluating similarity in DBMSs: Towards query optimization
title_full_unstemmed Evaluating similarity in DBMSs: Towards query optimization
title_sort Evaluating similarity in DBMSs: Towards query optimization
author Eleutério, Igor Alberte Rodrigues
author_facet Eleutério, Igor Alberte Rodrigues
author_role author
dc.contributor.none.fl_str_mv Traina Junior, Caetano
dc.contributor.author.fl_str_mv Eleutério, Igor Alberte Rodrigues
dc.subject.por.fl_str_mv Consultas por similaridade
Gist R-tree
Gist R-tree
Métodos de Acesso Métricos
Metric access methods
Optimization
Otimização
Relational database management systems
Similarity queries
Sistemas Gerenciadores de Bases de Dados Relacionais
topic Consultas por similaridade
Gist R-tree
Gist R-tree
Métodos de Acesso Métricos
Metric access methods
Optimization
Otimização
Relational database management systems
Similarity queries
Sistemas Gerenciadores de Bases de Dados Relacionais
description RDBMSs are omnipresent systems that store and retrieve data in diverse scenarios. They are good at dealing with scalar data, such as numbers, small strings, and dates, for which the Identity (=,&ne;) and Order relations (&le;, &ge;,<,>) are helpful. However, they struggle with complex data like images, videos, and audio tracks. For this kind of data, Identity and Order relations are not meaningful. In this context, the Similarity Queries are noteworthy because they are an approach to comparing and evaluating complex objects. Two noteworthy similarity queries are Range and k-NN. Many works in the literature implement systems to perform similarity queries. However, they have limitations, such as not using RDBMS structures to allow traditional queries, not implementing indexes, or requiring changes in SQL commands to operate similarity queries. In this masters research, we implemented two systems: MIGUE-Sim and CoSIM-Gres, each one with its own contributions to literature. MIGUE-Sim is focused on implementing similarity queries using only native resources of Postgres. With this system, we evaluated different ways to represent a k-NN query in plain SQL, and our proposed query is up to 10% faster than our main competitor. Also, we used the native Gist R-tree index to perform k-NN query, and it achieved a performance speed-up of up to 96% than our competitor. The CoSIM-Gres is focused on implementing three different access methods to perform similarity queries in RDBMS: Sequential Access, MAM Slim-tree, and Gist R-tree. To the best of our knowledge, this is the first in- depth discussion of the performance of similarity queries involving different access methods in RDBMS. We evaluated different cardinalities, dimensionalities, and distance functions, and our results point that i) distance functions of the Minkowski family do not impact the access methods performance significantly; ii) When the expected number of elements retrieved is low compared with the total number of elements in the table (around 5%), the MAM is much better than Sequential Access; iii) When the expected number of elements retrieved by the query is up to 50% of the dataset, the MAM is better than Sequential Access; otherwise, it is better to perform a Sequential Access; iv) When the Gist R-tree is available, it is better than MAM Slim-tree and Sequential Access to retrieve up to 20% of the dataset. Our results are relevant to future work on optimizing similarity queries in RDBMS.
publishDate 2024
dc.date.none.fl_str_mv 2024-05-22
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/55/55134/tde-23072024-143549/
url https://www.teses.usp.br/teses/disponiveis/55/55134/tde-23072024-143549/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1815258472398192640