Junções por similaridade aproximadas em espaços vetoriais densos

Santana , Douglas Rolins de

Junções por similaridade aproximadas em espaços vetoriais densos

Detalhes bibliográficos
Ano de defesa:	2023
Autor(a) principal:	Santana , Douglas Rolins de
Orientador(a):	Ribeiro, Leonardo Andrade
Banca de defesa:	Ribeiro, Leonardo Andrade, Bedo, Marcos Vinicius Naves, Martins, Wellington Santos
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	por
Instituição de defesa:	Universidade Federal de Goiás
Programa de Pós-Graduação:	Programa de Pós-graduação em Ciência da Computação (INF)
Departamento:	Instituto de Informática - INF (RG)
País:	Brasil
Palavras-chave em Português:	Junção por similaridade Word embeddings Vetores densos HNSW
Palavras-chave em Inglês:	Similarity join Dense vectors
Área do conhecimento CNPq:	CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO
Link de acesso:	http://repositorio.bc.ufg.br/tede/handle/tede/13058
Resumo:	Similarity Join is an operation that returns pairs of objects whose similarity is greater than or equal to a specified threshold, and is essential for tasks such as cleaning, mining, and data integration. A common approach is to use data vector representations, such as the TFIDF method, and measure the similarity between vectors using the cosine function. However, computing the similarity for all pairs of vectors can be computationally prohibitive on large data sets. Traditional algorithms exploit the sparsity of vectors and apply filters to reduce the comparison space. Recently, advances in natural language processing have produced in semantically richer vectors, improving the results quality. However, these vectors have different characteristics from those generated by traditional methods, being dense and of high dimensionality. Preliminary experiments demonstrated that L2AP, the best known algorithm for similarity join, is not efficient for dense vector spaces. Due to the intrinsic characteristics of such vectors, approximate solutions based on specialized indices are predominant for dealing with large datasets. In this context, we investigate how to perform similarity joins using the Hierarchical Navigable Small World (HNSW), a state-of-the-art graph-based index designed for approximate k-nearest neighbor (kNN) queries. We explored the design space of possible solutions, ranging from top-end alternatives to HNSW to deeper integration of similarity join processing into this framework. The experiments carried out demonstrated accelerations of up to 2.48 and 3.47 orders of magnitude in relation to the exact method and the baseline approach, respectively, maintaining recovery rates close to 100%.

Metadados do item

id	UFG-2_d391830d8dac21d2a9b69d3ddddd9e20
oai_identifier_str	oai:repositorio.bc.ufg.br:tede/13058
network_acronym_str	UFG-2
network_name_str	Repositório Institucional da UFG
repository_id_str
spelling	Ribeiro, Leonardo Andradehttp://lattes.cnpq.br/4036932351063584SantanaRibeiro, Leonardo AndradeBedo, Marcos Vinicius NavesMartins, Wellington Santoshttps://lattes.cnpq.br/6843698978977791Santana , Douglas Rolins de2023-10-16T14:48:15Z2023-10-16T14:48:15Z2023-08-24SANTANA, Douglas R. Junções por similaridade aproximadas em espaços vetoriais densos. 2023. 101 p. Dissertação (Mestrado em Ciência da computação) - Instituto de informática, Universidade federal de Goiás, 2023.http://repositorio.bc.ufg.br/tede/handle/tede/13058Similarity Join is an operation that returns pairs of objects whose similarity is greater than or equal to a specified threshold, and is essential for tasks such as cleaning, mining, and data integration. A common approach is to use data vector representations, such as the TFIDF method, and measure the similarity between vectors using the cosine function. However, computing the similarity for all pairs of vectors can be computationally prohibitive on large data sets. Traditional algorithms exploit the sparsity of vectors and apply filters to reduce the comparison space. Recently, advances in natural language processing have produced in semantically richer vectors, improving the results quality. However, these vectors have different characteristics from those generated by traditional methods, being dense and of high dimensionality. Preliminary experiments demonstrated that L2AP, the best known algorithm for similarity join, is not efficient for dense vector spaces. Due to the intrinsic characteristics of such vectors, approximate solutions based on specialized indices are predominant for dealing with large datasets. In this context, we investigate how to perform similarity joins using the Hierarchical Navigable Small World (HNSW), a state-of-the-art graph-based index designed for approximate k-nearest neighbor (kNN) queries. We explored the design space of possible solutions, ranging from top-end alternatives to HNSW to deeper integration of similarity join processing into this framework. The experiments carried out demonstrated accelerations of up to 2.48 and 3.47 orders of magnitude in relation to the exact method and the baseline approach, respectively, maintaining recovery rates close to 100%.Junção por similaridade é uma operação que retorna pares de objetos cuja similaridade é maior ou igual a um limite especificado, sendo essencial para tarefas como limpeza, mineração e integração de dados. Uma abordagem comum é utilizar representações vetoriais dos dados, como o método TF-IDF, e medir a similaridade entre vetores usando a função cosseno. No entanto, calcular a similaridade para todos os pares de vetores pode ser computacionalmente proibitivo em grandes conjuntos de dados. Algoritmos tradicionais exploram a esparsidade dos vetores e aplicam filtros para reduzir o espaço de comparação. Recentemente, avanços no processamento de linguagem natural resultaram em vetores semanticamente mais ricos, melhorando a qualidade dos resultados. No entanto, esses vetores têm características distintas dos gerados por métodos tradicionais, sendo densos e de alta dimensionalidade. Experimentos preliminares demonstraram que o L2AP, melhor algoritmo conhecido para junção por similaridade, não é eficiente em espaços vetoriais densos. Devido às características intrínsecas de tais vetores, soluções aproximadas baseadas em índices especializados são predominantes para lidar com grandes conjuntos de dados. Nesse contexto, foram investigadas formas de realizar junções por similaridade usando o Hierarchical Navigable Small World (HNSW), um índice baseado em grafos de última geração projetado para buscas aproximadas de k-vizinho mais próximo (kNN). Foi explorado o espaço de projeto de possíveis soluções, variando de alternativas do topo do HNSW à uma integração mais profunda do processamento de junção por similaridade nessa estrutura. Os experimentos realizados demonstraram acelerações de até 2,48 e 3,47 ordens de magnitude em relação ao método exato e à abordagem baseline, respectivamente, mantendo taxas de recuperação próximas a 100%.porUniversidade Federal de GoiásPrograma de Pós-graduação em Ciência da Computação (INF)UFGBrasilInstituto de Informática - INF (RG)Attribution-NonCommercial-NoDerivatives 4.0 Internationalinfo:eu-repo/semantics/openAccessJunção por similaridadeWord embeddingsVetores densosHNSWSimilarity joinDense vectorsCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAOJunções por similaridade aproximadas em espaços vetoriais densosApproximate similarity joins over dense vector embeddingsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisreponame:Repositório Institucional da UFGinstname:Universidade Federal de Goiás (UFG)instacron:UFGLICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://repositorio.bc.ufg.br/tede/bitstreams/27f069db-637b-4cff-8245-7943d47c7495/download8a4605be74aa9ea9d79846c1fba20a33MD51CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8805http://repositorio.bc.ufg.br/tede/bitstreams/ecad7e20-e16a-443f-ab56-d7db11953a0e/download4460e5956bc1d1639be9ae6146a50347MD52ORIGINALDissertação - Douglas Rolins de Santana - 2023.pdfDissertação - Douglas Rolins de Santana - 2023.pdfapplication/pdf2340915http://repositorio.bc.ufg.br/tede/bitstreams/2c901be8-735a-4bb3-84c1-5254c289cbd6/download85470543a8c6995a8b36bf5676fa176fMD53tede/130582023-10-16 11:48:15.88http://creativecommons.org/licenses/by-nc-nd/4.0/Attribution-NonCommercial-NoDerivatives 4.0 Internationalopen.accessoai:repositorio.bc.ufg.br:tede/13058http://repositorio.bc.ufg.br/tedeRepositório InstitucionalPUBhttps://repositorio.bc.ufg.br/tedeserver/oai/requestgrt.bc@ufg.bropendoar:oai:repositorio.bc.ufg.br:tede/12342023-10-16T14:48:15Repositório Institucional da UFG - Universidade Federal de Goiás (UFG)falseTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=
dc.title.none.fl_str_mv	Junções por similaridade aproximadas em espaços vetoriais densos
dc.title.alternative.eng.fl_str_mv	Approximate similarity joins over dense vector embeddings
title	Junções por similaridade aproximadas em espaços vetoriais densos
spellingShingle	Junções por similaridade aproximadas em espaços vetoriais densos Santana , Douglas Rolins de Junção por similaridade Word embeddings Vetores densos HNSW Similarity join Dense vectors CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO
title_short	Junções por similaridade aproximadas em espaços vetoriais densos
title_full	Junções por similaridade aproximadas em espaços vetoriais densos
title_fullStr	Junções por similaridade aproximadas em espaços vetoriais densos
title_full_unstemmed	Junções por similaridade aproximadas em espaços vetoriais densos
title_sort	Junções por similaridade aproximadas em espaços vetoriais densos
author	Santana , Douglas Rolins de
author_facet	Santana , Douglas Rolins de
author_role	author
dc.contributor.advisor1.fl_str_mv	Ribeiro, Leonardo Andrade
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/4036932351063584
dc.contributor.advisor-co1.fl_str_mv	Santana
dc.contributor.referee1.fl_str_mv	Ribeiro, Leonardo Andrade
dc.contributor.referee2.fl_str_mv	Bedo, Marcos Vinicius Naves
dc.contributor.referee3.fl_str_mv	Martins, Wellington Santos
dc.contributor.authorLattes.fl_str_mv	https://lattes.cnpq.br/6843698978977791
dc.contributor.author.fl_str_mv	Santana , Douglas Rolins de
contributor_str_mv	Ribeiro, Leonardo Andrade Santana Ribeiro, Leonardo Andrade Bedo, Marcos Vinicius Naves Martins, Wellington Santos
dc.subject.por.fl_str_mv	Junção por similaridade Word embeddings Vetores densos HNSW
topic	Junção por similaridade Word embeddings Vetores densos HNSW Similarity join Dense vectors CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO
dc.subject.eng.fl_str_mv	Similarity join Dense vectors
dc.subject.cnpq.fl_str_mv	CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO
description	Similarity Join is an operation that returns pairs of objects whose similarity is greater than or equal to a specified threshold, and is essential for tasks such as cleaning, mining, and data integration. A common approach is to use data vector representations, such as the TFIDF method, and measure the similarity between vectors using the cosine function. However, computing the similarity for all pairs of vectors can be computationally prohibitive on large data sets. Traditional algorithms exploit the sparsity of vectors and apply filters to reduce the comparison space. Recently, advances in natural language processing have produced in semantically richer vectors, improving the results quality. However, these vectors have different characteristics from those generated by traditional methods, being dense and of high dimensionality. Preliminary experiments demonstrated that L2AP, the best known algorithm for similarity join, is not efficient for dense vector spaces. Due to the intrinsic characteristics of such vectors, approximate solutions based on specialized indices are predominant for dealing with large datasets. In this context, we investigate how to perform similarity joins using the Hierarchical Navigable Small World (HNSW), a state-of-the-art graph-based index designed for approximate k-nearest neighbor (kNN) queries. We explored the design space of possible solutions, ranging from top-end alternatives to HNSW to deeper integration of similarity join processing into this framework. The experiments carried out demonstrated accelerations of up to 2.48 and 3.47 orders of magnitude in relation to the exact method and the baseline approach, respectively, maintaining recovery rates close to 100%.
publishDate	2023
dc.date.accessioned.fl_str_mv	2023-10-16T14:48:15Z
dc.date.available.fl_str_mv	2023-10-16T14:48:15Z
dc.date.issued.fl_str_mv	2023-08-24
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	SANTANA, Douglas R. Junções por similaridade aproximadas em espaços vetoriais densos. 2023. 101 p. Dissertação (Mestrado em Ciência da computação) - Instituto de informática, Universidade federal de Goiás, 2023.
dc.identifier.uri.fl_str_mv	http://repositorio.bc.ufg.br/tede/handle/tede/13058
identifier_str_mv	SANTANA, Douglas R. Junções por similaridade aproximadas em espaços vetoriais densos. 2023. 101 p. Dissertação (Mestrado em Ciência da computação) - Instituto de informática, Universidade federal de Goiás, 2023.
url	http://repositorio.bc.ufg.br/tede/handle/tede/13058
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	Attribution-NonCommercial-NoDerivatives 4.0 International info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Attribution-NonCommercial-NoDerivatives 4.0 International
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de Goiás
dc.publisher.program.fl_str_mv	Programa de Pós-graduação em Ciência da Computação (INF)
dc.publisher.initials.fl_str_mv	UFG
dc.publisher.country.fl_str_mv	Brasil
dc.publisher.department.fl_str_mv	Instituto de Informática - INF (RG)
publisher.none.fl_str_mv	Universidade Federal de Goiás
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFG instname:Universidade Federal de Goiás (UFG) instacron:UFG
instname_str	Universidade Federal de Goiás (UFG)
instacron_str	UFG
institution	UFG
reponame_str	Repositório Institucional da UFG
collection	Repositório Institucional da UFG
bitstream.url.fl_str_mv	http://repositorio.bc.ufg.br/tede/bitstreams/27f069db-637b-4cff-8245-7943d47c7495/download http://repositorio.bc.ufg.br/tede/bitstreams/ecad7e20-e16a-443f-ab56-d7db11953a0e/download http://repositorio.bc.ufg.br/tede/bitstreams/2c901be8-735a-4bb3-84c1-5254c289cbd6/download
bitstream.checksum.fl_str_mv	8a4605be74aa9ea9d79846c1fba20a33 4460e5956bc1d1639be9ae6146a50347 85470543a8c6995a8b36bf5676fa176f
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFG - Universidade Federal de Goiás (UFG)
repository.mail.fl_str_mv	grt.bc@ufg.br
_version_	1861293768003551232

Junções por similaridade aproximadas em espaços vetoriais densos

Registros relacionados