Using Similarity Self-Join Techniques

Cabral, Eugenio Ferreira

Using Similarity Self-Join Techniques

Detalhes bibliográficos
Ano de defesa:	2021
Autor(a) principal:	Cabral, Eugenio Ferreira
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Detecção de casos de exceção Epsilon join Hypercube ordering Junção epsilon Ordenação de hipercubos Outlier detection
Link de acesso:	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-29042021-111846/
Resumo:	The democratization of electronic devices over the years encouraged individuals and industries to produce data at a cheap price. As a consequence of this phenomenon, the data production increased globally at a fast pace. With this ever-growing data production, industries demanded better tools to find patterns in the large volume of data available and improve their decisionmaking processes. Some particular events might not fit in any existing pattern and yet bring important business insights. They are usually rare events that do not conform with the majority of the data, often classified as anomalies, exceptions or outliers. They can represent failures, frauds, scamming, invasions or abnormal conditions in systems. Detecting this type of event as soon as possible is crucial for real-world applications such as in finance, healthcare, social networks and quality control. Several algorithms have been introduced in the literature providing outstanding results in terms of effectivity, but, in practice, the existing alternatives are still very much expensive in terms of runtime. The most efficient approaches assume that an outlier can be identified by searching for similar instances, also known as neighbors due to their close proximity in the feature space. Data structures are used to store the instances and perform successive neighborhood search operations as a way to take advantage of neighborhood properties and detect outliers. Such type of operation has been strongly researched in the community of similarity search over the years. It is well-known by this community that successive neighborhood searches can be replaced by a single similarity join operation, but this observation does not seem obvious in the outlier detection literature because virtually all algorithms develop their own strategies to search for similar instances. Similarity join is a fundamental operation in database systems; given two datasets and a similarity threshold, the goal is to find all pairs of similar instances. When only one dataset is given, this operation is named similarity self-join; it returns a set of similar pairs for each instance. In this context, since outliers are rare events and diverge from the majority, the instances with few similar pairs are potential outliers. Many join-based algorithms aim to improve the efficiency of the operation in a diverse range of applications. In this work, we investigate how this overlap of concepts can improve the runtime and scalability of outlier detection algorithms. We propose two novel outlier detection algorithms that use join-based techniques - ODSSJ and HySortOD. Our experimental results suggests that our solutions are 3 orders of magnitude faster than existing state-of-the-art algorithms.

Metadados do item

id	USP_5c9f7249d9f2c7ed86fe7ef4db21fe79
oai_identifier_str	oai:teses.usp.br:tde-29042021-111846
network_acronym_str	USP
network_name_str	Biblioteca Digital de Teses e Dissertações da USP
repository_id_str
spelling	Using Similarity Self-Join TechniquesDetecção Rápida de Casos de Exceção Utilizando Técnicas de Auto-Junção por SimilaridadeDetecção de casos de exceçãoEpsilon joinHypercube orderingJunção epsilonOrdenação de hipercubosOutlier detectionThe democratization of electronic devices over the years encouraged individuals and industries to produce data at a cheap price. As a consequence of this phenomenon, the data production increased globally at a fast pace. With this ever-growing data production, industries demanded better tools to find patterns in the large volume of data available and improve their decisionmaking processes. Some particular events might not fit in any existing pattern and yet bring important business insights. They are usually rare events that do not conform with the majority of the data, often classified as anomalies, exceptions or outliers. They can represent failures, frauds, scamming, invasions or abnormal conditions in systems. Detecting this type of event as soon as possible is crucial for real-world applications such as in finance, healthcare, social networks and quality control. Several algorithms have been introduced in the literature providing outstanding results in terms of effectivity, but, in practice, the existing alternatives are still very much expensive in terms of runtime. The most efficient approaches assume that an outlier can be identified by searching for similar instances, also known as neighbors due to their close proximity in the feature space. Data structures are used to store the instances and perform successive neighborhood search operations as a way to take advantage of neighborhood properties and detect outliers. Such type of operation has been strongly researched in the community of similarity search over the years. It is well-known by this community that successive neighborhood searches can be replaced by a single similarity join operation, but this observation does not seem obvious in the outlier detection literature because virtually all algorithms develop their own strategies to search for similar instances. Similarity join is a fundamental operation in database systems; given two datasets and a similarity threshold, the goal is to find all pairs of similar instances. When only one dataset is given, this operation is named similarity self-join; it returns a set of similar pairs for each instance. In this context, since outliers are rare events and diverge from the majority, the instances with few similar pairs are potential outliers. Many join-based algorithms aim to improve the efficiency of the operation in a diverse range of applications. In this work, we investigate how this overlap of concepts can improve the runtime and scalability of outlier detection algorithms. We propose two novel outlier detection algorithms that use join-based techniques - ODSSJ and HySortOD. Our experimental results suggests that our solutions are 3 orders of magnitude faster than existing state-of-the-art algorithms.A democratização dos dispositivos eletrônicos ao longo dos anos incentivou indivíduos e indústrias a produzirem dados a um baixo custo. Como consequência, a produção de dados aumentou globalmente em ritmo acelerado. Com essa produção de dados cada vez maior, as indústrias exigiram melhores ferramentas para encontrar padrões e melhorar seus processos de tomada de decisão. Alguns eventos em particular podem não encaixar em nenhum padrão e ainda assim trazerem informações importantes. São usualmente eventos raros que não correspondem à maioria dos dados, também conhecidos como anomalias, exceções ou outliers. Eles podem representar falhas, fraudes, invasões ou condições anormais em sistemas. Detectar esses eventos o quanto antes é crucial em aplicações reais, como finanças, redes sociais e controle de qualidade. Vários algoritmos fornecem excelentes resultados em termos de qualidade, porém na prática, se mostram ineficientes para lidar com dados volumosos. Abordagens mais eficientes pressupõem que uma exceção pode ser identificada buscando por instâncias similares, também conhecidas como vizinhas devido à proximidade espacial entre as instâncias. As estruturas de dados armazenam dados e realizam sucessivas operações de busca por vizinhança para obter informações sobre a densidade da vizinhança, a qual é usada na detecção de exceções. Essa operação tem sido muito pesquisada na comunidade de busca por similaridade ao longo dos anos. Nessa comunidade, é sabido que essas sucessivas operações podem ser substituídas por uma junção por similaridade, mas essa observação não parece óbvia na literatura de detecção de casos de exceção porque praticamente todos algoritmos criam suas próprias estratégias de busca por similaridade. A junção por similaridade é uma operação que, dado dois conjuntos de dados e um limite de similaridade, o objetivo é encontrar todos os pares de instâncias similares. Porém, quando apenas um conjunto de dados é fornecido, essa operação é denominada auto-junção por similaridade. Os algoritmos para essa operação visam melhorar a eficiência em uma ampla gama de aplicações. Como casos de exceção são eventos raros e divergentes da maioria, instâncias com poucos pares podem ser uma exceção. Neste trabalho, propomos investigar como essa sobreposição de conceitos pode ser benéfica para melhorar o desempenho e a escalabilidade de algoritmos de detecção de exceção. Propomos dois novos algoritmos baseados em técnicas de junção por similaridade - ODSSJ e HySortOD. Os resultados experimentais sugerem que as soluções são 3 ordens de magnitude mais rápida que os algoritmos estado da arte existentes.Biblioteca Digitais de Teses e Dissertações da USPCordeiro, Robson Leonardo FerreiraCabral, Eugenio Ferreira2021-03-01info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-29042021-111846/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2021-04-29T17:25:02Zoai:teses.usp.br:tde-29042021-111846Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.bropendoar:27212021-04-29T17:25:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv	Using Similarity Self-Join Techniques Detecção Rápida de Casos de Exceção Utilizando Técnicas de Auto-Junção por Similaridade
title	Using Similarity Self-Join Techniques
spellingShingle	Using Similarity Self-Join Techniques Cabral, Eugenio Ferreira Detecção de casos de exceção Epsilon join Hypercube ordering Junção epsilon Ordenação de hipercubos Outlier detection
title_short	Using Similarity Self-Join Techniques
title_full	Using Similarity Self-Join Techniques
title_fullStr	Using Similarity Self-Join Techniques
title_full_unstemmed	Using Similarity Self-Join Techniques
title_sort	Using Similarity Self-Join Techniques
author	Cabral, Eugenio Ferreira
author_facet	Cabral, Eugenio Ferreira
author_role	author
dc.contributor.none.fl_str_mv	Cordeiro, Robson Leonardo Ferreira
dc.contributor.author.fl_str_mv	Cabral, Eugenio Ferreira
dc.subject.por.fl_str_mv	Detecção de casos de exceção Epsilon join Hypercube ordering Junção epsilon Ordenação de hipercubos Outlier detection
topic	Detecção de casos de exceção Epsilon join Hypercube ordering Junção epsilon Ordenação de hipercubos Outlier detection
description	The democratization of electronic devices over the years encouraged individuals and industries to produce data at a cheap price. As a consequence of this phenomenon, the data production increased globally at a fast pace. With this ever-growing data production, industries demanded better tools to find patterns in the large volume of data available and improve their decisionmaking processes. Some particular events might not fit in any existing pattern and yet bring important business insights. They are usually rare events that do not conform with the majority of the data, often classified as anomalies, exceptions or outliers. They can represent failures, frauds, scamming, invasions or abnormal conditions in systems. Detecting this type of event as soon as possible is crucial for real-world applications such as in finance, healthcare, social networks and quality control. Several algorithms have been introduced in the literature providing outstanding results in terms of effectivity, but, in practice, the existing alternatives are still very much expensive in terms of runtime. The most efficient approaches assume that an outlier can be identified by searching for similar instances, also known as neighbors due to their close proximity in the feature space. Data structures are used to store the instances and perform successive neighborhood search operations as a way to take advantage of neighborhood properties and detect outliers. Such type of operation has been strongly researched in the community of similarity search over the years. It is well-known by this community that successive neighborhood searches can be replaced by a single similarity join operation, but this observation does not seem obvious in the outlier detection literature because virtually all algorithms develop their own strategies to search for similar instances. Similarity join is a fundamental operation in database systems; given two datasets and a similarity threshold, the goal is to find all pairs of similar instances. When only one dataset is given, this operation is named similarity self-join; it returns a set of similar pairs for each instance. In this context, since outliers are rare events and diverge from the majority, the instances with few similar pairs are potential outliers. Many join-based algorithms aim to improve the efficiency of the operation in a diverse range of applications. In this work, we investigate how this overlap of concepts can improve the runtime and scalability of outlier detection algorithms. We propose two novel outlier detection algorithms that use join-based techniques - ODSSJ and HySortOD. Our experimental results suggests that our solutions are 3 orders of magnitude faster than existing state-of-the-art algorithms.
publishDate	2021
dc.date.none.fl_str_mv	2021-03-01
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-29042021-111846/
url	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-29042021-111846/
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv	Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Liberar o conteúdo para acesso público.
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP
instname_str	Universidade de São Paulo (USP)
instacron_str	USP
institution	USP
reponame_str	Biblioteca Digital de Teses e Dissertações da USP
collection	Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv	virginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.br
_version_	1865491511568760832

Using Similarity Self-Join Techniques

Registros relacionados