Using Similarity Self-Join Techniques
| Ano de defesa: | 2021 |
|---|---|
| Autor(a) principal: | |
| Orientador(a): | |
| Banca de defesa: | |
| Tipo de documento: | Dissertação |
| Tipo de acesso: | Acesso aberto |
| Idioma: | eng |
| Instituição de defesa: |
Biblioteca Digitais de Teses e Dissertações da USP
|
| Programa de Pós-Graduação: |
Não Informado pela instituição
|
| Departamento: |
Não Informado pela instituição
|
| País: |
Não Informado pela instituição
|
| Palavras-chave em Português: | |
| Link de acesso: | https://www.teses.usp.br/teses/disponiveis/55/55134/tde-29042021-111846/ |
Resumo: | The democratization of electronic devices over the years encouraged individuals and industries to produce data at a cheap price. As a consequence of this phenomenon, the data production increased globally at a fast pace. With this ever-growing data production, industries demanded better tools to find patterns in the large volume of data available and improve their decisionmaking processes. Some particular events might not fit in any existing pattern and yet bring important business insights. They are usually rare events that do not conform with the majority of the data, often classified as anomalies, exceptions or outliers. They can represent failures, frauds, scamming, invasions or abnormal conditions in systems. Detecting this type of event as soon as possible is crucial for real-world applications such as in finance, healthcare, social networks and quality control. Several algorithms have been introduced in the literature providing outstanding results in terms of effectivity, but, in practice, the existing alternatives are still very much expensive in terms of runtime. The most efficient approaches assume that an outlier can be identified by searching for similar instances, also known as neighbors due to their close proximity in the feature space. Data structures are used to store the instances and perform successive neighborhood search operations as a way to take advantage of neighborhood properties and detect outliers. Such type of operation has been strongly researched in the community of similarity search over the years. It is well-known by this community that successive neighborhood searches can be replaced by a single similarity join operation, but this observation does not seem obvious in the outlier detection literature because virtually all algorithms develop their own strategies to search for similar instances. Similarity join is a fundamental operation in database systems; given two datasets and a similarity threshold, the goal is to find all pairs of similar instances. When only one dataset is given, this operation is named similarity self-join; it returns a set of similar pairs for each instance. In this context, since outliers are rare events and diverge from the majority, the instances with few similar pairs are potential outliers. Many join-based algorithms aim to improve the efficiency of the operation in a diverse range of applications. In this work, we investigate how this overlap of concepts can improve the runtime and scalability of outlier detection algorithms. We propose two novel outlier detection algorithms that use join-based techniques - ODSSJ and HySortOD. Our experimental results suggests that our solutions are 3 orders of magnitude faster than existing state-of-the-art algorithms. |
| id |
USP_5c9f7249d9f2c7ed86fe7ef4db21fe79 |
|---|---|
| oai_identifier_str |
oai:teses.usp.br:tde-29042021-111846 |
| network_acronym_str |
USP |
| network_name_str |
Biblioteca Digital de Teses e Dissertações da USP |
| repository_id_str |
|
| spelling |
Using Similarity Self-Join TechniquesDetecção Rápida de Casos de Exceção Utilizando Técnicas de Auto-Junção por SimilaridadeDetecção de casos de exceçãoEpsilon joinHypercube orderingJunção epsilonOrdenação de hipercubosOutlier detectionThe democratization of electronic devices over the years encouraged individuals and industries to produce data at a cheap price. As a consequence of this phenomenon, the data production increased globally at a fast pace. With this ever-growing data production, industries demanded better tools to find patterns in the large volume of data available and improve their decisionmaking processes. Some particular events might not fit in any existing pattern and yet bring important business insights. They are usually rare events that do not conform with the majority of the data, often classified as anomalies, exceptions or outliers. They can represent failures, frauds, scamming, invasions or abnormal conditions in systems. Detecting this type of event as soon as possible is crucial for real-world applications such as in finance, healthcare, social networks and quality control. Several algorithms have been introduced in the literature providing outstanding results in terms of effectivity, but, in practice, the existing alternatives are still very much expensive in terms of runtime. The most efficient approaches assume that an outlier can be identified by searching for similar instances, also known as neighbors due to their close proximity in the feature space. Data structures are used to store the instances and perform successive neighborhood search operations as a way to take advantage of neighborhood properties and detect outliers. Such type of operation has been strongly researched in the community of similarity search over the years. It is well-known by this community that successive neighborhood searches can be replaced by a single similarity join operation, but this observation does not seem obvious in the outlier detection literature because virtually all algorithms develop their own strategies to search for similar instances. Similarity join is a fundamental operation in database systems; given two datasets and a similarity threshold, the goal is to find all pairs of similar instances. When only one dataset is given, this operation is named similarity self-join; it returns a set of similar pairs for each instance. In this context, since outliers are rare events and diverge from the majority, the instances with few similar pairs are potential outliers. Many join-based algorithms aim to improve the efficiency of the operation in a diverse range of applications. In this work, we investigate how this overlap of concepts can improve the runtime and scalability of outlier detection algorithms. We propose two novel outlier detection algorithms that use join-based techniques - ODSSJ and HySortOD. Our experimental results suggests that our solutions are 3 orders of magnitude faster than existing state-of-the-art algorithms.A democratização dos dispositivos eletrônicos ao longo dos anos incentivou indivíduos e indústrias a produzirem dados a um baixo custo. Como consequência, a produção de dados aumentou globalmente em ritmo acelerado. Com essa produção de dados cada vez maior, as indústrias exigiram melhores ferramentas para encontrar padrões e melhorar seus processos de tomada de decisão. Alguns eventos em particular podem não encaixar em nenhum padrão e ainda assim trazerem informações importantes. São usualmente eventos raros que não correspondem à maioria dos dados, também conhecidos como anomalias, exceções ou outliers. Eles podem representar falhas, fraudes, invasões ou condições anormais em sistemas. Detectar esses eventos o quanto antes é crucial em aplicações reais, como finanças, redes sociais e controle de qualidade. Vários algoritmos fornecem excelentes resultados em termos de qualidade, porém na prática, se mostram ineficientes para lidar com dados volumosos. Abordagens mais eficientes pressupõem que uma exceção pode ser identificada buscando por instâncias similares, também conhecidas como vizinhas devido à proximidade espacial entre as instâncias. As estruturas de dados armazenam dados e realizam sucessivas operações de busca por vizinhança para obter informações sobre a densidade da vizinhança, a qual é usada na detecção de exceções. Essa operação tem sido muito pesquisada na comunidade de busca por similaridade ao longo dos anos. Nessa comunidade, é sabido que essas sucessivas operações podem ser substituídas por uma junção por similaridade, mas essa observação não parece óbvia na literatura de detecção de casos de exceção porque praticamente todos algoritmos criam suas próprias estratégias de busca por similaridade. A junção por similaridade é uma operação que, dado dois conjuntos de dados e um limite de similaridade, o objetivo é encontrar todos os pares de instâncias similares. Porém, quando apenas um conjunto de dados é fornecido, essa operação é denominada auto-junção por similaridade. Os algoritmos para essa operação visam melhorar a eficiência em uma ampla gama de aplicações. Como casos de exceção são eventos raros e divergentes da maioria, instâncias com poucos pares podem ser uma exceção. Neste trabalho, propomos investigar como essa sobreposição de conceitos pode ser benéfica para melhorar o desempenho e a escalabilidade de algoritmos de detecção de exceção. Propomos dois novos algoritmos baseados em técnicas de junção por similaridade - ODSSJ e HySortOD. Os resultados experimentais sugerem que as soluções são 3 ordens de magnitude mais rápida que os algoritmos estado da arte existentes.Biblioteca Digitais de Teses e Dissertações da USPCordeiro, Robson Leonardo FerreiraCabral, Eugenio Ferreira2021-03-01info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-29042021-111846/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2021-04-29T17:25:02Zoai:teses.usp.br:tde-29042021-111846Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212021-04-29T17:25:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false |
| dc.title.none.fl_str_mv |
Using Similarity Self-Join Techniques Detecção Rápida de Casos de Exceção Utilizando Técnicas de Auto-Junção por Similaridade |
| title |
Using Similarity Self-Join Techniques |
| spellingShingle |
Using Similarity Self-Join Techniques Cabral, Eugenio Ferreira Detecção de casos de exceção Epsilon join Hypercube ordering Junção epsilon Ordenação de hipercubos Outlier detection |
| title_short |
Using Similarity Self-Join Techniques |
| title_full |
Using Similarity Self-Join Techniques |
| title_fullStr |
Using Similarity Self-Join Techniques |
| title_full_unstemmed |
Using Similarity Self-Join Techniques |
| title_sort |
Using Similarity Self-Join Techniques |
| author |
Cabral, Eugenio Ferreira |
| author_facet |
Cabral, Eugenio Ferreira |
| author_role |
author |
| dc.contributor.none.fl_str_mv |
Cordeiro, Robson Leonardo Ferreira |
| dc.contributor.author.fl_str_mv |
Cabral, Eugenio Ferreira |
| dc.subject.por.fl_str_mv |
Detecção de casos de exceção Epsilon join Hypercube ordering Junção epsilon Ordenação de hipercubos Outlier detection |
| topic |
Detecção de casos de exceção Epsilon join Hypercube ordering Junção epsilon Ordenação de hipercubos Outlier detection |
| description |
The democratization of electronic devices over the years encouraged individuals and industries to produce data at a cheap price. As a consequence of this phenomenon, the data production increased globally at a fast pace. With this ever-growing data production, industries demanded better tools to find patterns in the large volume of data available and improve their decisionmaking processes. Some particular events might not fit in any existing pattern and yet bring important business insights. They are usually rare events that do not conform with the majority of the data, often classified as anomalies, exceptions or outliers. They can represent failures, frauds, scamming, invasions or abnormal conditions in systems. Detecting this type of event as soon as possible is crucial for real-world applications such as in finance, healthcare, social networks and quality control. Several algorithms have been introduced in the literature providing outstanding results in terms of effectivity, but, in practice, the existing alternatives are still very much expensive in terms of runtime. The most efficient approaches assume that an outlier can be identified by searching for similar instances, also known as neighbors due to their close proximity in the feature space. Data structures are used to store the instances and perform successive neighborhood search operations as a way to take advantage of neighborhood properties and detect outliers. Such type of operation has been strongly researched in the community of similarity search over the years. It is well-known by this community that successive neighborhood searches can be replaced by a single similarity join operation, but this observation does not seem obvious in the outlier detection literature because virtually all algorithms develop their own strategies to search for similar instances. Similarity join is a fundamental operation in database systems; given two datasets and a similarity threshold, the goal is to find all pairs of similar instances. When only one dataset is given, this operation is named similarity self-join; it returns a set of similar pairs for each instance. In this context, since outliers are rare events and diverge from the majority, the instances with few similar pairs are potential outliers. Many join-based algorithms aim to improve the efficiency of the operation in a diverse range of applications. In this work, we investigate how this overlap of concepts can improve the runtime and scalability of outlier detection algorithms. We propose two novel outlier detection algorithms that use join-based techniques - ODSSJ and HySortOD. Our experimental results suggests that our solutions are 3 orders of magnitude faster than existing state-of-the-art algorithms. |
| publishDate |
2021 |
| dc.date.none.fl_str_mv |
2021-03-01 |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
| format |
masterThesis |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-29042021-111846/ |
| url |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-29042021-111846/ |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
|
| dc.rights.driver.fl_str_mv |
Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess |
| rights_invalid_str_mv |
Liberar o conteúdo para acesso público. |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.coverage.none.fl_str_mv |
|
| dc.publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
| publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
| dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP |
| instname_str |
Universidade de São Paulo (USP) |
| instacron_str |
USP |
| institution |
USP |
| reponame_str |
Biblioteca Digital de Teses e Dissertações da USP |
| collection |
Biblioteca Digital de Teses e Dissertações da USP |
| repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP) |
| repository.mail.fl_str_mv |
virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br |
| _version_ |
1815258546052268032 |