Towards terabyte-scale outlier detection using GPUs

Detalhes bibliográficos
Ano de defesa: 2017
Autor(a) principal: Fernando Augusto Freitas Da Silva Da Nova Mussel
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Universidade Federal de Minas Gerais
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://hdl.handle.net/1843/32269
Resumo: Outlier detection is an important data mining task for nding unusual data records in datasets. These anomalies often carry useful information that can be employed in a wide range of practical applications, such as network intrusion detection, fraud discovery in credit card or insurance databases, among several others. There are several challenges associated with the outlier detection problem and its computational cost is a major one. Signi cant research has been done to improve these methods' runtime complexity through the use of data partitioning, ordering and pruning rules. Though these advancements allow the outlier detection to be performed in near-linear time, they are not enough to enable processing large-scale datasets in a reasonable time. Even state-of-the-art methods are limited to processing small scale datasets and/or limited to nd just a tiny fraction of the true top-n outliers. Recently, GPU-based implementations have emerged as an alternative to address the computational bottleneck. They have shown promising results but, to the best of our knowledge, all distance-based GPU algorithms currently available are designed for in-memory detection: they require the dataset to t and be loaded into the GPU's memory. Consequently, their applicability is limited because they can not be used in scenarios where the GPU's computational power would the most useful: to process large scale datasets. The goal of this work is to use GPUs to accelerate the outlier detection process in terabyte-scale, disk-resident datasets. To achieve it, we have to develop algorithms and strategies to overcome the massive reductions in the GPU's computation throughput caused by disk accesses. We made two main contributions in this work. First, we developed set of tools and abstractions for out-of-core distance-based outlier detection in GPUs, such as an e ective parallelization strategy; algorithms and high-performance GPU kernels of essential operations for distance-based outlier detection; and an I/O subsystem that reduces data transfer overhead while allowing I/O and computation overlapping. The second main contributions is the development of a novel distancebased outlier detection algorithm for GPUs, DROIDg, capable of processing large scale and disk-resident datasets in reasonable time. It leverages a new ranking heuristic, proposed by ourselves, to improve the e ciency of its pruning rule, thereby massively reducing the amount of computation required by the detection. Our experimental analysis focused on assessing the performance bene ts of using GPUs for outlier detection in large-scale datasets. Thus, we compared DROIDg against some of the best out-of-core outlier detection algorithms available for CPUs: Orca, Diskaware and Dolphin. DROIDg achieved speedups between 10X and 137X over the best sequential algorithm. Moreover, it displayed far superior scalability with regards to the dataset size and number of outliers being detected. These results showed that GPUs enable the outlier detection to be performed at scales far beyond what even state-of-the-art CPU algorithms are capable of.
id UFMG_e25b5b92d28b517e98bce7c00a2ec734
oai_identifier_str oai:repositorio.ufmg.br:1843/32269
network_acronym_str UFMG
network_name_str Repositório Institucional da UFMG
repository_id_str
spelling 2020-01-28T15:06:16Z2025-09-09T01:20:33Z2020-01-28T15:06:16Z2017-03-29https://hdl.handle.net/1843/32269Outlier detection is an important data mining task for nding unusual data records in datasets. These anomalies often carry useful information that can be employed in a wide range of practical applications, such as network intrusion detection, fraud discovery in credit card or insurance databases, among several others. There are several challenges associated with the outlier detection problem and its computational cost is a major one. Signi cant research has been done to improve these methods' runtime complexity through the use of data partitioning, ordering and pruning rules. Though these advancements allow the outlier detection to be performed in near-linear time, they are not enough to enable processing large-scale datasets in a reasonable time. Even state-of-the-art methods are limited to processing small scale datasets and/or limited to nd just a tiny fraction of the true top-n outliers. Recently, GPU-based implementations have emerged as an alternative to address the computational bottleneck. They have shown promising results but, to the best of our knowledge, all distance-based GPU algorithms currently available are designed for in-memory detection: they require the dataset to t and be loaded into the GPU's memory. Consequently, their applicability is limited because they can not be used in scenarios where the GPU's computational power would the most useful: to process large scale datasets. The goal of this work is to use GPUs to accelerate the outlier detection process in terabyte-scale, disk-resident datasets. To achieve it, we have to develop algorithms and strategies to overcome the massive reductions in the GPU's computation throughput caused by disk accesses. We made two main contributions in this work. First, we developed set of tools and abstractions for out-of-core distance-based outlier detection in GPUs, such as an e ective parallelization strategy; algorithms and high-performance GPU kernels of essential operations for distance-based outlier detection; and an I/O subsystem that reduces data transfer overhead while allowing I/O and computation overlapping. The second main contributions is the development of a novel distancebased outlier detection algorithm for GPUs, DROIDg, capable of processing large scale and disk-resident datasets in reasonable time. It leverages a new ranking heuristic, proposed by ourselves, to improve the e ciency of its pruning rule, thereby massively reducing the amount of computation required by the detection. Our experimental analysis focused on assessing the performance bene ts of using GPUs for outlier detection in large-scale datasets. Thus, we compared DROIDg against some of the best out-of-core outlier detection algorithms available for CPUs: Orca, Diskaware and Dolphin. DROIDg achieved speedups between 10X and 137X over the best sequential algorithm. Moreover, it displayed far superior scalability with regards to the dataset size and number of outliers being detected. These results showed that GPUs enable the outlier detection to be performed at scales far beyond what even state-of-the-art CPU algorithms are capable of.engUniversidade Federal de Minas Geraishttp://creativecommons.org/licenses/by/3.0/pt/info:eu-repo/semantics/openAccessMineração de DadosDetecção de anomaliasGPUsComputação – Teses.Mineração de dados (Computação) - Teses.Detecção de anomalias (Computação) - Teses.Towards terabyte-scale outlier detection using GPUsDetecção de exceções em bases de dados massivas usando GPUsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisFernando Augusto Freitas Da Silva Da Nova Musselreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGhttp://lattes.cnpq.br/7064221120033977Wagner Meira Júniorhttp://lattes.cnpq.br/9092587237114334Adriano Alonso VelosoGeorge Luiz Medeiros TeodoroRenato Antônio Celso FerreiraDetecção de exceções é um importante método de mineração de dados, utilizado para encontrar registros inesperados em bases de dados. Essas anomalias comumente carregam informações úteis e podem ser utilizadas em diversas aplicações, tais como detecção de intrusões em rede, detecção de fraudes em bases de cartões de crédito e seguro, dentre outras. Existem diversos desafios associados com detecção de exceções e o principal é o custo computacional. Muita pesquisa foi feita para melhorar a complexidade temporal de tais métodos, por meio de particionamento de dados, ordenação e regras de poda. Mesmo assim, o estado-da-arte só é capaz de detectar, em tempo hábil, uma pequena quantidade das top-n exceções. Recentemente, implementações para GPU foram propostas a m de contornar o custo computacional do problema. Os resultados obtidos foram promissores, porém, no melhor do nosso conhecimento, os algoritmos para GPU propostos até o momento estão restritos a processar bases de dados carregados na memória da GPU. Consequentemente, estes métodos têm aplicabilidade limitada pois não podem ser utilizados nos casos onde a GPU seria mais útil: bases de dados de larga escala. O objetivo deste trabalho é utilizar GPUs para acelerar o processo de detecção de exceções em bases de dados de larga escala, residentes em disco. Dessa forma, desenvolvemos algoritmos e estratégias para minimizar a redução do throughput de computação causado por acessos ao disco. Este trabalho possui duas contribuições principais. Primeiro, nós desenvolvemos um conjunto de ferramentas e abstrações que facilitam a implementação algoritmos, para GPUs, de detecção de exceções em bases de dados armazenadas em disco. Entre tais abstrações temos uma nova estratégia de paralelização; algoritmos e kernels para operações essenciais à detecção de exceções; e um novo subsistema de I/O, capaz de reduzir o overhead de transferência de dados e permitir a execução concorrente de computação e I/O. Nossa segunda contribuição é um novo algoritmo, DROIDg, para a detecção de exceções, baseadas em distância, usando GPUs. Ele utiliza uma nova heurística de ordenação, a qual propusemos, que melhora a e ciência de sua regra de poda, dessa forma reduzindo enormemente a quantidade de computação necessária para realizar a detecção. Nossa análise experimental focou em determinar a aceleração que GPUs podem fornecer à detecção de exceções em bases de dados larga escala. Portanto, comparamos DROIDg contra alguns dos melhores algoritmos sequenciais out-of-core disponíveis na literatura: Orca, Diskaware e Dolphin. DROIDg alcançou speedups de 10X até 137X sob o melhor algoritmo para CPUs. Além disso, ele demonstrou escalabilidade consideravelmente maior com relação ao tamanho da base de dados e, também, do número de exceções sendo detectadas. Estes resultados demonstram que GPUs permitem realizar a detecção de exceções em escalas muito além do que, até mesmo, os algoritmos estado-da-arte para CPU são capazes.BrasilICX - DEPARTAMENTO DE CIÊNCIA DA COMPUTAÇÃOPrograma de Pós-Graduação em Ciência da ComputaçãoUFMGORIGINALFeranndoAugustoSNMussel_substituicaofinal.pdfapplication/pdf1996688https://repositorio.ufmg.br//bitstreams/6b3157e4-76d5-47d6-af12-51c61be75647/download5495ca36fab11c36748dffecec5372c5MD51trueAnonymousREADCC-LICENSElicense_rdfapplication/octet-stream914https://repositorio.ufmg.br//bitstreams/6e788c1f-3992-43db-a54f-0d627656a933/downloadf9944a358a0c32770bd9bed185bb5395MD52falseAnonymousREADLICENSElicense.txttext/plain2119https://repositorio.ufmg.br//bitstreams/9547b44b-a251-4ec6-873e-606d950a026c/download34badce4be7e31e3adb4575ae96af679MD53falseAnonymousREADTEXTDissertacao_FernandoMussel_versao_final.pdf.txttext/plain223296https://repositorio.ufmg.br//bitstreams/79b1a97c-b16b-46aa-a223-e9ce582b5094/download3841b4348dc86323924dfd8ffb6ea964MD54falseAnonymousREAD1843/322692025-09-08 22:20:33.644http://creativecommons.org/licenses/by/3.0/pt/Acesso Abertoopen.accessoai:repositorio.ufmg.br:1843/32269https://repositorio.ufmg.br/Repositório InstitucionalPUBhttps://repositorio.ufmg.br/oairepositorio@ufmg.bropendoar:2025-09-09T01:20:33Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)falseTElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEgRE8gUkVQT1NJVMOTUklPIElOU1RJVFVDSU9OQUwgREEgVUZNRwoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSBhbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIChSSS1VRk1HKSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZSBpcnJldm9nw6F2ZWwgZGUgcmVwcm9kdXppciBlL291IGRpc3RyaWJ1aXIgYSBzdWEgcHVibGljYcOnw6NvIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyw7RuaWNvIGUgZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBkZWNsYXJhIHF1ZSBjb25oZWNlIGEgcG9sw610aWNhIGRlIGNvcHlyaWdodCBkYSBlZGl0b3JhIGRvIHNldSBkb2N1bWVudG8gZSBxdWUgY29uaGVjZSBlIGFjZWl0YSBhcyBEaXJldHJpemVzIGRvIFJJLVVGTUcuCgpWb2PDqiBjb25jb3JkYSBxdWUgbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGTUcgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGRlIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBmaW5zIGRlIHNlZ3VyYW7Dp2EsIGJhY2stdXAgZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogZGVjbGFyYSBxdWUgYSBzdWEgcHVibGljYcOnw6NvIMOpIG9yaWdpbmFsIGUgcXVlIHZvY8OqIHRlbSBvIHBvZGVyIGRlIGNvbmNlZGVyIG9zIGRpcmVpdG9zIGNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRlIHN1YSBwdWJsaWNhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3XDqW0uCgpDYXNvIGEgc3VhIHB1YmxpY2HDp8OjbyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgYW8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHB1YmxpY2HDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBQVUJMSUNBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCk8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lKHMpIG91IG8ocykgbm9tZXMocykgZG8ocykgZGV0ZW50b3IoZXMpIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkYSBwdWJsaWNhw6fDo28sIGUgbsOjbyBmYXLDoSBxdWFscXVlciBhbHRlcmHDp8OjbywgYWzDqW0gZGFxdWVsYXMgY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4KCg==
dc.title.none.fl_str_mv Towards terabyte-scale outlier detection using GPUs
dc.title.alternative.none.fl_str_mv Detecção de exceções em bases de dados massivas usando GPUs
title Towards terabyte-scale outlier detection using GPUs
spellingShingle Towards terabyte-scale outlier detection using GPUs
Fernando Augusto Freitas Da Silva Da Nova Mussel
Computação – Teses.
Mineração de dados (Computação) - Teses.
Detecção de anomalias (Computação) - Teses.
Mineração de Dados
Detecção de anomalias
GPUs
title_short Towards terabyte-scale outlier detection using GPUs
title_full Towards terabyte-scale outlier detection using GPUs
title_fullStr Towards terabyte-scale outlier detection using GPUs
title_full_unstemmed Towards terabyte-scale outlier detection using GPUs
title_sort Towards terabyte-scale outlier detection using GPUs
author Fernando Augusto Freitas Da Silva Da Nova Mussel
author_facet Fernando Augusto Freitas Da Silva Da Nova Mussel
author_role author
dc.contributor.author.fl_str_mv Fernando Augusto Freitas Da Silva Da Nova Mussel
dc.subject.por.fl_str_mv Computação – Teses.
Mineração de dados (Computação) - Teses.
Detecção de anomalias (Computação) - Teses.
topic Computação – Teses.
Mineração de dados (Computação) - Teses.
Detecção de anomalias (Computação) - Teses.
Mineração de Dados
Detecção de anomalias
GPUs
dc.subject.other.none.fl_str_mv Mineração de Dados
Detecção de anomalias
GPUs
description Outlier detection is an important data mining task for nding unusual data records in datasets. These anomalies often carry useful information that can be employed in a wide range of practical applications, such as network intrusion detection, fraud discovery in credit card or insurance databases, among several others. There are several challenges associated with the outlier detection problem and its computational cost is a major one. Signi cant research has been done to improve these methods' runtime complexity through the use of data partitioning, ordering and pruning rules. Though these advancements allow the outlier detection to be performed in near-linear time, they are not enough to enable processing large-scale datasets in a reasonable time. Even state-of-the-art methods are limited to processing small scale datasets and/or limited to nd just a tiny fraction of the true top-n outliers. Recently, GPU-based implementations have emerged as an alternative to address the computational bottleneck. They have shown promising results but, to the best of our knowledge, all distance-based GPU algorithms currently available are designed for in-memory detection: they require the dataset to t and be loaded into the GPU's memory. Consequently, their applicability is limited because they can not be used in scenarios where the GPU's computational power would the most useful: to process large scale datasets. The goal of this work is to use GPUs to accelerate the outlier detection process in terabyte-scale, disk-resident datasets. To achieve it, we have to develop algorithms and strategies to overcome the massive reductions in the GPU's computation throughput caused by disk accesses. We made two main contributions in this work. First, we developed set of tools and abstractions for out-of-core distance-based outlier detection in GPUs, such as an e ective parallelization strategy; algorithms and high-performance GPU kernels of essential operations for distance-based outlier detection; and an I/O subsystem that reduces data transfer overhead while allowing I/O and computation overlapping. The second main contributions is the development of a novel distancebased outlier detection algorithm for GPUs, DROIDg, capable of processing large scale and disk-resident datasets in reasonable time. It leverages a new ranking heuristic, proposed by ourselves, to improve the e ciency of its pruning rule, thereby massively reducing the amount of computation required by the detection. Our experimental analysis focused on assessing the performance bene ts of using GPUs for outlier detection in large-scale datasets. Thus, we compared DROIDg against some of the best out-of-core outlier detection algorithms available for CPUs: Orca, Diskaware and Dolphin. DROIDg achieved speedups between 10X and 137X over the best sequential algorithm. Moreover, it displayed far superior scalability with regards to the dataset size and number of outliers being detected. These results showed that GPUs enable the outlier detection to be performed at scales far beyond what even state-of-the-art CPU algorithms are capable of.
publishDate 2017
dc.date.issued.fl_str_mv 2017-03-29
dc.date.accessioned.fl_str_mv 2020-01-28T15:06:16Z
2025-09-09T01:20:33Z
dc.date.available.fl_str_mv 2020-01-28T15:06:16Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://hdl.handle.net/1843/32269
url https://hdl.handle.net/1843/32269
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv http://creativecommons.org/licenses/by/3.0/pt/
info:eu-repo/semantics/openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by/3.0/pt/
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade Federal de Minas Gerais
publisher.none.fl_str_mv Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFMG
instname:Universidade Federal de Minas Gerais (UFMG)
instacron:UFMG
instname_str Universidade Federal de Minas Gerais (UFMG)
instacron_str UFMG
institution UFMG
reponame_str Repositório Institucional da UFMG
collection Repositório Institucional da UFMG
bitstream.url.fl_str_mv https://repositorio.ufmg.br//bitstreams/6b3157e4-76d5-47d6-af12-51c61be75647/download
https://repositorio.ufmg.br//bitstreams/6e788c1f-3992-43db-a54f-0d627656a933/download
https://repositorio.ufmg.br//bitstreams/9547b44b-a251-4ec6-873e-606d950a026c/download
https://repositorio.ufmg.br//bitstreams/79b1a97c-b16b-46aa-a223-e9ce582b5094/download
bitstream.checksum.fl_str_mv 5495ca36fab11c36748dffecec5372c5
f9944a358a0c32770bd9bed185bb5395
34badce4be7e31e3adb4575ae96af679
3841b4348dc86323924dfd8ffb6ea964
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv repositorio@ufmg.br
_version_ 1862105695692783616