Effective and unsupervised fractal-based feature selection for very large datasets: removing linear and non-linear attribute correlations

Fraideinberze, Antonio Canabrava

Effective and unsupervised fractal-based feature selection for very large datasets: removing linear and non-linear attribute correlations

Detalhes bibliográficos
Ano de defesa:	2017
Autor(a) principal:	Fraideinberze, Antonio Canabrava
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Big data Correlações não-lineares entre atributos Feature selection Fractal theory Massive parallel processing Non-linear attribute correlations Processamento paralelo em massa Seleção de atributos Teoria de fractais
Link de acesso:	http://www.teses.usp.br/teses/disponiveis/55/55134/tde-17112017-154451/
Resumo:	Given a very large dataset of moderate-to-high dimensionality, how to mine useful patterns from it? In such cases, dimensionality reduction is essential to overcome the well-known curse of dimensionality. Although there exist algorithms to reduce the dimensionality of Big Data, unfortunately, they all fail to identify/eliminate non-linear correlations that may occur between the attributes. This MSc work tackles the problem by exploring concepts of the Fractal Theory and massive parallel processing to present Curl-Remover, a novel dimensionality reduction technique for very large datasets. Our contributions are: (a) Curl-Remover eliminates linear and non-linear attribute correlations as well as irrelevant attributes; (b) it is unsupervised and suits for analytical tasks in general not only classification; (c) it presents linear scale-up on both the data size and the number of machines used; (d) it does not require the user to guess the number of attributes to be removed, and; (e) it preserves the attributes semantics by performing feature selection, not feature extraction. We executed experiments on synthetic and real data spanning up to 1.1 billion points, and report that our proposed Curl-Remover outperformed two PCA-based algorithms from the state-of-the-art, being in average up to 8% more accurate.

Metadados do item

id	USP_b3aceda86897bc6df332013c81a0623d
oai_identifier_str	oai:teses.usp.br:tde-17112017-154451
network_acronym_str	USP
network_name_str	Biblioteca Digital de Teses e Dissertações da USP
repository_id_str
spelling	Effective and unsupervised fractal-based feature selection for very large datasets: removing linear and non-linear attribute correlationsSeleção de atributos efetiva e não-supervisionada em grandes bases de dados: aplicando a Teoria de Fractais para remover correlações lineares e não-linearesBig dataBig dataCorrelações não-lineares entre atributosFeature selectionFractal theoryMassive parallel processingNon-linear attribute correlationsProcessamento paralelo em massaSeleção de atributosTeoria de fractaisGiven a very large dataset of moderate-to-high dimensionality, how to mine useful patterns from it? In such cases, dimensionality reduction is essential to overcome the well-known curse of dimensionality. Although there exist algorithms to reduce the dimensionality of Big Data, unfortunately, they all fail to identify/eliminate non-linear correlations that may occur between the attributes. This MSc work tackles the problem by exploring concepts of the Fractal Theory and massive parallel processing to present Curl-Remover, a novel dimensionality reduction technique for very large datasets. Our contributions are: (a) Curl-Remover eliminates linear and non-linear attribute correlations as well as irrelevant attributes; (b) it is unsupervised and suits for analytical tasks in general not only classification; (c) it presents linear scale-up on both the data size and the number of machines used; (d) it does not require the user to guess the number of attributes to be removed, and; (e) it preserves the attributes semantics by performing feature selection, not feature extraction. We executed experiments on synthetic and real data spanning up to 1.1 billion points, and report that our proposed Curl-Remover outperformed two PCA-based algorithms from the state-of-the-art, being in average up to 8% more accurate.Dada uma grande base de dados de dimensionalidade moderada a alta, como identificar padrões úteis nos objetos de dados? Nesses casos, a redução de dimensionalidade é essencial para superar um fenômeno conhecido na literatura como a maldição da alta dimensionalidade. Embora existam algoritmos capazes de reduzir a dimensionalidade de conjuntos de dados na escala de Terabytes, infelizmente, todos falham em relação à identificação/eliminação de correlações não lineares entre os atributos. Este trabalho de Mestrado trata o problema explorando conceitos da Teoria de Fractais e processamento paralelo em massa para apresentar Curl-Remover, uma nova técnica de redução de dimensionalidade bem adequada ao pré-processamento de Big Data. Suas principais contribuições são: (a) Curl-Remover elimina correlações lineares e não lineares entre atributos, bem como atributos irrelevantes; (b) não depende de supervisão do usuário e é útil para tarefas analíticas em geral não apenas para a classificação; (c) apresenta escalabilidade linear tanto em relação ao número de objetos de dados quanto ao número de máquinas utilizadas; (d) não requer que o usuário sugira um número de atributos para serem removidos, e; (e) mantêm a semântica dos atributos por ser uma técnica de seleção de atributos, não de extração de atributos. Experimentos foram executados em conjuntos de dados sintéticos e reais contendo até 1,1 bilhões de pontos, e a nova técnica Curl-Remover apresentou desempenho superior comparada a dois algoritmos do estado da arte baseados em PCA, obtendo em média até 8% a mais em acurácia de resultados.Biblioteca Digitais de Teses e Dissertações da USPCordeiro, Robson Leonardo FerreiraFraideinberze, Antonio Canabrava2017-09-04info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://www.teses.usp.br/teses/disponiveis/55/55134/tde-17112017-154451/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2018-07-17T16:38:18Zoai:teses.usp.br:tde-17112017-154451Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.bropendoar:27212018-07-17T16:38:18Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv	Effective and unsupervised fractal-based feature selection for very large datasets: removing linear and non-linear attribute correlations Seleção de atributos efetiva e não-supervisionada em grandes bases de dados: aplicando a Teoria de Fractais para remover correlações lineares e não-lineares
title	Effective and unsupervised fractal-based feature selection for very large datasets: removing linear and non-linear attribute correlations
spellingShingle	Effective and unsupervised fractal-based feature selection for very large datasets: removing linear and non-linear attribute correlations Fraideinberze, Antonio Canabrava Big data Big data Correlações não-lineares entre atributos Feature selection Fractal theory Massive parallel processing Non-linear attribute correlations Processamento paralelo em massa Seleção de atributos Teoria de fractais
title_short	Effective and unsupervised fractal-based feature selection for very large datasets: removing linear and non-linear attribute correlations
title_full	Effective and unsupervised fractal-based feature selection for very large datasets: removing linear and non-linear attribute correlations
title_fullStr	Effective and unsupervised fractal-based feature selection for very large datasets: removing linear and non-linear attribute correlations
title_full_unstemmed	Effective and unsupervised fractal-based feature selection for very large datasets: removing linear and non-linear attribute correlations
title_sort	Effective and unsupervised fractal-based feature selection for very large datasets: removing linear and non-linear attribute correlations
author	Fraideinberze, Antonio Canabrava
author_facet	Fraideinberze, Antonio Canabrava
author_role	author
dc.contributor.none.fl_str_mv	Cordeiro, Robson Leonardo Ferreira
dc.contributor.author.fl_str_mv	Fraideinberze, Antonio Canabrava
dc.subject.por.fl_str_mv	Big data Big data Correlações não-lineares entre atributos Feature selection Fractal theory Massive parallel processing Non-linear attribute correlations Processamento paralelo em massa Seleção de atributos Teoria de fractais
topic	Big data Big data Correlações não-lineares entre atributos Feature selection Fractal theory Massive parallel processing Non-linear attribute correlations Processamento paralelo em massa Seleção de atributos Teoria de fractais
description	Given a very large dataset of moderate-to-high dimensionality, how to mine useful patterns from it? In such cases, dimensionality reduction is essential to overcome the well-known curse of dimensionality. Although there exist algorithms to reduce the dimensionality of Big Data, unfortunately, they all fail to identify/eliminate non-linear correlations that may occur between the attributes. This MSc work tackles the problem by exploring concepts of the Fractal Theory and massive parallel processing to present Curl-Remover, a novel dimensionality reduction technique for very large datasets. Our contributions are: (a) Curl-Remover eliminates linear and non-linear attribute correlations as well as irrelevant attributes; (b) it is unsupervised and suits for analytical tasks in general not only classification; (c) it presents linear scale-up on both the data size and the number of machines used; (d) it does not require the user to guess the number of attributes to be removed, and; (e) it preserves the attributes semantics by performing feature selection, not feature extraction. We executed experiments on synthetic and real data spanning up to 1.1 billion points, and report that our proposed Curl-Remover outperformed two PCA-based algorithms from the state-of-the-art, being in average up to 8% more accurate.
publishDate	2017
dc.date.none.fl_str_mv	2017-09-04
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://www.teses.usp.br/teses/disponiveis/55/55134/tde-17112017-154451/
url	http://www.teses.usp.br/teses/disponiveis/55/55134/tde-17112017-154451/
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv	Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Liberar o conteúdo para acesso público.
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP
instname_str	Universidade de São Paulo (USP)
instacron_str	USP
institution	USP
reponame_str	Biblioteca Digital de Teses e Dissertações da USP
collection	Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv	virginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.br
_version_	1815258201998753792

Effective and unsupervised fractal-based feature selection for very large datasets: removing linear and non-linear attribute correlations

Registros relacionados