Algoritmo de aprendizado supervisionado - baseado em máquinas de vetores de suporte - uma contribuição para o reconhecimento de dados desbalanceados

Rufino, Hugo Leonardo Pereira

Algoritmo de aprendizado supervisionado - baseado em máquinas de vetores de suporte - uma contribuição para o reconhecimento de dados desbalanceados

Detalhes bibliográficos
Ano de defesa:	2011
Autor(a) principal:	Rufino, Hugo Leonardo Pereira
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Tese
Tipo de acesso:	Acesso aberto
Idioma:	por
Instituição de defesa:	Universidade Federal de Uberlândia BR Programa de Pós-graduação em Engenharia Elétrica Engenharias UFU
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Aprendizado do computador Inteligência artificial Algoritmos de computador Aprendizado supervisionado Máquinas de vetores de suporte Conjuntos de dados desbalanceados Supervised learning Support vector machines Unbalanced datasets CNPQ::ENGENHARIAS::ENGENHARIA ELETRICA
Link de acesso:	https://repositorio.ufu.br/handle/123456789/14284
Resumo:	The machine learning in datasets that have unbalanced classes, has received considerable attention in the scientific community, because the traditional classification algorithms don t provide a satisfactory performance. This low performance can be explained by the fact that the traditional techniques of machine learning consider that each class present in the database has an approximately equal number of instances. However, most real datasets, have classes with an unbalanced distribution, where one class is over represented in comparison with the others. This gives rise to classifiers with high accuracy to predict the majority class and low accuracy for predicting the minority class. Therefore, the minority class is ignored by the classifier. This predisposition of the classifier for the majority class occurs, because the classifiers are designed to maximize accuracy in relation to the database being used for training. In training the classifier, it is assumed that when making the prediction of data not yet seen, they have the same distribution of the data that were used in training. This limits its ability to recognize examples of the minority class. Several improvements in the traditional classification algorithms have been proposed in the literature, where considerations were made at the level of data and algorithms. The former uses various ways of resampling, such as oversampling of examples from the minority class, undersampling the majority class or a combination of both. The latter attempt to adapt (by inserting dierent costs in the minority class examples and majority, changing kernels and other techniques) the existing classification algorithms to improve the performance of minority class. Several algorithms in the form of a ensemble machine, are also reported as meta-techniques for working with unbalanced classes. This thesis studies the main algorithms that deal with unbalanced class, highlighting its main features as: the generation of new synthetic examples instead of replicating data at random, in the process of oversampling; the use of dierent penalties to misclassification of the minority and majority class; and the use of ensembles for that the generated classifiers have a greater ability to generalize. After assessing the contributions that each algorithm provides, a study was done if one could get something more of the characteristics of each one. It was made a modification in the algorithm that generates new synthetic examples of way that reduces the possibility of generating new elements in the incorrect region. As with highly unbalanced datasets, the generation of synthetic elements is not enough to balance the whole, there was a need to develop a new algorithm to perform an undersampling the majority class examples. And to enhance the generalization ability of the generated classifier, was also made a change to an ensemble algorithm. Using these three steps, we obtained an compound algorithm that has a hit rate of data classification better than the algorithms on which it was relied.

Metadados do item

id	UFU_5824826fe5256f846207ec488de4ebfd
oai_identifier_str	oai:repositorio.ufu.br:123456789/14284
network_acronym_str	UFU
network_name_str	Repositório Institucional da UFU
repository_id_str
spelling	Algoritmo de aprendizado supervisionado - baseado em máquinas de vetores de suporte - uma contribuição para o reconhecimento de dados desbalanceadosSupervised learning Algorithm - Based on Support Vector Machines - A Contribution to the Recognition of Unbalanced DataAprendizado do computadorInteligência artificialAlgoritmos de computadorAprendizado supervisionadoMáquinas de vetores de suporteConjuntos de dados desbalanceadosSupervised learningSupport vector machinesUnbalanced datasetsCNPQ::ENGENHARIAS::ENGENHARIA ELETRICAThe machine learning in datasets that have unbalanced classes, has received considerable attention in the scientific community, because the traditional classification algorithms don t provide a satisfactory performance. This low performance can be explained by the fact that the traditional techniques of machine learning consider that each class present in the database has an approximately equal number of instances. However, most real datasets, have classes with an unbalanced distribution, where one class is over represented in comparison with the others. This gives rise to classifiers with high accuracy to predict the majority class and low accuracy for predicting the minority class. Therefore, the minority class is ignored by the classifier. This predisposition of the classifier for the majority class occurs, because the classifiers are designed to maximize accuracy in relation to the database being used for training. In training the classifier, it is assumed that when making the prediction of data not yet seen, they have the same distribution of the data that were used in training. This limits its ability to recognize examples of the minority class. Several improvements in the traditional classification algorithms have been proposed in the literature, where considerations were made at the level of data and algorithms. The former uses various ways of resampling, such as oversampling of examples from the minority class, undersampling the majority class or a combination of both. The latter attempt to adapt (by inserting dierent costs in the minority class examples and majority, changing kernels and other techniques) the existing classification algorithms to improve the performance of minority class. Several algorithms in the form of a ensemble machine, are also reported as meta-techniques for working with unbalanced classes. This thesis studies the main algorithms that deal with unbalanced class, highlighting its main features as: the generation of new synthetic examples instead of replicating data at random, in the process of oversampling; the use of dierent penalties to misclassification of the minority and majority class; and the use of ensembles for that the generated classifiers have a greater ability to generalize. After assessing the contributions that each algorithm provides, a study was done if one could get something more of the characteristics of each one. It was made a modification in the algorithm that generates new synthetic examples of way that reduces the possibility of generating new elements in the incorrect region. As with highly unbalanced datasets, the generation of synthetic elements is not enough to balance the whole, there was a need to develop a new algorithm to perform an undersampling the majority class examples. And to enhance the generalization ability of the generated classifier, was also made a change to an ensemble algorithm. Using these three steps, we obtained an compound algorithm that has a hit rate of data classification better than the algorithms on which it was relied.Doutor em CiênciasO aprendizado de máquina em conjuntos de dados que possuam classes desbalanceadas tem recebido considerável atenção na comunidade científica, pois os algoritmos de classificação tradicionais não fornecem um desempenho satisfatório. Este baixo desempenho pode ser justificado pelo fato das técnicas tradicionais de aprendizado de máquina considerarem que cada classe presente em um conjunto de dados possui um número aproximadamente igual de instâncias. Entretanto, a maioria dos conjuntos de dados reais possuem classes com uma distribuição desbalanceada, onde uma classe de dados está super representada em comparação com outras classes. Isto faz com que surjam classificadores com uma alta precisão para a predição da classe majoritária e com baixa precisão para prever a classe minoritária. Logo, a classe minoritária é ignorada pelo classificador. Esta predisposição do classificador em relação à classe majoritária ocorre em função dos classificadores serem projetados para maximizar a precisão em relação ao conjunto de dados que está sendo utilizado para o treinamento. No treinamento do classificador é assumido que quando for fazer a predição de dados ainda não vistos, estes terão a mesma distribuição dos dados que foram utilizados no treinamento. Isto limita sua habilidade em reconhecer exemplos da classe minoritária. Várias melhorias nos algoritmos tradicionais de classificação têm sido propostas na literatura, onde foram feitas considerações a nível de dados e a nível de algoritmos. O primeiro utiliza diversas formas de reamostragem, tal como super-amostragem de exemplos da classe minoritária, sub-amostragem de exemplos da classe majoritária ou a combinação de ambos. Os últimos tentam adaptar (inserindo custos diferenciados em exemplos da classe minoritária e majoritária, alterando kernels, e outras técnicas) os algoritmos de classificação já existentes para melhorar o desempenho da classe minoritária. Vários algoritmos na forma de um comitê de máquinas também são reportados como meta-técnicas para trabalhar com classes desbalanceadas. Esta tese estuda os principais algoritmos que lidam com classes desbalanceadas, destacando suas principais características como: a geração de novos exemplos sintéticos ao invés da replicação de dados de forma aleatória, no processo de super-amostragem; o uso de penalidades diferentes para erros de classificação da classe minoritária e majoritária; e a utilização de comitês de máquinas para que os classificadores gerados possuam uma capacidade de generalização maior. Após o levantamento das contribuições que cada algoritmo fornece, foi feito um estudo se poderia obter algo mais das características de cada um. Foi feita uma modificação no algoritmo que gera novos exemplos sintéticos de forma que reduzisse a possibilidade de geração de novos elementos na região incorreta. Como em conjuntos de dados altamente desbalanceados, a geração de elementos sintéticos não é suficiente para equilibrar o conjunto, houve a necessidade da criação de um novo algoritmo para efetuar uma sub-amostragem de exemplos da classe majoritária. E, para melhorar a capacidade de generalização do classificador gerado, também foi feita uma modificação em um algoritmo de comitês de máquinas. Utilizando estas três etapas, obteve-se um algoritmo composto que possui uma taxa de acerto na classificação de dados melhor que os algoritmos nos quais se baseou.Universidade Federal de UberlândiaBRPrograma de Pós-graduação em Engenharia ElétricaEngenhariasUFUVeiga, Antônio Cláudio Paschoarellihttp://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4782222Y6Camilo Júnior, Celso Gonçalveshttp://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4736184D1Carrijo, Gilberto Aranteshttp://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4781864Y0Yamanaka, Keijihttp://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4798494D8Vellasco, Marley Maria Bernardes Rebuzzihttp://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4781818T3Rufino, Hugo Leonardo Pereira2016-06-22T18:37:51Z2011-11-072016-06-22T18:37:51Z2011-09-26info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfapplication/pdfRUFINO, Hugo Leonardo Pereira. Supervised learning Algorithm - Based on Support Vector Machines - A Contribution to the Recognition of Unbalanced Data. 2011. 107 f. Tese (Doutorado em Engenharias) - Universidade Federal de Uberlândia, Uberlândia, 2011.https://repositorio.ufu.br/handle/123456789/14284porinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFUinstname:Universidade Federal de Uberlândia (UFU)instacron:UFU2016-06-23T06:50:54Zoai:repositorio.ufu.br:123456789/14284Repositório InstitucionalONGhttp://repositorio.ufu.br/oai/requestdiinf@dirbi.ufu.bropendoar:2016-06-23T06:50:54Repositório Institucional da UFU - Universidade Federal de Uberlândia (UFU)false
dc.title.none.fl_str_mv	Algoritmo de aprendizado supervisionado - baseado em máquinas de vetores de suporte - uma contribuição para o reconhecimento de dados desbalanceados Supervised learning Algorithm - Based on Support Vector Machines - A Contribution to the Recognition of Unbalanced Data
title	Algoritmo de aprendizado supervisionado - baseado em máquinas de vetores de suporte - uma contribuição para o reconhecimento de dados desbalanceados
spellingShingle	Algoritmo de aprendizado supervisionado - baseado em máquinas de vetores de suporte - uma contribuição para o reconhecimento de dados desbalanceados Rufino, Hugo Leonardo Pereira Aprendizado do computador Inteligência artificial Algoritmos de computador Aprendizado supervisionado Máquinas de vetores de suporte Conjuntos de dados desbalanceados Supervised learning Support vector machines Unbalanced datasets CNPQ::ENGENHARIAS::ENGENHARIA ELETRICA
title_short	Algoritmo de aprendizado supervisionado - baseado em máquinas de vetores de suporte - uma contribuição para o reconhecimento de dados desbalanceados
title_full	Algoritmo de aprendizado supervisionado - baseado em máquinas de vetores de suporte - uma contribuição para o reconhecimento de dados desbalanceados
title_fullStr	Algoritmo de aprendizado supervisionado - baseado em máquinas de vetores de suporte - uma contribuição para o reconhecimento de dados desbalanceados
title_full_unstemmed	Algoritmo de aprendizado supervisionado - baseado em máquinas de vetores de suporte - uma contribuição para o reconhecimento de dados desbalanceados
title_sort	Algoritmo de aprendizado supervisionado - baseado em máquinas de vetores de suporte - uma contribuição para o reconhecimento de dados desbalanceados
author	Rufino, Hugo Leonardo Pereira
author_facet	Rufino, Hugo Leonardo Pereira
author_role	author
dc.contributor.none.fl_str_mv	Veiga, Antônio Cláudio Paschoarelli http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4782222Y6 Camilo Júnior, Celso Gonçalves http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4736184D1 Carrijo, Gilberto Arantes http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4781864Y0 Yamanaka, Keiji http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4798494D8 Vellasco, Marley Maria Bernardes Rebuzzi http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4781818T3
dc.contributor.author.fl_str_mv	Rufino, Hugo Leonardo Pereira
dc.subject.por.fl_str_mv	Aprendizado do computador Inteligência artificial Algoritmos de computador Aprendizado supervisionado Máquinas de vetores de suporte Conjuntos de dados desbalanceados Supervised learning Support vector machines Unbalanced datasets CNPQ::ENGENHARIAS::ENGENHARIA ELETRICA
topic	Aprendizado do computador Inteligência artificial Algoritmos de computador Aprendizado supervisionado Máquinas de vetores de suporte Conjuntos de dados desbalanceados Supervised learning Support vector machines Unbalanced datasets CNPQ::ENGENHARIAS::ENGENHARIA ELETRICA
description	The machine learning in datasets that have unbalanced classes, has received considerable attention in the scientific community, because the traditional classification algorithms don t provide a satisfactory performance. This low performance can be explained by the fact that the traditional techniques of machine learning consider that each class present in the database has an approximately equal number of instances. However, most real datasets, have classes with an unbalanced distribution, where one class is over represented in comparison with the others. This gives rise to classifiers with high accuracy to predict the majority class and low accuracy for predicting the minority class. Therefore, the minority class is ignored by the classifier. This predisposition of the classifier for the majority class occurs, because the classifiers are designed to maximize accuracy in relation to the database being used for training. In training the classifier, it is assumed that when making the prediction of data not yet seen, they have the same distribution of the data that were used in training. This limits its ability to recognize examples of the minority class. Several improvements in the traditional classification algorithms have been proposed in the literature, where considerations were made at the level of data and algorithms. The former uses various ways of resampling, such as oversampling of examples from the minority class, undersampling the majority class or a combination of both. The latter attempt to adapt (by inserting dierent costs in the minority class examples and majority, changing kernels and other techniques) the existing classification algorithms to improve the performance of minority class. Several algorithms in the form of a ensemble machine, are also reported as meta-techniques for working with unbalanced classes. This thesis studies the main algorithms that deal with unbalanced class, highlighting its main features as: the generation of new synthetic examples instead of replicating data at random, in the process of oversampling; the use of dierent penalties to misclassification of the minority and majority class; and the use of ensembles for that the generated classifiers have a greater ability to generalize. After assessing the contributions that each algorithm provides, a study was done if one could get something more of the characteristics of each one. It was made a modification in the algorithm that generates new synthetic examples of way that reduces the possibility of generating new elements in the incorrect region. As with highly unbalanced datasets, the generation of synthetic elements is not enough to balance the whole, there was a need to develop a new algorithm to perform an undersampling the majority class examples. And to enhance the generalization ability of the generated classifier, was also made a change to an ensemble algorithm. Using these three steps, we obtained an compound algorithm that has a hit rate of data classification better than the algorithms on which it was relied.
publishDate	2011
dc.date.none.fl_str_mv	2011-11-07 2011-09-26 2016-06-22T18:37:51Z 2016-06-22T18:37:51Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	RUFINO, Hugo Leonardo Pereira. Supervised learning Algorithm - Based on Support Vector Machines - A Contribution to the Recognition of Unbalanced Data. 2011. 107 f. Tese (Doutorado em Engenharias) - Universidade Federal de Uberlândia, Uberlândia, 2011. https://repositorio.ufu.br/handle/123456789/14284
identifier_str_mv	RUFINO, Hugo Leonardo Pereira. Supervised learning Algorithm - Based on Support Vector Machines - A Contribution to the Recognition of Unbalanced Data. 2011. 107 f. Tese (Doutorado em Engenharias) - Universidade Federal de Uberlândia, Uberlândia, 2011.
url	https://repositorio.ufu.br/handle/123456789/14284
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf application/pdf
dc.publisher.none.fl_str_mv	Universidade Federal de Uberlândia BR Programa de Pós-graduação em Engenharia Elétrica Engenharias UFU
publisher.none.fl_str_mv	Universidade Federal de Uberlândia BR Programa de Pós-graduação em Engenharia Elétrica Engenharias UFU
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFU instname:Universidade Federal de Uberlândia (UFU) instacron:UFU
instname_str	Universidade Federal de Uberlândia (UFU)
instacron_str	UFU
institution	UFU
reponame_str	Repositório Institucional da UFU
collection	Repositório Institucional da UFU
repository.name.fl_str_mv	Repositório Institucional da UFU - Universidade Federal de Uberlândia (UFU)
repository.mail.fl_str_mv	diinf@dirbi.ufu.br
_version_	1827843479950065664

Algoritmo de aprendizado supervisionado - baseado em máquinas de vetores de suporte - uma contribuição para o reconhecimento de dados desbalanceados

Registros relacionados