A comparative study of text classification techniques for hate speech detection

Silva, Rodolfo Costa Cezar da

A comparative study of text classification techniques for hate speech detection

Detalhes bibliográficos
Ano de defesa:	2022
Autor(a) principal:	Silva, Rodolfo Costa Cezar da
Orientador(a):	Rosa, Thierson Couto
Banca de defesa:	Rosa, Thierson Couto, Moura, Edleno Silva de, Silva, Nádia Félix Felipe da
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Universidade Federal de Goiás
Programa de Pós-Graduação:	Programa de Pós-graduação em Ciência da Computação (INF)
Departamento:	Instituto de Informática - INF (RMG)
País:	Brasil
Palavras-chave em Português:	Classificação de texto Desbalanceamento de classes Detecção de discurso de ódio Aprendizado de máquina
Palavras-chave em Inglês:	Text classification Class imbalance Hate speech detection Machine learning
Área do conhecimento CNPq:	CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Link de acesso:	http://repositorio.bc.ufg.br/tede/handle/tede/13276
Resumo:	The dissemination of hate speech on the Internet, specially on social media platforms, has been a serious and recurrent problem. In the present study, we compare eleven methods for classifying hate speech, including traditional machine learning methods, neural network-based approaches and transformers, as well as their combination with eight techniques to address the class imbalance problem, which is a recurrent issue in hate speech classification. The data transformation techniques we investigated include data resampling techniques and a modification of a technique based on compound features (c_features).All models have been tested on seven datasets with varying specificity, following a rigorous experimentation protocol that includes cross-validation and the use of appropriate evaluation metrics, as well as validation of the results through appropriate statistical tests for multiple comparisons. To our knowledge, there is no broader comparative study in data enhancing techniques for hate speech detection, nor any work that combine data resampling techniques with transformers. Our extensive experimentation, based on over 2,900measurements, reveal that most data resampling techniques are ineffective to enhance the effectiveness of classifiers, with the exception of ROS which improves most classification methods, including the transformers. For the smallest dataset, ROS provided gains of 60.43% and 33.47% for BERT and RoBERTa, respectively. The experiments revealed that c_features improved all classification methods that they could be combined with. The compound features technique provided satisfactory gains of up to 7.8% for SVM. Finally,we investigate cost-effectiveness for a few of the best classification methods. This analysis provided confirmation that the traditional method Logistic Regression (LR) combined with the use of c_features can provide great effectiveness with low overhead in all datasets considered

Metadados do item

id	UFG-2_be6e618c034482878074acf9a062c001
oai_identifier_str	oai:repositorio.bc.ufg.br:tede/13276
network_acronym_str	UFG-2
network_name_str	Repositório Institucional da UFG
repository_id_str
spelling	Rosa, Thierson Coutohttp://lattes.cnpq.br/4414718560764818Rosa, Thierson CoutoMoura, Edleno Silva deSilva, Nádia Félix Felipe dahttp://lattes.cnpq.br/3093346314417983Silva, Rodolfo Costa Cezar da2024-02-27T15:03:38Z2024-02-27T15:03:38Z2022-01-27SILVA, R. C. C. A comparative study of text classification techniques for hate speech detection. 2022. 72 f. Dissertação (Mestrado em Ciências Computação) - Instituto de Informática, Universidade Federal de Goiás, Goiânia, 2022.http://repositorio.bc.ufg.br/tede/handle/tede/13276The dissemination of hate speech on the Internet, specially on social media platforms, has been a serious and recurrent problem. In the present study, we compare eleven methods for classifying hate speech, including traditional machine learning methods, neural network-based approaches and transformers, as well as their combination with eight techniques to address the class imbalance problem, which is a recurrent issue in hate speech classification. The data transformation techniques we investigated include data resampling techniques and a modification of a technique based on compound features (c_features).All models have been tested on seven datasets with varying specificity, following a rigorous experimentation protocol that includes cross-validation and the use of appropriate evaluation metrics, as well as validation of the results through appropriate statistical tests for multiple comparisons. To our knowledge, there is no broader comparative study in data enhancing techniques for hate speech detection, nor any work that combine data resampling techniques with transformers. Our extensive experimentation, based on over 2,900measurements, reveal that most data resampling techniques are ineffective to enhance the effectiveness of classifiers, with the exception of ROS which improves most classification methods, including the transformers. For the smallest dataset, ROS provided gains of 60.43% and 33.47% for BERT and RoBERTa, respectively. The experiments revealed that c_features improved all classification methods that they could be combined with. The compound features technique provided satisfactory gains of up to 7.8% for SVM. Finally,we investigate cost-effectiveness for a few of the best classification methods. This analysis provided confirmation that the traditional method Logistic Regression (LR) combined with the use of c_features can provide great effectiveness with low overhead in all datasets consideredA disseminação do discurso de ódio na Internet, especialmente nas plataformas de redes sociais, tem sido um problema recorrente. No presente estudo, comparamos onze métodos de classificação para discurso de ódio, incluindo métodos tradicionais de aprendizado de máquina, abordagens baseadas em redes neurais e Transformers, assim como a combinação com oito técnicas para resolver o problema de desbalanceamento de classes,uma característica inerente à classificação de discurso de ódio. As técnicas de transformação de dados que investigamos incluem técnicas de reamostragem de dados e uma modificação de uma técnica baseada em features compostas (c_features). Todos os modelos foram testados em sete coleções de dados com especificidades variadas, seguindo um rigoroso protocolo de experimentação que inclui validação cruzada e o uso de métricas apropriadas, bem como a validação dos resultados por meio de testes estatísticos apropriados para comparações múltiplas. Até onde sabemos, não há estudo comparativo mais amplo em técnicas de expansão de dados para detecção de discurso de ódio, nem qualquer trabalho que combine técnicas de reamostragem de dados com Transformers. Nossa extensa experimentação, baseada em mais de 2.900 medições, revela que a maioria das técnicas de reamostragem de dados são ineficazes para aumentar a eficácia dos classificadores, com exceção da técnica de Random Oversampling (ROS) que melhora a maioria dos métodos de classificação, incluindo os Transformers. Para a menor coleção de dados, ROS proporcionou ganhos de 60,43% e 33,47% para BERT e RoBERTa,respectivamente. Os experimentos revelaram que a técnica de c_features melhorou todos os métodos de classificação com os quais ele pôde ser combinado. A técnica de features compostas proporcionou ganhos satisfatórios de até 7,8% para SVM. Finalmente, investigamos a relação custo-efetividade de alguns dos melhores métodos de classificação. Essa análise confirmou que o método tradicional de Regressão Logística (LR) combinado como uso de c_features proporciona grande eficácia com baixo overhead em todas as coleções de dados consideradas.Submitted by Marlene Santos (marlene.bc.ufg@gmail.com) on 2024-02-23T17:17:52Z workflow start=Step: editstep - action:claimaction No. of bitstreams: 2 Dissertação - Rodolfo Costa Cezar da Silva - 2022.pdf: 4201587 bytes, checksum: b3a294341a032cfe63503a0de32e5fc6 (MD5) license_rdf: 805 bytes, checksum: 4460e5956bc1d1639be9ae6146a50347 (MD5)Step: editstep - action:editaction Approved for entry into archive by Luciana Ferreira(lucgeral@gmail.com) on 2024-02-27T15:03:38Z (GMT)Made available in DSpace on 2024-02-27T15:03:38Z (GMT). No. of bitstreams: 2 Dissertação - Rodolfo Costa Cezar da Silva - 2022.pdf: 4201587 bytes, checksum: b3a294341a032cfe63503a0de32e5fc6 (MD5) license_rdf: 805 bytes, checksum: 4460e5956bc1d1639be9ae6146a50347 (MD5) Previous issue date: 2022-01-27Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPESengUniversidade Federal de GoiásPrograma de Pós-graduação em Ciência da Computação (INF)UFGBrasilInstituto de Informática - INF (RMG)Attribution-NonCommercial-NoDerivatives 4.0 Internationalhttp://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/openAccessClassificação de textoDesbalanceamento de classesDetecção de discurso de ódioAprendizado de máquinaText classificationClass imbalanceHate speech detectionMachine learningCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOA comparative study of text classification techniques for hate speech detectionUm estudo comparativo de técnicas de classificação de texto para detecção de discurso de ódioinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisreponame:Repositório Institucional da UFGinstname:Universidade Federal de Goiás (UFG)instacron:UFGLICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://repositorio.bc.ufg.br/tede/bitstreams/93f61b88-00f8-45c6-bef8-1ff2dc88392b/download8a4605be74aa9ea9d79846c1fba20a33MD51ORIGINALDissertação - Rodolfo Costa Cezar da Silva - 2022.pdfDissertação - Rodolfo Costa Cezar da Silva - 2022.pdfapplication/pdf4201587http://repositorio.bc.ufg.br/tede/bitstreams/7de1363e-8489-4608-9032-fbc04410d4d8/downloadb3a294341a032cfe63503a0de32e5fc6MD52CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8805http://repositorio.bc.ufg.br/tede/bitstreams/5cd4a14f-97c8-427a-b99e-056d56c77f2e/download4460e5956bc1d1639be9ae6146a50347MD52tede/132762024-02-27 12:03:38.54http://creativecommons.org/licenses/by-nc-nd/4.0/Attribution-NonCommercial-NoDerivatives 4.0 Internationalopen.accessoai:repositorio.bc.ufg.br:tede/13276http://repositorio.bc.ufg.br/tedeRepositório InstitucionalPUBhttp://repositorio.bc.ufg.br/oai/requesttasesdissertacoes.bc@ufg.bropendoar:2024-02-27T15:03:38Repositório Institucional da UFG - Universidade Federal de Goiás (UFG)falseTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=
dc.title.none.fl_str_mv	A comparative study of text classification techniques for hate speech detection
dc.title.alternative.por.fl_str_mv	Um estudo comparativo de técnicas de classificação de texto para detecção de discurso de ódio
title	A comparative study of text classification techniques for hate speech detection
spellingShingle	A comparative study of text classification techniques for hate speech detection Silva, Rodolfo Costa Cezar da Classificação de texto Desbalanceamento de classes Detecção de discurso de ódio Aprendizado de máquina Text classification Class imbalance Hate speech detection Machine learning CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
title_short	A comparative study of text classification techniques for hate speech detection
title_full	A comparative study of text classification techniques for hate speech detection
title_fullStr	A comparative study of text classification techniques for hate speech detection
title_full_unstemmed	A comparative study of text classification techniques for hate speech detection
title_sort	A comparative study of text classification techniques for hate speech detection
author	Silva, Rodolfo Costa Cezar da
author_facet	Silva, Rodolfo Costa Cezar da
author_role	author
dc.contributor.advisor1.fl_str_mv	Rosa, Thierson Couto
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/4414718560764818
dc.contributor.referee1.fl_str_mv	Rosa, Thierson Couto
dc.contributor.referee2.fl_str_mv	Moura, Edleno Silva de
dc.contributor.referee3.fl_str_mv	Silva, Nádia Félix Felipe da
dc.contributor.authorLattes.fl_str_mv	http://lattes.cnpq.br/3093346314417983
dc.contributor.author.fl_str_mv	Silva, Rodolfo Costa Cezar da
contributor_str_mv	Rosa, Thierson Couto Rosa, Thierson Couto Moura, Edleno Silva de Silva, Nádia Félix Felipe da
dc.subject.por.fl_str_mv	Classificação de texto Desbalanceamento de classes Detecção de discurso de ódio Aprendizado de máquina
topic	Classificação de texto Desbalanceamento de classes Detecção de discurso de ódio Aprendizado de máquina Text classification Class imbalance Hate speech detection Machine learning CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
dc.subject.eng.fl_str_mv	Text classification Class imbalance Hate speech detection Machine learning
dc.subject.cnpq.fl_str_mv	CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
description	The dissemination of hate speech on the Internet, specially on social media platforms, has been a serious and recurrent problem. In the present study, we compare eleven methods for classifying hate speech, including traditional machine learning methods, neural network-based approaches and transformers, as well as their combination with eight techniques to address the class imbalance problem, which is a recurrent issue in hate speech classification. The data transformation techniques we investigated include data resampling techniques and a modification of a technique based on compound features (c_features).All models have been tested on seven datasets with varying specificity, following a rigorous experimentation protocol that includes cross-validation and the use of appropriate evaluation metrics, as well as validation of the results through appropriate statistical tests for multiple comparisons. To our knowledge, there is no broader comparative study in data enhancing techniques for hate speech detection, nor any work that combine data resampling techniques with transformers. Our extensive experimentation, based on over 2,900measurements, reveal that most data resampling techniques are ineffective to enhance the effectiveness of classifiers, with the exception of ROS which improves most classification methods, including the transformers. For the smallest dataset, ROS provided gains of 60.43% and 33.47% for BERT and RoBERTa, respectively. The experiments revealed that c_features improved all classification methods that they could be combined with. The compound features technique provided satisfactory gains of up to 7.8% for SVM. Finally,we investigate cost-effectiveness for a few of the best classification methods. This analysis provided confirmation that the traditional method Logistic Regression (LR) combined with the use of c_features can provide great effectiveness with low overhead in all datasets considered
publishDate	2022
dc.date.issued.fl_str_mv	2022-01-27
dc.date.accessioned.fl_str_mv	2024-02-27T15:03:38Z
dc.date.available.fl_str_mv	2024-02-27T15:03:38Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	SILVA, R. C. C. A comparative study of text classification techniques for hate speech detection. 2022. 72 f. Dissertação (Mestrado em Ciências Computação) - Instituto de Informática, Universidade Federal de Goiás, Goiânia, 2022.
dc.identifier.uri.fl_str_mv	http://repositorio.bc.ufg.br/tede/handle/tede/13276
identifier_str_mv	SILVA, R. C. C. A comparative study of text classification techniques for hate speech detection. 2022. 72 f. Dissertação (Mestrado em Ciências Computação) - Instituto de Informática, Universidade Federal de Goiás, Goiânia, 2022.
url	http://repositorio.bc.ufg.br/tede/handle/tede/13276
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/ info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de Goiás
dc.publisher.program.fl_str_mv	Programa de Pós-graduação em Ciência da Computação (INF)
dc.publisher.initials.fl_str_mv	UFG
dc.publisher.country.fl_str_mv	Brasil
dc.publisher.department.fl_str_mv	Instituto de Informática - INF (RMG)
publisher.none.fl_str_mv	Universidade Federal de Goiás
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFG instname:Universidade Federal de Goiás (UFG) instacron:UFG
instname_str	Universidade Federal de Goiás (UFG)
instacron_str	UFG
institution	UFG
reponame_str	Repositório Institucional da UFG
collection	Repositório Institucional da UFG
bitstream.url.fl_str_mv	http://repositorio.bc.ufg.br/tede/bitstreams/93f61b88-00f8-45c6-bef8-1ff2dc88392b/download http://repositorio.bc.ufg.br/tede/bitstreams/7de1363e-8489-4608-9032-fbc04410d4d8/download http://repositorio.bc.ufg.br/tede/bitstreams/5cd4a14f-97c8-427a-b99e-056d56c77f2e/download
bitstream.checksum.fl_str_mv	8a4605be74aa9ea9d79846c1fba20a33 b3a294341a032cfe63503a0de32e5fc6 4460e5956bc1d1639be9ae6146a50347
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFG - Universidade Federal de Goiás (UFG)
repository.mail.fl_str_mv	tasesdissertacoes.bc@ufg.br
_version_	1798044944661741568

A comparative study of text classification techniques for hate speech detection

Registros relacionados