A comprehensive exploitation of instance selection methods for automatic text classification

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: Washington Luiz Miranda da Cunha
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Universidade Federal de Minas Gerais
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://hdl.handle.net/1843/76441
Resumo: Progress in Natural Language Processing (NLP) has been dictated by the rule of more: more data, more computing power, more complexity, best exemplified by the {Large Language Models. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. Our focus here is an under-investigated data engineering (DE) technique, with enormous potential in the current scenario – Instance Selection (IS). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining or improving the effectiveness (accuracy) of the trained models and reducing the training process cost. In this sense, the main contribution of this Ph.D. dissertation is twofold. Firstly, we survey classical and recent IS techniques and provide a scientifically sound comparison of IS methods applied to an essential NLP task - Automatic Text Classification (ATC). IS methods have been normally applied to small tabular datasets and have not been systematically compared in ATC. We consider several neural and non-neural SOTA ATC solutions and many datasets. We answer several research questions based on tradeoffs induced by a tripod: effectiveness, efficiency, reduction. Our answers reveal an enormous unfulfilled potential for IS solutions. Furthermore, in the case of fine-tuning the transformer methods, the IS methods reduce the amount of data needed, without losing effectiveness and with considerable training-time gains. Considering the issues revealed by the traditional IS approaches, the second main contribution is the proposal of two IS solutions: E2SC, a novel redundancy-oriented two-step framework aimed at large datasets with a particular focus on transformers. E2SC estimates the probability of each instance being removed from the training set based on scalable, fast, and calibrated weak classifiers. We hypothesize that it is possible to estimate the effectiveness of a strong classifier (Transformer) with a weaker one. However, as mentioned, E2SC focuses solely on the removal of redundant instances, leaving other aspects, such as noise, that may help to further reduce training, untouched. Therefore, we also propose biO-IS, an extended framework built upon our previous one aimed at simultaneously removing redundant and noisy instances from the training. biO-IS estimates redundancy based on E2SC and captures noise with the support of a new entropy-based step. We also propose a novel iterative process to estimate near-optimum reduction rates for both steps. Our final solution is able to reduce the training sets by 41% on average (up to 60%) while maintaining the effectiveness in all tested datasets, with speedup gains of 1.67 on average (up to 2.46x). No other baseline, was capable of scaling for datasets with hundreds of thousands of documents and achieving results with this level of quality, considering the tradeoff among training reduction, effectiveness, and speedup.
id UFMG_55052938cee50220b39aa5fd5bd16221
oai_identifier_str oai:repositorio.ufmg.br:1843/76441
network_acronym_str UFMG
network_name_str Repositório Institucional da UFMG
repository_id_str
spelling A comprehensive exploitation of instance selection methods for automatic text classificationUma exploração abrangente de métodos de seleção de instâncias para classificação automática de textoComputação – TesesAprendizado do computador – TesesClassificação (Computadores) – TesesProcessamento de linguagem natural – TesesSeleção de Instâncias – TesesInstance SelectionAutomatic Text ClassificationProgress in Natural Language Processing (NLP) has been dictated by the rule of more: more data, more computing power, more complexity, best exemplified by the {Large Language Models. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. Our focus here is an under-investigated data engineering (DE) technique, with enormous potential in the current scenario – Instance Selection (IS). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining or improving the effectiveness (accuracy) of the trained models and reducing the training process cost. In this sense, the main contribution of this Ph.D. dissertation is twofold. Firstly, we survey classical and recent IS techniques and provide a scientifically sound comparison of IS methods applied to an essential NLP task - Automatic Text Classification (ATC). IS methods have been normally applied to small tabular datasets and have not been systematically compared in ATC. We consider several neural and non-neural SOTA ATC solutions and many datasets. We answer several research questions based on tradeoffs induced by a tripod: effectiveness, efficiency, reduction. Our answers reveal an enormous unfulfilled potential for IS solutions. Furthermore, in the case of fine-tuning the transformer methods, the IS methods reduce the amount of data needed, without losing effectiveness and with considerable training-time gains. Considering the issues revealed by the traditional IS approaches, the second main contribution is the proposal of two IS solutions: E2SC, a novel redundancy-oriented two-step framework aimed at large datasets with a particular focus on transformers. E2SC estimates the probability of each instance being removed from the training set based on scalable, fast, and calibrated weak classifiers. We hypothesize that it is possible to estimate the effectiveness of a strong classifier (Transformer) with a weaker one. However, as mentioned, E2SC focuses solely on the removal of redundant instances, leaving other aspects, such as noise, that may help to further reduce training, untouched. Therefore, we also propose biO-IS, an extended framework built upon our previous one aimed at simultaneously removing redundant and noisy instances from the training. biO-IS estimates redundancy based on E2SC and captures noise with the support of a new entropy-based step. We also propose a novel iterative process to estimate near-optimum reduction rates for both steps. Our final solution is able to reduce the training sets by 41% on average (up to 60%) while maintaining the effectiveness in all tested datasets, with speedup gains of 1.67 on average (up to 2.46x). No other baseline, was capable of scaling for datasets with hundreds of thousands of documents and achieving results with this level of quality, considering the tradeoff among training reduction, effectiveness, and speedup.CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível SuperiorUniversidade Federal de Minas Gerais2024-09-13T16:35:19Z2025-09-08T23:34:51Z2024-09-13T16:35:19Z2024-08-26info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://hdl.handle.net/1843/76441engPrograma Institucional de Internacionalização – CAPES - PrInthttp://creativecommons.org/licenses/by/3.0/pt/info:eu-repo/semantics/openAccessWashington Luiz Miranda da Cunhareponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMG2025-09-08T23:34:51Zoai:repositorio.ufmg.br:1843/76441Repositório InstitucionalPUBhttps://repositorio.ufmg.br/oairepositorio@ufmg.bropendoar:2025-09-08T23:34:51Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false
dc.title.none.fl_str_mv A comprehensive exploitation of instance selection methods for automatic text classification
Uma exploração abrangente de métodos de seleção de instâncias para classificação automática de texto
title A comprehensive exploitation of instance selection methods for automatic text classification
spellingShingle A comprehensive exploitation of instance selection methods for automatic text classification
Washington Luiz Miranda da Cunha
Computação – Teses
Aprendizado do computador – Teses
Classificação (Computadores) – Teses
Processamento de linguagem natural – Teses
Seleção de Instâncias – Teses
Instance Selection
Automatic Text Classification
title_short A comprehensive exploitation of instance selection methods for automatic text classification
title_full A comprehensive exploitation of instance selection methods for automatic text classification
title_fullStr A comprehensive exploitation of instance selection methods for automatic text classification
title_full_unstemmed A comprehensive exploitation of instance selection methods for automatic text classification
title_sort A comprehensive exploitation of instance selection methods for automatic text classification
author Washington Luiz Miranda da Cunha
author_facet Washington Luiz Miranda da Cunha
author_role author
dc.contributor.author.fl_str_mv Washington Luiz Miranda da Cunha
dc.subject.por.fl_str_mv Computação – Teses
Aprendizado do computador – Teses
Classificação (Computadores) – Teses
Processamento de linguagem natural – Teses
Seleção de Instâncias – Teses
Instance Selection
Automatic Text Classification
topic Computação – Teses
Aprendizado do computador – Teses
Classificação (Computadores) – Teses
Processamento de linguagem natural – Teses
Seleção de Instâncias – Teses
Instance Selection
Automatic Text Classification
description Progress in Natural Language Processing (NLP) has been dictated by the rule of more: more data, more computing power, more complexity, best exemplified by the {Large Language Models. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. Our focus here is an under-investigated data engineering (DE) technique, with enormous potential in the current scenario – Instance Selection (IS). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining or improving the effectiveness (accuracy) of the trained models and reducing the training process cost. In this sense, the main contribution of this Ph.D. dissertation is twofold. Firstly, we survey classical and recent IS techniques and provide a scientifically sound comparison of IS methods applied to an essential NLP task - Automatic Text Classification (ATC). IS methods have been normally applied to small tabular datasets and have not been systematically compared in ATC. We consider several neural and non-neural SOTA ATC solutions and many datasets. We answer several research questions based on tradeoffs induced by a tripod: effectiveness, efficiency, reduction. Our answers reveal an enormous unfulfilled potential for IS solutions. Furthermore, in the case of fine-tuning the transformer methods, the IS methods reduce the amount of data needed, without losing effectiveness and with considerable training-time gains. Considering the issues revealed by the traditional IS approaches, the second main contribution is the proposal of two IS solutions: E2SC, a novel redundancy-oriented two-step framework aimed at large datasets with a particular focus on transformers. E2SC estimates the probability of each instance being removed from the training set based on scalable, fast, and calibrated weak classifiers. We hypothesize that it is possible to estimate the effectiveness of a strong classifier (Transformer) with a weaker one. However, as mentioned, E2SC focuses solely on the removal of redundant instances, leaving other aspects, such as noise, that may help to further reduce training, untouched. Therefore, we also propose biO-IS, an extended framework built upon our previous one aimed at simultaneously removing redundant and noisy instances from the training. biO-IS estimates redundancy based on E2SC and captures noise with the support of a new entropy-based step. We also propose a novel iterative process to estimate near-optimum reduction rates for both steps. Our final solution is able to reduce the training sets by 41% on average (up to 60%) while maintaining the effectiveness in all tested datasets, with speedup gains of 1.67 on average (up to 2.46x). No other baseline, was capable of scaling for datasets with hundreds of thousands of documents and achieving results with this level of quality, considering the tradeoff among training reduction, effectiveness, and speedup.
publishDate 2024
dc.date.none.fl_str_mv 2024-09-13T16:35:19Z
2024-09-13T16:35:19Z
2024-08-26
2025-09-08T23:34:51Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://hdl.handle.net/1843/76441
url https://hdl.handle.net/1843/76441
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv Programa Institucional de Internacionalização – CAPES - PrInt
dc.rights.driver.fl_str_mv http://creativecommons.org/licenses/by/3.0/pt/
info:eu-repo/semantics/openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by/3.0/pt/
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Federal de Minas Gerais
publisher.none.fl_str_mv Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFMG
instname:Universidade Federal de Minas Gerais (UFMG)
instacron:UFMG
instname_str Universidade Federal de Minas Gerais (UFMG)
instacron_str UFMG
institution UFMG
reponame_str Repositório Institucional da UFMG
collection Repositório Institucional da UFMG
repository.name.fl_str_mv Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv repositorio@ufmg.br
_version_ 1856413972917387264