A comprehensive exploitation of instance selection methods for automatic text classification
| Ano de defesa: | 2024 |
|---|---|
| Autor(a) principal: | |
| Orientador(a): | |
| Banca de defesa: | |
| Tipo de documento: | Tese |
| Tipo de acesso: | Acesso aberto |
| Idioma: | eng |
| Instituição de defesa: |
Universidade Federal de Minas Gerais
|
| Programa de Pós-Graduação: |
Não Informado pela instituição
|
| Departamento: |
Não Informado pela instituição
|
| País: |
Não Informado pela instituição
|
| Palavras-chave em Português: | |
| Link de acesso: | https://hdl.handle.net/1843/76441 |
Resumo: | Progress in Natural Language Processing (NLP) has been dictated by the rule of more: more data, more computing power, more complexity, best exemplified by the {Large Language Models. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. Our focus here is an under-investigated data engineering (DE) technique, with enormous potential in the current scenario – Instance Selection (IS). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining or improving the effectiveness (accuracy) of the trained models and reducing the training process cost. In this sense, the main contribution of this Ph.D. dissertation is twofold. Firstly, we survey classical and recent IS techniques and provide a scientifically sound comparison of IS methods applied to an essential NLP task - Automatic Text Classification (ATC). IS methods have been normally applied to small tabular datasets and have not been systematically compared in ATC. We consider several neural and non-neural SOTA ATC solutions and many datasets. We answer several research questions based on tradeoffs induced by a tripod: effectiveness, efficiency, reduction. Our answers reveal an enormous unfulfilled potential for IS solutions. Furthermore, in the case of fine-tuning the transformer methods, the IS methods reduce the amount of data needed, without losing effectiveness and with considerable training-time gains. Considering the issues revealed by the traditional IS approaches, the second main contribution is the proposal of two IS solutions: E2SC, a novel redundancy-oriented two-step framework aimed at large datasets with a particular focus on transformers. E2SC estimates the probability of each instance being removed from the training set based on scalable, fast, and calibrated weak classifiers. We hypothesize that it is possible to estimate the effectiveness of a strong classifier (Transformer) with a weaker one. However, as mentioned, E2SC focuses solely on the removal of redundant instances, leaving other aspects, such as noise, that may help to further reduce training, untouched. Therefore, we also propose biO-IS, an extended framework built upon our previous one aimed at simultaneously removing redundant and noisy instances from the training. biO-IS estimates redundancy based on E2SC and captures noise with the support of a new entropy-based step. We also propose a novel iterative process to estimate near-optimum reduction rates for both steps. Our final solution is able to reduce the training sets by 41% on average (up to 60%) while maintaining the effectiveness in all tested datasets, with speedup gains of 1.67 on average (up to 2.46x). No other baseline, was capable of scaling for datasets with hundreds of thousands of documents and achieving results with this level of quality, considering the tradeoff among training reduction, effectiveness, and speedup. |
| id |
UFMG_55052938cee50220b39aa5fd5bd16221 |
|---|---|
| oai_identifier_str |
oai:repositorio.ufmg.br:1843/76441 |
| network_acronym_str |
UFMG |
| network_name_str |
Repositório Institucional da UFMG |
| repository_id_str |
|
| spelling |
A comprehensive exploitation of instance selection methods for automatic text classificationUma exploração abrangente de métodos de seleção de instâncias para classificação automática de textoComputação – TesesAprendizado do computador – TesesClassificação (Computadores) – TesesProcessamento de linguagem natural – TesesSeleção de Instâncias – TesesInstance SelectionAutomatic Text ClassificationProgress in Natural Language Processing (NLP) has been dictated by the rule of more: more data, more computing power, more complexity, best exemplified by the {Large Language Models. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. Our focus here is an under-investigated data engineering (DE) technique, with enormous potential in the current scenario – Instance Selection (IS). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining or improving the effectiveness (accuracy) of the trained models and reducing the training process cost. In this sense, the main contribution of this Ph.D. dissertation is twofold. Firstly, we survey classical and recent IS techniques and provide a scientifically sound comparison of IS methods applied to an essential NLP task - Automatic Text Classification (ATC). IS methods have been normally applied to small tabular datasets and have not been systematically compared in ATC. We consider several neural and non-neural SOTA ATC solutions and many datasets. We answer several research questions based on tradeoffs induced by a tripod: effectiveness, efficiency, reduction. Our answers reveal an enormous unfulfilled potential for IS solutions. Furthermore, in the case of fine-tuning the transformer methods, the IS methods reduce the amount of data needed, without losing effectiveness and with considerable training-time gains. Considering the issues revealed by the traditional IS approaches, the second main contribution is the proposal of two IS solutions: E2SC, a novel redundancy-oriented two-step framework aimed at large datasets with a particular focus on transformers. E2SC estimates the probability of each instance being removed from the training set based on scalable, fast, and calibrated weak classifiers. We hypothesize that it is possible to estimate the effectiveness of a strong classifier (Transformer) with a weaker one. However, as mentioned, E2SC focuses solely on the removal of redundant instances, leaving other aspects, such as noise, that may help to further reduce training, untouched. Therefore, we also propose biO-IS, an extended framework built upon our previous one aimed at simultaneously removing redundant and noisy instances from the training. biO-IS estimates redundancy based on E2SC and captures noise with the support of a new entropy-based step. We also propose a novel iterative process to estimate near-optimum reduction rates for both steps. Our final solution is able to reduce the training sets by 41% on average (up to 60%) while maintaining the effectiveness in all tested datasets, with speedup gains of 1.67 on average (up to 2.46x). No other baseline, was capable of scaling for datasets with hundreds of thousands of documents and achieving results with this level of quality, considering the tradeoff among training reduction, effectiveness, and speedup.CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível SuperiorUniversidade Federal de Minas Gerais2024-09-13T16:35:19Z2025-09-08T23:34:51Z2024-09-13T16:35:19Z2024-08-26info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://hdl.handle.net/1843/76441engPrograma Institucional de Internacionalização – CAPES - PrInthttp://creativecommons.org/licenses/by/3.0/pt/info:eu-repo/semantics/openAccessWashington Luiz Miranda da Cunhareponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMG2025-09-08T23:34:51Zoai:repositorio.ufmg.br:1843/76441Repositório InstitucionalPUBhttps://repositorio.ufmg.br/oairepositorio@ufmg.bropendoar:2025-09-08T23:34:51Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false |
| dc.title.none.fl_str_mv |
A comprehensive exploitation of instance selection methods for automatic text classification Uma exploração abrangente de métodos de seleção de instâncias para classificação automática de texto |
| title |
A comprehensive exploitation of instance selection methods for automatic text classification |
| spellingShingle |
A comprehensive exploitation of instance selection methods for automatic text classification Washington Luiz Miranda da Cunha Computação – Teses Aprendizado do computador – Teses Classificação (Computadores) – Teses Processamento de linguagem natural – Teses Seleção de Instâncias – Teses Instance Selection Automatic Text Classification |
| title_short |
A comprehensive exploitation of instance selection methods for automatic text classification |
| title_full |
A comprehensive exploitation of instance selection methods for automatic text classification |
| title_fullStr |
A comprehensive exploitation of instance selection methods for automatic text classification |
| title_full_unstemmed |
A comprehensive exploitation of instance selection methods for automatic text classification |
| title_sort |
A comprehensive exploitation of instance selection methods for automatic text classification |
| author |
Washington Luiz Miranda da Cunha |
| author_facet |
Washington Luiz Miranda da Cunha |
| author_role |
author |
| dc.contributor.author.fl_str_mv |
Washington Luiz Miranda da Cunha |
| dc.subject.por.fl_str_mv |
Computação – Teses Aprendizado do computador – Teses Classificação (Computadores) – Teses Processamento de linguagem natural – Teses Seleção de Instâncias – Teses Instance Selection Automatic Text Classification |
| topic |
Computação – Teses Aprendizado do computador – Teses Classificação (Computadores) – Teses Processamento de linguagem natural – Teses Seleção de Instâncias – Teses Instance Selection Automatic Text Classification |
| description |
Progress in Natural Language Processing (NLP) has been dictated by the rule of more: more data, more computing power, more complexity, best exemplified by the {Large Language Models. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. Our focus here is an under-investigated data engineering (DE) technique, with enormous potential in the current scenario – Instance Selection (IS). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining or improving the effectiveness (accuracy) of the trained models and reducing the training process cost. In this sense, the main contribution of this Ph.D. dissertation is twofold. Firstly, we survey classical and recent IS techniques and provide a scientifically sound comparison of IS methods applied to an essential NLP task - Automatic Text Classification (ATC). IS methods have been normally applied to small tabular datasets and have not been systematically compared in ATC. We consider several neural and non-neural SOTA ATC solutions and many datasets. We answer several research questions based on tradeoffs induced by a tripod: effectiveness, efficiency, reduction. Our answers reveal an enormous unfulfilled potential for IS solutions. Furthermore, in the case of fine-tuning the transformer methods, the IS methods reduce the amount of data needed, without losing effectiveness and with considerable training-time gains. Considering the issues revealed by the traditional IS approaches, the second main contribution is the proposal of two IS solutions: E2SC, a novel redundancy-oriented two-step framework aimed at large datasets with a particular focus on transformers. E2SC estimates the probability of each instance being removed from the training set based on scalable, fast, and calibrated weak classifiers. We hypothesize that it is possible to estimate the effectiveness of a strong classifier (Transformer) with a weaker one. However, as mentioned, E2SC focuses solely on the removal of redundant instances, leaving other aspects, such as noise, that may help to further reduce training, untouched. Therefore, we also propose biO-IS, an extended framework built upon our previous one aimed at simultaneously removing redundant and noisy instances from the training. biO-IS estimates redundancy based on E2SC and captures noise with the support of a new entropy-based step. We also propose a novel iterative process to estimate near-optimum reduction rates for both steps. Our final solution is able to reduce the training sets by 41% on average (up to 60%) while maintaining the effectiveness in all tested datasets, with speedup gains of 1.67 on average (up to 2.46x). No other baseline, was capable of scaling for datasets with hundreds of thousands of documents and achieving results with this level of quality, considering the tradeoff among training reduction, effectiveness, and speedup. |
| publishDate |
2024 |
| dc.date.none.fl_str_mv |
2024-09-13T16:35:19Z 2024-09-13T16:35:19Z 2024-08-26 2025-09-08T23:34:51Z |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
| format |
doctoralThesis |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
https://hdl.handle.net/1843/76441 |
| url |
https://hdl.handle.net/1843/76441 |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
Programa Institucional de Internacionalização – CAPES - PrInt |
| dc.rights.driver.fl_str_mv |
http://creativecommons.org/licenses/by/3.0/pt/ info:eu-repo/semantics/openAccess |
| rights_invalid_str_mv |
http://creativecommons.org/licenses/by/3.0/pt/ |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais |
| publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais |
| dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG |
| instname_str |
Universidade Federal de Minas Gerais (UFMG) |
| instacron_str |
UFMG |
| institution |
UFMG |
| reponame_str |
Repositório Institucional da UFMG |
| collection |
Repositório Institucional da UFMG |
| repository.name.fl_str_mv |
Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG) |
| repository.mail.fl_str_mv |
repositorio@ufmg.br |
| _version_ |
1856413972917387264 |