Data Augmentation methods in natural language processing.

Taynan Maier Ferreira

Data Augmentation methods in natural language processing.

Detalhes bibliográficos
Ano de defesa:	2021
Autor(a) principal:	Taynan Maier Ferreira
Orientador(a):	Anna Helena Reali Costa
Banca de defesa:	Aline Marins Paes Carvalho, Thiago Alexandre Salgueiro Pardo
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Universidade de São Paulo
Programa de Pós-Graduação:	Engenharia Elétrica
Departamento:	Não Informado pela instituição
País:	BR
Link de acesso:	https://doi.org/10.11606/D.3.2021.tde-04112021-162156
Resumo:	Data Augmentation (DA) methods a family of techniques designed for synthetic gen eration of training data have shown remarkable results in various Deep Learning and Machine Learning tasks. Despite its widespread and successful adoption within the com puter vision community, DA techniques designed for natural language processing (NLP) tasks have exhibited much slower advances and limited success in achieving performance gains. As a consequence, with the exception of applications of back-translation to machine translation tasks, these techniques have not been as thoroughly explored by the wider NLP community. There is no unified view or comparative analysis between the various DA methods available. Furthermore, there still lacks a proper practical understanding of the relationship between DA and several important aspects of model design, such as training data and regularization parameters. In this work, we perform a comprehensive study of NLP DA techniques, comparing their relative performance under different settings in Sentiment Analysis tasks. We also propose Deep Back-Translation, a novel NLP DA technique. We perform qualitative and quantitative analysis of generated synthetic data, evaluate its performance gains and compare all of these aspects to previous existing DA procedures.

Metadados do item

id	USP_6e38ba7050072a84d53a10660452c9d4
oai_identifier_str	oai:teses.usp.br:tde-04112021-162156
network_acronym_str	USP
network_name_str	Biblioteca Digital de Teses e Dissertações da USP
repository_id_str
spelling	info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesis Data Augmentation methods in natural language processing. Métodos de aumento de dados em processamento de linguagem natural. 2021-07-20Anna Helena Reali CostaAline Marins Paes CarvalhoThiago Alexandre Salgueiro PardoTaynan Maier FerreiraUniversidade de São PauloEngenharia ElétricaUSPBR Aprendizado computacional Aumento de dados Back-translation Data Augmentation Machine learning Natural language processing Processamento de linguagem natural Data Augmentation (DA) methods a family of techniques designed for synthetic gen eration of training data have shown remarkable results in various Deep Learning and Machine Learning tasks. Despite its widespread and successful adoption within the com puter vision community, DA techniques designed for natural language processing (NLP) tasks have exhibited much slower advances and limited success in achieving performance gains. As a consequence, with the exception of applications of back-translation to machine translation tasks, these techniques have not been as thoroughly explored by the wider NLP community. There is no unified view or comparative analysis between the various DA methods available. Furthermore, there still lacks a proper practical understanding of the relationship between DA and several important aspects of model design, such as training data and regularization parameters. In this work, we perform a comprehensive study of NLP DA techniques, comparing their relative performance under different settings in Sentiment Analysis tasks. We also propose Deep Back-Translation, a novel NLP DA technique. We perform qualitative and quantitative analysis of generated synthetic data, evaluate its performance gains and compare all of these aspects to previous existing DA procedures. Métodos de aumento de dados (AD) uma família de técnicas desenhada para a geração de dados de treino sintéticos têm demonstrado resultados notáveis em diversas tarefas de Aprendizado Profundo e Aprendizado de Máquina. Apesar de sua adoção ampla e bem-sucedida dentro da comunidade de visão computacional, técnicas de AD desenhados para tarefas de Processamento de Linguagem Natural (PLN) têm demonstrado avanço muito mais lento e limitado sucesso em ganho de desempenho. Como consequência, com a exceção da adoção de Back-Translation em tarefas de tradução, essas técnicas não tem sido exploradas tão profundamente e de forma ampla pela comunidade de PLN. Não há uma visão unificada ou análise comparativa entre os vários métodos de AD disponíveis. Além disso, ainda não se tem um entendimento prático adequado sobre o relacionamento entre AD e diversos outros aspectos importantes do desenho de um modelo, como dados de treino e parâmetros de regularização. Nesse trabalho, realizamos um profundo estudo de técnicas de AD em PLN, comparando seus desempenhos relativos sob diferentes cenários em tarefas de Análise de Sentimentos. Também propomos Deep Back-Translation, uma nova técnica de AD para PLN. N´os realizamos uma análise qualitativa e quantitativa do dado sintético, avaliamos seu ganho de desempenho e comparamos todos esses aspectos com procedimentos prévios de AD. https://doi.org/10.11606/D.3.2021.tde-04112021-162156info:eu-repo/semantics/openAccessengreponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USP2023-12-21T18:13:44Zoai:teses.usp.br:tde-04112021-162156Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.bropendoar:27212021-11-05T17:28:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.en.fl_str_mv	Data Augmentation methods in natural language processing.
dc.title.alternative.pt.fl_str_mv	Métodos de aumento de dados em processamento de linguagem natural.
title	Data Augmentation methods in natural language processing.
spellingShingle	Data Augmentation methods in natural language processing. Taynan Maier Ferreira
title_short	Data Augmentation methods in natural language processing.
title_full	Data Augmentation methods in natural language processing.
title_fullStr	Data Augmentation methods in natural language processing.
title_full_unstemmed	Data Augmentation methods in natural language processing.
title_sort	Data Augmentation methods in natural language processing.
author	Taynan Maier Ferreira
author_facet	Taynan Maier Ferreira
author_role	author
dc.contributor.advisor1.fl_str_mv	Anna Helena Reali Costa
dc.contributor.referee1.fl_str_mv	Aline Marins Paes Carvalho
dc.contributor.referee2.fl_str_mv	Thiago Alexandre Salgueiro Pardo
dc.contributor.author.fl_str_mv	Taynan Maier Ferreira
contributor_str_mv	Anna Helena Reali Costa Aline Marins Paes Carvalho Thiago Alexandre Salgueiro Pardo
description	Data Augmentation (DA) methods a family of techniques designed for synthetic gen eration of training data have shown remarkable results in various Deep Learning and Machine Learning tasks. Despite its widespread and successful adoption within the com puter vision community, DA techniques designed for natural language processing (NLP) tasks have exhibited much slower advances and limited success in achieving performance gains. As a consequence, with the exception of applications of back-translation to machine translation tasks, these techniques have not been as thoroughly explored by the wider NLP community. There is no unified view or comparative analysis between the various DA methods available. Furthermore, there still lacks a proper practical understanding of the relationship between DA and several important aspects of model design, such as training data and regularization parameters. In this work, we perform a comprehensive study of NLP DA techniques, comparing their relative performance under different settings in Sentiment Analysis tasks. We also propose Deep Back-Translation, a novel NLP DA technique. We perform qualitative and quantitative analysis of generated synthetic data, evaluate its performance gains and compare all of these aspects to previous existing DA procedures.
publishDate	2021
dc.date.issued.fl_str_mv	2021-07-20
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://doi.org/10.11606/D.3.2021.tde-04112021-162156
url	https://doi.org/10.11606/D.3.2021.tde-04112021-162156
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade de São Paulo
dc.publisher.program.fl_str_mv	Engenharia Elétrica
dc.publisher.initials.fl_str_mv	USP
dc.publisher.country.fl_str_mv	BR
publisher.none.fl_str_mv	Universidade de São Paulo
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP
instname_str	Universidade de São Paulo (USP)
instacron_str	USP
institution	USP
reponame_str	Biblioteca Digital de Teses e Dissertações da USP
collection	Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv	virginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.br
_version_	1786376563055394816

Data Augmentation methods in natural language processing.

Registros relacionados