Comparison of natural language processing algorithms applied to small supervised datasets in the legal domain

Noguti, Mariana Yukari, 1987-

Comparison of natural language processing algorithms applied to small supervised datasets in the legal domain

Detalhes bibliográficos
Ano de defesa:	2022
Autor(a) principal:	Noguti, Mariana Yukari, 1987-
Orientador(a):	Oliveira, Luiz Eduardo Soares de, 1971-
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Não Informado pela instituição
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Algorítmos Processamento da linguagem natural (Computação) Inteligência artificial Ciência da Computação
Link de acesso:	https://hdl.handle.net/1884/78717
Resumo:	Orientador: Luiz Eduardo S. Oliveira

Metadados do item

id	UFPR_1948b0937d70ab7bdecc16a96e21030c
oai_identifier_str	oai:acervodigital.ufpr.br:1884/78717
network_acronym_str	UFPR
network_name_str	Repositório Institucional da UFPR
repository_id_str
spelling	Noguti, Mariana Yukari, 1987-Vellasques, Eduardo, 1979-Universidade Federal do Paraná. Setor de Ciências Exatas. Programa de Pós-Graduação em InformáticaOliveira, Luiz Eduardo Soares de, 1971-2022-10-24T15:31:28Z2022-10-24T15:31:28Z2022https://hdl.handle.net/1884/78717Orientador: Luiz Eduardo S. OliveiraCoorientador: Eduardo VellasquesDissertação (mestrado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 05/08/2022Inclui referênciasÁrea de concentração: Ciência da ComputaçãoResumo: O presente trabalho procura investigar a performance de técnicas de transfer learning em conjunto com técnicas de data augmentation e diferentes algoritmos de aprendizagem supervisionada e semi-supervisionada na classificação de textos da área legal em tópicos pré-definidos. A intenção é investigar as melhores técnicas capazes de otimizar a performance na aludida tarefa utilizando uma base de dados rotulados relativamente pequena e grandes quantidades de dados não rotulados. Mais especificamente, serão utilizados como teste dados de atendimentos ao público realizados pelo Ministério Público do Estado do Paraná (MPPR), com o objetivo de classificar as descrições dos atendimentos em um dos assuntos listados pela instituição e automatizar a tarefa no sistema de registros. Como os integrantes da instituição possuem diversas demandas, não é possível avaliar um grande volume de dados, de modo que a otimização de classificadores com utilização de poucos dados é uma tarefa relevante para o desenvolvimento do produto final. Além disso, considerando o vocabulário particular utilizado cotidianamente pelo MPPR, pretende-se avaliar o impacto da realização de fine-tuning em modelos de linguagem pré-existentes em português na performance do classificador. Para a presente pesquisa foi obtida uma base rotulada contendo 6.500 observações com o objetivo de classificar textos curtos em 50 diferentes assuntos relacionados às áreas de atuação do MPPR. Também foram disponibilizados grandes volumes de observações não rotuladas para compor uma base semi-supervisionada, bem como uma base contendo mais de um milhão de registros internos, utilizada no treinamento de diferentes modelos de linguagem. Os resultados da pesquisa demonstram que, no caso da aprendizagem supervisionada através de classificadores lineares como a Regressão Logística e o SVM e ensembles como o Gradient Boosting e Random Forest, a melhor performance é observada utilizando embeddings extraídos pela técnica word2vec quando comparado com o modelo BERT. Este último demonstra performance superior quando utiliza como vantagem a arquitetura do próprio modelo como classificador, tendo superado os modelos anteriores neste sentido. O melhor resultado obtido indica que o uso conjunto do modelo de linguagem BERT ajustado ao vocabulário jurídico, técnicas específicas de aprendizado semi-supervisionado e data augmentation obtém melhor performance quando comparado aos demais modelos, com obtenção de acurácia de 80,7% na predição de 50 classes.Abstract: This research seeks to investigate the performance of transfer learning techniques in conjunction with data augmentation and different supervised and semi-supervised learning algorithms in the classification of texts in the legal area on predefined topics. The intention is to investigate how the recent advances in Natural Language Processing (NLP) can contribute to tackle such type of problem (where amount of labelled data is low but there is a large volume of unlabelled/domain-specific data). More specifically, we will use the records of demands to the Public Prosecutor’s Office of the State of Paraná in order to classify the descriptions in one of the subjects listed by the institution and automate the task in the records system. Considering that the members of the institution have several demands, it is not possible to evaluate a large volume of data, so that the optimization of classifiers in low regime data is a relevant task for the development of the final product. In addition, considering the specificity of the vocabulary used by the MPPR, it is intended to assess the impact of fine-tuning pre-existing Portuguese language models on the classifier’s performance. For this investigation, a labeled dataset was obtained containing 6,500 observations in order to classify texts on 50 different categories related to the areas of activity of the MPPR. Large volumes of unlabeled observations were also made available to compose a semi-supervised dataset, as well as a dataset containing more than one million internal records, used in the training of different language models. Our results demonstrate that, in the case of supervised learning through linear classifiers such as Logistic Regression and SVM and boosted trees such as Gradient Boosting and Random Forest, better performance is observed using embeddings extracted by the word2vec technique when compared to the BERT model. The latter demonstrates superior performance when using the architecture of the model itself as a classifier to its advantage, having surpassed the previous models in this sense. The best result obtained indicates that the joint use of the BERT language model fine-tuned to the legal vocabulary, specific techniques of semi-supervised learning and data augmentation presents better performance when compared to all previous models, having obtained an accuracy of 80.7% in the prediction of 50 classes.1 recurso online : PDF.application/pdfAlgorítmosProcessamento da linguagem natural (Computação)Inteligência artificialCiência da ComputaçãoComparison of natural language processing algorithms applied to small supervised datasets in the legal domaininfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisengreponame:Repositório Institucional da UFPRinstname:Universidade Federal do Paraná (UFPR)instacron:UFPRinfo:eu-repo/semantics/openAccessORIGINALR - D - MARIANA YUKARI NOGUTI.pdfapplication/pdf1657928https://acervodigital.ufpr.br/bitstream/1884/78717/1/R%20-%20D%20-%20MARIANA%20YUKARI%20NOGUTI.pdff660a21f11eb488f99c3809ae11c1150MD51open access1884/787172022-10-24 12:31:28.486open accessoai:acervodigital.ufpr.br:1884/78717Repositório InstitucionalPUBhttp://acervodigital.ufpr.br/oai/requestinformacaodigital@ufpr.bropendoar:3082022-10-24T15:31:28Repositório Institucional da UFPR - Universidade Federal do Paraná (UFPR)false
dc.title.pt_BR.fl_str_mv	Comparison of natural language processing algorithms applied to small supervised datasets in the legal domain
title	Comparison of natural language processing algorithms applied to small supervised datasets in the legal domain
spellingShingle	Comparison of natural language processing algorithms applied to small supervised datasets in the legal domain Noguti, Mariana Yukari, 1987- Algorítmos Processamento da linguagem natural (Computação) Inteligência artificial Ciência da Computação
title_short	Comparison of natural language processing algorithms applied to small supervised datasets in the legal domain
title_full	Comparison of natural language processing algorithms applied to small supervised datasets in the legal domain
title_fullStr	Comparison of natural language processing algorithms applied to small supervised datasets in the legal domain
title_full_unstemmed	Comparison of natural language processing algorithms applied to small supervised datasets in the legal domain
title_sort	Comparison of natural language processing algorithms applied to small supervised datasets in the legal domain
author	Noguti, Mariana Yukari, 1987-
author_facet	Noguti, Mariana Yukari, 1987-
author_role	author
dc.contributor.other.pt_BR.fl_str_mv	Vellasques, Eduardo, 1979- Universidade Federal do Paraná. Setor de Ciências Exatas. Programa de Pós-Graduação em Informática
dc.contributor.author.fl_str_mv	Noguti, Mariana Yukari, 1987-
dc.contributor.advisor1.fl_str_mv	Oliveira, Luiz Eduardo Soares de, 1971-
contributor_str_mv	Oliveira, Luiz Eduardo Soares de, 1971-
dc.subject.por.fl_str_mv	Algorítmos Processamento da linguagem natural (Computação) Inteligência artificial Ciência da Computação
topic	Algorítmos Processamento da linguagem natural (Computação) Inteligência artificial Ciência da Computação
description	Orientador: Luiz Eduardo S. Oliveira
publishDate	2022
dc.date.accessioned.fl_str_mv	2022-10-24T15:31:28Z
dc.date.available.fl_str_mv	2022-10-24T15:31:28Z
dc.date.issued.fl_str_mv	2022
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://hdl.handle.net/1884/78717
url	https://hdl.handle.net/1884/78717
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	1 recurso online : PDF. application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFPR instname:Universidade Federal do Paraná (UFPR) instacron:UFPR
instname_str	Universidade Federal do Paraná (UFPR)
instacron_str	UFPR
institution	UFPR
reponame_str	Repositório Institucional da UFPR
collection	Repositório Institucional da UFPR
bitstream.url.fl_str_mv	https://acervodigital.ufpr.br/bitstream/1884/78717/1/R%20-%20D%20-%20MARIANA%20YUKARI%20NOGUTI.pdf
bitstream.checksum.fl_str_mv	f660a21f11eb488f99c3809ae11c1150
bitstream.checksumAlgorithm.fl_str_mv	MD5
repository.name.fl_str_mv	Repositório Institucional da UFPR - Universidade Federal do Paraná (UFPR)
repository.mail.fl_str_mv	informacaodigital@ufpr.br
_version_	1847526171437170688

Comparison of natural language processing algorithms applied to small supervised datasets in the legal domain

Registros relacionados