Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: Garcia, Eduardo Augusto Santos lattes
Orientador(a): Silva, Nádia Félix Felipe da lattes
Banca de defesa: Silva, Nádia Félix Felipe da, Lima, Eliomar Araújo de, Soares, Anderson da Silva, Placca, José Avelino
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
dARK ID: ark:/38995/001300000fk84
Idioma: por
Instituição de defesa: Universidade Federal de Goiás
Programa de Pós-Graduação: Programa de Pós-graduação em Ciência da Computação (INF)
Departamento: Instituto de Informática - INF (RMG)
País: Brasil
Palavras-chave em Português:
Palavras-chave em Inglês:
Área do conhecimento CNPq:
Link de acesso: http://repositorio.bc.ufg.br/tede/handle/tede/13781
Resumo: This research investigates the application of Natural Language Processing (NLP) within the legal domain for the Portuguese language, emphasizing the importance of domain adaptation for pre-trained language models, such as RoBERTa, using specialized legal corpora. We compiled and pre-processed a Portuguese legal corpus, named LegalPT, addressing the challenges of high near-duplicate document rates in legal corpora and conducting a comparison with generic web-scraped corpora. Experiments with these corpora revealed that pre-training on a combined dataset of legal and general data resulted in a more effective model for legal tasks. Our model, called RoBERTaLexPT, outperformed larger models trained solely on generic corpora, such as BERTimbau and Albertina-PT-*, and other legal models from similar works. For evaluating the performance of these models, we propose in this Master’s dissertation a legal benchmark composed of several datasets, including LeNER-Br, RRI, FGV, UlyssesNER-Br, CEIAEntidades, and CEIA-Frases. This study contributes to the improvement of NLP solutions in the Brazilian legal context by openly providing enhanced models, a specialized corpus, and a rigorous benchmark suite.
id UFG-2_e8046093d9567bde166990aebd280b04
oai_identifier_str oai:repositorio.bc.ufg.br:tede/13781
network_acronym_str UFG-2
network_name_str Repositório Institucional da UFG
repository_id_str
spelling Silva, Nádia Félix Felipe dahttp://lattes.cnpq.br/7864834001694765Lima, Eliomar Araújo dehttp://lattes.cnpq.br/1362170231777201Silva, Nádia Félix Felipe daLima, Eliomar Araújo deSoares, Anderson da SilvaPlacca, José Avelinohttp://lattes.cnpq.br/4332449817645365Garcia, Eduardo Augusto Santos2025-01-15T14:46:17Z2025-01-15T14:46:17Z2024-05-28GARCIA, E. A. S. Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora. 2024. 82 f. Dissertação (Mestrado em Ciência da Computação) - Instituto de Informática, Universidade Federal de Goiás, Goiânia, 2024.http://repositorio.bc.ufg.br/tede/handle/tede/13781ark:/38995/001300000fk84This research investigates the application of Natural Language Processing (NLP) within the legal domain for the Portuguese language, emphasizing the importance of domain adaptation for pre-trained language models, such as RoBERTa, using specialized legal corpora. We compiled and pre-processed a Portuguese legal corpus, named LegalPT, addressing the challenges of high near-duplicate document rates in legal corpora and conducting a comparison with generic web-scraped corpora. Experiments with these corpora revealed that pre-training on a combined dataset of legal and general data resulted in a more effective model for legal tasks. Our model, called RoBERTaLexPT, outperformed larger models trained solely on generic corpora, such as BERTimbau and Albertina-PT-*, and other legal models from similar works. For evaluating the performance of these models, we propose in this Master’s dissertation a legal benchmark composed of several datasets, including LeNER-Br, RRI, FGV, UlyssesNER-Br, CEIAEntidades, and CEIA-Frases. This study contributes to the improvement of NLP solutions in the Brazilian legal context by openly providing enhanced models, a specialized corpus, and a rigorous benchmark suite.Este trabalho investiga a aplicação do Processamento de Linguagem Natural (PLN) no contexto jurídico em língua portuguesa, com ênfase na importância da adaptação de domínio para modelos de linguagem pré-treinados, como o RoBERTa, a partir de conjunto de dados com documentos de domínio legal. Compilamos e pré-processamos um corpus jurídico português, denominado LegalPT, no qual abordamos os desafios da alta quantidade de quase duplicatas em corpora legais e realizamos uma comparação com corpora genéricos de raspagem da Web. Experimentos com esses dados revelaram que o pré-treinamento com dados jurídicos e gerais resultou em um modelo mais eficaz para tarefas jurídicas. O nosso modelo, denominado RoBERTaLexPT, superou arquiteturas maiores treinadas apenas em corpora genéricos, como o BERTimbau e Albertina-PT-*, e outros modelos jurídicos de trabalhos similares. Para a avaliação do desempenho desses modelos, propomos nesta dissertação de mestrado um benchmark jurídico composto por diversos conjuntos de dados, incluindo LeNER-Br, RRI, FGV, UlyssesNER-Br, CEIAEntidades e CEIA-Frases. Este estudo contribui para aprimorar as soluções de PLN no contexto legal brasileiro, disponibilizando de forma aberta modelos aprimorados, um corpus especializado e um conjunto de benchmark rigoroso.porUniversidade Federal de GoiásPrograma de Pós-graduação em Ciência da Computação (INF)UFGBrasilInstituto de Informática - INF (RMG)Attribution-NonCommercial-NoDerivatives 4.0 Internationalinfo:eu-repo/semantics/openAccessProcessamento de linguagem naturalModelo de linguagemDomínio legalBenchmark JurídicoNatural language processingLanguage model,Legal DomainLegal BenchmarkCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOLegal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal CorporaAdaptação de domínio Legal em Modelos de Linguagens em português - Desenvolvimento e avaliação de modelos baseados em RoBERTa em corpora legaisinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisreponame:Repositório Institucional da UFGinstname:Universidade Federal de Goiás (UFG)instacron:UFGORIGINALDissertação - Eduardo Augusto Santos Garcia - 2024.pdfDissertação - Eduardo Augusto Santos Garcia - 2024.pdfapplication/pdf590322http://repositorio.bc.ufg.br/tede/bitstreams/1e03366c-f56e-40d2-9b60-ff618f25f8fc/download5bbcc4c1f25c94ad4ff69b49f82ca181MD51LICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://repositorio.bc.ufg.br/tede/bitstreams/bce30191-f27b-4ecf-aa41-48caeb92cdda/download8a4605be74aa9ea9d79846c1fba20a33MD52CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8805http://repositorio.bc.ufg.br/tede/bitstreams/6614cb62-c4bf-4535-bf39-daa6aa409998/download4460e5956bc1d1639be9ae6146a50347MD53tede/137812025-01-15 11:46:17.955http://creativecommons.org/licenses/by-nc-nd/4.0/Attribution-NonCommercial-NoDerivatives 4.0 Internationalopen.accessoai:repositorio.bc.ufg.br:tede/13781http://repositorio.bc.ufg.br/tedeRepositório InstitucionalPUBhttps://repositorio.bc.ufg.br/tedeserver/oai/requestgrt.bc@ufg.bropendoar:oai:repositorio.bc.ufg.br:tede/12342025-01-15T14:46:17Repositório Institucional da UFG - Universidade Federal de Goiás (UFG)falseTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=
dc.title.none.fl_str_mv Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora
dc.title.alternative.eng.fl_str_mv Adaptação de domínio Legal em Modelos de Linguagens em português - Desenvolvimento e avaliação de modelos baseados em RoBERTa em corpora legais
title Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora
spellingShingle Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora
Garcia, Eduardo Augusto Santos
Processamento de linguagem natural
Modelo de linguagem
Domínio legal
Benchmark Jurídico
Natural language processing
Language model,
Legal Domain
Legal Benchmark
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
title_short Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora
title_full Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora
title_fullStr Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora
title_full_unstemmed Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora
title_sort Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora
author Garcia, Eduardo Augusto Santos
author_facet Garcia, Eduardo Augusto Santos
author_role author
dc.contributor.advisor1.fl_str_mv Silva, Nádia Félix Felipe da
dc.contributor.advisor1Lattes.fl_str_mv http://lattes.cnpq.br/7864834001694765
dc.contributor.advisor-co1.fl_str_mv Lima, Eliomar Araújo de
dc.contributor.advisor-co1Lattes.fl_str_mv http://lattes.cnpq.br/1362170231777201
dc.contributor.referee1.fl_str_mv Silva, Nádia Félix Felipe da
dc.contributor.referee2.fl_str_mv Lima, Eliomar Araújo de
dc.contributor.referee3.fl_str_mv Soares, Anderson da Silva
dc.contributor.referee4.fl_str_mv Placca, José Avelino
dc.contributor.authorLattes.fl_str_mv http://lattes.cnpq.br/4332449817645365
dc.contributor.author.fl_str_mv Garcia, Eduardo Augusto Santos
contributor_str_mv Silva, Nádia Félix Felipe da
Lima, Eliomar Araújo de
Silva, Nádia Félix Felipe da
Lima, Eliomar Araújo de
Soares, Anderson da Silva
Placca, José Avelino
dc.subject.por.fl_str_mv Processamento de linguagem natural
Modelo de linguagem
Domínio legal
Benchmark Jurídico
topic Processamento de linguagem natural
Modelo de linguagem
Domínio legal
Benchmark Jurídico
Natural language processing
Language model,
Legal Domain
Legal Benchmark
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
dc.subject.eng.fl_str_mv Natural language processing
Language model,
Legal Domain
Legal Benchmark
dc.subject.cnpq.fl_str_mv CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
description This research investigates the application of Natural Language Processing (NLP) within the legal domain for the Portuguese language, emphasizing the importance of domain adaptation for pre-trained language models, such as RoBERTa, using specialized legal corpora. We compiled and pre-processed a Portuguese legal corpus, named LegalPT, addressing the challenges of high near-duplicate document rates in legal corpora and conducting a comparison with generic web-scraped corpora. Experiments with these corpora revealed that pre-training on a combined dataset of legal and general data resulted in a more effective model for legal tasks. Our model, called RoBERTaLexPT, outperformed larger models trained solely on generic corpora, such as BERTimbau and Albertina-PT-*, and other legal models from similar works. For evaluating the performance of these models, we propose in this Master’s dissertation a legal benchmark composed of several datasets, including LeNER-Br, RRI, FGV, UlyssesNER-Br, CEIAEntidades, and CEIA-Frases. This study contributes to the improvement of NLP solutions in the Brazilian legal context by openly providing enhanced models, a specialized corpus, and a rigorous benchmark suite.
publishDate 2024
dc.date.issued.fl_str_mv 2024-05-28
dc.date.accessioned.fl_str_mv 2025-01-15T14:46:17Z
dc.date.available.fl_str_mv 2025-01-15T14:46:17Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.citation.fl_str_mv GARCIA, E. A. S. Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora. 2024. 82 f. Dissertação (Mestrado em Ciência da Computação) - Instituto de Informática, Universidade Federal de Goiás, Goiânia, 2024.
dc.identifier.uri.fl_str_mv http://repositorio.bc.ufg.br/tede/handle/tede/13781
dc.identifier.dark.fl_str_mv ark:/38995/001300000fk84
identifier_str_mv GARCIA, E. A. S. Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora. 2024. 82 f. Dissertação (Mestrado em Ciência da Computação) - Instituto de Informática, Universidade Federal de Goiás, Goiânia, 2024.
ark:/38995/001300000fk84
url http://repositorio.bc.ufg.br/tede/handle/tede/13781
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv Attribution-NonCommercial-NoDerivatives 4.0 International
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Attribution-NonCommercial-NoDerivatives 4.0 International
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade Federal de Goiás
dc.publisher.program.fl_str_mv Programa de Pós-graduação em Ciência da Computação (INF)
dc.publisher.initials.fl_str_mv UFG
dc.publisher.country.fl_str_mv Brasil
dc.publisher.department.fl_str_mv Instituto de Informática - INF (RMG)
publisher.none.fl_str_mv Universidade Federal de Goiás
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFG
instname:Universidade Federal de Goiás (UFG)
instacron:UFG
instname_str Universidade Federal de Goiás (UFG)
instacron_str UFG
institution UFG
reponame_str Repositório Institucional da UFG
collection Repositório Institucional da UFG
bitstream.url.fl_str_mv http://repositorio.bc.ufg.br/tede/bitstreams/1e03366c-f56e-40d2-9b60-ff618f25f8fc/download
http://repositorio.bc.ufg.br/tede/bitstreams/bce30191-f27b-4ecf-aa41-48caeb92cdda/download
http://repositorio.bc.ufg.br/tede/bitstreams/6614cb62-c4bf-4535-bf39-daa6aa409998/download
bitstream.checksum.fl_str_mv 5bbcc4c1f25c94ad4ff69b49f82ca181
8a4605be74aa9ea9d79846c1fba20a33
4460e5956bc1d1639be9ae6146a50347
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFG - Universidade Federal de Goiás (UFG)
repository.mail.fl_str_mv grt.bc@ufg.br
_version_ 1846536642943254528