Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora
| Ano de defesa: | 2024 |
|---|---|
| Autor(a) principal: | |
| Orientador(a): | |
| Banca de defesa: | , , , |
| Tipo de documento: | Dissertação |
| Tipo de acesso: | Acesso aberto |
| dARK ID: | ark:/38995/001300000fk84 |
| Idioma: | por |
| Instituição de defesa: |
Universidade Federal de Goiás
|
| Programa de Pós-Graduação: |
Programa de Pós-graduação em Ciência da Computação (INF)
|
| Departamento: |
Instituto de Informática - INF (RMG)
|
| País: |
Brasil
|
| Palavras-chave em Português: | |
| Palavras-chave em Inglês: | |
| Área do conhecimento CNPq: | |
| Link de acesso: | http://repositorio.bc.ufg.br/tede/handle/tede/13781 |
Resumo: | This research investigates the application of Natural Language Processing (NLP) within the legal domain for the Portuguese language, emphasizing the importance of domain adaptation for pre-trained language models, such as RoBERTa, using specialized legal corpora. We compiled and pre-processed a Portuguese legal corpus, named LegalPT, addressing the challenges of high near-duplicate document rates in legal corpora and conducting a comparison with generic web-scraped corpora. Experiments with these corpora revealed that pre-training on a combined dataset of legal and general data resulted in a more effective model for legal tasks. Our model, called RoBERTaLexPT, outperformed larger models trained solely on generic corpora, such as BERTimbau and Albertina-PT-*, and other legal models from similar works. For evaluating the performance of these models, we propose in this Master’s dissertation a legal benchmark composed of several datasets, including LeNER-Br, RRI, FGV, UlyssesNER-Br, CEIAEntidades, and CEIA-Frases. This study contributes to the improvement of NLP solutions in the Brazilian legal context by openly providing enhanced models, a specialized corpus, and a rigorous benchmark suite. |
| id |
UFG-2_e8046093d9567bde166990aebd280b04 |
|---|---|
| oai_identifier_str |
oai:repositorio.bc.ufg.br:tede/13781 |
| network_acronym_str |
UFG-2 |
| network_name_str |
Repositório Institucional da UFG |
| repository_id_str |
|
| spelling |
Silva, Nádia Félix Felipe dahttp://lattes.cnpq.br/7864834001694765Lima, Eliomar Araújo dehttp://lattes.cnpq.br/1362170231777201Silva, Nádia Félix Felipe daLima, Eliomar Araújo deSoares, Anderson da SilvaPlacca, José Avelinohttp://lattes.cnpq.br/4332449817645365Garcia, Eduardo Augusto Santos2025-01-15T14:46:17Z2025-01-15T14:46:17Z2024-05-28GARCIA, E. A. S. Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora. 2024. 82 f. Dissertação (Mestrado em Ciência da Computação) - Instituto de Informática, Universidade Federal de Goiás, Goiânia, 2024.http://repositorio.bc.ufg.br/tede/handle/tede/13781ark:/38995/001300000fk84This research investigates the application of Natural Language Processing (NLP) within the legal domain for the Portuguese language, emphasizing the importance of domain adaptation for pre-trained language models, such as RoBERTa, using specialized legal corpora. We compiled and pre-processed a Portuguese legal corpus, named LegalPT, addressing the challenges of high near-duplicate document rates in legal corpora and conducting a comparison with generic web-scraped corpora. Experiments with these corpora revealed that pre-training on a combined dataset of legal and general data resulted in a more effective model for legal tasks. Our model, called RoBERTaLexPT, outperformed larger models trained solely on generic corpora, such as BERTimbau and Albertina-PT-*, and other legal models from similar works. For evaluating the performance of these models, we propose in this Master’s dissertation a legal benchmark composed of several datasets, including LeNER-Br, RRI, FGV, UlyssesNER-Br, CEIAEntidades, and CEIA-Frases. This study contributes to the improvement of NLP solutions in the Brazilian legal context by openly providing enhanced models, a specialized corpus, and a rigorous benchmark suite.Este trabalho investiga a aplicação do Processamento de Linguagem Natural (PLN) no contexto jurídico em língua portuguesa, com ênfase na importância da adaptação de domínio para modelos de linguagem pré-treinados, como o RoBERTa, a partir de conjunto de dados com documentos de domínio legal. Compilamos e pré-processamos um corpus jurídico português, denominado LegalPT, no qual abordamos os desafios da alta quantidade de quase duplicatas em corpora legais e realizamos uma comparação com corpora genéricos de raspagem da Web. Experimentos com esses dados revelaram que o pré-treinamento com dados jurídicos e gerais resultou em um modelo mais eficaz para tarefas jurídicas. O nosso modelo, denominado RoBERTaLexPT, superou arquiteturas maiores treinadas apenas em corpora genéricos, como o BERTimbau e Albertina-PT-*, e outros modelos jurídicos de trabalhos similares. Para a avaliação do desempenho desses modelos, propomos nesta dissertação de mestrado um benchmark jurídico composto por diversos conjuntos de dados, incluindo LeNER-Br, RRI, FGV, UlyssesNER-Br, CEIAEntidades e CEIA-Frases. Este estudo contribui para aprimorar as soluções de PLN no contexto legal brasileiro, disponibilizando de forma aberta modelos aprimorados, um corpus especializado e um conjunto de benchmark rigoroso.porUniversidade Federal de GoiásPrograma de Pós-graduação em Ciência da Computação (INF)UFGBrasilInstituto de Informática - INF (RMG)Attribution-NonCommercial-NoDerivatives 4.0 Internationalinfo:eu-repo/semantics/openAccessProcessamento de linguagem naturalModelo de linguagemDomínio legalBenchmark JurídicoNatural language processingLanguage model,Legal DomainLegal BenchmarkCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOLegal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal CorporaAdaptação de domínio Legal em Modelos de Linguagens em português - Desenvolvimento e avaliação de modelos baseados em RoBERTa em corpora legaisinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisreponame:Repositório Institucional da UFGinstname:Universidade Federal de Goiás (UFG)instacron:UFGORIGINALDissertação - Eduardo Augusto Santos Garcia - 2024.pdfDissertação - Eduardo Augusto Santos Garcia - 2024.pdfapplication/pdf590322http://repositorio.bc.ufg.br/tede/bitstreams/1e03366c-f56e-40d2-9b60-ff618f25f8fc/download5bbcc4c1f25c94ad4ff69b49f82ca181MD51LICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://repositorio.bc.ufg.br/tede/bitstreams/bce30191-f27b-4ecf-aa41-48caeb92cdda/download8a4605be74aa9ea9d79846c1fba20a33MD52CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8805http://repositorio.bc.ufg.br/tede/bitstreams/6614cb62-c4bf-4535-bf39-daa6aa409998/download4460e5956bc1d1639be9ae6146a50347MD53tede/137812025-01-15 11:46:17.955http://creativecommons.org/licenses/by-nc-nd/4.0/Attribution-NonCommercial-NoDerivatives 4.0 Internationalopen.accessoai:repositorio.bc.ufg.br:tede/13781http://repositorio.bc.ufg.br/tedeRepositório InstitucionalPUBhttps://repositorio.bc.ufg.br/tedeserver/oai/requestgrt.bc@ufg.bropendoar:oai:repositorio.bc.ufg.br:tede/12342025-01-15T14:46:17Repositório Institucional da UFG - Universidade Federal de Goiás (UFG)falseTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo= |
| dc.title.none.fl_str_mv |
Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora |
| dc.title.alternative.eng.fl_str_mv |
Adaptação de domínio Legal em Modelos de Linguagens em português - Desenvolvimento e avaliação de modelos baseados em RoBERTa em corpora legais |
| title |
Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora |
| spellingShingle |
Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora Garcia, Eduardo Augusto Santos Processamento de linguagem natural Modelo de linguagem Domínio legal Benchmark Jurídico Natural language processing Language model, Legal Domain Legal Benchmark CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
| title_short |
Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora |
| title_full |
Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora |
| title_fullStr |
Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora |
| title_full_unstemmed |
Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora |
| title_sort |
Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora |
| author |
Garcia, Eduardo Augusto Santos |
| author_facet |
Garcia, Eduardo Augusto Santos |
| author_role |
author |
| dc.contributor.advisor1.fl_str_mv |
Silva, Nádia Félix Felipe da |
| dc.contributor.advisor1Lattes.fl_str_mv |
http://lattes.cnpq.br/7864834001694765 |
| dc.contributor.advisor-co1.fl_str_mv |
Lima, Eliomar Araújo de |
| dc.contributor.advisor-co1Lattes.fl_str_mv |
http://lattes.cnpq.br/1362170231777201 |
| dc.contributor.referee1.fl_str_mv |
Silva, Nádia Félix Felipe da |
| dc.contributor.referee2.fl_str_mv |
Lima, Eliomar Araújo de |
| dc.contributor.referee3.fl_str_mv |
Soares, Anderson da Silva |
| dc.contributor.referee4.fl_str_mv |
Placca, José Avelino |
| dc.contributor.authorLattes.fl_str_mv |
http://lattes.cnpq.br/4332449817645365 |
| dc.contributor.author.fl_str_mv |
Garcia, Eduardo Augusto Santos |
| contributor_str_mv |
Silva, Nádia Félix Felipe da Lima, Eliomar Araújo de Silva, Nádia Félix Felipe da Lima, Eliomar Araújo de Soares, Anderson da Silva Placca, José Avelino |
| dc.subject.por.fl_str_mv |
Processamento de linguagem natural Modelo de linguagem Domínio legal Benchmark Jurídico |
| topic |
Processamento de linguagem natural Modelo de linguagem Domínio legal Benchmark Jurídico Natural language processing Language model, Legal Domain Legal Benchmark CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
| dc.subject.eng.fl_str_mv |
Natural language processing Language model, Legal Domain Legal Benchmark |
| dc.subject.cnpq.fl_str_mv |
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
| description |
This research investigates the application of Natural Language Processing (NLP) within the legal domain for the Portuguese language, emphasizing the importance of domain adaptation for pre-trained language models, such as RoBERTa, using specialized legal corpora. We compiled and pre-processed a Portuguese legal corpus, named LegalPT, addressing the challenges of high near-duplicate document rates in legal corpora and conducting a comparison with generic web-scraped corpora. Experiments with these corpora revealed that pre-training on a combined dataset of legal and general data resulted in a more effective model for legal tasks. Our model, called RoBERTaLexPT, outperformed larger models trained solely on generic corpora, such as BERTimbau and Albertina-PT-*, and other legal models from similar works. For evaluating the performance of these models, we propose in this Master’s dissertation a legal benchmark composed of several datasets, including LeNER-Br, RRI, FGV, UlyssesNER-Br, CEIAEntidades, and CEIA-Frases. This study contributes to the improvement of NLP solutions in the Brazilian legal context by openly providing enhanced models, a specialized corpus, and a rigorous benchmark suite. |
| publishDate |
2024 |
| dc.date.issued.fl_str_mv |
2024-05-28 |
| dc.date.accessioned.fl_str_mv |
2025-01-15T14:46:17Z |
| dc.date.available.fl_str_mv |
2025-01-15T14:46:17Z |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
| format |
masterThesis |
| status_str |
publishedVersion |
| dc.identifier.citation.fl_str_mv |
GARCIA, E. A. S. Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora. 2024. 82 f. Dissertação (Mestrado em Ciência da Computação) - Instituto de Informática, Universidade Federal de Goiás, Goiânia, 2024. |
| dc.identifier.uri.fl_str_mv |
http://repositorio.bc.ufg.br/tede/handle/tede/13781 |
| dc.identifier.dark.fl_str_mv |
ark:/38995/001300000fk84 |
| identifier_str_mv |
GARCIA, E. A. S. Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora. 2024. 82 f. Dissertação (Mestrado em Ciência da Computação) - Instituto de Informática, Universidade Federal de Goiás, Goiânia, 2024. ark:/38995/001300000fk84 |
| url |
http://repositorio.bc.ufg.br/tede/handle/tede/13781 |
| dc.language.iso.fl_str_mv |
por |
| language |
por |
| dc.rights.driver.fl_str_mv |
Attribution-NonCommercial-NoDerivatives 4.0 International info:eu-repo/semantics/openAccess |
| rights_invalid_str_mv |
Attribution-NonCommercial-NoDerivatives 4.0 International |
| eu_rights_str_mv |
openAccess |
| dc.publisher.none.fl_str_mv |
Universidade Federal de Goiás |
| dc.publisher.program.fl_str_mv |
Programa de Pós-graduação em Ciência da Computação (INF) |
| dc.publisher.initials.fl_str_mv |
UFG |
| dc.publisher.country.fl_str_mv |
Brasil |
| dc.publisher.department.fl_str_mv |
Instituto de Informática - INF (RMG) |
| publisher.none.fl_str_mv |
Universidade Federal de Goiás |
| dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFG instname:Universidade Federal de Goiás (UFG) instacron:UFG |
| instname_str |
Universidade Federal de Goiás (UFG) |
| instacron_str |
UFG |
| institution |
UFG |
| reponame_str |
Repositório Institucional da UFG |
| collection |
Repositório Institucional da UFG |
| bitstream.url.fl_str_mv |
http://repositorio.bc.ufg.br/tede/bitstreams/1e03366c-f56e-40d2-9b60-ff618f25f8fc/download http://repositorio.bc.ufg.br/tede/bitstreams/bce30191-f27b-4ecf-aa41-48caeb92cdda/download http://repositorio.bc.ufg.br/tede/bitstreams/6614cb62-c4bf-4535-bf39-daa6aa409998/download |
| bitstream.checksum.fl_str_mv |
5bbcc4c1f25c94ad4ff69b49f82ca181 8a4605be74aa9ea9d79846c1fba20a33 4460e5956bc1d1639be9ae6146a50347 |
| bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 |
| repository.name.fl_str_mv |
Repositório Institucional da UFG - Universidade Federal de Goiás (UFG) |
| repository.mail.fl_str_mv |
grt.bc@ufg.br |
| _version_ |
1846536642943254528 |