Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles
| Ano de defesa: | 2024 |
|---|---|
| Autor(a) principal: | |
| Orientador(a): | |
| Banca de defesa: | |
| Tipo de documento: | Dissertação |
| Tipo de acesso: | Acesso aberto |
| Idioma: | eng |
| Instituição de defesa: |
Biblioteca Digitais de Teses e Dissertações da USP
|
| Programa de Pós-Graduação: |
Não Informado pela instituição
|
| Departamento: |
Não Informado pela instituição
|
| País: |
Não Informado pela instituição
|
| Palavras-chave em Português: | |
| Link de acesso: | https://www.teses.usp.br/teses/disponiveis/8/8139/tde-10122024-105745/ |
Resumo: | This master\'s thesis investigates the challenges of translating spatial language using open-source Large Language Models (LLMs) compared to traditional Neural Machine Translation (NMT) systems. It addresses the problem of accurately translating the semantics of spatial prepositions such as ACROSS, INTO, ONTO, and THROUGH, which are often translated into similar verbal or prepositional forms from English to Portuguese (EN-PT-BR). Correctly translating these prepositions is crucial for maintaining the semantic integrity of the source content while ensuring fluency and adherence to the lexicalization patterns of the target language (House 2018; Talmy 2000b; Slobin 2005). The research begins by contextualizing the challenges of spatial language translation, highlighting the limitations of current NMT systems and the potential advantages of LLMs. A comprehensive literature review traces the evolution of translation theories, the development of NMT, and the rise of LLMs, while also describing the potential limitations of the current approach. The methodology employs a corpus-based analysis, assembling a bilingual dataset centered on spatial prepositions comprising TED Talks subtitles from the OPUS platform. This dataset was meticulously pre-processed to facilitate both automated metrics and manual error analysis. The evaluation metrics used include BLEU, METEOR, BERTScore, COMET, and TER, while the manual error analysis specifically identifies and categorizes the types of errors each system makes. The findings reveal that moderate-sized LLMs such as LLaMa-3-8B and Mixtral-8x7B achieve accuracy close to NMT systems such as Google, although this relationship is not always linear, as models like Gemma-7B presented similar performance in human reviews. However, LLMs generally presented other serious mistranslation errors, including interlanguage/code-switching (in) and anglicism (an) errors, failing to convey idiomacity in the target language. Conversely, NMT systems achieved better general fluency and precision for machine translation tasks. Manual error analysis, on the other hand, underscores the ongoing challenges both LLMs and NMT systems face in translating the nuances of spatial language, with both groups presenting consistent numbers of errors like polysemy (po) and syntactic projection (sp) errors, where they either fail to translate a preposition\'s appropriate meaning or copy the lexicalization patterns from the source text into the target text (Fernandes et al. 2024; Oliveira and Fernandes 2022). The master\'s thesis concludes that despite the advancements in LLMs, significant hurdles remain in translating spatial language accurately. It suggests that future research should focus on enhancing training datasets, refining model architectures, and developing more sophisticated evaluation metrics that better capture the semantic subtleties of spatial language. This study contributes to the field by providing a detailed comparison of model performance in spatial language translation from EN-PT-BR and proposing directions for future improvements |
| id |
USP_a1a7b88ed6cdb9b3fab31e70e5db9c5a |
|---|---|
| oai_identifier_str |
oai:teses.usp.br:tde-10122024-105745 |
| network_acronym_str |
USP |
| network_name_str |
Biblioteca Digital de Teses e Dissertações da USP |
| repository_id_str |
|
| spelling |
Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitlesDecodificando a semântica espacial: uma análise comparativa da performance de LLMs de código aberto em comparação com sistemas NMT na tradução de legendas do EN-PT-BRAvaliação da tradução automáticaLanguage typologyMachine translation evaluationModelos de Linguagem (LLMs)Natural Language Processing (NLP)Neural Machine Translation (NMT)Open-source Large Language Models (LLMs)Polissemia das preposiçõesPreposition polysemyProcessamento de Linguagem Natural (NLP)Semântica espacialSpatial semanticsTipologia linguísticaTradução Automática Neural (NMT)This master\'s thesis investigates the challenges of translating spatial language using open-source Large Language Models (LLMs) compared to traditional Neural Machine Translation (NMT) systems. It addresses the problem of accurately translating the semantics of spatial prepositions such as ACROSS, INTO, ONTO, and THROUGH, which are often translated into similar verbal or prepositional forms from English to Portuguese (EN-PT-BR). Correctly translating these prepositions is crucial for maintaining the semantic integrity of the source content while ensuring fluency and adherence to the lexicalization patterns of the target language (House 2018; Talmy 2000b; Slobin 2005). The research begins by contextualizing the challenges of spatial language translation, highlighting the limitations of current NMT systems and the potential advantages of LLMs. A comprehensive literature review traces the evolution of translation theories, the development of NMT, and the rise of LLMs, while also describing the potential limitations of the current approach. The methodology employs a corpus-based analysis, assembling a bilingual dataset centered on spatial prepositions comprising TED Talks subtitles from the OPUS platform. This dataset was meticulously pre-processed to facilitate both automated metrics and manual error analysis. The evaluation metrics used include BLEU, METEOR, BERTScore, COMET, and TER, while the manual error analysis specifically identifies and categorizes the types of errors each system makes. The findings reveal that moderate-sized LLMs such as LLaMa-3-8B and Mixtral-8x7B achieve accuracy close to NMT systems such as Google, although this relationship is not always linear, as models like Gemma-7B presented similar performance in human reviews. However, LLMs generally presented other serious mistranslation errors, including interlanguage/code-switching (in) and anglicism (an) errors, failing to convey idiomacity in the target language. Conversely, NMT systems achieved better general fluency and precision for machine translation tasks. Manual error analysis, on the other hand, underscores the ongoing challenges both LLMs and NMT systems face in translating the nuances of spatial language, with both groups presenting consistent numbers of errors like polysemy (po) and syntactic projection (sp) errors, where they either fail to translate a preposition\'s appropriate meaning or copy the lexicalization patterns from the source text into the target text (Fernandes et al. 2024; Oliveira and Fernandes 2022). The master\'s thesis concludes that despite the advancements in LLMs, significant hurdles remain in translating spatial language accurately. It suggests that future research should focus on enhancing training datasets, refining model architectures, and developing more sophisticated evaluation metrics that better capture the semantic subtleties of spatial language. This study contributes to the field by providing a detailed comparison of model performance in spatial language translation from EN-PT-BR and proposing directions for future improvementsEsta dissertação de mestrado investiga os desafios da tradução da espacialidade usando Grandes Modelos de Linguagem (LLMs) de código aberto em comparação com sistemas tradicionais de Tradução Automática Neural (NMT), abordando problemas na tradução de preposições espaciais como ACROSS, INTO, ONTO e THROUGH, que frequentemente são traduzidas utilizando-se as mesmas formas verbais ou preposicionais do inglês para o português (EN-PT-BR). A tradução correta dessas preposições é crucial para manter a integridade semântica da língua de origem, garantindo fluidez e aderência aos padrões de lexicalização da língua alvo (House 2018; Talmy 2000b; Slobin 2005). A pesquisa contextualiza os desafios da tradução da linguagem espacial, destacando as limitações dos sistemas NMT atuais e as potenciais vantagens dos LLMs. A revisão de literatura traça a evolução das teorias de tradução, o desenvolvimento da NMT e o surgimento dos LLMs, descrevendo também suas limitações. A metodologia emprega uma análise baseada em corpus, a partir de um conjunto de dados bilíngue centrado em preposições espaciais de legendas de TED Talks obtidos pela plataforma OPUS. Este conjunto de dados foi meticulosamente pré-processado para facilitar tanto o cálculo de métricas automatizadas quanto a análise de erros manual. As métricas utilizadas incluem BLEU, METEOR, BERTScore, COMET e TER, enquanto a análise manual identifica e categoriza os tipos de erros que cada sistema comete. Os resultados revelam que LLMs de tamanho moderado, como LLaMa-3-8B e Mixtral-8x7B, alcançam precisão próxima aos sistemas NMT, como o Google, embora essa relação nem sempre seja linear, pois modelos como Gemma-7B possuíram desempenho similar na avaliação humana. No entanto, os LLMs em geral apresentaram sérios erros de tradução, incluindo interlíngua/code-switching (in) e anglicismos (an), não conseguindo transmitir idiomaticidade na língua-alvo. Por outro lado, os sistemas NMT alcançaram muito melhor fluidez na tarefa de tradução automática. No entanto, a análise humana destaca os desafios contínuos enfrentados tanto pelos LLMs quanto pelos sistemas NMT na tradução das nuances da espacialidade, com ambos os grupos apresentando números consistentes de erros como polissemia (po) e projeção sintática (sp), nos quais falham em traduzir o significado apropriado de uma preposição ou copiam os padrões de lexicalização da língua de origem para o texto alvo (Fernandes et al. 2024; Oliveira e Fernandes 2022). A dissertação conclui que, apesar dos avanços nos LLMs, permanecem obstáculos na tradução precisa da linguagem espacial, sugerindo que pesquisas futuras devem se concentrar em aprimorar conjuntos de dados de treinamento, refinar arquiteturas desses modelos e desenvolver métricas de avaliação mais sofisticadas que capturem melhor as sutilezas da semântica espacial. Este estudo contribui para o campo fornecendo uma comparação detalhada do desempenho de LLMs e NMT na tradução da linguagem espacial do EN-PT-BR, propondo direções para melhorias futurasBiblioteca Digitais de Teses e Dissertações da USPLopes, Marcos FernandoFernandes, Rafael Macário2024-08-06info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/8/8139/tde-10122024-105745/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2024-12-10T13:27:02Zoai:teses.usp.br:tde-10122024-105745Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212024-12-10T13:27:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false |
| dc.title.none.fl_str_mv |
Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles Decodificando a semântica espacial: uma análise comparativa da performance de LLMs de código aberto em comparação com sistemas NMT na tradução de legendas do EN-PT-BR |
| title |
Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles |
| spellingShingle |
Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles Fernandes, Rafael Macário Avaliação da tradução automática Language typology Machine translation evaluation Modelos de Linguagem (LLMs) Natural Language Processing (NLP) Neural Machine Translation (NMT) Open-source Large Language Models (LLMs) Polissemia das preposições Preposition polysemy Processamento de Linguagem Natural (NLP) Semântica espacial Spatial semantics Tipologia linguística Tradução Automática Neural (NMT) |
| title_short |
Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles |
| title_full |
Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles |
| title_fullStr |
Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles |
| title_full_unstemmed |
Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles |
| title_sort |
Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles |
| author |
Fernandes, Rafael Macário |
| author_facet |
Fernandes, Rafael Macário |
| author_role |
author |
| dc.contributor.none.fl_str_mv |
Lopes, Marcos Fernando |
| dc.contributor.author.fl_str_mv |
Fernandes, Rafael Macário |
| dc.subject.por.fl_str_mv |
Avaliação da tradução automática Language typology Machine translation evaluation Modelos de Linguagem (LLMs) Natural Language Processing (NLP) Neural Machine Translation (NMT) Open-source Large Language Models (LLMs) Polissemia das preposições Preposition polysemy Processamento de Linguagem Natural (NLP) Semântica espacial Spatial semantics Tipologia linguística Tradução Automática Neural (NMT) |
| topic |
Avaliação da tradução automática Language typology Machine translation evaluation Modelos de Linguagem (LLMs) Natural Language Processing (NLP) Neural Machine Translation (NMT) Open-source Large Language Models (LLMs) Polissemia das preposições Preposition polysemy Processamento de Linguagem Natural (NLP) Semântica espacial Spatial semantics Tipologia linguística Tradução Automática Neural (NMT) |
| description |
This master\'s thesis investigates the challenges of translating spatial language using open-source Large Language Models (LLMs) compared to traditional Neural Machine Translation (NMT) systems. It addresses the problem of accurately translating the semantics of spatial prepositions such as ACROSS, INTO, ONTO, and THROUGH, which are often translated into similar verbal or prepositional forms from English to Portuguese (EN-PT-BR). Correctly translating these prepositions is crucial for maintaining the semantic integrity of the source content while ensuring fluency and adherence to the lexicalization patterns of the target language (House 2018; Talmy 2000b; Slobin 2005). The research begins by contextualizing the challenges of spatial language translation, highlighting the limitations of current NMT systems and the potential advantages of LLMs. A comprehensive literature review traces the evolution of translation theories, the development of NMT, and the rise of LLMs, while also describing the potential limitations of the current approach. The methodology employs a corpus-based analysis, assembling a bilingual dataset centered on spatial prepositions comprising TED Talks subtitles from the OPUS platform. This dataset was meticulously pre-processed to facilitate both automated metrics and manual error analysis. The evaluation metrics used include BLEU, METEOR, BERTScore, COMET, and TER, while the manual error analysis specifically identifies and categorizes the types of errors each system makes. The findings reveal that moderate-sized LLMs such as LLaMa-3-8B and Mixtral-8x7B achieve accuracy close to NMT systems such as Google, although this relationship is not always linear, as models like Gemma-7B presented similar performance in human reviews. However, LLMs generally presented other serious mistranslation errors, including interlanguage/code-switching (in) and anglicism (an) errors, failing to convey idiomacity in the target language. Conversely, NMT systems achieved better general fluency and precision for machine translation tasks. Manual error analysis, on the other hand, underscores the ongoing challenges both LLMs and NMT systems face in translating the nuances of spatial language, with both groups presenting consistent numbers of errors like polysemy (po) and syntactic projection (sp) errors, where they either fail to translate a preposition\'s appropriate meaning or copy the lexicalization patterns from the source text into the target text (Fernandes et al. 2024; Oliveira and Fernandes 2022). The master\'s thesis concludes that despite the advancements in LLMs, significant hurdles remain in translating spatial language accurately. It suggests that future research should focus on enhancing training datasets, refining model architectures, and developing more sophisticated evaluation metrics that better capture the semantic subtleties of spatial language. This study contributes to the field by providing a detailed comparison of model performance in spatial language translation from EN-PT-BR and proposing directions for future improvements |
| publishDate |
2024 |
| dc.date.none.fl_str_mv |
2024-08-06 |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
| format |
masterThesis |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
https://www.teses.usp.br/teses/disponiveis/8/8139/tde-10122024-105745/ |
| url |
https://www.teses.usp.br/teses/disponiveis/8/8139/tde-10122024-105745/ |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
|
| dc.rights.driver.fl_str_mv |
Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess |
| rights_invalid_str_mv |
Liberar o conteúdo para acesso público. |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.coverage.none.fl_str_mv |
|
| dc.publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
| publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
| dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP |
| instname_str |
Universidade de São Paulo (USP) |
| instacron_str |
USP |
| institution |
USP |
| reponame_str |
Biblioteca Digital de Teses e Dissertações da USP |
| collection |
Biblioteca Digital de Teses e Dissertações da USP |
| repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP) |
| repository.mail.fl_str_mv |
virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br |
| _version_ |
1818598504058060800 |