Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: Fernandes, Rafael Macário
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://www.teses.usp.br/teses/disponiveis/8/8139/tde-10122024-105745/
Resumo: This master\'s thesis investigates the challenges of translating spatial language using open-source Large Language Models (LLMs) compared to traditional Neural Machine Translation (NMT) systems. It addresses the problem of accurately translating the semantics of spatial prepositions such as ACROSS, INTO, ONTO, and THROUGH, which are often translated into similar verbal or prepositional forms from English to Portuguese (EN-PT-BR). Correctly translating these prepositions is crucial for maintaining the semantic integrity of the source content while ensuring fluency and adherence to the lexicalization patterns of the target language (House 2018; Talmy 2000b; Slobin 2005). The research begins by contextualizing the challenges of spatial language translation, highlighting the limitations of current NMT systems and the potential advantages of LLMs. A comprehensive literature review traces the evolution of translation theories, the development of NMT, and the rise of LLMs, while also describing the potential limitations of the current approach. The methodology employs a corpus-based analysis, assembling a bilingual dataset centered on spatial prepositions comprising TED Talks subtitles from the OPUS platform. This dataset was meticulously pre-processed to facilitate both automated metrics and manual error analysis. The evaluation metrics used include BLEU, METEOR, BERTScore, COMET, and TER, while the manual error analysis specifically identifies and categorizes the types of errors each system makes. The findings reveal that moderate-sized LLMs such as LLaMa-3-8B and Mixtral-8x7B achieve accuracy close to NMT systems such as Google, although this relationship is not always linear, as models like Gemma-7B presented similar performance in human reviews. However, LLMs generally presented other serious mistranslation errors, including interlanguage/code-switching (in) and anglicism (an) errors, failing to convey idiomacity in the target language. Conversely, NMT systems achieved better general fluency and precision for machine translation tasks. Manual error analysis, on the other hand, underscores the ongoing challenges both LLMs and NMT systems face in translating the nuances of spatial language, with both groups presenting consistent numbers of errors like polysemy (po) and syntactic projection (sp) errors, where they either fail to translate a preposition\'s appropriate meaning or copy the lexicalization patterns from the source text into the target text (Fernandes et al. 2024; Oliveira and Fernandes 2022). The master\'s thesis concludes that despite the advancements in LLMs, significant hurdles remain in translating spatial language accurately. It suggests that future research should focus on enhancing training datasets, refining model architectures, and developing more sophisticated evaluation metrics that better capture the semantic subtleties of spatial language. This study contributes to the field by providing a detailed comparison of model performance in spatial language translation from EN-PT-BR and proposing directions for future improvements
id USP_a1a7b88ed6cdb9b3fab31e70e5db9c5a
oai_identifier_str oai:teses.usp.br:tde-10122024-105745
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str
spelling Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitlesDecodificando a semântica espacial: uma análise comparativa da performance de LLMs de código aberto em comparação com sistemas NMT na tradução de legendas do EN-PT-BRAvaliação da tradução automáticaLanguage typologyMachine translation evaluationModelos de Linguagem (LLMs)Natural Language Processing (NLP)Neural Machine Translation (NMT)Open-source Large Language Models (LLMs)Polissemia das preposiçõesPreposition polysemyProcessamento de Linguagem Natural (NLP)Semântica espacialSpatial semanticsTipologia linguísticaTradução Automática Neural (NMT)This master\'s thesis investigates the challenges of translating spatial language using open-source Large Language Models (LLMs) compared to traditional Neural Machine Translation (NMT) systems. It addresses the problem of accurately translating the semantics of spatial prepositions such as ACROSS, INTO, ONTO, and THROUGH, which are often translated into similar verbal or prepositional forms from English to Portuguese (EN-PT-BR). Correctly translating these prepositions is crucial for maintaining the semantic integrity of the source content while ensuring fluency and adherence to the lexicalization patterns of the target language (House 2018; Talmy 2000b; Slobin 2005). The research begins by contextualizing the challenges of spatial language translation, highlighting the limitations of current NMT systems and the potential advantages of LLMs. A comprehensive literature review traces the evolution of translation theories, the development of NMT, and the rise of LLMs, while also describing the potential limitations of the current approach. The methodology employs a corpus-based analysis, assembling a bilingual dataset centered on spatial prepositions comprising TED Talks subtitles from the OPUS platform. This dataset was meticulously pre-processed to facilitate both automated metrics and manual error analysis. The evaluation metrics used include BLEU, METEOR, BERTScore, COMET, and TER, while the manual error analysis specifically identifies and categorizes the types of errors each system makes. The findings reveal that moderate-sized LLMs such as LLaMa-3-8B and Mixtral-8x7B achieve accuracy close to NMT systems such as Google, although this relationship is not always linear, as models like Gemma-7B presented similar performance in human reviews. However, LLMs generally presented other serious mistranslation errors, including interlanguage/code-switching (in) and anglicism (an) errors, failing to convey idiomacity in the target language. Conversely, NMT systems achieved better general fluency and precision for machine translation tasks. Manual error analysis, on the other hand, underscores the ongoing challenges both LLMs and NMT systems face in translating the nuances of spatial language, with both groups presenting consistent numbers of errors like polysemy (po) and syntactic projection (sp) errors, where they either fail to translate a preposition\'s appropriate meaning or copy the lexicalization patterns from the source text into the target text (Fernandes et al. 2024; Oliveira and Fernandes 2022). The master\'s thesis concludes that despite the advancements in LLMs, significant hurdles remain in translating spatial language accurately. It suggests that future research should focus on enhancing training datasets, refining model architectures, and developing more sophisticated evaluation metrics that better capture the semantic subtleties of spatial language. This study contributes to the field by providing a detailed comparison of model performance in spatial language translation from EN-PT-BR and proposing directions for future improvementsEsta dissertação de mestrado investiga os desafios da tradução da espacialidade usando Grandes Modelos de Linguagem (LLMs) de código aberto em comparação com sistemas tradicionais de Tradução Automática Neural (NMT), abordando problemas na tradução de preposições espaciais como ACROSS, INTO, ONTO e THROUGH, que frequentemente são traduzidas utilizando-se as mesmas formas verbais ou preposicionais do inglês para o português (EN-PT-BR). A tradução correta dessas preposições é crucial para manter a integridade semântica da língua de origem, garantindo fluidez e aderência aos padrões de lexicalização da língua alvo (House 2018; Talmy 2000b; Slobin 2005). A pesquisa contextualiza os desafios da tradução da linguagem espacial, destacando as limitações dos sistemas NMT atuais e as potenciais vantagens dos LLMs. A revisão de literatura traça a evolução das teorias de tradução, o desenvolvimento da NMT e o surgimento dos LLMs, descrevendo também suas limitações. A metodologia emprega uma análise baseada em corpus, a partir de um conjunto de dados bilíngue centrado em preposições espaciais de legendas de TED Talks obtidos pela plataforma OPUS. Este conjunto de dados foi meticulosamente pré-processado para facilitar tanto o cálculo de métricas automatizadas quanto a análise de erros manual. As métricas utilizadas incluem BLEU, METEOR, BERTScore, COMET e TER, enquanto a análise manual identifica e categoriza os tipos de erros que cada sistema comete. Os resultados revelam que LLMs de tamanho moderado, como LLaMa-3-8B e Mixtral-8x7B, alcançam precisão próxima aos sistemas NMT, como o Google, embora essa relação nem sempre seja linear, pois modelos como Gemma-7B possuíram desempenho similar na avaliação humana. No entanto, os LLMs em geral apresentaram sérios erros de tradução, incluindo interlíngua/code-switching (in) e anglicismos (an), não conseguindo transmitir idiomaticidade na língua-alvo. Por outro lado, os sistemas NMT alcançaram muito melhor fluidez na tarefa de tradução automática. No entanto, a análise humana destaca os desafios contínuos enfrentados tanto pelos LLMs quanto pelos sistemas NMT na tradução das nuances da espacialidade, com ambos os grupos apresentando números consistentes de erros como polissemia (po) e projeção sintática (sp), nos quais falham em traduzir o significado apropriado de uma preposição ou copiam os padrões de lexicalização da língua de origem para o texto alvo (Fernandes et al. 2024; Oliveira e Fernandes 2022). A dissertação conclui que, apesar dos avanços nos LLMs, permanecem obstáculos na tradução precisa da linguagem espacial, sugerindo que pesquisas futuras devem se concentrar em aprimorar conjuntos de dados de treinamento, refinar arquiteturas desses modelos e desenvolver métricas de avaliação mais sofisticadas que capturem melhor as sutilezas da semântica espacial. Este estudo contribui para o campo fornecendo uma comparação detalhada do desempenho de LLMs e NMT na tradução da linguagem espacial do EN-PT-BR, propondo direções para melhorias futurasBiblioteca Digitais de Teses e Dissertações da USPLopes, Marcos FernandoFernandes, Rafael Macário2024-08-06info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/8/8139/tde-10122024-105745/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2024-12-10T13:27:02Zoai:teses.usp.br:tde-10122024-105745Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212024-12-10T13:27:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles
Decodificando a semântica espacial: uma análise comparativa da performance de LLMs de código aberto em comparação com sistemas NMT na tradução de legendas do EN-PT-BR
title Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles
spellingShingle Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles
Fernandes, Rafael Macário
Avaliação da tradução automática
Language typology
Machine translation evaluation
Modelos de Linguagem (LLMs)
Natural Language Processing (NLP)
Neural Machine Translation (NMT)
Open-source Large Language Models (LLMs)
Polissemia das preposições
Preposition polysemy
Processamento de Linguagem Natural (NLP)
Semântica espacial
Spatial semantics
Tipologia linguística
Tradução Automática Neural (NMT)
title_short Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles
title_full Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles
title_fullStr Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles
title_full_unstemmed Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles
title_sort Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles
author Fernandes, Rafael Macário
author_facet Fernandes, Rafael Macário
author_role author
dc.contributor.none.fl_str_mv Lopes, Marcos Fernando
dc.contributor.author.fl_str_mv Fernandes, Rafael Macário
dc.subject.por.fl_str_mv Avaliação da tradução automática
Language typology
Machine translation evaluation
Modelos de Linguagem (LLMs)
Natural Language Processing (NLP)
Neural Machine Translation (NMT)
Open-source Large Language Models (LLMs)
Polissemia das preposições
Preposition polysemy
Processamento de Linguagem Natural (NLP)
Semântica espacial
Spatial semantics
Tipologia linguística
Tradução Automática Neural (NMT)
topic Avaliação da tradução automática
Language typology
Machine translation evaluation
Modelos de Linguagem (LLMs)
Natural Language Processing (NLP)
Neural Machine Translation (NMT)
Open-source Large Language Models (LLMs)
Polissemia das preposições
Preposition polysemy
Processamento de Linguagem Natural (NLP)
Semântica espacial
Spatial semantics
Tipologia linguística
Tradução Automática Neural (NMT)
description This master\'s thesis investigates the challenges of translating spatial language using open-source Large Language Models (LLMs) compared to traditional Neural Machine Translation (NMT) systems. It addresses the problem of accurately translating the semantics of spatial prepositions such as ACROSS, INTO, ONTO, and THROUGH, which are often translated into similar verbal or prepositional forms from English to Portuguese (EN-PT-BR). Correctly translating these prepositions is crucial for maintaining the semantic integrity of the source content while ensuring fluency and adherence to the lexicalization patterns of the target language (House 2018; Talmy 2000b; Slobin 2005). The research begins by contextualizing the challenges of spatial language translation, highlighting the limitations of current NMT systems and the potential advantages of LLMs. A comprehensive literature review traces the evolution of translation theories, the development of NMT, and the rise of LLMs, while also describing the potential limitations of the current approach. The methodology employs a corpus-based analysis, assembling a bilingual dataset centered on spatial prepositions comprising TED Talks subtitles from the OPUS platform. This dataset was meticulously pre-processed to facilitate both automated metrics and manual error analysis. The evaluation metrics used include BLEU, METEOR, BERTScore, COMET, and TER, while the manual error analysis specifically identifies and categorizes the types of errors each system makes. The findings reveal that moderate-sized LLMs such as LLaMa-3-8B and Mixtral-8x7B achieve accuracy close to NMT systems such as Google, although this relationship is not always linear, as models like Gemma-7B presented similar performance in human reviews. However, LLMs generally presented other serious mistranslation errors, including interlanguage/code-switching (in) and anglicism (an) errors, failing to convey idiomacity in the target language. Conversely, NMT systems achieved better general fluency and precision for machine translation tasks. Manual error analysis, on the other hand, underscores the ongoing challenges both LLMs and NMT systems face in translating the nuances of spatial language, with both groups presenting consistent numbers of errors like polysemy (po) and syntactic projection (sp) errors, where they either fail to translate a preposition\'s appropriate meaning or copy the lexicalization patterns from the source text into the target text (Fernandes et al. 2024; Oliveira and Fernandes 2022). The master\'s thesis concludes that despite the advancements in LLMs, significant hurdles remain in translating spatial language accurately. It suggests that future research should focus on enhancing training datasets, refining model architectures, and developing more sophisticated evaluation metrics that better capture the semantic subtleties of spatial language. This study contributes to the field by providing a detailed comparison of model performance in spatial language translation from EN-PT-BR and proposing directions for future improvements
publishDate 2024
dc.date.none.fl_str_mv 2024-08-06
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/8/8139/tde-10122024-105745/
url https://www.teses.usp.br/teses/disponiveis/8/8139/tde-10122024-105745/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1818598504058060800