Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles

Fernandes, Rafael Macário

Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles

Detalhes bibliográficos
Ano de defesa:	2024
Autor(a) principal:	Fernandes, Rafael Macário
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Avaliação da tradução automática Language typology Machine translation evaluation Modelos de Linguagem (LLMs) Natural Language Processing (NLP) Neural Machine Translation (NMT) Open-source Large Language Models (LLMs) Polissemia das preposições Preposition polysemy Processamento de Linguagem Natural (NLP) Semântica espacial Spatial semantics Tipologia linguística Tradução Automática Neural (NMT)
Link de acesso:	https://www.teses.usp.br/teses/disponiveis/8/8139/tde-10122024-105745/
Resumo:	This master\'s thesis investigates the challenges of translating spatial language using open-source Large Language Models (LLMs) compared to traditional Neural Machine Translation (NMT) systems. It addresses the problem of accurately translating the semantics of spatial prepositions such as ACROSS, INTO, ONTO, and THROUGH, which are often translated into similar verbal or prepositional forms from English to Portuguese (EN-PT-BR). Correctly translating these prepositions is crucial for maintaining the semantic integrity of the source content while ensuring fluency and adherence to the lexicalization patterns of the target language (House 2018; Talmy 2000b; Slobin 2005). The research begins by contextualizing the challenges of spatial language translation, highlighting the limitations of current NMT systems and the potential advantages of LLMs. A comprehensive literature review traces the evolution of translation theories, the development of NMT, and the rise of LLMs, while also describing the potential limitations of the current approach. The methodology employs a corpus-based analysis, assembling a bilingual dataset centered on spatial prepositions comprising TED Talks subtitles from the OPUS platform. This dataset was meticulously pre-processed to facilitate both automated metrics and manual error analysis. The evaluation metrics used include BLEU, METEOR, BERTScore, COMET, and TER, while the manual error analysis specifically identifies and categorizes the types of errors each system makes. The findings reveal that moderate-sized LLMs such as LLaMa-3-8B and Mixtral-8x7B achieve accuracy close to NMT systems such as Google, although this relationship is not always linear, as models like Gemma-7B presented similar performance in human reviews. However, LLMs generally presented other serious mistranslation errors, including interlanguage/code-switching (in) and anglicism (an) errors, failing to convey idiomacity in the target language. Conversely, NMT systems achieved better general fluency and precision for machine translation tasks. Manual error analysis, on the other hand, underscores the ongoing challenges both LLMs and NMT systems face in translating the nuances of spatial language, with both groups presenting consistent numbers of errors like polysemy (po) and syntactic projection (sp) errors, where they either fail to translate a preposition\'s appropriate meaning or copy the lexicalization patterns from the source text into the target text (Fernandes et al. 2024; Oliveira and Fernandes 2022). The master\'s thesis concludes that despite the advancements in LLMs, significant hurdles remain in translating spatial language accurately. It suggests that future research should focus on enhancing training datasets, refining model architectures, and developing more sophisticated evaluation metrics that better capture the semantic subtleties of spatial language. This study contributes to the field by providing a detailed comparison of model performance in spatial language translation from EN-PT-BR and proposing directions for future improvements

Metadados do item

id	USP_a1a7b88ed6cdb9b3fab31e70e5db9c5a
oai_identifier_str	oai:teses.usp.br:tde-10122024-105745
network_acronym_str	USP
network_name_str	Biblioteca Digital de Teses e Dissertações da USP
repository_id_str
spelling	Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitlesDecodificando a semântica espacial: uma análise comparativa da performance de LLMs de código aberto em comparação com sistemas NMT na tradução de legendas do EN-PT-BRAvaliação da tradução automáticaLanguage typologyMachine translation evaluationModelos de Linguagem (LLMs)Natural Language Processing (NLP)Neural Machine Translation (NMT)Open-source Large Language Models (LLMs)Polissemia das preposiçõesPreposition polysemyProcessamento de Linguagem Natural (NLP)Semântica espacialSpatial semanticsTipologia linguísticaTradução Automática Neural (NMT)This master\'s thesis investigates the challenges of translating spatial language using open-source Large Language Models (LLMs) compared to traditional Neural Machine Translation (NMT) systems. It addresses the problem of accurately translating the semantics of spatial prepositions such as ACROSS, INTO, ONTO, and THROUGH, which are often translated into similar verbal or prepositional forms from English to Portuguese (EN-PT-BR). Correctly translating these prepositions is crucial for maintaining the semantic integrity of the source content while ensuring fluency and adherence to the lexicalization patterns of the target language (House 2018; Talmy 2000b; Slobin 2005). The research begins by contextualizing the challenges of spatial language translation, highlighting the limitations of current NMT systems and the potential advantages of LLMs. A comprehensive literature review traces the evolution of translation theories, the development of NMT, and the rise of LLMs, while also describing the potential limitations of the current approach. The methodology employs a corpus-based analysis, assembling a bilingual dataset centered on spatial prepositions comprising TED Talks subtitles from the OPUS platform. This dataset was meticulously pre-processed to facilitate both automated metrics and manual error analysis. The evaluation metrics used include BLEU, METEOR, BERTScore, COMET, and TER, while the manual error analysis specifically identifies and categorizes the types of errors each system makes. The findings reveal that moderate-sized LLMs such as LLaMa-3-8B and Mixtral-8x7B achieve accuracy close to NMT systems such as Google, although this relationship is not always linear, as models like Gemma-7B presented similar performance in human reviews. However, LLMs generally presented other serious mistranslation errors, including interlanguage/code-switching (in) and anglicism (an) errors, failing to convey idiomacity in the target language. Conversely, NMT systems achieved better general fluency and precision for machine translation tasks. Manual error analysis, on the other hand, underscores the ongoing challenges both LLMs and NMT systems face in translating the nuances of spatial language, with both groups presenting consistent numbers of errors like polysemy (po) and syntactic projection (sp) errors, where they either fail to translate a preposition\'s appropriate meaning or copy the lexicalization patterns from the source text into the target text (Fernandes et al. 2024; Oliveira and Fernandes 2022). The master\'s thesis concludes that despite the advancements in LLMs, significant hurdles remain in translating spatial language accurately. It suggests that future research should focus on enhancing training datasets, refining model architectures, and developing more sophisticated evaluation metrics that better capture the semantic subtleties of spatial language. This study contributes to the field by providing a detailed comparison of model performance in spatial language translation from EN-PT-BR and proposing directions for future improvementsEsta dissertação de mestrado investiga os desafios da tradução da espacialidade usando Grandes Modelos de Linguagem (LLMs) de código aberto em comparação com sistemas tradicionais de Tradução Automática Neural (NMT), abordando problemas na tradução de preposições espaciais como ACROSS, INTO, ONTO e THROUGH, que frequentemente são traduzidas utilizando-se as mesmas formas verbais ou preposicionais do inglês para o português (EN-PT-BR). A tradução correta dessas preposições é crucial para manter a integridade semântica da língua de origem, garantindo fluidez e aderência aos padrões de lexicalização da língua alvo (House 2018; Talmy 2000b; Slobin 2005). A pesquisa contextualiza os desafios da tradução da linguagem espacial, destacando as limitações dos sistemas NMT atuais e as potenciais vantagens dos LLMs. A revisão de literatura traça a evolução das teorias de tradução, o desenvolvimento da NMT e o surgimento dos LLMs, descrevendo também suas limitações. A metodologia emprega uma análise baseada em corpus, a partir de um conjunto de dados bilíngue centrado em preposições espaciais de legendas de TED Talks obtidos pela plataforma OPUS. Este conjunto de dados foi meticulosamente pré-processado para facilitar tanto o cálculo de métricas automatizadas quanto a análise de erros manual. As métricas utilizadas incluem BLEU, METEOR, BERTScore, COMET e TER, enquanto a análise manual identifica e categoriza os tipos de erros que cada sistema comete. Os resultados revelam que LLMs de tamanho moderado, como LLaMa-3-8B e Mixtral-8x7B, alcançam precisão próxima aos sistemas NMT, como o Google, embora essa relação nem sempre seja linear, pois modelos como Gemma-7B possuíram desempenho similar na avaliação humana. No entanto, os LLMs em geral apresentaram sérios erros de tradução, incluindo interlíngua/code-switching (in) e anglicismos (an), não conseguindo transmitir idiomaticidade na língua-alvo. Por outro lado, os sistemas NMT alcançaram muito melhor fluidez na tarefa de tradução automática. No entanto, a análise humana destaca os desafios contínuos enfrentados tanto pelos LLMs quanto pelos sistemas NMT na tradução das nuances da espacialidade, com ambos os grupos apresentando números consistentes de erros como polissemia (po) e projeção sintática (sp), nos quais falham em traduzir o significado apropriado de uma preposição ou copiam os padrões de lexicalização da língua de origem para o texto alvo (Fernandes et al. 2024; Oliveira e Fernandes 2022). A dissertação conclui que, apesar dos avanços nos LLMs, permanecem obstáculos na tradução precisa da linguagem espacial, sugerindo que pesquisas futuras devem se concentrar em aprimorar conjuntos de dados de treinamento, refinar arquiteturas desses modelos e desenvolver métricas de avaliação mais sofisticadas que capturem melhor as sutilezas da semântica espacial. Este estudo contribui para o campo fornecendo uma comparação detalhada do desempenho de LLMs e NMT na tradução da linguagem espacial do EN-PT-BR, propondo direções para melhorias futurasBiblioteca Digitais de Teses e Dissertações da USPLopes, Marcos FernandoFernandes, Rafael Macário2024-08-06info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/8/8139/tde-10122024-105745/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2024-12-10T13:27:02Zoai:teses.usp.br:tde-10122024-105745Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.bropendoar:27212024-12-10T13:27:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv	Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles Decodificando a semântica espacial: uma análise comparativa da performance de LLMs de código aberto em comparação com sistemas NMT na tradução de legendas do EN-PT-BR
title	Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles
spellingShingle	Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles Fernandes, Rafael Macário Avaliação da tradução automática Language typology Machine translation evaluation Modelos de Linguagem (LLMs) Natural Language Processing (NLP) Neural Machine Translation (NMT) Open-source Large Language Models (LLMs) Polissemia das preposições Preposition polysemy Processamento de Linguagem Natural (NLP) Semântica espacial Spatial semantics Tipologia linguística Tradução Automática Neural (NMT)
title_short	Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles
title_full	Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles
title_fullStr	Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles
title_full_unstemmed	Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles
title_sort	Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles
author	Fernandes, Rafael Macário
author_facet	Fernandes, Rafael Macário
author_role	author
dc.contributor.none.fl_str_mv	Lopes, Marcos Fernando
dc.contributor.author.fl_str_mv	Fernandes, Rafael Macário
dc.subject.por.fl_str_mv	Avaliação da tradução automática Language typology Machine translation evaluation Modelos de Linguagem (LLMs) Natural Language Processing (NLP) Neural Machine Translation (NMT) Open-source Large Language Models (LLMs) Polissemia das preposições Preposition polysemy Processamento de Linguagem Natural (NLP) Semântica espacial Spatial semantics Tipologia linguística Tradução Automática Neural (NMT)
topic	Avaliação da tradução automática Language typology Machine translation evaluation Modelos de Linguagem (LLMs) Natural Language Processing (NLP) Neural Machine Translation (NMT) Open-source Large Language Models (LLMs) Polissemia das preposições Preposition polysemy Processamento de Linguagem Natural (NLP) Semântica espacial Spatial semantics Tipologia linguística Tradução Automática Neural (NMT)
description	This master\'s thesis investigates the challenges of translating spatial language using open-source Large Language Models (LLMs) compared to traditional Neural Machine Translation (NMT) systems. It addresses the problem of accurately translating the semantics of spatial prepositions such as ACROSS, INTO, ONTO, and THROUGH, which are often translated into similar verbal or prepositional forms from English to Portuguese (EN-PT-BR). Correctly translating these prepositions is crucial for maintaining the semantic integrity of the source content while ensuring fluency and adherence to the lexicalization patterns of the target language (House 2018; Talmy 2000b; Slobin 2005). The research begins by contextualizing the challenges of spatial language translation, highlighting the limitations of current NMT systems and the potential advantages of LLMs. A comprehensive literature review traces the evolution of translation theories, the development of NMT, and the rise of LLMs, while also describing the potential limitations of the current approach. The methodology employs a corpus-based analysis, assembling a bilingual dataset centered on spatial prepositions comprising TED Talks subtitles from the OPUS platform. This dataset was meticulously pre-processed to facilitate both automated metrics and manual error analysis. The evaluation metrics used include BLEU, METEOR, BERTScore, COMET, and TER, while the manual error analysis specifically identifies and categorizes the types of errors each system makes. The findings reveal that moderate-sized LLMs such as LLaMa-3-8B and Mixtral-8x7B achieve accuracy close to NMT systems such as Google, although this relationship is not always linear, as models like Gemma-7B presented similar performance in human reviews. However, LLMs generally presented other serious mistranslation errors, including interlanguage/code-switching (in) and anglicism (an) errors, failing to convey idiomacity in the target language. Conversely, NMT systems achieved better general fluency and precision for machine translation tasks. Manual error analysis, on the other hand, underscores the ongoing challenges both LLMs and NMT systems face in translating the nuances of spatial language, with both groups presenting consistent numbers of errors like polysemy (po) and syntactic projection (sp) errors, where they either fail to translate a preposition\'s appropriate meaning or copy the lexicalization patterns from the source text into the target text (Fernandes et al. 2024; Oliveira and Fernandes 2022). The master\'s thesis concludes that despite the advancements in LLMs, significant hurdles remain in translating spatial language accurately. It suggests that future research should focus on enhancing training datasets, refining model architectures, and developing more sophisticated evaluation metrics that better capture the semantic subtleties of spatial language. This study contributes to the field by providing a detailed comparison of model performance in spatial language translation from EN-PT-BR and proposing directions for future improvements
publishDate	2024
dc.date.none.fl_str_mv	2024-08-06
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://www.teses.usp.br/teses/disponiveis/8/8139/tde-10122024-105745/
url	https://www.teses.usp.br/teses/disponiveis/8/8139/tde-10122024-105745/
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv	Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Liberar o conteúdo para acesso público.
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP
instname_str	Universidade de São Paulo (USP)
instacron_str	USP
institution	USP
reponame_str	Biblioteca Digital de Teses e Dissertações da USP
collection	Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv	virginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.br
_version_	1818598504058060800

Decoding spatial semantics: a comparative analysis of the performance of open-source LLMs against NMT systems in translating EN-PT-BR subtitles

Registros relacionados