Processing a learner corpus to identify differences: the influence of task, genre and student background

Detalhes bibliográficos
Ano de defesa: 2016
Autor(a) principal: Andressa Rodrigues Gomide
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://hdl.handle.net/1843/MGSS-A9KGY5
Resumo: This master thesis deals with the technical and methodological aspects in creating, cleaning and processing a Brazilian university level learner corpus, the Corpus do Inglês sem Fronteiras (CorIsF) v 1.0. The two main goals of this study consist of making the processing of CorIsF replicable and in investigating and describing the variation of some linguistic characteristics across different learner groups, tasks andgenres. The procedure was carried in R, a free software environment for statistical computing and graphics, and was divided in four parts: dataset compilation and preprocessing; dataset processing; extraction of the key features; and data visualization. The first step deals with the method used to collect the data and to do the first cleaning process, such as eliminating unwanted data and keeping the relevant ones. In the following step, CorIsF was subset in five small corpora covering different learner profiles, two different tasks, and on genre, and annotated with a part-ofspeech (POS) tagger. In the third step the variability of POS within subcorpora, the frequency of types and tokens, and the usage of n-grams were investigated. In the final step some exploratory data visualization were performed with the creation and analysis of plots and wordclouds. After the preparation of the data, the language used in each subcorpora was contrasted and analysed, suggesting that task, genre and student background are likely to influence learners written production.
id UFMG_b3c3b13ee5f5900facf6bcf756bd18dd
oai_identifier_str oai:repositorio.ufmg.br:1843/MGSS-A9KGY5
network_acronym_str UFMG
network_name_str Repositório Institucional da UFMG
repository_id_str
spelling Processing a learner corpus to identify differences: the influence of task, genre and student backgroundLíngua inglesa Estudo e ensino Falantes de português BrasilLíngua inglesa Estudo e ensino Falantes estrangeirosLingüística textualAquisição da segunda linguagemLingua inglesa GramaticaLinguística de corpusInglês para fins acadêmicosCorpus de aprendizDesenho de corpusThis master thesis deals with the technical and methodological aspects in creating, cleaning and processing a Brazilian university level learner corpus, the Corpus do Inglês sem Fronteiras (CorIsF) v 1.0. The two main goals of this study consist of making the processing of CorIsF replicable and in investigating and describing the variation of some linguistic characteristics across different learner groups, tasks andgenres. The procedure was carried in R, a free software environment for statistical computing and graphics, and was divided in four parts: dataset compilation and preprocessing; dataset processing; extraction of the key features; and data visualization. The first step deals with the method used to collect the data and to do the first cleaning process, such as eliminating unwanted data and keeping the relevant ones. In the following step, CorIsF was subset in five small corpora covering different learner profiles, two different tasks, and on genre, and annotated with a part-ofspeech (POS) tagger. In the third step the variability of POS within subcorpora, the frequency of types and tokens, and the usage of n-grams were investigated. In the final step some exploratory data visualization were performed with the creation and analysis of plots and wordclouds. After the preparation of the data, the language used in each subcorpora was contrasted and analysed, suggesting that task, genre and student background are likely to influence learners written production.Universidade Federal de Minas Gerais2019-08-14T21:55:50Z2025-09-08T23:39:36Z2019-08-14T21:55:50Z2016-03-21info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://hdl.handle.net/1843/MGSS-A9KGY5Andressa Rodrigues Gomideinfo:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMG2025-09-08T23:39:36Zoai:repositorio.ufmg.br:1843/MGSS-A9KGY5Repositório InstitucionalPUBhttps://repositorio.ufmg.br/oairepositorio@ufmg.bropendoar:2025-09-08T23:39:36Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false
dc.title.none.fl_str_mv Processing a learner corpus to identify differences: the influence of task, genre and student background
title Processing a learner corpus to identify differences: the influence of task, genre and student background
spellingShingle Processing a learner corpus to identify differences: the influence of task, genre and student background
Andressa Rodrigues Gomide
Língua inglesa Estudo e ensino Falantes de português Brasil
Língua inglesa Estudo e ensino Falantes estrangeiros
Lingüística textual
Aquisição da segunda linguagem
Lingua inglesa Gramatica
Linguística de corpus
Inglês para fins acadêmicos
Corpus de aprendiz
Desenho de corpus
title_short Processing a learner corpus to identify differences: the influence of task, genre and student background
title_full Processing a learner corpus to identify differences: the influence of task, genre and student background
title_fullStr Processing a learner corpus to identify differences: the influence of task, genre and student background
title_full_unstemmed Processing a learner corpus to identify differences: the influence of task, genre and student background
title_sort Processing a learner corpus to identify differences: the influence of task, genre and student background
author Andressa Rodrigues Gomide
author_facet Andressa Rodrigues Gomide
author_role author
dc.contributor.author.fl_str_mv Andressa Rodrigues Gomide
dc.subject.por.fl_str_mv Língua inglesa Estudo e ensino Falantes de português Brasil
Língua inglesa Estudo e ensino Falantes estrangeiros
Lingüística textual
Aquisição da segunda linguagem
Lingua inglesa Gramatica
Linguística de corpus
Inglês para fins acadêmicos
Corpus de aprendiz
Desenho de corpus
topic Língua inglesa Estudo e ensino Falantes de português Brasil
Língua inglesa Estudo e ensino Falantes estrangeiros
Lingüística textual
Aquisição da segunda linguagem
Lingua inglesa Gramatica
Linguística de corpus
Inglês para fins acadêmicos
Corpus de aprendiz
Desenho de corpus
description This master thesis deals with the technical and methodological aspects in creating, cleaning and processing a Brazilian university level learner corpus, the Corpus do Inglês sem Fronteiras (CorIsF) v 1.0. The two main goals of this study consist of making the processing of CorIsF replicable and in investigating and describing the variation of some linguistic characteristics across different learner groups, tasks andgenres. The procedure was carried in R, a free software environment for statistical computing and graphics, and was divided in four parts: dataset compilation and preprocessing; dataset processing; extraction of the key features; and data visualization. The first step deals with the method used to collect the data and to do the first cleaning process, such as eliminating unwanted data and keeping the relevant ones. In the following step, CorIsF was subset in five small corpora covering different learner profiles, two different tasks, and on genre, and annotated with a part-ofspeech (POS) tagger. In the third step the variability of POS within subcorpora, the frequency of types and tokens, and the usage of n-grams were investigated. In the final step some exploratory data visualization were performed with the creation and analysis of plots and wordclouds. After the preparation of the data, the language used in each subcorpora was contrasted and analysed, suggesting that task, genre and student background are likely to influence learners written production.
publishDate 2016
dc.date.none.fl_str_mv 2016-03-21
2019-08-14T21:55:50Z
2019-08-14T21:55:50Z
2025-09-08T23:39:36Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://hdl.handle.net/1843/MGSS-A9KGY5
url https://hdl.handle.net/1843/MGSS-A9KGY5
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Federal de Minas Gerais
publisher.none.fl_str_mv Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFMG
instname:Universidade Federal de Minas Gerais (UFMG)
instacron:UFMG
instname_str Universidade Federal de Minas Gerais (UFMG)
instacron_str UFMG
institution UFMG
reponame_str Repositório Institucional da UFMG
collection Repositório Institucional da UFMG
repository.name.fl_str_mv Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv repositorio@ufmg.br
_version_ 1856414097768185856