Um estudo direcionado por corpora: estruturas lexicais em um corpus especializado

Edilson Rosa da Rocha

Um estudo direcionado por corpora: estruturas lexicais em um corpus especializado

Detalhes bibliográficos
Ano de defesa:	2024
Autor(a) principal:	Edilson Rosa da Rocha
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	por
Instituição de defesa:	Universidade Federal de Minas Gerais
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Língua inglesa – Estudo e ensino Linguística de corpus Língua inglesa – Lexicologia
Link de acesso:	https://hdl.handle.net/1843/75902
Resumo:	Linguistic studies related to phraseology have been gaining credibility, especially regarding the formation and analysis of formulaic language (Hunston & Francis, 1999; Wray, 2002; Biber, 2009). In this research, we aimed to directly identify, analyze, and classify Phrasal frames (P-frames) from a corpus-driven perspective (Biber, 2012). We considered the hypothesis of identifying p-frames independently of Lexical Bundles (LBs), analyzing the internal structures of lexical units (ULs). We utilized the specialized Corpus of Articles of Applied Linguistics (CorAAL), compiled from 6 high-impact journals in the field of Applied Linguistics in English language, totaling 973,844 words from 150 articles published between 2014 and 2018. AntConc (Anthony, 2022), through N-gram tool, generated the list of ULs. We investigated lexical sequences of 5-words with a variable gap, with a minimum frequency of 20 times per million words and a minimum dispersion of 10 times, resulting in a final list of 66 ULs. We identified 11 ULs that are not associated with the LBs in the study by Biber et al. (1999), but their absence in this study does not automatically classify them as p-frames. For such analysis, the parameters of variability and predictability (Tan & Römer, 2022) were integrated into the frequency criterion. We employed agglomerative hierarchical clustering and R scripts to compare the frequency, variability, and internal entropies of the ULs, observing low variability (0.02 - 0.05) and predictability (0.0 - 0.0). For instance, the lexical units ((at, in) the + of the [end, beginning, time]), (english as a + language [foreign, second]), and (it is + to note [important]) exhibit characteristics of p-frames by displaying discontinuity in their lexical units and flexibility regarding the filling of gaps with functional and content words. Thus, by identifying p-frames only from continuous ULs, we exclude those with low variability, as highlighted in the analysis. Furthermore, the filling of internal spaces in the 11 identified ULs (1345, 1245, 123*5), consists of content words with nominal base (Nb), verbal base (Vb), and adjectival base (Ab), as exemplified by the expression: the + of the [purpose(s), validity, teaching, use, majority, etc.]. These clusters demonstrate high levels of internal variability (from .11 to .74) and predictability (from .58 to .97), being divided into subgroups according to the reduction of similarity of the clusters being merged. The second grouping presents distinct subdivisions in the dendrogram. The results show that as internal variability increases, p-frames filled with content words, and different from each other, tend to form distinct groups. Thus, statistical analysis using internal variability and entropy allowed the identification of p-frames not derived from LBs.

Metadados do item

id	UFMG_aabba0a03063eda4dacb3e2dab7fdffc
oai_identifier_str	oai:repositorio.ufmg.br:1843/75902
network_acronym_str	UFMG
network_name_str	Repositório Institucional da UFMG
repository_id_str
spelling	2024-09-03T14:30:57Z2025-09-08T23:50:43Z2024-09-03T14:30:57Z2024-08-02https://hdl.handle.net/1843/75902Linguistic studies related to phraseology have been gaining credibility, especially regarding the formation and analysis of formulaic language (Hunston & Francis, 1999; Wray, 2002; Biber, 2009). In this research, we aimed to directly identify, analyze, and classify Phrasal frames (P-frames) from a corpus-driven perspective (Biber, 2012). We considered the hypothesis of identifying p-frames independently of Lexical Bundles (LBs), analyzing the internal structures of lexical units (ULs). We utilized the specialized Corpus of Articles of Applied Linguistics (CorAAL), compiled from 6 high-impact journals in the field of Applied Linguistics in English language, totaling 973,844 words from 150 articles published between 2014 and 2018. AntConc (Anthony, 2022), through N-gram tool, generated the list of ULs. We investigated lexical sequences of 5-words with a variable gap, with a minimum frequency of 20 times per million words and a minimum dispersion of 10 times, resulting in a final list of 66 ULs. We identified 11 ULs that are not associated with the LBs in the study by Biber et al. (1999), but their absence in this study does not automatically classify them as p-frames. For such analysis, the parameters of variability and predictability (Tan & Römer, 2022) were integrated into the frequency criterion. We employed agglomerative hierarchical clustering and R scripts to compare the frequency, variability, and internal entropies of the ULs, observing low variability (0.02 - 0.05) and predictability (0.0 - 0.0). For instance, the lexical units ((at, in) the + of the [end, beginning, time]), (english as a + language [foreign, second]), and (it is + to note [important]) exhibit characteristics of p-frames by displaying discontinuity in their lexical units and flexibility regarding the filling of gaps with functional and content words. Thus, by identifying p-frames only from continuous ULs, we exclude those with low variability, as highlighted in the analysis. Furthermore, the filling of internal spaces in the 11 identified ULs (1345, 1245, 1235), consists of content words with nominal base (Nb), verbal base (Vb), and adjectival base (Ab), as exemplified by the expression: the + of the [purpose(s), validity, teaching, use, majority, etc.]. These clusters demonstrate high levels of internal variability (from .11 to .74) and predictability (from .58 to .97), being divided into subgroups according to the reduction of similarity of the clusters being merged. The second grouping presents distinct subdivisions in the dendrogram. The results show that as internal variability increases, p-frames filled with content words, and different from each other, tend to form distinct groups. Thus, statistical analysis using internal variability and entropy allowed the identification of p-frames not derived from LBs.porUniversidade Federal de Minas GeraisPrograma Institucional de Internacionalização – CAPES - PrInthttp://creativecommons.org/licenses/by-nd/3.0/pt/info:eu-repo/semantics/openAccessDirecionado por corpusEstruturas LexicaisPacotes LexicaisAnálise Multivariada de DadosClustersLíngua inglesa – Estudo e ensinoLinguística de corpusLíngua inglesa – LexicologiaUm estudo direcionado por corpora: estruturas lexicais em um corpus especializadoCorpus-driven study: phrase frame in a specialized corpusinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisEdilson Rosa da Rochareponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGhttp://lattes.cnpq.br/6046442925731876Deise Prina Dutrahttp://lattes.cnpq.br/3000229202863164Ana Eliza Pereira BocornyHeliana Ribeiro de MelloEstudos linguísticos relacionados à fraseologia têm ganhado credibilidade, principalmente quanto à formação e análise da linguagem formulaica (Hunston e Francis, 1999; Wray, 2002; Biber, 2009). Nesta pesquisa, buscamos identificar, analisar e classificar diretamente as Estruturas Lexicais (ELexs), sob a perspectiva metodológica direcionada por corpus (Biber, 2012). Consideramos a hipótese de identificação das ELexs independentemente dos Pacotes Lexicais (PLs), analisando as estruturas internas das unidades lexicais (ULs). Utilizamos o corpus especializado Corpus of Articles of Applied Linguistics (CorAAL), compilado de 6 revistas de alto impacto da área de Linguística Aplicada em língua inglesa, totalizando 973.844 palavras de 150 artigos, publicados entre 2014 e 2018. O AntConc (Anthony, 2022), através do N-gram, gerou a lista de ULs. Investigamos sequências lexicais de 5-palavras com uma lacuna variável, com frequência mínima de 20 vezes por milhão de palavras, com dispersão mínima de 10 vezes, resultando em uma lista final de 66 ULs. Identificamos 11 ULs que não estão associadas aos PLs do estudo de Biber et al. (1999), mas sua ausência nesse estudo não as classifica automaticamente como ELexs. Para tal análise foram integrados ao critério de frequência os parâmetros de variabilidade e previsibilidade (Tan e Römer 2022). Utilizamos agrupamento hierárquico aglomerativo e scripts em R para comparar frequência, variabilidade e entropias internas das ULs, constatando baixa variabilidade (0.02 - 0.05) e previsibilidade (0.0 - 0.0). Por exemplo, as unidades lexicais ((at, in) the + of the [end, beginning, time]), (english as a + language [foreign, second]) e (it is + to note [important]) exibem características de ELexs por apresentarem descontinuidade em sua unidade lexical e flexibilidade quanto ao preenchimento das lacunas com palavras funcionais e de conteúdo. Assim, ao identificar ELexs apenas a partir de ULs contínuas, excluímos as de baixa variabilidade, conforme destacado na análise. Além do mais, o preenchimento dos espaços internos das 11 ULs identificadas (1345, 1245, 1235) são com palavras de conteúdo de base nominal (bN), base verbal (bV) e base adjetival (bA), como exemplificado pela expressão: the + of the [purpose(s), validity, teaching, use, majority, etc.]. Esses clusters demonstram níveis elevados de variabilidade (de .11 a .74) e previsibilidade (de .58 a .97) internamente, sendo divididos em subgrupos de acordo com a redução da similaridade dos clusters que estão sendo fundidos. O segundo agrupamento apresenta subdivisões distintas no dendrograma. Os resultados mostram que conforme a variabilidade interna aumenta, as ELexs, preenchidas com palavras de conteúdo e diferentes entre si, tendem a formar grupos distintos. Assim, a análise estatística usando variabilidade e entropia interna permitiu identificar ELexs não derivadas de PLs.BrasilFALE - FACULDADE DE LETRASPrograma de Pós-Graduação em Estudos LinguísticosUFMGCC-LICENSElicense_rdfapplication/octet-stream805https://repositorio.ufmg.br//bitstreams/de7fa31b-41a5-40e3-b717-356588bed75d/download00e5e6a57d5512d202d12cb48704dfd6MD51falseAnonymousREADORIGINALESTUDO DIRECIONADO POR CORPORA_Estruturas Lexicais em um corpus especializado.pdfapplication/pdf3255845https://repositorio.ufmg.br//bitstreams/c3c76ec9-cda5-435e-9a82-34f8ebef195a/downloaddeb029ff96d9de1e0cff8ee00f963024MD52trueAnonymousREADLICENSElicense.txttext/plain2118https://repositorio.ufmg.br//bitstreams/5ff7fc05-2e29-4023-a760-eea54c7ccfcf/downloadcda590c95a0b51b4d15f60c9642ca272MD53falseAnonymousREAD1843/759022025-09-08 20:50:43.231http://creativecommons.org/licenses/by-nd/3.0/pt/Acesso Abertoopen.accessoai:repositorio.ufmg.br:1843/75902https://repositorio.ufmg.br/Repositório InstitucionalPUBhttps://repositorio.ufmg.br/oairepositorio@ufmg.bropendoar:2025-09-08T23:50:43Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)falseTElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEgRE8gUkVQT1NJVMOTUklPIElOU1RJVFVDSU9OQUwgREEgVUZNRwoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSBhbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIChSSS1VRk1HKSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZSBpcnJldm9nw6F2ZWwgZGUgcmVwcm9kdXppciBlL291IGRpc3RyaWJ1aXIgYSBzdWEgcHVibGljYcOnw6NvIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyw7RuaWNvIGUgZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBkZWNsYXJhIHF1ZSBjb25oZWNlIGEgcG9sw610aWNhIGRlIGNvcHlyaWdodCBkYSBlZGl0b3JhIGRvIHNldSBkb2N1bWVudG8gZSBxdWUgY29uaGVjZSBlIGFjZWl0YSBhcyBEaXJldHJpemVzIGRvIFJJLVVGTUcuCgpWb2PDqiBjb25jb3JkYSBxdWUgbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGTUcgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGRlIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBmaW5zIGRlIHNlZ3VyYW7Dp2EsIGJhY2stdXAgZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogZGVjbGFyYSBxdWUgYSBzdWEgcHVibGljYcOnw6NvIMOpIG9yaWdpbmFsIGUgcXVlIHZvY8OqIHRlbSBvIHBvZGVyIGRlIGNvbmNlZGVyIG9zIGRpcmVpdG9zIGNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRlIHN1YSBwdWJsaWNhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3XDqW0uCgpDYXNvIGEgc3VhIHB1YmxpY2HDp8OjbyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgYW8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHB1YmxpY2HDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBQVUJMSUNBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCk8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lKHMpIG91IG8ocykgbm9tZXMocykgZG8ocykgZGV0ZW50b3IoZXMpIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkYSBwdWJsaWNhw6fDo28sIGUgbsOjbyBmYXLDoSBxdWFscXVlciBhbHRlcmHDp8OjbywgYWzDqW0gZGFxdWVsYXMgY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4K
dc.title.none.fl_str_mv	Um estudo direcionado por corpora: estruturas lexicais em um corpus especializado
dc.title.alternative.none.fl_str_mv	Corpus-driven study: phrase frame in a specialized corpus
title	Um estudo direcionado por corpora: estruturas lexicais em um corpus especializado
spellingShingle	Um estudo direcionado por corpora: estruturas lexicais em um corpus especializado Edilson Rosa da Rocha Língua inglesa – Estudo e ensino Linguística de corpus Língua inglesa – Lexicologia Direcionado por corpus Estruturas Lexicais Pacotes Lexicais Análise Multivariada de Dados Clusters
title_short	Um estudo direcionado por corpora: estruturas lexicais em um corpus especializado
title_full	Um estudo direcionado por corpora: estruturas lexicais em um corpus especializado
title_fullStr	Um estudo direcionado por corpora: estruturas lexicais em um corpus especializado
title_full_unstemmed	Um estudo direcionado por corpora: estruturas lexicais em um corpus especializado
title_sort	Um estudo direcionado por corpora: estruturas lexicais em um corpus especializado
author	Edilson Rosa da Rocha
author_facet	Edilson Rosa da Rocha
author_role	author
dc.contributor.author.fl_str_mv	Edilson Rosa da Rocha
dc.subject.por.fl_str_mv	Língua inglesa – Estudo e ensino Linguística de corpus Língua inglesa – Lexicologia
topic	Língua inglesa – Estudo e ensino Linguística de corpus Língua inglesa – Lexicologia Direcionado por corpus Estruturas Lexicais Pacotes Lexicais Análise Multivariada de Dados Clusters
dc.subject.other.none.fl_str_mv	Direcionado por corpus Estruturas Lexicais Pacotes Lexicais Análise Multivariada de Dados Clusters
description	Linguistic studies related to phraseology have been gaining credibility, especially regarding the formation and analysis of formulaic language (Hunston & Francis, 1999; Wray, 2002; Biber, 2009). In this research, we aimed to directly identify, analyze, and classify Phrasal frames (P-frames) from a corpus-driven perspective (Biber, 2012). We considered the hypothesis of identifying p-frames independently of Lexical Bundles (LBs), analyzing the internal structures of lexical units (ULs). We utilized the specialized Corpus of Articles of Applied Linguistics (CorAAL), compiled from 6 high-impact journals in the field of Applied Linguistics in English language, totaling 973,844 words from 150 articles published between 2014 and 2018. AntConc (Anthony, 2022), through N-gram tool, generated the list of ULs. We investigated lexical sequences of 5-words with a variable gap, with a minimum frequency of 20 times per million words and a minimum dispersion of 10 times, resulting in a final list of 66 ULs. We identified 11 ULs that are not associated with the LBs in the study by Biber et al. (1999), but their absence in this study does not automatically classify them as p-frames. For such analysis, the parameters of variability and predictability (Tan & Römer, 2022) were integrated into the frequency criterion. We employed agglomerative hierarchical clustering and R scripts to compare the frequency, variability, and internal entropies of the ULs, observing low variability (0.02 - 0.05) and predictability (0.0 - 0.0). For instance, the lexical units ((at, in) the + of the [end, beginning, time]), (english as a + language [foreign, second]), and (it is + to note [important]) exhibit characteristics of p-frames by displaying discontinuity in their lexical units and flexibility regarding the filling of gaps with functional and content words. Thus, by identifying p-frames only from continuous ULs, we exclude those with low variability, as highlighted in the analysis. Furthermore, the filling of internal spaces in the 11 identified ULs (1345, 1245, 123*5), consists of content words with nominal base (Nb), verbal base (Vb), and adjectival base (Ab), as exemplified by the expression: the + of the [purpose(s), validity, teaching, use, majority, etc.]. These clusters demonstrate high levels of internal variability (from .11 to .74) and predictability (from .58 to .97), being divided into subgroups according to the reduction of similarity of the clusters being merged. The second grouping presents distinct subdivisions in the dendrogram. The results show that as internal variability increases, p-frames filled with content words, and different from each other, tend to form distinct groups. Thus, statistical analysis using internal variability and entropy allowed the identification of p-frames not derived from LBs.
publishDate	2024
dc.date.accessioned.fl_str_mv	2024-09-03T14:30:57Z 2025-09-08T23:50:43Z
dc.date.available.fl_str_mv	2024-09-03T14:30:57Z
dc.date.issued.fl_str_mv	2024-08-02
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://hdl.handle.net/1843/75902
url	https://hdl.handle.net/1843/75902
dc.language.iso.fl_str_mv	por
language	por
dc.relation.none.fl_str_mv	Programa Institucional de Internacionalização – CAPES - PrInt
dc.rights.driver.fl_str_mv	http://creativecommons.org/licenses/by-nd/3.0/pt/ info:eu-repo/semantics/openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nd/3.0/pt/
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de Minas Gerais
publisher.none.fl_str_mv	Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG
instname_str	Universidade Federal de Minas Gerais (UFMG)
instacron_str	UFMG
institution	UFMG
reponame_str	Repositório Institucional da UFMG
collection	Repositório Institucional da UFMG
bitstream.url.fl_str_mv	https://repositorio.ufmg.br//bitstreams/de7fa31b-41a5-40e3-b717-356588bed75d/download https://repositorio.ufmg.br//bitstreams/c3c76ec9-cda5-435e-9a82-34f8ebef195a/download https://repositorio.ufmg.br//bitstreams/5ff7fc05-2e29-4023-a760-eea54c7ccfcf/download
bitstream.checksum.fl_str_mv	00e5e6a57d5512d202d12cb48704dfd6 deb029ff96d9de1e0cff8ee00f963024 cda590c95a0b51b4d15f60c9642ca272
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv	repositorio@ufmg.br
_version_	1862105857142030336

Um estudo direcionado por corpora: estruturas lexicais em um corpus especializado

Registros relacionados