Identificação e extração semiautomática de contextos definitórios em corpus com vistas à redação da definição terminológica: proposta de sistematização linguística para a língua portuguesa

Kamikawachi, Dayse Simon Landim

Identificação e extração semiautomática de contextos definitórios em corpus com vistas à redação da definição terminológica: proposta de sistematização linguística para a língua portuguesa

Detalhes bibliográficos
Ano de defesa:	2014
Autor(a) principal:	Kamikawachi, Dayse Simon Landim
Orientador(a):	Almeida, Gladis Maria de Barcellos
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Tese
Tipo de acesso:	Acesso aberto
Idioma:	por
Instituição de defesa:	Universidade Federal de São Carlos Câmpus São Carlos
Programa de Pós-Graduação:	Programa de Pós-Graduação em Linguística - PPGL
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Contexto definitório Definição terminológica Processamento de linguagem natural Terminologia
Palavras-chave em Inglês:	Definitory context Terminological definition Natural language processing Terminology
Área do conhecimento CNPq:	LINGUISTICA, LETRAS E ARTES::LINGUISTICA::TEORIA E ANALISE LINGUISTICA
Link de acesso:	https://hdl.handle.net/20.500.14289/22229
Resumo:	Computational tools in Natural Language Processing (NLP) are essential in handling electronic texts. Some of the resources commonly used are: frequency counters, word lists, keywords, and concordancers. It is noteworthy that this last tool, which terminologists use to view and extract defining contexts for certain terms, is useful in the stage of writing terminological definitions. Depending on the term and corpus size, the list of concordances may exceed a few hundred lines, making the task of defining the term extremely time-consuming. Yet, while concordancers facilitate the task of writing definitions, studies in terminology (ALARCÓN, 2009) and NLP (KLAVANS; MURESAN, 2001) have shown that it is possible to develop linguistic formalism that can be a substrate for the generation or enrichment of a system capable of detecting such contexts. While research in this direction has already been undertaken for English, Spanish, German, and French, among other languagesǡ research on Portuguese needs a more accurate linguistic explanation of defining context, in order to serve as a base for the development of similar systems for the Portuguese language. Therefore, the goals of this research are as follows: 1) to investigate the patterns of defining contexts found in technical corpora in Portuguese, 2) to provide linguistic knowledge which can be formalised computationally to create a system of semi-automatic extraction of candidate defining contexts, and 3) to evaluate the results generated. As the study’s corpus, we used scientific articles from the Bank of Portuguese (LAEL-PUC), and for the analysis, the following verbs were chosen: nomear ‘to name’, conceber ‘to conceive’, chamar ‘to call’, entender ‘to understand’, conhecer ‘to know’ and denominar ‘to denominate’. It was possible to do: 1) a quantitative and qualitative description of each verbal definitory pattern, 2) a local grammar for the chosen verbs with the purpose of aiding in the semiautomatic retrieval of definitory contexts, 3) an exclusion grammar to serve as a stoplist for local grammars and 4) a set of heuristics for a semiautomatic definitory context classifier. The evaluation of the lexical-syntactic rules of these six verbs showed 64% accuracy and 92% coverage in the global average, which represents an optimistic result in comparison to the results of previous studies. As a result, it was possible to 1) validate the methodology used, making it possible to extend it to other lexical-syntactic patterns and 2) obtain linguistic knowledge in order to integrate a semiautomatic computational system for definitory context candidates extraction for the Portuguese language.

Metadados do item

id	SCAR_64cb74955d41a595189d89db5b262666
oai_identifier_str	oai:repositorio.ufscar.br:20.500.14289/22229
network_acronym_str	SCAR
network_name_str	Repositório Institucional da UFSCAR
repository_id_str
spelling	Kamikawachi, Dayse Simon LandimAlmeida, Gladis Maria de Barcelloshttp://lattes.cnpq.br/4046789388750478http://lattes.cnpq.br/78460394221361992025-06-17T12:16:06Z2014-08-18KAMIKAWACHI, Dayse Simon Landim. Identificação e extração semiautomática de contextos definitórios em corpus com vistas à redação da definição terminológica: proposta de sistematização linguística para a língua portuguesa. 2014. Tese (Doutorado em Linguística) – Universidade Federal de São Carlos, São Carlos, 2014. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/22229.https://hdl.handle.net/20.500.14289/22229Computational tools in Natural Language Processing (NLP) are essential in handling electronic texts. Some of the resources commonly used are: frequency counters, word lists, keywords, and concordancers. It is noteworthy that this last tool, which terminologists use to view and extract defining contexts for certain terms, is useful in the stage of writing terminological definitions. Depending on the term and corpus size, the list of concordances may exceed a few hundred lines, making the task of defining the term extremely time-consuming. Yet, while concordancers facilitate the task of writing definitions, studies in terminology (ALARCÓN, 2009) and NLP (KLAVANS; MURESAN, 2001) have shown that it is possible to develop linguistic formalism that can be a substrate for the generation or enrichment of a system capable of detecting such contexts. While research in this direction has already been undertaken for English, Spanish, German, and French, among other languagesǡ research on Portuguese needs a more accurate linguistic explanation of defining context, in order to serve as a base for the development of similar systems for the Portuguese language. Therefore, the goals of this research are as follows: 1) to investigate the patterns of defining contexts found in technical corpora in Portuguese, 2) to provide linguistic knowledge which can be formalised computationally to create a system of semi-automatic extraction of candidate defining contexts, and 3) to evaluate the results generated. As the study’s corpus, we used scientific articles from the Bank of Portuguese (LAEL-PUC), and for the analysis, the following verbs were chosen: nomear ‘to name’, conceber ‘to conceive’, chamar ‘to call’, entender ‘to understand’, conhecer ‘to know’ and denominar ‘to denominate’. It was possible to do: 1) a quantitative and qualitative description of each verbal definitory pattern, 2) a local grammar for the chosen verbs with the purpose of aiding in the semiautomatic retrieval of definitory contexts, 3) an exclusion grammar to serve as a stoplist for local grammars and 4) a set of heuristics for a semiautomatic definitory context classifier. The evaluation of the lexical-syntactic rules of these six verbs showed 64% accuracy and 92% coverage in the global average, which represents an optimistic result in comparison to the results of previous studies. As a result, it was possible to 1) validate the methodology used, making it possible to extend it to other lexical-syntactic patterns and 2) obtain linguistic knowledge in order to integrate a semiautomatic computational system for definitory context candidates extraction for the Portuguese language.Ferramentas computacionais em Processamento de Língua Natural (PLN) são essenciais na manipulação de textos eletrônicos. Algumas ferramentas utilizadas que se podem citar são: contadores de frequência, listas de palavras, palavras-chave e concordanciadores. Destaca-se que esta última é a que terminólogos recorrem para visualizar e extrair contextos definitórios sobre determinado termo, os quais serão úteis na etapa da redação da definição terminológica. Ocorre que a lista de concordâncias, dependendo do termo e do tamanho do corpus, pode chegar muitas vezes a várias centenas de linhas, tornando a tarefa de definir extremamente morosa. Ainda que o concordanciador facilite essa tarefa humana, estudos no âmbito da Terminologia (ALARCÓN, 2009) e do PLN (KLAVANS; MURESAN, 2001) têm demonstrado que é possível desenvolver formalismo linguístico de maneira a auxiliar na geração ou enriquecimento de um sistema capaz de detectar automaticamente tais contextos. Pesquisas nessa direção têm sido realizadas para o inglês, espanhol, alemão, francês, entre outras línguas, mas para o português ainda há a necessidade de uma descrição linguística mais apurada sobre como se constituem os contextos definitórios, de modo que essa descrição possa servir de base para a construção de sistemas semelhantes para o português. Assim, esta pesquisa tem como objetivos gerais: 1) investigar padrões de contextos definitórios presentes em corpora de especialidades em língua portuguesa do Brasil; 2) proporcionar conhecimento linguístico que possa ser formalizado computacionalmente, a fim de integrar um sistema de extração semiautomática de candidatos a contextos definitórios; e, finalmente, 3) avaliar os resultados gerados. Na análise, foram eleitos os verbos “nomear”, “conceber”, “chamar”, “entender”, “conhecer” e “denominar” e, como corpus de estudo, esta pesquisa valeu-se de artigos científicos do Banco do Português (LAEL-PUC/SP). Foi possível realizar: 1) uma descrição quantitativa e qualitativa de cada padrão verbal definitório; 2) uma gramática local para os seis verbos, a fim de auxiliar na recuperação semiautomática de contextos definitórios; 3) uma gramática de exclusão para servir como uma stoplist das gramáticas locais; e 4) um conjunto de heurísticas para um classificador semiautomático de contextos definitórios. A avaliação geral apresentou precisão de 64% e cobertura de 92% na média global, o que demonstra um resultado otimista, se comparado com os demais trabalhos na literatura. Como resultado, foi possível: 1) validar a metodologia empregada a fim de estendê-la a outros padrões léxico-sintáticos; 2) obter conhecimento linguístico de modo a integrar um sistema computacional de extração semiautomática de candidatos a contextos definitórios para o português.Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)porUniversidade Federal de São CarlosCâmpus São CarlosPrograma de Pós-Graduação em Linguística - PPGLUFSCarAttribution-NonCommercial-NoDerivs 3.0 Brazilhttp://creativecommons.org/licenses/by-nc-nd/3.0/br/info:eu-repo/semantics/openAccessDefinitory contextTerminological definitionNatural language processingTerminologyContexto definitórioDefinição terminológicaProcessamento de linguagem naturalTerminologiaLINGUISTICA, LETRAS E ARTES::LINGUISTICA::TEORIA E ANALISE LINGUISTICAIdentificação e extração semiautomática de contextos definitórios em corpus com vistas à redação da definição terminológica: proposta de sistematização linguística para a língua portuguesaSemiautomatic identification and extraction of definitory contexts in corpora to support the writing of terminological definitions: a proposal for the linguistic systematization of the Portuguese languageIdentificación y extracción semiautomática de contextos definitorios en corpus para apoyar la redacción de definiciones terminológicas: una propuesta de sistematización lingüística de la lengua portuguesainfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisreponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINALTeseFinal-Dayse_Simon_Landim_Kamikawachi.pdfTeseFinal-Dayse_Simon_Landim_Kamikawachi.pdfapplication/pdf3437595https://repositorio.ufscar.br/bitstreams/d27aff78-cedf-4232-ad6d-f48672e63802/download769df4736bba35228196576da7bc87b0MD51trueAnonymousREADCC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8905https://repositorio.ufscar.br/bitstreams/99c2a2f3-0e4d-4315-89e7-98266f348afb/download57e258e544f104f04afb1d5e5b4e53c0MD52falseAnonymousREADTEXTTeseFinal-Dayse_Simon_Landim_Kamikawachi.pdf.txtTeseFinal-Dayse_Simon_Landim_Kamikawachi.pdf.txtExtracted texttext/plain103879https://repositorio.ufscar.br/bitstreams/bc9e6764-bb34-4f19-9f5d-22dd197ba97e/download95e13f6841aa5ba297869cfdb6cc8013MD53falseAnonymousREADTHUMBNAILTeseFinal-Dayse_Simon_Landim_Kamikawachi.pdf.jpgTeseFinal-Dayse_Simon_Landim_Kamikawachi.pdf.jpgGenerated Thumbnailimage/jpeg6968https://repositorio.ufscar.br/bitstreams/786a5ba6-6943-45d8-ba36-95da232a92cc/downloade9edd66e7ff73791ef1b9691fa4d0e6cMD54falseAnonymousREAD20.500.14289/222292025-06-18 00:03:20.453http://creativecommons.org/licenses/by-nc-nd/3.0/br/Attribution-NonCommercial-NoDerivs 3.0 Brazilopen.accessoai:repositorio.ufscar.br:20.500.14289/22229https://repositorio.ufscar.brRepositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestrepositorio.sibi@ufscar.bropendoar:43222025-06-18T03:03:20Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false
dc.title.por.fl_str_mv	Identificação e extração semiautomática de contextos definitórios em corpus com vistas à redação da definição terminológica: proposta de sistematização linguística para a língua portuguesa
dc.title.alternative.eng.fl_str_mv	Semiautomatic identification and extraction of definitory contexts in corpora to support the writing of terminological definitions: a proposal for the linguistic systematization of the Portuguese language
dc.title.alternative.spa.fl_str_mv	Identificación y extracción semiautomática de contextos definitorios en corpus para apoyar la redacción de definiciones terminológicas: una propuesta de sistematización lingüística de la lengua portuguesa
title	Identificação e extração semiautomática de contextos definitórios em corpus com vistas à redação da definição terminológica: proposta de sistematização linguística para a língua portuguesa
spellingShingle	Identificação e extração semiautomática de contextos definitórios em corpus com vistas à redação da definição terminológica: proposta de sistematização linguística para a língua portuguesa Kamikawachi, Dayse Simon Landim Definitory context Terminological definition Natural language processing Terminology Contexto definitório Definição terminológica Processamento de linguagem natural Terminologia LINGUISTICA, LETRAS E ARTES::LINGUISTICA::TEORIA E ANALISE LINGUISTICA
title_short	Identificação e extração semiautomática de contextos definitórios em corpus com vistas à redação da definição terminológica: proposta de sistematização linguística para a língua portuguesa
title_full	Identificação e extração semiautomática de contextos definitórios em corpus com vistas à redação da definição terminológica: proposta de sistematização linguística para a língua portuguesa
title_fullStr	Identificação e extração semiautomática de contextos definitórios em corpus com vistas à redação da definição terminológica: proposta de sistematização linguística para a língua portuguesa
title_full_unstemmed	Identificação e extração semiautomática de contextos definitórios em corpus com vistas à redação da definição terminológica: proposta de sistematização linguística para a língua portuguesa
title_sort	Identificação e extração semiautomática de contextos definitórios em corpus com vistas à redação da definição terminológica: proposta de sistematização linguística para a língua portuguesa
author	Kamikawachi, Dayse Simon Landim
author_facet	Kamikawachi, Dayse Simon Landim
author_role	author
dc.contributor.authorlattes.none.fl_str_mv	http://lattes.cnpq.br/7846039422136199
dc.contributor.author.fl_str_mv	Kamikawachi, Dayse Simon Landim
dc.contributor.advisor1.fl_str_mv	Almeida, Gladis Maria de Barcellos
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/4046789388750478
contributor_str_mv	Almeida, Gladis Maria de Barcellos
dc.subject.eng.fl_str_mv	Definitory context Terminological definition Natural language processing Terminology
topic	Definitory context Terminological definition Natural language processing Terminology Contexto definitório Definição terminológica Processamento de linguagem natural Terminologia LINGUISTICA, LETRAS E ARTES::LINGUISTICA::TEORIA E ANALISE LINGUISTICA
dc.subject.por.fl_str_mv	Contexto definitório Definição terminológica Processamento de linguagem natural Terminologia
dc.subject.cnpq.fl_str_mv	LINGUISTICA, LETRAS E ARTES::LINGUISTICA::TEORIA E ANALISE LINGUISTICA
description	Computational tools in Natural Language Processing (NLP) are essential in handling electronic texts. Some of the resources commonly used are: frequency counters, word lists, keywords, and concordancers. It is noteworthy that this last tool, which terminologists use to view and extract defining contexts for certain terms, is useful in the stage of writing terminological definitions. Depending on the term and corpus size, the list of concordances may exceed a few hundred lines, making the task of defining the term extremely time-consuming. Yet, while concordancers facilitate the task of writing definitions, studies in terminology (ALARCÓN, 2009) and NLP (KLAVANS; MURESAN, 2001) have shown that it is possible to develop linguistic formalism that can be a substrate for the generation or enrichment of a system capable of detecting such contexts. While research in this direction has already been undertaken for English, Spanish, German, and French, among other languagesǡ research on Portuguese needs a more accurate linguistic explanation of defining context, in order to serve as a base for the development of similar systems for the Portuguese language. Therefore, the goals of this research are as follows: 1) to investigate the patterns of defining contexts found in technical corpora in Portuguese, 2) to provide linguistic knowledge which can be formalised computationally to create a system of semi-automatic extraction of candidate defining contexts, and 3) to evaluate the results generated. As the study’s corpus, we used scientific articles from the Bank of Portuguese (LAEL-PUC), and for the analysis, the following verbs were chosen: nomear ‘to name’, conceber ‘to conceive’, chamar ‘to call’, entender ‘to understand’, conhecer ‘to know’ and denominar ‘to denominate’. It was possible to do: 1) a quantitative and qualitative description of each verbal definitory pattern, 2) a local grammar for the chosen verbs with the purpose of aiding in the semiautomatic retrieval of definitory contexts, 3) an exclusion grammar to serve as a stoplist for local grammars and 4) a set of heuristics for a semiautomatic definitory context classifier. The evaluation of the lexical-syntactic rules of these six verbs showed 64% accuracy and 92% coverage in the global average, which represents an optimistic result in comparison to the results of previous studies. As a result, it was possible to 1) validate the methodology used, making it possible to extend it to other lexical-syntactic patterns and 2) obtain linguistic knowledge in order to integrate a semiautomatic computational system for definitory context candidates extraction for the Portuguese language.
publishDate	2014
dc.date.issued.fl_str_mv	2014-08-18
dc.date.accessioned.fl_str_mv	2025-06-17T12:16:06Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	KAMIKAWACHI, Dayse Simon Landim. Identificação e extração semiautomática de contextos definitórios em corpus com vistas à redação da definição terminológica: proposta de sistematização linguística para a língua portuguesa. 2014. Tese (Doutorado em Linguística) – Universidade Federal de São Carlos, São Carlos, 2014. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/22229.
dc.identifier.uri.fl_str_mv	https://hdl.handle.net/20.500.14289/22229
identifier_str_mv	KAMIKAWACHI, Dayse Simon Landim. Identificação e extração semiautomática de contextos definitórios em corpus com vistas à redação da definição terminológica: proposta de sistematização linguística para a língua portuguesa. 2014. Tese (Doutorado em Linguística) – Universidade Federal de São Carlos, São Carlos, 2014. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/22229.
url	https://hdl.handle.net/20.500.14289/22229
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de São Carlos Câmpus São Carlos
dc.publisher.program.fl_str_mv	Programa de Pós-Graduação em Linguística - PPGL
dc.publisher.initials.fl_str_mv	UFSCar
publisher.none.fl_str_mv	Universidade Federal de São Carlos Câmpus São Carlos
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFSCAR instname:Universidade Federal de São Carlos (UFSCAR) instacron:UFSCAR
instname_str	Universidade Federal de São Carlos (UFSCAR)
instacron_str	UFSCAR
institution	UFSCAR
reponame_str	Repositório Institucional da UFSCAR
collection	Repositório Institucional da UFSCAR
bitstream.url.fl_str_mv	https://repositorio.ufscar.br/bitstreams/d27aff78-cedf-4232-ad6d-f48672e63802/download https://repositorio.ufscar.br/bitstreams/99c2a2f3-0e4d-4315-89e7-98266f348afb/download https://repositorio.ufscar.br/bitstreams/bc9e6764-bb34-4f19-9f5d-22dd197ba97e/download https://repositorio.ufscar.br/bitstreams/786a5ba6-6943-45d8-ba36-95da232a92cc/download
bitstream.checksum.fl_str_mv	769df4736bba35228196576da7bc87b0 57e258e544f104f04afb1d5e5b4e53c0 95e13f6841aa5ba297869cfdb6cc8013 e9edd66e7ff73791ef1b9691fa4d0e6c
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)
repository.mail.fl_str_mv	repositorio.sibi@ufscar.br
_version_	1851688800718159872

Identificação e extração semiautomática de contextos definitórios em corpus com vistas à redação da definição terminológica: proposta de sistematização linguística para a língua portuguesa

Registros relacionados