Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry.

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: Cordeiro, Fábio Corrêa
Orientador(a): Coelho, Flávio Codeço
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Não Informado pela instituição
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Inglês:
Link de acesso: https://hdl.handle.net/10438/35868
Resumo: Numerous companies are interested in gathering strategic information from their document repositories. It is especially relevant for the oil and gas industry, which has large repositories of geoscientific reports from several decades of production. Making this information available in a structured format can unlock valuable information among these mountains of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built with general domain texts extracted from the Internet and written primarily in English. This thesis presents a methodology for extracting geoscientific entities and relations from technical documents and populating a knowledge graph - the Petro KGraph. We also developed a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese. Along the text, we describe the natural language processing and information extraction resources, the process used to train machine learning models, and review relevant literature. Finally, we evaluate each model and the overall methodology. We developed an innovative Entity Linking approach that allows finding new entities beyond those already known. Another crucial contribution is that the new resources and evaluation procedures constitute a new benchmark for the Portuguese language and the geoscience domain. We evaluated an information retrieval system using the Petro KGraph to expand its queries, which presented a slightly better result than the system without a query expansion. Plans for future work include building an improved information retrieval test set, comparing the results using different graph embedding algorithms, and testing language models launched after BERT models.
id FGV_a3bbcd3349c4f423624eca7a6fe06552
oai_identifier_str oai:repositorio.fgv.br:10438/35868
network_acronym_str FGV
network_name_str Repositório Institucional do FGV (FGV Repositório Digital)
repository_id_str
spelling Cordeiro, Fábio CorrêaEscolas::EMApSilva, Moacyr Alvim Horta Barbosa daSouza, Renato RochaCoelho, Flávio Codeço2024-09-16T13:42:13Z2024-09-16T13:42:13Z2024-08-13https://hdl.handle.net/10438/35868Numerous companies are interested in gathering strategic information from their document repositories. It is especially relevant for the oil and gas industry, which has large repositories of geoscientific reports from several decades of production. Making this information available in a structured format can unlock valuable information among these mountains of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built with general domain texts extracted from the Internet and written primarily in English. This thesis presents a methodology for extracting geoscientific entities and relations from technical documents and populating a knowledge graph - the Petro KGraph. We also developed a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese. Along the text, we describe the natural language processing and information extraction resources, the process used to train machine learning models, and review relevant literature. Finally, we evaluate each model and the overall methodology. We developed an innovative Entity Linking approach that allows finding new entities beyond those already known. Another crucial contribution is that the new resources and evaluation procedures constitute a new benchmark for the Portuguese language and the geoscience domain. We evaluated an information retrieval system using the Petro KGraph to expand its queries, which presented a slightly better result than the system without a query expansion. Plans for future work include building an improved information retrieval test set, comparing the results using different graph embedding algorithms, and testing language models launched after BERT models.Inúmeras organizações estão interessadas em coletar informações estratégicas de seus repositórios de documentos. Essa preocupação e especialmente relevante para a indústria de óleo e gás, que possui grandes repositórios de relatórios geocinéticos de várias décadas de produção. Disponibilizar esta informação em um formato estruturado pode desbloquear informações valiosas dessa montanhas de dados, o que ´e crucial para apoiar uma vasta gama de aplicações industriais e acadêmicas. Contudo, a maioria dos recursos para processamento de linguagem natural foi construída com textos de domínio geral extraídos da Internet e escritos principalmente em inglês. Esta tese apresenta uma metodologia para extrair entidades e relações geocinéticas de documentos t´técnicos e povoar um grafo de conhecimento - o Petro KGraph. Também desenvolvemos um conjunto abrangente de recursos de processamento de linguagem natural e extração de informações para a indústria de ´óleo e gás em português. Ao longo do texto, descrevemos os recursos de processamento de linguagem natural e extração de informações, o processo utilizado para treinar os modelos de aprendizado de m´máquina e revisamos a literatura relevante. Por fim, avaliamos cada modelo e a metodologia como um todo. Desenvolvemos uma abordagem inovadora para o Entity Linking que permite encontrar novas entidades além das já conhecidas. Outra contribuição crucial ´e que os recursos e procedimentos de avaliação constituem uma nova referência para a língua portuguesa e o domínio das geociências. Avaliamos um sistema de recuperação de informação utilizando o Petro KGraph para expandir suas consultas, que apresentou um resultado um pouco melhor que o sistema sem expansão de consultas. Os planos para trabalhos futuros incluem o aprimoramento de um conjunto de teste de recuperação de informações, a comparação dos resultados usando diferentes algoritmos de vetorização de grafos e o teste de modelos de linguagem lançados após os modelos BERT.PetrobrasengNatural language processingInformation extractionKnowledge graphOntology populationGraph embeddingInformation retrievalMatemáticaProcessamento da linguagem natural (Computação)Ontologias (Recuperação da informação)Recuperação da informaçãoMineração de dados (Computação)Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry.info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional do FGV (FGV Repositório Digital)instname:Fundação Getulio Vargas (FGV)instacron:FGVORIGINALPetroKGraph - Fabio_Cordeiro.pdfPetroKGraph - Fabio_Cordeiro.pdfPDFapplication/pdf7561652https://repositorio.fgv.br/bitstreams/c6c6620f-2c89-468a-8f54-46b4884cdaac/download5ad8a5970253eaa2eb2fbab6d5e5a506MD51LICENSElicense.txtlicense.txttext/plain; charset=utf-85112https://repositorio.fgv.br/bitstreams/489c6f97-2c72-41b3-a4ee-4e507aaf61eb/download2a4b67231f701c416a809246e7a10077MD52TEXTPetroKGraph - Fabio_Cordeiro.pdf.txtPetroKGraph - Fabio_Cordeiro.pdf.txtExtracted texttext/plain100833https://repositorio.fgv.br/bitstreams/fc18947a-dd8c-49ce-a219-9a2d9de38f08/downloadc22ff48b1aff1c4a664f6a60edb378e4MD53THUMBNAILPetroKGraph - Fabio_Cordeiro.pdf.jpgPetroKGraph - Fabio_Cordeiro.pdf.jpgGenerated Thumbnailimage/jpeg2939https://repositorio.fgv.br/bitstreams/d28c07ab-fd91-44e2-98f3-31354bc9c7b9/downloadf9dd05e73fd207e2946782988d8c742dMD5410438/358682024-09-17 12:02:35.75open.accessoai:repositorio.fgv.br:10438/35868https://repositorio.fgv.brRepositório InstitucionalPRIhttp://bibliotecadigital.fgv.br/dspace-oai/requestopendoar:39742024-09-17T12:02:35Repositório Institucional do FGV (FGV Repositório Digital) - Fundação Getulio Vargas (FGV)falseVGVybW8gZGUgTGljZW5jaWFtZW50bwpIw6EgdW0gw7psdGltbyBwYXNzbzogcGFyYSByZXByb2R1emlyLCB0cmFkdXppciBlIGRpc3RyaWJ1aXIgc3VhIHN1Ym1pc3PDo28gZW0gdG9kbyBvIG11bmRvLCB2b2PDqiBkZXZlIGNvbmNvcmRhciBjb20gb3MgdGVybW9zIGEgc2VndWlyLgoKQ29uY29yZGFyIGNvbSBvIFRlcm1vIGRlIExpY2VuY2lhbWVudG8sIHNlbGVjaW9uYW5kbyAiRXUgY29uY29yZG8gY29tIG8gVGVybW8gZGUgTGljZW5jaWFtZW50byIgZSBjbGlxdWUgZW0gIkZpbmFsaXphciBzdWJtaXNzw6NvIi4KClRFUk1PUyBMSUNFTkNJQU1FTlRPIFBBUkEgQVJRVUlWQU1FTlRPLCBSRVBST0RVw4fDg08gRSBESVZVTEdBw4fDg08gUMOaQkxJQ0EgREUgQ09OVEXDmkRPIMOAIEJJQkxJT1RFQ0EgVklSVFVBTCBGR1YgKHZlcnPDo28gMS4yKQoKMS4gVm9jw6osIHVzdcOhcmlvLWRlcG9zaXRhbnRlIGRhIEJpYmxpb3RlY2EgVmlydHVhbCBGR1YsIGFzc2VndXJhLCBubyBwcmVzZW50ZSBhdG8sIHF1ZSDDqSB0aXR1bGFyIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBwYXRyaW1vbmlhaXMgZS9vdSBkaXJlaXRvcyBjb25leG9zIHJlZmVyZW50ZXMgw6AgdG90YWxpZGFkZSBkYSBPYnJhIG9yYSBkZXBvc2l0YWRhIGVtIGZvcm1hdG8gZGlnaXRhbCwgYmVtIGNvbW8gZGUgc2V1cyBjb21wb25lbnRlcyBtZW5vcmVzLCBlbSBzZSB0cmF0YW5kbyBkZSBvYnJhIGNvbGV0aXZhLCBjb25mb3JtZSBvIHByZWNlaXR1YWRvIHBlbGEgTGVpIDkuNjEwLzk4IGUvb3UgTGVpIDkuNjA5Lzk4LiBOw6NvIHNlbmRvIGVzdGUgbyBjYXNvLCB2b2PDqiBhc3NlZ3VyYSB0ZXIgb2J0aWRvLCBkaXJldGFtZW50ZSBkb3MgZGV2aWRvcyB0aXR1bGFyZXMsIGF1dG9yaXphw6fDo28gcHLDqXZpYSBlIGV4cHJlc3NhIHBhcmEgbyBkZXDDs3NpdG8gZSBkaXZ1bGdhw6fDo28gZGEgT2JyYSwgYWJyYW5nZW5kbyB0b2RvcyBvcyBkaXJlaXRvcyBhdXRvcmFpcyBlIGNvbmV4b3MgYWZldGFkb3MgcGVsYSBhc3NpbmF0dXJhIGRvcyBwcmVzZW50ZXMgdGVybW9zIGRlIGxpY2VuY2lhbWVudG8sIGRlIG1vZG8gYSBlZmV0aXZhbWVudGUgaXNlbnRhciBhIEZ1bmRhw6fDo28gR2V0dWxpbyBWYXJnYXMgZSBzZXVzIGZ1bmNpb27DoXJpb3MgZGUgcXVhbHF1ZXIgcmVzcG9uc2FiaWxpZGFkZSBwZWxvIHVzbyBuw6NvLWF1dG9yaXphZG8gZG8gbWF0ZXJpYWwgZGVwb3NpdGFkbywgc2VqYSBlbSB2aW5jdWxhw6fDo28gw6AgQmlibGlvdGVjYSBWaXJ0dWFsIEZHViwgc2VqYSBlbSB2aW5jdWxhw6fDo28gYSBxdWFpc3F1ZXIgc2VydmnDp29zIGRlIGJ1c2NhIGUgZGlzdHJpYnVpw6fDo28gZGUgY29udGXDumRvIHF1ZSBmYcOnYW0gdXNvIGRhcyBpbnRlcmZhY2VzIGUgZXNwYcOnbyBkZSBhcm1hemVuYW1lbnRvIHByb3ZpZGVuY2lhZG9zIHBlbGEgRnVuZGHDp8OjbyBHZXR1bGlvIFZhcmdhcyBwb3IgbWVpbyBkZSBzZXVzIHNpc3RlbWFzIGluZm9ybWF0aXphZG9zLgoKMi4gQSBhc3NpbmF0dXJhIGRlc3RhIGxpY2Vuw6dhIHRlbSBjb21vIGNvbnNlccO8w6puY2lhIGEgdHJhbnNmZXLDqm5jaWEsIGEgdMOtdHVsbyBuw6NvLWV4Y2x1c2l2byBlIG7Do28tb25lcm9zbywgaXNlbnRhIGRvIHBhZ2FtZW50byBkZSByb3lhbHRpZXMgb3UgcXVhbHF1ZXIgb3V0cmEgY29udHJhcHJlc3Rhw6fDo28sIHBlY3VuacOhcmlhIG91IG7Do28sIMOgIEZ1bmRhw6fDo28gR2V0dWxpbyBWYXJnYXMsIGRvcyBkaXJlaXRvcyBkZSBhcm1hemVuYXIgZGlnaXRhbG1lbnRlLCByZXByb2R1emlyIGUgZGlzdHJpYnVpciBuYWNpb25hbCBlIGludGVybmFjaW9uYWxtZW50ZSBhIE9icmEsIGluY2x1aW5kby1zZSBvIHNldSByZXN1bW8vYWJzdHJhY3QsIHBvciBtZWlvcyBlbGV0csO0bmljb3MsIG5vIHNpdGUgZGEgQmlibGlvdGVjYSBWaXJ0dWFsIEZHViwgYW8gcMO6YmxpY28gZW0gZ2VyYWwsIGVtIHJlZ2ltZSBkZSBhY2Vzc28gYWJlcnRvLgoKMy4gQSBwcmVzZW50ZSBsaWNlbsOnYSB0YW1iw6ltIGFicmFuZ2UsIG5vcyBtZXNtb3MgdGVybW9zIGVzdGFiZWxlY2lkb3Mgbm8gaXRlbSAyLCBzdXByYSwgcXVhbHF1ZXIgZGlyZWl0byBkZSBjb211bmljYcOnw6NvIGFvIHDDumJsaWNvIGNhYsOtdmVsIGVtIHJlbGHDp8OjbyDDoCBPYnJhIG9yYSBkZXBvc2l0YWRhLCBpbmNsdWluZG8tc2Ugb3MgdXNvcyByZWZlcmVudGVzIMOgIHJlcHJlc2VudGHDp8OjbyBww7pibGljYSBlL291IGV4ZWN1w6fDo28gcMO6YmxpY2EsIGJlbSBjb21vIHF1YWxxdWVyIG91dHJhIG1vZGFsaWRhZGUgZGUgY29tdW5pY2HDp8OjbyBhbyBww7pibGljbyBxdWUgZXhpc3RhIG91IHZlbmhhIGEgZXhpc3Rpciwgbm9zIHRlcm1vcyBkbyBhcnRpZ28gNjggZSBzZWd1aW50ZXMgZGEgTGVpIDkuNjEwLzk4LCBuYSBleHRlbnPDo28gcXVlIGZvciBhcGxpY8OhdmVsIGFvcyBzZXJ2acOnb3MgcHJlc3RhZG9zIGFvIHDDumJsaWNvIHBlbGEgQmlibGlvdGVjYSBWaXJ0dWFsIEZHVi4KCjQuIEVzdGEgbGljZW7Dp2EgYWJyYW5nZSwgYWluZGEsIG5vcyBtZXNtb3MgdGVybW9zIGVzdGFiZWxlY2lkb3Mgbm8gaXRlbSAyLCBzdXByYSwgdG9kb3Mgb3MgZGlyZWl0b3MgY29uZXhvcyBkZSBhcnRpc3RhcyBpbnTDqXJwcmV0ZXMgb3UgZXhlY3V0YW50ZXMsIHByb2R1dG9yZXMgZm9ub2dyw6FmaWNvcyBvdSBlbXByZXNhcyBkZSByYWRpb2RpZnVzw6NvIHF1ZSBldmVudHVhbG1lbnRlIHNlamFtIGFwbGljw6F2ZWlzIGVtIHJlbGHDp8OjbyDDoCBvYnJhIGRlcG9zaXRhZGEsIGVtIGNvbmZvcm1pZGFkZSBjb20gbyByZWdpbWUgZml4YWRvIG5vIFTDrXR1bG8gViBkYSBMZWkgOS42MTAvOTguCgo1LiBTZSBhIE9icmEgZGVwb3NpdGFkYSBmb2kgb3Ugw6kgb2JqZXRvIGRlIGZpbmFuY2lhbWVudG8gcG9yIGluc3RpdHVpw6fDtWVzIGRlIGZvbWVudG8gw6AgcGVzcXVpc2Egb3UgcXVhbHF1ZXIgb3V0cmEgc2VtZWxoYW50ZSwgdm9jw6ogb3UgbyB0aXR1bGFyIGFzc2VndXJhIHF1ZSBjdW1wcml1IHRvZGFzIGFzIG9icmlnYcOnw7VlcyBxdWUgbGhlIGZvcmFtIGltcG9zdGFzIHBlbGEgaW5zdGl0dWnDp8OjbyBmaW5hbmNpYWRvcmEgZW0gcmF6w6NvIGRvIGZpbmFuY2lhbWVudG8sIGUgcXVlIG7Do28gZXN0w6EgY29udHJhcmlhbmRvIHF1YWxxdWVyIGRpc3Bvc2nDp8OjbyBjb250cmF0dWFsIHJlZmVyZW50ZSDDoCBwdWJsaWNhw6fDo28gZG8gY29udGXDumRvIG9yYSBzdWJtZXRpZG8gw6AgQmlibGlvdGVjYSBWaXJ0dWFsIEZHVi4KCjYuIENhc28gYSBPYnJhIG9yYSBkZXBvc2l0YWRhIGVuY29udHJlLXNlIGxpY2VuY2lhZGEgc29iIHVtYSBsaWNlbsOnYSBDcmVhdGl2ZSBDb21tb25zIChxdWFscXVlciB2ZXJzw6NvKSwgc29iIGEgbGljZW7Dp2EgR05VIEZyZWUgRG9jdW1lbnRhdGlvbiBMaWNlbnNlIChxdWFscXVlciB2ZXJzw6NvKSwgb3Ugb3V0cmEgbGljZW7Dp2EgcXVhbGlmaWNhZGEgY29tbyBsaXZyZSBzZWd1bmRvIG9zIGNyaXTDqXJpb3MgZGEgRGVmaW5pdGlvbiBvZiBGcmVlIEN1bHR1cmFsIFdvcmtzIChkaXNwb27DrXZlbCBlbTogaHR0cDovL2ZyZWVkb21kZWZpbmVkLm9yZy9EZWZpbml0aW9uKSBvdSBGcmVlIFNvZnR3YXJlIERlZmluaXRpb24gKGRpc3BvbsOtdmVsIGVtOiBodHRwOi8vd3d3LmdudS5vcmcvcGhpbG9zb3BoeS9mcmVlLXN3Lmh0bWwpLCBvIGFycXVpdm8gcmVmZXJlbnRlIMOgIE9icmEgZGV2ZSBpbmRpY2FyIGEgbGljZW7Dp2EgYXBsaWPDoXZlbCBlbSBjb250ZcO6ZG8gbGVnw612ZWwgcG9yIHNlcmVzIGh1bWFub3MgZSwgc2UgcG9zc8OtdmVsLCB0YW1iw6ltIGVtIG1ldGFkYWRvcyBsZWfDrXZlaXMgcG9yIG3DoXF1aW5hLiBBIGluZGljYcOnw6NvIGRhIGxpY2Vuw6dhIGFwbGljw6F2ZWwgZGV2ZSBzZXIgYWNvbXBhbmhhZGEgZGUgdW0gbGluayBwYXJhIG9zIHRlcm1vcyBkZSBsaWNlbmNpYW1lbnRvIG91IHN1YSBjw7NwaWEgaW50ZWdyYWwuCgpBbyBjb25jbHVpciBhIHByZXNlbnRlIGV0YXBhIGUgYXMgZXRhcGFzIHN1YnNlccO8ZW50ZXMgZG8gcHJvY2Vzc28gZGUgc3VibWlzc8OjbyBkZSBhcnF1aXZvcyDDoCBCaWJsaW90ZWNhIFZpcnR1YWwgRkdWLCB2b2PDqiBhdGVzdGEgcXVlIGxldSBlIGNvbmNvcmRhIGludGVncmFsbWVudGUgY29tIG9zIHRlcm1vcyBhY2ltYSBkZWxpbWl0YWRvcywgYXNzaW5hbmRvLW9zIHNlbSBmYXplciBxdWFscXVlciByZXNlcnZhIGUgbm92YW1lbnRlIGNvbmZpcm1hbmRvIHF1ZSBjdW1wcmUgb3MgcmVxdWlzaXRvcyBpbmRpY2Fkb3Mgbm8gaXRlbSAxLCBzdXByYS4KCkhhdmVuZG8gcXVhbHF1ZXIgZGlzY29yZMOibmNpYSBlbSByZWxhw6fDo28gYW9zIHByZXNlbnRlcyB0ZXJtb3Mgb3UgbsOjbyBzZSB2ZXJpZmljYW5kbyBvIGV4aWdpZG8gbm8gaXRlbSAxLCBzdXByYSwgdm9jw6ogZGV2ZSBpbnRlcnJvbXBlciBpbWVkaWF0YW1lbnRlIG8gcHJvY2Vzc28gZGUgc3VibWlzc8Ojby4gQSBjb250aW51aWRhZGUgZG8gcHJvY2Vzc28gZXF1aXZhbGUgw6AgYXNzaW5hdHVyYSBkZXN0ZSBkb2N1bWVudG8sIGNvbSB0b2RhcyBhcyBjb25zZXHDvMOqbmNpYXMgbmVsZSBwcmV2aXN0YXMsIHN1amVpdGFuZG8tc2UgbyBzaWduYXTDoXJpbyBhIHNhbsOnw7VlcyBjaXZpcyBlIGNyaW1pbmFpcyBjYXNvIG7Do28gc2VqYSB0aXR1bGFyIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBwYXRyaW1vbmlhaXMgZS9vdSBjb25leG9zIGFwbGljw6F2ZWlzIMOgIE9icmEgZGVwb3NpdGFkYSBkdXJhbnRlIGVzdGUgcHJvY2Vzc28sIG91IGNhc28gbsOjbyB0ZW5oYSBvYnRpZG8gcHLDqXZpYSBlIGV4cHJlc3NhIGF1dG9yaXphw6fDo28gZG8gdGl0dWxhciBwYXJhIG8gZGVww7NzaXRvIGUgdG9kb3Mgb3MgdXNvcyBkYSBPYnJhIGVudm9sdmlkb3MuCgpQYXJhIGEgc29sdcOnw6NvIGRlIHF1YWxxdWVyIGTDunZpZGEgcXVhbnRvIGFvcyB0ZXJtb3MgZGUgbGljZW5jaWFtZW50byBlIG8gcHJvY2Vzc28gZGUgc3VibWlzc8OjbywgY2xpcXVlIG5vIGxpbmsgIkZhbGUgY29ub3NjbyIuCgpTZSB2b2PDqiB0aXZlciBkw7p2aWRhcyBzb2JyZSBlc3RhIGxpY2Vuw6dhLCBwb3IgZmF2b3IgZW50cmUgZW0gY29udGF0byBjb20gb3MgYWRtaW5pc3RyYWRvcmVzIGRvIFJlcG9zaXTDs3Jpby4K
dc.title.eng.fl_str_mv Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry.
title Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry.
spellingShingle Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry.
Cordeiro, Fábio Corrêa
Natural language processing
Information extraction
Knowledge graph
Ontology population
Graph embedding
Information retrieval
Matemática
Processamento da linguagem natural (Computação)
Ontologias (Recuperação da informação)
Recuperação da informação
Mineração de dados (Computação)
title_short Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry.
title_full Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry.
title_fullStr Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry.
title_full_unstemmed Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry.
title_sort Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry.
author Cordeiro, Fábio Corrêa
author_facet Cordeiro, Fábio Corrêa
author_role author
dc.contributor.unidadefgv.por.fl_str_mv Escolas::EMAp
dc.contributor.member.none.fl_str_mv Silva, Moacyr Alvim Horta Barbosa da
Souza, Renato Rocha
dc.contributor.author.fl_str_mv Cordeiro, Fábio Corrêa
dc.contributor.advisor1.fl_str_mv Coelho, Flávio Codeço
contributor_str_mv Coelho, Flávio Codeço
dc.subject.eng.fl_str_mv Natural language processing
Information extraction
Knowledge graph
Ontology population
Graph embedding
Information retrieval
topic Natural language processing
Information extraction
Knowledge graph
Ontology population
Graph embedding
Information retrieval
Matemática
Processamento da linguagem natural (Computação)
Ontologias (Recuperação da informação)
Recuperação da informação
Mineração de dados (Computação)
dc.subject.area.por.fl_str_mv Matemática
dc.subject.bibliodata.por.fl_str_mv Processamento da linguagem natural (Computação)
Ontologias (Recuperação da informação)
Recuperação da informação
Mineração de dados (Computação)
description Numerous companies are interested in gathering strategic information from their document repositories. It is especially relevant for the oil and gas industry, which has large repositories of geoscientific reports from several decades of production. Making this information available in a structured format can unlock valuable information among these mountains of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built with general domain texts extracted from the Internet and written primarily in English. This thesis presents a methodology for extracting geoscientific entities and relations from technical documents and populating a knowledge graph - the Petro KGraph. We also developed a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese. Along the text, we describe the natural language processing and information extraction resources, the process used to train machine learning models, and review relevant literature. Finally, we evaluate each model and the overall methodology. We developed an innovative Entity Linking approach that allows finding new entities beyond those already known. Another crucial contribution is that the new resources and evaluation procedures constitute a new benchmark for the Portuguese language and the geoscience domain. We evaluated an information retrieval system using the Petro KGraph to expand its queries, which presented a slightly better result than the system without a query expansion. Plans for future work include building an improved information retrieval test set, comparing the results using different graph embedding algorithms, and testing language models launched after BERT models.
publishDate 2024
dc.date.accessioned.fl_str_mv 2024-09-16T13:42:13Z
dc.date.available.fl_str_mv 2024-09-16T13:42:13Z
dc.date.issued.fl_str_mv 2024-08-13
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://hdl.handle.net/10438/35868
url https://hdl.handle.net/10438/35868
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.source.none.fl_str_mv reponame:Repositório Institucional do FGV (FGV Repositório Digital)
instname:Fundação Getulio Vargas (FGV)
instacron:FGV
instname_str Fundação Getulio Vargas (FGV)
instacron_str FGV
institution FGV
reponame_str Repositório Institucional do FGV (FGV Repositório Digital)
collection Repositório Institucional do FGV (FGV Repositório Digital)
bitstream.url.fl_str_mv https://repositorio.fgv.br/bitstreams/c6c6620f-2c89-468a-8f54-46b4884cdaac/download
https://repositorio.fgv.br/bitstreams/489c6f97-2c72-41b3-a4ee-4e507aaf61eb/download
https://repositorio.fgv.br/bitstreams/fc18947a-dd8c-49ce-a219-9a2d9de38f08/download
https://repositorio.fgv.br/bitstreams/d28c07ab-fd91-44e2-98f3-31354bc9c7b9/download
bitstream.checksum.fl_str_mv 5ad8a5970253eaa2eb2fbab6d5e5a506
2a4b67231f701c416a809246e7a10077
c22ff48b1aff1c4a664f6a60edb378e4
f9dd05e73fd207e2946782988d8c742d
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional do FGV (FGV Repositório Digital) - Fundação Getulio Vargas (FGV)
repository.mail.fl_str_mv
_version_ 1827842584025759744