Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry.
| Ano de defesa: | 2024 |
|---|---|
| Autor(a) principal: | |
| Orientador(a): | |
| Banca de defesa: | |
| Tipo de documento: | Tese |
| Tipo de acesso: | Acesso aberto |
| Idioma: | eng |
| Instituição de defesa: |
Não Informado pela instituição
|
| Programa de Pós-Graduação: |
Não Informado pela instituição
|
| Departamento: |
Não Informado pela instituição
|
| País: |
Não Informado pela instituição
|
| Palavras-chave em Inglês: | |
| Link de acesso: | https://hdl.handle.net/10438/35868 |
Resumo: | Numerous companies are interested in gathering strategic information from their document repositories. It is especially relevant for the oil and gas industry, which has large repositories of geoscientific reports from several decades of production. Making this information available in a structured format can unlock valuable information among these mountains of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built with general domain texts extracted from the Internet and written primarily in English. This thesis presents a methodology for extracting geoscientific entities and relations from technical documents and populating a knowledge graph - the Petro KGraph. We also developed a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese. Along the text, we describe the natural language processing and information extraction resources, the process used to train machine learning models, and review relevant literature. Finally, we evaluate each model and the overall methodology. We developed an innovative Entity Linking approach that allows finding new entities beyond those already known. Another crucial contribution is that the new resources and evaluation procedures constitute a new benchmark for the Portuguese language and the geoscience domain. We evaluated an information retrieval system using the Petro KGraph to expand its queries, which presented a slightly better result than the system without a query expansion. Plans for future work include building an improved information retrieval test set, comparing the results using different graph embedding algorithms, and testing language models launched after BERT models. |
| id |
FGV_a3bbcd3349c4f423624eca7a6fe06552 |
|---|---|
| oai_identifier_str |
oai:repositorio.fgv.br:10438/35868 |
| network_acronym_str |
FGV |
| network_name_str |
Repositório Institucional do FGV (FGV Repositório Digital) |
| repository_id_str |
|
| spelling |
Cordeiro, Fábio CorrêaEscolas::EMApSilva, Moacyr Alvim Horta Barbosa daSouza, Renato RochaCoelho, Flávio Codeço2024-09-16T13:42:13Z2024-09-16T13:42:13Z2024-08-13https://hdl.handle.net/10438/35868Numerous companies are interested in gathering strategic information from their document repositories. It is especially relevant for the oil and gas industry, which has large repositories of geoscientific reports from several decades of production. Making this information available in a structured format can unlock valuable information among these mountains of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built with general domain texts extracted from the Internet and written primarily in English. This thesis presents a methodology for extracting geoscientific entities and relations from technical documents and populating a knowledge graph - the Petro KGraph. We also developed a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese. Along the text, we describe the natural language processing and information extraction resources, the process used to train machine learning models, and review relevant literature. Finally, we evaluate each model and the overall methodology. We developed an innovative Entity Linking approach that allows finding new entities beyond those already known. Another crucial contribution is that the new resources and evaluation procedures constitute a new benchmark for the Portuguese language and the geoscience domain. We evaluated an information retrieval system using the Petro KGraph to expand its queries, which presented a slightly better result than the system without a query expansion. Plans for future work include building an improved information retrieval test set, comparing the results using different graph embedding algorithms, and testing language models launched after BERT models.Inúmeras organizações estão interessadas em coletar informações estratégicas de seus repositórios de documentos. Essa preocupação e especialmente relevante para a indústria de óleo e gás, que possui grandes repositórios de relatórios geocinéticos de várias décadas de produção. Disponibilizar esta informação em um formato estruturado pode desbloquear informações valiosas dessa montanhas de dados, o que ´e crucial para apoiar uma vasta gama de aplicações industriais e acadêmicas. Contudo, a maioria dos recursos para processamento de linguagem natural foi construída com textos de domínio geral extraídos da Internet e escritos principalmente em inglês. Esta tese apresenta uma metodologia para extrair entidades e relações geocinéticas de documentos t´técnicos e povoar um grafo de conhecimento - o Petro KGraph. Também desenvolvemos um conjunto abrangente de recursos de processamento de linguagem natural e extração de informações para a indústria de ´óleo e gás em português. Ao longo do texto, descrevemos os recursos de processamento de linguagem natural e extração de informações, o processo utilizado para treinar os modelos de aprendizado de m´máquina e revisamos a literatura relevante. Por fim, avaliamos cada modelo e a metodologia como um todo. Desenvolvemos uma abordagem inovadora para o Entity Linking que permite encontrar novas entidades além das já conhecidas. Outra contribuição crucial ´e que os recursos e procedimentos de avaliação constituem uma nova referência para a língua portuguesa e o domínio das geociências. Avaliamos um sistema de recuperação de informação utilizando o Petro KGraph para expandir suas consultas, que apresentou um resultado um pouco melhor que o sistema sem expansão de consultas. Os planos para trabalhos futuros incluem o aprimoramento de um conjunto de teste de recuperação de informações, a comparação dos resultados usando diferentes algoritmos de vetorização de grafos e o teste de modelos de linguagem lançados após os modelos BERT.PetrobrasengNatural language processingInformation extractionKnowledge graphOntology populationGraph embeddingInformation retrievalMatemáticaProcessamento da linguagem natural (Computação)Ontologias (Recuperação da informação)Recuperação da informaçãoMineração de dados (Computação)Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry.info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional do FGV (FGV Repositório Digital)instname:Fundação Getulio Vargas (FGV)instacron:FGVORIGINALPetroKGraph - Fabio_Cordeiro.pdfPetroKGraph - Fabio_Cordeiro.pdfPDFapplication/pdf7561652https://repositorio.fgv.br/bitstreams/c6c6620f-2c89-468a-8f54-46b4884cdaac/download5ad8a5970253eaa2eb2fbab6d5e5a506MD51LICENSElicense.txtlicense.txttext/plain; charset=utf-85112https://repositorio.fgv.br/bitstreams/489c6f97-2c72-41b3-a4ee-4e507aaf61eb/download2a4b67231f701c416a809246e7a10077MD52TEXTPetroKGraph - Fabio_Cordeiro.pdf.txtPetroKGraph - Fabio_Cordeiro.pdf.txtExtracted texttext/plain100833https://repositorio.fgv.br/bitstreams/fc18947a-dd8c-49ce-a219-9a2d9de38f08/downloadc22ff48b1aff1c4a664f6a60edb378e4MD53THUMBNAILPetroKGraph - Fabio_Cordeiro.pdf.jpgPetroKGraph - Fabio_Cordeiro.pdf.jpgGenerated Thumbnailimage/jpeg2939https://repositorio.fgv.br/bitstreams/d28c07ab-fd91-44e2-98f3-31354bc9c7b9/downloadf9dd05e73fd207e2946782988d8c742dMD5410438/358682024-09-17 12:02:35.75open.accessoai:repositorio.fgv.br:10438/35868https://repositorio.fgv.brRepositório InstitucionalPRIhttp://bibliotecadigital.fgv.br/dspace-oai/requestopendoar:39742024-09-17T12:02:35Repositório Institucional do FGV (FGV Repositório Digital) - Fundação Getulio Vargas (FGV)falseVGVybW8gZGUgTGljZW5jaWFtZW50bwpIw6EgdW0gw7psdGltbyBwYXNzbzogcGFyYSByZXByb2R1emlyLCB0cmFkdXppciBlIGRpc3RyaWJ1aXIgc3VhIHN1Ym1pc3PDo28gZW0gdG9kbyBvIG11bmRvLCB2b2PDqiBkZXZlIGNvbmNvcmRhciBjb20gb3MgdGVybW9zIGEgc2VndWlyLgoKQ29uY29yZGFyIGNvbSBvIFRlcm1vIGRlIExpY2VuY2lhbWVudG8sIHNlbGVjaW9uYW5kbyAiRXUgY29uY29yZG8gY29tIG8gVGVybW8gZGUgTGljZW5jaWFtZW50byIgZSBjbGlxdWUgZW0gIkZpbmFsaXphciBzdWJtaXNzw6NvIi4KClRFUk1PUyBMSUNFTkNJQU1FTlRPIFBBUkEgQVJRVUlWQU1FTlRPLCBSRVBST0RVw4fDg08gRSBESVZVTEdBw4fDg08gUMOaQkxJQ0EgREUgQ09OVEXDmkRPIMOAIEJJQkxJT1RFQ0EgVklSVFVBTCBGR1YgKHZlcnPDo28gMS4yKQoKMS4gVm9jw6osIHVzdcOhcmlvLWRlcG9zaXRhbnRlIGRhIEJpYmxpb3RlY2EgVmlydHVhbCBGR1YsIGFzc2VndXJhLCBubyBwcmVzZW50ZSBhdG8sIHF1ZSDDqSB0aXR1bGFyIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBwYXRyaW1vbmlhaXMgZS9vdSBkaXJlaXRvcyBjb25leG9zIHJlZmVyZW50ZXMgw6AgdG90YWxpZGFkZSBkYSBPYnJhIG9yYSBkZXBvc2l0YWRhIGVtIGZvcm1hdG8gZGlnaXRhbCwgYmVtIGNvbW8gZGUgc2V1cyBjb21wb25lbnRlcyBtZW5vcmVzLCBlbSBzZSB0cmF0YW5kbyBkZSBvYnJhIGNvbGV0aXZhLCBjb25mb3JtZSBvIHByZWNlaXR1YWRvIHBlbGEgTGVpIDkuNjEwLzk4IGUvb3UgTGVpIDkuNjA5Lzk4LiBOw6NvIHNlbmRvIGVzdGUgbyBjYXNvLCB2b2PDqiBhc3NlZ3VyYSB0ZXIgb2J0aWRvLCBkaXJldGFtZW50ZSBkb3MgZGV2aWRvcyB0aXR1bGFyZXMsIGF1dG9yaXphw6fDo28gcHLDqXZpYSBlIGV4cHJlc3NhIHBhcmEgbyBkZXDDs3NpdG8gZSBkaXZ1bGdhw6fDo28gZGEgT2JyYSwgYWJyYW5nZW5kbyB0b2RvcyBvcyBkaXJlaXRvcyBhdXRvcmFpcyBlIGNvbmV4b3MgYWZldGFkb3MgcGVsYSBhc3NpbmF0dXJhIGRvcyBwcmVzZW50ZXMgdGVybW9zIGRlIGxpY2VuY2lhbWVudG8sIGRlIG1vZG8gYSBlZmV0aXZhbWVudGUgaXNlbnRhciBhIEZ1bmRhw6fDo28gR2V0dWxpbyBWYXJnYXMgZSBzZXVzIGZ1bmNpb27DoXJpb3MgZGUgcXVhbHF1ZXIgcmVzcG9uc2FiaWxpZGFkZSBwZWxvIHVzbyBuw6NvLWF1dG9yaXphZG8gZG8gbWF0ZXJpYWwgZGVwb3NpdGFkbywgc2VqYSBlbSB2aW5jdWxhw6fDo28gw6AgQmlibGlvdGVjYSBWaXJ0dWFsIEZHViwgc2VqYSBlbSB2aW5jdWxhw6fDo28gYSBxdWFpc3F1ZXIgc2VydmnDp29zIGRlIGJ1c2NhIGUgZGlzdHJpYnVpw6fDo28gZGUgY29udGXDumRvIHF1ZSBmYcOnYW0gdXNvIGRhcyBpbnRlcmZhY2VzIGUgZXNwYcOnbyBkZSBhcm1hemVuYW1lbnRvIHByb3ZpZGVuY2lhZG9zIHBlbGEgRnVuZGHDp8OjbyBHZXR1bGlvIFZhcmdhcyBwb3IgbWVpbyBkZSBzZXVzIHNpc3RlbWFzIGluZm9ybWF0aXphZG9zLgoKMi4gQSBhc3NpbmF0dXJhIGRlc3RhIGxpY2Vuw6dhIHRlbSBjb21vIGNvbnNlccO8w6puY2lhIGEgdHJhbnNmZXLDqm5jaWEsIGEgdMOtdHVsbyBuw6NvLWV4Y2x1c2l2byBlIG7Do28tb25lcm9zbywgaXNlbnRhIGRvIHBhZ2FtZW50byBkZSByb3lhbHRpZXMgb3UgcXVhbHF1ZXIgb3V0cmEgY29udHJhcHJlc3Rhw6fDo28sIHBlY3VuacOhcmlhIG91IG7Do28sIMOgIEZ1bmRhw6fDo28gR2V0dWxpbyBWYXJnYXMsIGRvcyBkaXJlaXRvcyBkZSBhcm1hemVuYXIgZGlnaXRhbG1lbnRlLCByZXByb2R1emlyIGUgZGlzdHJpYnVpciBuYWNpb25hbCBlIGludGVybmFjaW9uYWxtZW50ZSBhIE9icmEsIGluY2x1aW5kby1zZSBvIHNldSByZXN1bW8vYWJzdHJhY3QsIHBvciBtZWlvcyBlbGV0csO0bmljb3MsIG5vIHNpdGUgZGEgQmlibGlvdGVjYSBWaXJ0dWFsIEZHViwgYW8gcMO6YmxpY28gZW0gZ2VyYWwsIGVtIHJlZ2ltZSBkZSBhY2Vzc28gYWJlcnRvLgoKMy4gQSBwcmVzZW50ZSBsaWNlbsOnYSB0YW1iw6ltIGFicmFuZ2UsIG5vcyBtZXNtb3MgdGVybW9zIGVzdGFiZWxlY2lkb3Mgbm8gaXRlbSAyLCBzdXByYSwgcXVhbHF1ZXIgZGlyZWl0byBkZSBjb211bmljYcOnw6NvIGFvIHDDumJsaWNvIGNhYsOtdmVsIGVtIHJlbGHDp8OjbyDDoCBPYnJhIG9yYSBkZXBvc2l0YWRhLCBpbmNsdWluZG8tc2Ugb3MgdXNvcyByZWZlcmVudGVzIMOgIHJlcHJlc2VudGHDp8OjbyBww7pibGljYSBlL291IGV4ZWN1w6fDo28gcMO6YmxpY2EsIGJlbSBjb21vIHF1YWxxdWVyIG91dHJhIG1vZGFsaWRhZGUgZGUgY29tdW5pY2HDp8OjbyBhbyBww7pibGljbyBxdWUgZXhpc3RhIG91IHZlbmhhIGEgZXhpc3Rpciwgbm9zIHRlcm1vcyBkbyBhcnRpZ28gNjggZSBzZWd1aW50ZXMgZGEgTGVpIDkuNjEwLzk4LCBuYSBleHRlbnPDo28gcXVlIGZvciBhcGxpY8OhdmVsIGFvcyBzZXJ2acOnb3MgcHJlc3RhZG9zIGFvIHDDumJsaWNvIHBlbGEgQmlibGlvdGVjYSBWaXJ0dWFsIEZHVi4KCjQuIEVzdGEgbGljZW7Dp2EgYWJyYW5nZSwgYWluZGEsIG5vcyBtZXNtb3MgdGVybW9zIGVzdGFiZWxlY2lkb3Mgbm8gaXRlbSAyLCBzdXByYSwgdG9kb3Mgb3MgZGlyZWl0b3MgY29uZXhvcyBkZSBhcnRpc3RhcyBpbnTDqXJwcmV0ZXMgb3UgZXhlY3V0YW50ZXMsIHByb2R1dG9yZXMgZm9ub2dyw6FmaWNvcyBvdSBlbXByZXNhcyBkZSByYWRpb2RpZnVzw6NvIHF1ZSBldmVudHVhbG1lbnRlIHNlamFtIGFwbGljw6F2ZWlzIGVtIHJlbGHDp8OjbyDDoCBvYnJhIGRlcG9zaXRhZGEsIGVtIGNvbmZvcm1pZGFkZSBjb20gbyByZWdpbWUgZml4YWRvIG5vIFTDrXR1bG8gViBkYSBMZWkgOS42MTAvOTguCgo1LiBTZSBhIE9icmEgZGVwb3NpdGFkYSBmb2kgb3Ugw6kgb2JqZXRvIGRlIGZpbmFuY2lhbWVudG8gcG9yIGluc3RpdHVpw6fDtWVzIGRlIGZvbWVudG8gw6AgcGVzcXVpc2Egb3UgcXVhbHF1ZXIgb3V0cmEgc2VtZWxoYW50ZSwgdm9jw6ogb3UgbyB0aXR1bGFyIGFzc2VndXJhIHF1ZSBjdW1wcml1IHRvZGFzIGFzIG9icmlnYcOnw7VlcyBxdWUgbGhlIGZvcmFtIGltcG9zdGFzIHBlbGEgaW5zdGl0dWnDp8OjbyBmaW5hbmNpYWRvcmEgZW0gcmF6w6NvIGRvIGZpbmFuY2lhbWVudG8sIGUgcXVlIG7Do28gZXN0w6EgY29udHJhcmlhbmRvIHF1YWxxdWVyIGRpc3Bvc2nDp8OjbyBjb250cmF0dWFsIHJlZmVyZW50ZSDDoCBwdWJsaWNhw6fDo28gZG8gY29udGXDumRvIG9yYSBzdWJtZXRpZG8gw6AgQmlibGlvdGVjYSBWaXJ0dWFsIEZHVi4KCjYuIENhc28gYSBPYnJhIG9yYSBkZXBvc2l0YWRhIGVuY29udHJlLXNlIGxpY2VuY2lhZGEgc29iIHVtYSBsaWNlbsOnYSBDcmVhdGl2ZSBDb21tb25zIChxdWFscXVlciB2ZXJzw6NvKSwgc29iIGEgbGljZW7Dp2EgR05VIEZyZWUgRG9jdW1lbnRhdGlvbiBMaWNlbnNlIChxdWFscXVlciB2ZXJzw6NvKSwgb3Ugb3V0cmEgbGljZW7Dp2EgcXVhbGlmaWNhZGEgY29tbyBsaXZyZSBzZWd1bmRvIG9zIGNyaXTDqXJpb3MgZGEgRGVmaW5pdGlvbiBvZiBGcmVlIEN1bHR1cmFsIFdvcmtzIChkaXNwb27DrXZlbCBlbTogaHR0cDovL2ZyZWVkb21kZWZpbmVkLm9yZy9EZWZpbml0aW9uKSBvdSBGcmVlIFNvZnR3YXJlIERlZmluaXRpb24gKGRpc3BvbsOtdmVsIGVtOiBodHRwOi8vd3d3LmdudS5vcmcvcGhpbG9zb3BoeS9mcmVlLXN3Lmh0bWwpLCBvIGFycXVpdm8gcmVmZXJlbnRlIMOgIE9icmEgZGV2ZSBpbmRpY2FyIGEgbGljZW7Dp2EgYXBsaWPDoXZlbCBlbSBjb250ZcO6ZG8gbGVnw612ZWwgcG9yIHNlcmVzIGh1bWFub3MgZSwgc2UgcG9zc8OtdmVsLCB0YW1iw6ltIGVtIG1ldGFkYWRvcyBsZWfDrXZlaXMgcG9yIG3DoXF1aW5hLiBBIGluZGljYcOnw6NvIGRhIGxpY2Vuw6dhIGFwbGljw6F2ZWwgZGV2ZSBzZXIgYWNvbXBhbmhhZGEgZGUgdW0gbGluayBwYXJhIG9zIHRlcm1vcyBkZSBsaWNlbmNpYW1lbnRvIG91IHN1YSBjw7NwaWEgaW50ZWdyYWwuCgpBbyBjb25jbHVpciBhIHByZXNlbnRlIGV0YXBhIGUgYXMgZXRhcGFzIHN1YnNlccO8ZW50ZXMgZG8gcHJvY2Vzc28gZGUgc3VibWlzc8OjbyBkZSBhcnF1aXZvcyDDoCBCaWJsaW90ZWNhIFZpcnR1YWwgRkdWLCB2b2PDqiBhdGVzdGEgcXVlIGxldSBlIGNvbmNvcmRhIGludGVncmFsbWVudGUgY29tIG9zIHRlcm1vcyBhY2ltYSBkZWxpbWl0YWRvcywgYXNzaW5hbmRvLW9zIHNlbSBmYXplciBxdWFscXVlciByZXNlcnZhIGUgbm92YW1lbnRlIGNvbmZpcm1hbmRvIHF1ZSBjdW1wcmUgb3MgcmVxdWlzaXRvcyBpbmRpY2Fkb3Mgbm8gaXRlbSAxLCBzdXByYS4KCkhhdmVuZG8gcXVhbHF1ZXIgZGlzY29yZMOibmNpYSBlbSByZWxhw6fDo28gYW9zIHByZXNlbnRlcyB0ZXJtb3Mgb3UgbsOjbyBzZSB2ZXJpZmljYW5kbyBvIGV4aWdpZG8gbm8gaXRlbSAxLCBzdXByYSwgdm9jw6ogZGV2ZSBpbnRlcnJvbXBlciBpbWVkaWF0YW1lbnRlIG8gcHJvY2Vzc28gZGUgc3VibWlzc8Ojby4gQSBjb250aW51aWRhZGUgZG8gcHJvY2Vzc28gZXF1aXZhbGUgw6AgYXNzaW5hdHVyYSBkZXN0ZSBkb2N1bWVudG8sIGNvbSB0b2RhcyBhcyBjb25zZXHDvMOqbmNpYXMgbmVsZSBwcmV2aXN0YXMsIHN1amVpdGFuZG8tc2UgbyBzaWduYXTDoXJpbyBhIHNhbsOnw7VlcyBjaXZpcyBlIGNyaW1pbmFpcyBjYXNvIG7Do28gc2VqYSB0aXR1bGFyIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBwYXRyaW1vbmlhaXMgZS9vdSBjb25leG9zIGFwbGljw6F2ZWlzIMOgIE9icmEgZGVwb3NpdGFkYSBkdXJhbnRlIGVzdGUgcHJvY2Vzc28sIG91IGNhc28gbsOjbyB0ZW5oYSBvYnRpZG8gcHLDqXZpYSBlIGV4cHJlc3NhIGF1dG9yaXphw6fDo28gZG8gdGl0dWxhciBwYXJhIG8gZGVww7NzaXRvIGUgdG9kb3Mgb3MgdXNvcyBkYSBPYnJhIGVudm9sdmlkb3MuCgpQYXJhIGEgc29sdcOnw6NvIGRlIHF1YWxxdWVyIGTDunZpZGEgcXVhbnRvIGFvcyB0ZXJtb3MgZGUgbGljZW5jaWFtZW50byBlIG8gcHJvY2Vzc28gZGUgc3VibWlzc8OjbywgY2xpcXVlIG5vIGxpbmsgIkZhbGUgY29ub3NjbyIuCgpTZSB2b2PDqiB0aXZlciBkw7p2aWRhcyBzb2JyZSBlc3RhIGxpY2Vuw6dhLCBwb3IgZmF2b3IgZW50cmUgZW0gY29udGF0byBjb20gb3MgYWRtaW5pc3RyYWRvcmVzIGRvIFJlcG9zaXTDs3Jpby4K |
| dc.title.eng.fl_str_mv |
Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry. |
| title |
Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry. |
| spellingShingle |
Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry. Cordeiro, Fábio Corrêa Natural language processing Information extraction Knowledge graph Ontology population Graph embedding Information retrieval Matemática Processamento da linguagem natural (Computação) Ontologias (Recuperação da informação) Recuperação da informação Mineração de dados (Computação) |
| title_short |
Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry. |
| title_full |
Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry. |
| title_fullStr |
Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry. |
| title_full_unstemmed |
Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry. |
| title_sort |
Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry. |
| author |
Cordeiro, Fábio Corrêa |
| author_facet |
Cordeiro, Fábio Corrêa |
| author_role |
author |
| dc.contributor.unidadefgv.por.fl_str_mv |
Escolas::EMAp |
| dc.contributor.member.none.fl_str_mv |
Silva, Moacyr Alvim Horta Barbosa da Souza, Renato Rocha |
| dc.contributor.author.fl_str_mv |
Cordeiro, Fábio Corrêa |
| dc.contributor.advisor1.fl_str_mv |
Coelho, Flávio Codeço |
| contributor_str_mv |
Coelho, Flávio Codeço |
| dc.subject.eng.fl_str_mv |
Natural language processing Information extraction Knowledge graph Ontology population Graph embedding Information retrieval |
| topic |
Natural language processing Information extraction Knowledge graph Ontology population Graph embedding Information retrieval Matemática Processamento da linguagem natural (Computação) Ontologias (Recuperação da informação) Recuperação da informação Mineração de dados (Computação) |
| dc.subject.area.por.fl_str_mv |
Matemática |
| dc.subject.bibliodata.por.fl_str_mv |
Processamento da linguagem natural (Computação) Ontologias (Recuperação da informação) Recuperação da informação Mineração de dados (Computação) |
| description |
Numerous companies are interested in gathering strategic information from their document repositories. It is especially relevant for the oil and gas industry, which has large repositories of geoscientific reports from several decades of production. Making this information available in a structured format can unlock valuable information among these mountains of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built with general domain texts extracted from the Internet and written primarily in English. This thesis presents a methodology for extracting geoscientific entities and relations from technical documents and populating a knowledge graph - the Petro KGraph. We also developed a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese. Along the text, we describe the natural language processing and information extraction resources, the process used to train machine learning models, and review relevant literature. Finally, we evaluate each model and the overall methodology. We developed an innovative Entity Linking approach that allows finding new entities beyond those already known. Another crucial contribution is that the new resources and evaluation procedures constitute a new benchmark for the Portuguese language and the geoscience domain. We evaluated an information retrieval system using the Petro KGraph to expand its queries, which presented a slightly better result than the system without a query expansion. Plans for future work include building an improved information retrieval test set, comparing the results using different graph embedding algorithms, and testing language models launched after BERT models. |
| publishDate |
2024 |
| dc.date.accessioned.fl_str_mv |
2024-09-16T13:42:13Z |
| dc.date.available.fl_str_mv |
2024-09-16T13:42:13Z |
| dc.date.issued.fl_str_mv |
2024-08-13 |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
| format |
doctoralThesis |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
https://hdl.handle.net/10438/35868 |
| url |
https://hdl.handle.net/10438/35868 |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.source.none.fl_str_mv |
reponame:Repositório Institucional do FGV (FGV Repositório Digital) instname:Fundação Getulio Vargas (FGV) instacron:FGV |
| instname_str |
Fundação Getulio Vargas (FGV) |
| instacron_str |
FGV |
| institution |
FGV |
| reponame_str |
Repositório Institucional do FGV (FGV Repositório Digital) |
| collection |
Repositório Institucional do FGV (FGV Repositório Digital) |
| bitstream.url.fl_str_mv |
https://repositorio.fgv.br/bitstreams/c6c6620f-2c89-468a-8f54-46b4884cdaac/download https://repositorio.fgv.br/bitstreams/489c6f97-2c72-41b3-a4ee-4e507aaf61eb/download https://repositorio.fgv.br/bitstreams/fc18947a-dd8c-49ce-a219-9a2d9de38f08/download https://repositorio.fgv.br/bitstreams/d28c07ab-fd91-44e2-98f3-31354bc9c7b9/download |
| bitstream.checksum.fl_str_mv |
5ad8a5970253eaa2eb2fbab6d5e5a506 2a4b67231f701c416a809246e7a10077 c22ff48b1aff1c4a664f6a60edb378e4 f9dd05e73fd207e2946782988d8c742d |
| bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 MD5 |
| repository.name.fl_str_mv |
Repositório Institucional do FGV (FGV Repositório Digital) - Fundação Getulio Vargas (FGV) |
| repository.mail.fl_str_mv |
|
| _version_ |
1827842584025759744 |