MorphoMap: mapeamento automático de narrativas clínicas para uma terminologia médica

Pacheco, Edson José

MorphoMap: mapeamento automático de narrativas clínicas para uma terminologia médica

Detalhes bibliográficos
Ano de defesa:	2009
Autor(a) principal:	Pacheco, Edson José
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Tese
Tipo de acesso:	Acesso aberto
Idioma:	por
Instituição de defesa:	Universidade Tecnológica Federal do Paraná Curitiba Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Ontologia Sistemas de processamento da fala Linguística - Processamento de dados Sistemas de recuperação da informação Informática médica Medicina - Processamento de dados Registros médicos Engenharia elétrica Ontology Speech processing systems Computational linguistics Information storage and retrieval systems Medical informatics Medical - Data processing Medical records Electric engineering
Link de acesso:	http://repositorio.utfpr.edu.br/jspui/handle/1/124
Resumo:	Clinical documentation requires the representation of fine-grained descriptions of patients' history, evolution, and treatment. These descriptions are materialized in findings reports, medical orders, as well as in evolution and discharge summaries. In most clinical environments natural language is the main carrier of documentation. Written clinical jargon is commonly characterized by idiosyncratic terminology, a high frequency of highly context-dependent ambiguous expressions (especially acronyms and abbreviations). Violations of spelling and grammar rules are common. The purpose of this work is to map free text from clinical narratives to a domain ontology (SNOMED CT). To this end, natural language processing (NLP) tools will be combined with a heuristic of semantic mapping. The study uses discharge summaries from the Hospital das Clínicas de Porto Alegre, RS, Brazil. Parts of these texts are used for creating a training corpus, using manual annotation supported by active learning technology, used for the training of NLP tools that are used for the identification of parts of speech, the cleansing of "dirty" text passages. Thus it was possible to obtain relatively well-formed and unambiguous noun phrases, heuristics was implemented to semantic mapping between these noun phrases (in Portuguese) and the terms describing the SNOMED CT concepts (English and Spanish) uses the technology of morphosemantic indexing, using a multilingual subword thesaurus, provided by the MorphoSaurus system, the resolution of acronyms, and the identification of named entities (e.g. numbers). In this study, 80 per cent of the summaries are analyzed and manually annotated, resulting in a domain corpus that supports the specialization of the OpenNLP system, mainly following the paradigm of statistical natural language processing (the accuracy of the tagger obtained was 93.67%). Simultaneously, several techniques have been used for validating and improving the subword thesaurus. To this end, the semantic representations of comparable test corpora from the medical domain in English, Spanish, and Portuguese were compared with regard to the relative frequency of semantic identifiers, improving the corpus coverage (2% to Portuguese, and 50% to Spanish). The result was used as an input by a team of lexicon curators, which continuously fix errors and fill gaps in the trilingual thesaurus underlying the MorphoSaurus system. The progress of this work could be objectified using OHSUMED, a standard medical information retrieval benchmark. The mapping of text-encoded clinical information to a domain ontology constitutes an area of high scientific and practical interest due to the need for the analysis of structured data, whereas the clinical information is routinely recorded in a largely unstructured way. In this work the ontology used was SNOMED CT. The evaluation of mapping methodology indicates accuracy of 83.9%.

Metadados do item

id	UTFPR-12_44834766cb9eebde3c7c5e13558b347b
oai_identifier_str	oai:repositorio.utfpr.edu.br:1/124
network_acronym_str	UTFPR-12
network_name_str	Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))
repository_id_str
spelling	MorphoMap: mapeamento automático de narrativas clínicas para uma terminologia médicaOntologiaSistemas de processamento da falaLinguística - Processamento de dadosSistemas de recuperação da informaçãoInformática médicaMedicina - Processamento de dadosRegistros médicosEngenharia elétricaOntologySpeech processing systemsComputational linguisticsInformation storage and retrieval systemsMedical informaticsMedical - Data processingMedical recordsElectric engineeringClinical documentation requires the representation of fine-grained descriptions of patients' history, evolution, and treatment. These descriptions are materialized in findings reports, medical orders, as well as in evolution and discharge summaries. In most clinical environments natural language is the main carrier of documentation. Written clinical jargon is commonly characterized by idiosyncratic terminology, a high frequency of highly context-dependent ambiguous expressions (especially acronyms and abbreviations). Violations of spelling and grammar rules are common. The purpose of this work is to map free text from clinical narratives to a domain ontology (SNOMED CT). To this end, natural language processing (NLP) tools will be combined with a heuristic of semantic mapping. The study uses discharge summaries from the Hospital das Clínicas de Porto Alegre, RS, Brazil. Parts of these texts are used for creating a training corpus, using manual annotation supported by active learning technology, used for the training of NLP tools that are used for the identification of parts of speech, the cleansing of "dirty" text passages. Thus it was possible to obtain relatively well-formed and unambiguous noun phrases, heuristics was implemented to semantic mapping between these noun phrases (in Portuguese) and the terms describing the SNOMED CT concepts (English and Spanish) uses the technology of morphosemantic indexing, using a multilingual subword thesaurus, provided by the MorphoSaurus system, the resolution of acronyms, and the identification of named entities (e.g. numbers). In this study, 80 per cent of the summaries are analyzed and manually annotated, resulting in a domain corpus that supports the specialization of the OpenNLP system, mainly following the paradigm of statistical natural language processing (the accuracy of the tagger obtained was 93.67%). Simultaneously, several techniques have been used for validating and improving the subword thesaurus. To this end, the semantic representations of comparable test corpora from the medical domain in English, Spanish, and Portuguese were compared with regard to the relative frequency of semantic identifiers, improving the corpus coverage (2% to Portuguese, and 50% to Spanish). The result was used as an input by a team of lexicon curators, which continuously fix errors and fill gaps in the trilingual thesaurus underlying the MorphoSaurus system. The progress of this work could be objectified using OHSUMED, a standard medical information retrieval benchmark. The mapping of text-encoded clinical information to a domain ontology constitutes an area of high scientific and practical interest due to the need for the analysis of structured data, whereas the clinical information is routinely recorded in a largely unstructured way. In this work the ontology used was SNOMED CT. The evaluation of mapping methodology indicates accuracy of 83.9%.A documentação clínica requer a representação de situações complexas como pareceres clínicos, imagens e resultados de exames, planos de tratamento, dentre outras. Entre os profissionais da área de saúde, a linguagem natural é o meio principal de documentação. Neste tipo de linguagem, caracterizada por uma elevada flexibilidade sintática e léxica, é comum a prevalência de ambigüidades em sentenças e termos. O objetivo do presente trabalho consiste em mapear informações codificadas em narrativas clínicas para uma ontologia de domínio (SNOMED CT). Para sua consecução, aplicaram-se ferramentas processamento de linguagem natural (PLN), assim como adotaram-se heurísticas para o mapeamento de textos para ontologias. Para o desenvolvimento da pesquisa, uma amostra de sumários de alta foi obtida junto ao Hospital das Clínicas de Porto Alegre, RS, Brasil. Parte dos sumários foi manualmente anotada, com a aplicação da estratégia de Active Learning, visando a preparação de um corpus para o treinamento de ferramentas de PLN. Paralelamente, foram desenvolvidos algoritmos para o pré-processamento dos textos ‘sujos’ (com grande quantidade de erros, acrônimos, abreviações, etc). Com a identificação das frases nominais, resultado do processamento das ferramentas de PLN, diversas heurísticas (identificação de acrônimos, correção ortográfica, supressão de valores numéricos e distância conceitual) para o mapeamento para a SNOMED CT foram aplicadas. A versão atual da SNOMED CT não está disponível em português, demandando o uso de ferramentas para processamento multi-lingual. Para tanto, o pesquisa atual é parte da iniciativa do projeto MorphoSaurus, por meio do qual desenvolve-se e disponibiliza-se um thesaurus multi-língue (português, alemão, inglês, espanhol, sueco, francês), bem como componentes de software que permitem o processamento inter-lingual. Para realização da pesquisa, 80% da base de sumários foi analisada e manualmente anotada, resultando em um corpus de domínio (textos médicos e em português) que permitiu a especialização do software OpenNLP (baseado no modelo estatístico para o PLN e selecionado após a avaliação de outras soluções disponíveis). A precisão do etiquetador atingiu 93.67%. O thesaurus multi-língue do MorphoSaurus foi estendido, reestruturado e avaliado (automaticamente com a comparação por meio de textos comparáveis – ‘traduções de um mesmo texto para diferentes idiomas’) e sofreu intervenções objetivando a correção de imperfeições existentes, resultando na melhoria da cobertura lingüística, no caso do português, em 2%; e 50% para o caso do espanhol, medidas obtidas por meio do levantamento das curvas de precisão e revocação para a base do OHSUMED. Por fim, a codificação de informações de narrativas clínicas para uma ontologia de domínio é uma área de elevado interesse científico e clínico, visto que grande parte dos dados produzidos quando do atendimento médico é armazenado em texto livre e não em campos estruturados. Para o alcance deste fim, adotou-se a SNOMED CT. A viabilidade da metodologia de mapeamento foi demonstrada com a avaliação dos resultados do mapeamento automático contra um padrão ouro, manualmente desenvolvido, indicando precisão de 83,9%.Universidade Tecnológica Federal do ParanáCuritibaPrograma de Pós-Graduação em Engenharia Elétrica e Informática IndustrialNohama, PercySchulz, StefanPacheco, Edson José2010-10-14T18:49:57Z2010-10-14T18:49:57Z200914/10/2010info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesis5,76 MBapplication/pdfPACHECO, Edson José. MorphoMap: mapeamento automático de narrativas clínicas para uma terminologia médica. 2009. 155 f. Tese (Doutorado em Engenharia Elétrica e Informática Industrial) – Universidade Tecnológica Federal do Paraná, Curitiba, 2009.http://repositorio.utfpr.edu.br/jspui/handle/1/124porinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))instname:Universidade Tecnológica Federal do Paraná (UTFPR)instacron:UTFPR2020-06-03T18:01:18Zoai:repositorio.utfpr.edu.br:1/124Repositório InstitucionalPUBhttp://repositorio.utfpr.edu.br:8080/oai/requestriut@utfpr.edu.br \|\| sibi@utfpr.edu.bropendoar:2020-06-03T18:01:18Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT)) - Universidade Tecnológica Federal do Paraná (UTFPR)false
dc.title.none.fl_str_mv	MorphoMap: mapeamento automático de narrativas clínicas para uma terminologia médica
title	MorphoMap: mapeamento automático de narrativas clínicas para uma terminologia médica
spellingShingle	MorphoMap: mapeamento automático de narrativas clínicas para uma terminologia médica Pacheco, Edson José Ontologia Sistemas de processamento da fala Linguística - Processamento de dados Sistemas de recuperação da informação Informática médica Medicina - Processamento de dados Registros médicos Engenharia elétrica Ontology Speech processing systems Computational linguistics Information storage and retrieval systems Medical informatics Medical - Data processing Medical records Electric engineering
title_short	MorphoMap: mapeamento automático de narrativas clínicas para uma terminologia médica
title_full	MorphoMap: mapeamento automático de narrativas clínicas para uma terminologia médica
title_fullStr	MorphoMap: mapeamento automático de narrativas clínicas para uma terminologia médica
title_full_unstemmed	MorphoMap: mapeamento automático de narrativas clínicas para uma terminologia médica
title_sort	MorphoMap: mapeamento automático de narrativas clínicas para uma terminologia médica
author	Pacheco, Edson José
author_facet	Pacheco, Edson José
author_role	author
dc.contributor.none.fl_str_mv	Nohama, Percy Schulz, Stefan
dc.contributor.author.fl_str_mv	Pacheco, Edson José
dc.subject.por.fl_str_mv	Ontologia Sistemas de processamento da fala Linguística - Processamento de dados Sistemas de recuperação da informação Informática médica Medicina - Processamento de dados Registros médicos Engenharia elétrica Ontology Speech processing systems Computational linguistics Information storage and retrieval systems Medical informatics Medical - Data processing Medical records Electric engineering
topic	Ontologia Sistemas de processamento da fala Linguística - Processamento de dados Sistemas de recuperação da informação Informática médica Medicina - Processamento de dados Registros médicos Engenharia elétrica Ontology Speech processing systems Computational linguistics Information storage and retrieval systems Medical informatics Medical - Data processing Medical records Electric engineering
description	Clinical documentation requires the representation of fine-grained descriptions of patients' history, evolution, and treatment. These descriptions are materialized in findings reports, medical orders, as well as in evolution and discharge summaries. In most clinical environments natural language is the main carrier of documentation. Written clinical jargon is commonly characterized by idiosyncratic terminology, a high frequency of highly context-dependent ambiguous expressions (especially acronyms and abbreviations). Violations of spelling and grammar rules are common. The purpose of this work is to map free text from clinical narratives to a domain ontology (SNOMED CT). To this end, natural language processing (NLP) tools will be combined with a heuristic of semantic mapping. The study uses discharge summaries from the Hospital das Clínicas de Porto Alegre, RS, Brazil. Parts of these texts are used for creating a training corpus, using manual annotation supported by active learning technology, used for the training of NLP tools that are used for the identification of parts of speech, the cleansing of "dirty" text passages. Thus it was possible to obtain relatively well-formed and unambiguous noun phrases, heuristics was implemented to semantic mapping between these noun phrases (in Portuguese) and the terms describing the SNOMED CT concepts (English and Spanish) uses the technology of morphosemantic indexing, using a multilingual subword thesaurus, provided by the MorphoSaurus system, the resolution of acronyms, and the identification of named entities (e.g. numbers). In this study, 80 per cent of the summaries are analyzed and manually annotated, resulting in a domain corpus that supports the specialization of the OpenNLP system, mainly following the paradigm of statistical natural language processing (the accuracy of the tagger obtained was 93.67%). Simultaneously, several techniques have been used for validating and improving the subword thesaurus. To this end, the semantic representations of comparable test corpora from the medical domain in English, Spanish, and Portuguese were compared with regard to the relative frequency of semantic identifiers, improving the corpus coverage (2% to Portuguese, and 50% to Spanish). The result was used as an input by a team of lexicon curators, which continuously fix errors and fill gaps in the trilingual thesaurus underlying the MorphoSaurus system. The progress of this work could be objectified using OHSUMED, a standard medical information retrieval benchmark. The mapping of text-encoded clinical information to a domain ontology constitutes an area of high scientific and practical interest due to the need for the analysis of structured data, whereas the clinical information is routinely recorded in a largely unstructured way. In this work the ontology used was SNOMED CT. The evaluation of mapping methodology indicates accuracy of 83.9%.
publishDate	2009
dc.date.none.fl_str_mv	14/10/2010 2009 2010-10-14T18:49:57Z 2010-10-14T18:49:57Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	PACHECO, Edson José. MorphoMap: mapeamento automático de narrativas clínicas para uma terminologia médica. 2009. 155 f. Tese (Doutorado em Engenharia Elétrica e Informática Industrial) – Universidade Tecnológica Federal do Paraná, Curitiba, 2009. http://repositorio.utfpr.edu.br/jspui/handle/1/124
identifier_str_mv	PACHECO, Edson José. MorphoMap: mapeamento automático de narrativas clínicas para uma terminologia médica. 2009. 155 f. Tese (Doutorado em Engenharia Elétrica e Informática Industrial) – Universidade Tecnológica Federal do Paraná, Curitiba, 2009.
url	http://repositorio.utfpr.edu.br/jspui/handle/1/124
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	5,76 MB application/pdf
dc.publisher.none.fl_str_mv	Universidade Tecnológica Federal do Paraná Curitiba Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial
publisher.none.fl_str_mv	Universidade Tecnológica Federal do Paraná Curitiba Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT)) instname:Universidade Tecnológica Federal do Paraná (UTFPR) instacron:UTFPR
instname_str	Universidade Tecnológica Federal do Paraná (UTFPR)
instacron_str	UTFPR
institution	UTFPR
reponame_str	Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))
collection	Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))
repository.name.fl_str_mv	Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT)) - Universidade Tecnológica Federal do Paraná (UTFPR)
repository.mail.fl_str_mv	riut@utfpr.edu.br \|\| sibi@utfpr.edu.br
_version_	1850498253209993216

MorphoMap: mapeamento automático de narrativas clínicas para uma terminologia médica

Registros relacionados