Mineração de texto, agrupamento de seqüências e integração de dados para o desenvolvimento da Plant Defense Mechanisms Database

Detalhes bibliográficos
Ano de defesa: 2008
Autor(a) principal: Adriano Barbosa da Silva
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://hdl.handle.net/1843/BUOS-8S4JGC
Resumo: This work aims to describe the technologies used for the Plant Defense Mechnaisms Database development, a database about the defense mechanisms against biotic and abiotic types of stresses in plants. For this purpose we have developed the program LAITOR, this is used in order to identify in the scientific literature the protein terms and names of biotic and abiotic stimuli (bioentities) along with terms indicating of a biological action (bioaction), nevertheless, validating those occurrences in the same sentence only. The tool NLPROT has been used for the initial bioentities tagging which were validated a posteriori by LAITOR. Later, for those protein terms which belong to the NCBI Gene database and with a corresponding record in the UniProtKB database, it was performed the clustering of sequences belonging to other organisms deposited in the same UniProtKB database, to achieve this aim we developed the Seed Linkage software. This software exploits direct and indirect multiple links from the sequences of these organisms to the initially determined seed. We found that the raw and relative scores of 400 and 0.3, respectively, are those which maximizes the inclusion of correct sequences in the rebuilding of a manually inspected clusters dataset. After the identification of 780 protein terms from the analysis of 7,306 scientific abstracts using the program LAITOR, 1,390 unique UniProtKB identifiers were used to cluster 15,669 sequences in the 611 clusters of the PubMed database. We have developed a software library, named SRS.php, to acquire the information referring to each of these proteins, using for this purpose the SRS server installed at the EMBL using the Web Services technology. With the usage of this library, a SOAP client accesses the server and retrieve, in a programmatic manner, the available data. After to perform the text mining analysis with the program LAITOR, the sequence clustering using the Seed Linkage software, and the subsequent data acquisition using the SOAP protocol, all these information were made available by a HTML server at http://www.biodados.icb.ufmg.br/pdm. In this website, users are able to perform a search using keywords or a BLAST-based similarity search. After the visualization of the retrieved records, a link is created for the co-occurrence of the protein terms in the text mining analysis, as well as for the phylogenetic tree of the proteins grouped in each PDM cluster. Furthermore, we have implemented the PDM SOAP server, which enables the distribution of PDM data through Web Services. We have created a method, named query_pdm, where any record deposited in this database can be accessed using SOAP. Summarizing, we present a set of methods implemented as software components, or programs in fact, which can be used in similar applications to PDM, being, therefore, freely available for the scientific community interested in such techniques
id UFMG_5bb143cf30903eba8b3ccfa6c421e797
oai_identifier_str oai:repositorio.ufmg.br:1843/BUOS-8S4JGC
network_acronym_str UFMG
network_name_str Repositório Institucional da UFMG
repository_id_str
spelling Mineração de texto, agrupamento de seqüências e integração de dados para o desenvolvimento da Plant Defense Mechanisms DatabaseHomologia (Biologia)Banco de dadosBioinformáticaProteínasMineração de dados (Computação)BioinformáticaThis work aims to describe the technologies used for the Plant Defense Mechnaisms Database development, a database about the defense mechanisms against biotic and abiotic types of stresses in plants. For this purpose we have developed the program LAITOR, this is used in order to identify in the scientific literature the protein terms and names of biotic and abiotic stimuli (bioentities) along with terms indicating of a biological action (bioaction), nevertheless, validating those occurrences in the same sentence only. The tool NLPROT has been used for the initial bioentities tagging which were validated a posteriori by LAITOR. Later, for those protein terms which belong to the NCBI Gene database and with a corresponding record in the UniProtKB database, it was performed the clustering of sequences belonging to other organisms deposited in the same UniProtKB database, to achieve this aim we developed the Seed Linkage software. This software exploits direct and indirect multiple links from the sequences of these organisms to the initially determined seed. We found that the raw and relative scores of 400 and 0.3, respectively, are those which maximizes the inclusion of correct sequences in the rebuilding of a manually inspected clusters dataset. After the identification of 780 protein terms from the analysis of 7,306 scientific abstracts using the program LAITOR, 1,390 unique UniProtKB identifiers were used to cluster 15,669 sequences in the 611 clusters of the PubMed database. We have developed a software library, named SRS.php, to acquire the information referring to each of these proteins, using for this purpose the SRS server installed at the EMBL using the Web Services technology. With the usage of this library, a SOAP client accesses the server and retrieve, in a programmatic manner, the available data. After to perform the text mining analysis with the program LAITOR, the sequence clustering using the Seed Linkage software, and the subsequent data acquisition using the SOAP protocol, all these information were made available by a HTML server at http://www.biodados.icb.ufmg.br/pdm. In this website, users are able to perform a search using keywords or a BLAST-based similarity search. After the visualization of the retrieved records, a link is created for the co-occurrence of the protein terms in the text mining analysis, as well as for the phylogenetic tree of the proteins grouped in each PDM cluster. Furthermore, we have implemented the PDM SOAP server, which enables the distribution of PDM data through Web Services. We have created a method, named query_pdm, where any record deposited in this database can be accessed using SOAP. Summarizing, we present a set of methods implemented as software components, or programs in fact, which can be used in similar applications to PDM, being, therefore, freely available for the scientific community interested in such techniquesUniversidade Federal de Minas Gerais2019-08-14T01:21:44Z2025-09-09T01:21:26Z2019-08-14T01:21:44Z2008-05-26info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://hdl.handle.net/1843/BUOS-8S4JGCAdriano Barbosa da Silvainfo:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMG2025-09-09T01:21:26Zoai:repositorio.ufmg.br:1843/BUOS-8S4JGCRepositório InstitucionalPUBhttps://repositorio.ufmg.br/oairepositorio@ufmg.bropendoar:2025-09-09T01:21:26Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false
dc.title.none.fl_str_mv Mineração de texto, agrupamento de seqüências e integração de dados para o desenvolvimento da Plant Defense Mechanisms Database
title Mineração de texto, agrupamento de seqüências e integração de dados para o desenvolvimento da Plant Defense Mechanisms Database
spellingShingle Mineração de texto, agrupamento de seqüências e integração de dados para o desenvolvimento da Plant Defense Mechanisms Database
Adriano Barbosa da Silva
Homologia (Biologia)
Banco de dados
Bioinformática
Proteínas
Mineração de dados (Computação)
Bioinformática
title_short Mineração de texto, agrupamento de seqüências e integração de dados para o desenvolvimento da Plant Defense Mechanisms Database
title_full Mineração de texto, agrupamento de seqüências e integração de dados para o desenvolvimento da Plant Defense Mechanisms Database
title_fullStr Mineração de texto, agrupamento de seqüências e integração de dados para o desenvolvimento da Plant Defense Mechanisms Database
title_full_unstemmed Mineração de texto, agrupamento de seqüências e integração de dados para o desenvolvimento da Plant Defense Mechanisms Database
title_sort Mineração de texto, agrupamento de seqüências e integração de dados para o desenvolvimento da Plant Defense Mechanisms Database
author Adriano Barbosa da Silva
author_facet Adriano Barbosa da Silva
author_role author
dc.contributor.author.fl_str_mv Adriano Barbosa da Silva
dc.subject.por.fl_str_mv Homologia (Biologia)
Banco de dados
Bioinformática
Proteínas
Mineração de dados (Computação)
Bioinformática
topic Homologia (Biologia)
Banco de dados
Bioinformática
Proteínas
Mineração de dados (Computação)
Bioinformática
description This work aims to describe the technologies used for the Plant Defense Mechnaisms Database development, a database about the defense mechanisms against biotic and abiotic types of stresses in plants. For this purpose we have developed the program LAITOR, this is used in order to identify in the scientific literature the protein terms and names of biotic and abiotic stimuli (bioentities) along with terms indicating of a biological action (bioaction), nevertheless, validating those occurrences in the same sentence only. The tool NLPROT has been used for the initial bioentities tagging which were validated a posteriori by LAITOR. Later, for those protein terms which belong to the NCBI Gene database and with a corresponding record in the UniProtKB database, it was performed the clustering of sequences belonging to other organisms deposited in the same UniProtKB database, to achieve this aim we developed the Seed Linkage software. This software exploits direct and indirect multiple links from the sequences of these organisms to the initially determined seed. We found that the raw and relative scores of 400 and 0.3, respectively, are those which maximizes the inclusion of correct sequences in the rebuilding of a manually inspected clusters dataset. After the identification of 780 protein terms from the analysis of 7,306 scientific abstracts using the program LAITOR, 1,390 unique UniProtKB identifiers were used to cluster 15,669 sequences in the 611 clusters of the PubMed database. We have developed a software library, named SRS.php, to acquire the information referring to each of these proteins, using for this purpose the SRS server installed at the EMBL using the Web Services technology. With the usage of this library, a SOAP client accesses the server and retrieve, in a programmatic manner, the available data. After to perform the text mining analysis with the program LAITOR, the sequence clustering using the Seed Linkage software, and the subsequent data acquisition using the SOAP protocol, all these information were made available by a HTML server at http://www.biodados.icb.ufmg.br/pdm. In this website, users are able to perform a search using keywords or a BLAST-based similarity search. After the visualization of the retrieved records, a link is created for the co-occurrence of the protein terms in the text mining analysis, as well as for the phylogenetic tree of the proteins grouped in each PDM cluster. Furthermore, we have implemented the PDM SOAP server, which enables the distribution of PDM data through Web Services. We have created a method, named query_pdm, where any record deposited in this database can be accessed using SOAP. Summarizing, we present a set of methods implemented as software components, or programs in fact, which can be used in similar applications to PDM, being, therefore, freely available for the scientific community interested in such techniques
publishDate 2008
dc.date.none.fl_str_mv 2008-05-26
2019-08-14T01:21:44Z
2019-08-14T01:21:44Z
2025-09-09T01:21:26Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://hdl.handle.net/1843/BUOS-8S4JGC
url https://hdl.handle.net/1843/BUOS-8S4JGC
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Federal de Minas Gerais
publisher.none.fl_str_mv Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFMG
instname:Universidade Federal de Minas Gerais (UFMG)
instacron:UFMG
instname_str Universidade Federal de Minas Gerais (UFMG)
instacron_str UFMG
institution UFMG
reponame_str Repositório Institucional da UFMG
collection Repositório Institucional da UFMG
repository.name.fl_str_mv Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv repositorio@ufmg.br
_version_ 1856414107457028096