Using a provenance model and spatiotemporal information to integrate heterogeneous biodiversity semantic data.

Amanqui, Flor Karina Mamani

Using a provenance model and spatiotemporal information to integrate heterogeneous biodiversity semantic data.

Detalhes bibliográficos
Ano de defesa:	2017
Autor(a) principal:	Amanqui, Flor Karina Mamani
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Tese
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Biodiversidade Dados abertos vinculados Proveniência Web semântica
Link de acesso:	http://www.teses.usp.br/teses/disponiveis/55/55134/tde-30012018-093704/
Resumo:	In the last few years, the Web of data is being rapidly populated with biodiversity data. However, when researchers need to retrieve, integrate, and visualize these data, they need to rely on semi-manual approaches. That is due to the fact that biodiversity repositories, such as GBIF, offer data as just strings in CSV format spreadsheets. There is no machine readable metadata that could add meaning (semantics) to data. Without this metadata, automatic solutions are impossible and labor intensive semi-manual approaches for data integration and visualization are unavoidable. To reduce this problem, we present a novel architecture, called STBioData, to automatically link spatiotemporal biodiversity data, from heterogeneous data sources, to enable easier searching, visualization and downloading of relevant data. It supports the generation of interactive maps and mapping between biodiversity data and ontologies describing them (such as Darwin Core, DBpedia, GeoSPARQL, Time and PROV-O). A new biodiversity provenance model (BioProv), extending the W3C PROV Data Model, was proposed. BioProv enables applications that deal with biodiversity data to incorporate provenance data in their information. A web based prototype, based on this architecture, was implemented. It supports biodiversity domain experts in tasks, such as identifying a species conservation status, by automating most of the necessary tasks. It uses collection data, from important Brazilian biodiversity research institutions, and species geographic distributions and conservation status, from the IUCN Red List of Threatened Species. These data are converted to linked data, enriched and saved as RDF Triples. Users can access the system, using a web interface, and search for collection and species distribution records based on species names, time ranges and geographic location. After a data set is recovered, it can be displayed in an interactive map. The records contents are also shown (including provenance data) together with links to the original records at GBIF and IUCN. Users can export datasets, as a CSV or RDF file, or get a print out in PDF (including the visualizations). Choosing different time ranges, users can, for instance, verify the evolution of a species distribution. The STBioData prototype was tested using use cases. For the tests, 46,211 collection records, from SpeciesLink, and 38,589 conservation status records (including maps), from IUCN, for marine mammal were converted to 2,233,782. RDF triples and linked using well known ontologies. 90% of biodiversity experts, using the tool to determine conservation status, were able to find information about dolphin species, with a satisfactory recovery time, and were able to understand the interactive map. In an information retrieval experiment, when compared with SpeciesLink keyword based search, the prototypes semantic based search performed, on average, 24% better in precision and 22% in recall tests. And that does not takes into account cases where only the prototype returned search results. These results demonstrate the value of having public available linked biodiversity data with semantics.

Metadados do item

id	USP_dd82fc13b3181d1f3244232131aa31f3
oai_identifier_str	oai:teses.usp.br:tde-30012018-093704
network_acronym_str	USP
network_name_str	Biblioteca Digital de Teses e Dissertações da USP
repository_id_str
spelling	Using a provenance model and spatiotemporal information to integrate heterogeneous biodiversity semantic data.Usando um modelo de proveniência e informações espaço-temporais para integrar dados semânticos heterogêneos sobre biodiversidadeBiodiversidadeDados abertos vinculadosProveniênciaWeb semânticaIn the last few years, the Web of data is being rapidly populated with biodiversity data. However, when researchers need to retrieve, integrate, and visualize these data, they need to rely on semi-manual approaches. That is due to the fact that biodiversity repositories, such as GBIF, offer data as just strings in CSV format spreadsheets. There is no machine readable metadata that could add meaning (semantics) to data. Without this metadata, automatic solutions are impossible and labor intensive semi-manual approaches for data integration and visualization are unavoidable. To reduce this problem, we present a novel architecture, called STBioData, to automatically link spatiotemporal biodiversity data, from heterogeneous data sources, to enable easier searching, visualization and downloading of relevant data. It supports the generation of interactive maps and mapping between biodiversity data and ontologies describing them (such as Darwin Core, DBpedia, GeoSPARQL, Time and PROV-O). A new biodiversity provenance model (BioProv), extending the W3C PROV Data Model, was proposed. BioProv enables applications that deal with biodiversity data to incorporate provenance data in their information. A web based prototype, based on this architecture, was implemented. It supports biodiversity domain experts in tasks, such as identifying a species conservation status, by automating most of the necessary tasks. It uses collection data, from important Brazilian biodiversity research institutions, and species geographic distributions and conservation status, from the IUCN Red List of Threatened Species. These data are converted to linked data, enriched and saved as RDF Triples. Users can access the system, using a web interface, and search for collection and species distribution records based on species names, time ranges and geographic location. After a data set is recovered, it can be displayed in an interactive map. The records contents are also shown (including provenance data) together with links to the original records at GBIF and IUCN. Users can export datasets, as a CSV or RDF file, or get a print out in PDF (including the visualizations). Choosing different time ranges, users can, for instance, verify the evolution of a species distribution. The STBioData prototype was tested using use cases. For the tests, 46,211 collection records, from SpeciesLink, and 38,589 conservation status records (including maps), from IUCN, for marine mammal were converted to 2,233,782. RDF triples and linked using well known ontologies. 90% of biodiversity experts, using the tool to determine conservation status, were able to find information about dolphin species, with a satisfactory recovery time, and were able to understand the interactive map. In an information retrieval experiment, when compared with SpeciesLink keyword based search, the prototypes semantic based search performed, on average, 24% better in precision and 22% in recall tests. And that does not takes into account cases where only the prototype returned search results. These results demonstrate the value of having public available linked biodiversity data with semantics.Nos últimos anos, a Web de dados está sendo rapidamente preenchida com dados de biodiversidade. No entanto, quando pesquisadores precisam recuperar, integrar e visualizar esses dados, eles precisam confiar em abordagens semi-manuais. Isso ocorre devido ao fato de que repositórios sobre biodiversidade, como GBIF, oferecem dados como cadeias de caracteres em planilhas no formato CSV. Não há nenhum metadado legível por máquinas que poderia acrescentar significado (semântico) aos dados. Sem os metadados, soluções automáticas são impossíveis, sendo necessário para visualização e integração dos dados, a utilização de abordagens semi-manuais. Para reduzir esse problema, apresentamos uma arquitetura chamada STBioData. Com ela é possível vincular automaticamente dados de biodiversidade, com informações espaço-temporais provenientes de fontes heterogêneas, tornando mais fácil a pesquisa, visualização e download dos dados relevantes. Ele suporta a geração de mapas interativos e o mapeamento entre dados de biodiversidade e ontologias que os descrevem (como Darwin Core, DBpedia, GeoSPARQL, Time e PROV-O). Foi proposto um novo modelo de proveniência para biodiversidade (BioProv), que estende o modelo de dados PROV W3C. BioProv permite que aplicativos que lidam com dados de biodiversidade incorporem os dados de proveniência em suas informações. Foi implementado um protótipo Web baseado nesta arquitetura. Ele oferece suporte aos especialistas do domínio de biodiversidade em tarefas como, identificação do status de conservação da espécie, além de automatizar a maioria das tarefas necessária. Foi utilizado coleções de dados de importantes pesquisas brasileiras sobre biodiversidade, juntamente com dados de distribuição geográfica das espécies e seu estado de conservação, provenientes da lista de espécies ameaçadas da IUCN (Red List). Esses dados são convertidos em dados conectados, enriquecidos e salvados como triplas RDF. Os usuários podem acessar o sistema, usando uma interface web que permite procurar, utilizando os nomes das espécies, intervalos de tempo e localização geográfica. Os dados recuperados podem ser visualizados no mapa interativo. O conteúdo de registros também é mostrado (incluindo dados de proveniência), juntamente com links para os registros originais no GBIF e IUCN. Os usuários podem exportar o conjunto de dados, como um arquivo CSV ou RDF, ou salvar em PDF (incluindo as visualizações). Escolhendo diferentes intervalos de tempo, os usuários podem por exemplo, verificar a evolução da distribuição das espécies. O protótipo STBioData foi testado usando casos de uso. Para esses testes, 46.211 registros de coleção do SpeciesLink e 38.589 registros de estado de conservação da IUCN (incluindo mapas), sobre mamíferos marinhos, foram convertidos em 2.233.782 triplas RDF. Essas triplas reutilizam ontologias representativas da área . 90% dos especialistas em biodiversidade, usaram a ferramenta para determinar o estado de conservação, eles foram capaz de encontrar as informações sobre determinada espécie de golfinho, com um tempo de recuperação satisfatório e também foram capaz de entender o mapa interativo gerado. Em um experimento sobre recuperação de informações, quando comparado com o sistema de busca por palavra-chave utilizado pela base SpeciesLink, a busca semântica realizada pelo protótipo STBioData, em média, é 24% melhor em testes de precisão e 22% melhor em testes de revocação. Não são considerados os casos onde o protótipo somente retornou o resultado da busca. Esses resultados demonstram o valor de ter dados conectados sobre biodiversidade disponíveis publicamente em um formato semântico.Biblioteca Digitais de Teses e Dissertações da USPMoreira, Dilvan de AbreuAmanqui, Flor Karina Mamani2017-11-20info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttp://www.teses.usp.br/teses/disponiveis/55/55134/tde-30012018-093704/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2018-09-20T19:49:24Zoai:teses.usp.br:tde-30012018-093704Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.bropendoar:27212018-09-20T19:49:24Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv	Using a provenance model and spatiotemporal information to integrate heterogeneous biodiversity semantic data. Usando um modelo de proveniência e informações espaço-temporais para integrar dados semânticos heterogêneos sobre biodiversidade
title	Using a provenance model and spatiotemporal information to integrate heterogeneous biodiversity semantic data.
spellingShingle	Using a provenance model and spatiotemporal information to integrate heterogeneous biodiversity semantic data. Amanqui, Flor Karina Mamani Biodiversidade Dados abertos vinculados Proveniência Web semântica
title_short	Using a provenance model and spatiotemporal information to integrate heterogeneous biodiversity semantic data.
title_full	Using a provenance model and spatiotemporal information to integrate heterogeneous biodiversity semantic data.
title_fullStr	Using a provenance model and spatiotemporal information to integrate heterogeneous biodiversity semantic data.
title_full_unstemmed	Using a provenance model and spatiotemporal information to integrate heterogeneous biodiversity semantic data.
title_sort	Using a provenance model and spatiotemporal information to integrate heterogeneous biodiversity semantic data.
author	Amanqui, Flor Karina Mamani
author_facet	Amanqui, Flor Karina Mamani
author_role	author
dc.contributor.none.fl_str_mv	Moreira, Dilvan de Abreu
dc.contributor.author.fl_str_mv	Amanqui, Flor Karina Mamani
dc.subject.por.fl_str_mv	Biodiversidade Dados abertos vinculados Proveniência Web semântica
topic	Biodiversidade Dados abertos vinculados Proveniência Web semântica
description	In the last few years, the Web of data is being rapidly populated with biodiversity data. However, when researchers need to retrieve, integrate, and visualize these data, they need to rely on semi-manual approaches. That is due to the fact that biodiversity repositories, such as GBIF, offer data as just strings in CSV format spreadsheets. There is no machine readable metadata that could add meaning (semantics) to data. Without this metadata, automatic solutions are impossible and labor intensive semi-manual approaches for data integration and visualization are unavoidable. To reduce this problem, we present a novel architecture, called STBioData, to automatically link spatiotemporal biodiversity data, from heterogeneous data sources, to enable easier searching, visualization and downloading of relevant data. It supports the generation of interactive maps and mapping between biodiversity data and ontologies describing them (such as Darwin Core, DBpedia, GeoSPARQL, Time and PROV-O). A new biodiversity provenance model (BioProv), extending the W3C PROV Data Model, was proposed. BioProv enables applications that deal with biodiversity data to incorporate provenance data in their information. A web based prototype, based on this architecture, was implemented. It supports biodiversity domain experts in tasks, such as identifying a species conservation status, by automating most of the necessary tasks. It uses collection data, from important Brazilian biodiversity research institutions, and species geographic distributions and conservation status, from the IUCN Red List of Threatened Species. These data are converted to linked data, enriched and saved as RDF Triples. Users can access the system, using a web interface, and search for collection and species distribution records based on species names, time ranges and geographic location. After a data set is recovered, it can be displayed in an interactive map. The records contents are also shown (including provenance data) together with links to the original records at GBIF and IUCN. Users can export datasets, as a CSV or RDF file, or get a print out in PDF (including the visualizations). Choosing different time ranges, users can, for instance, verify the evolution of a species distribution. The STBioData prototype was tested using use cases. For the tests, 46,211 collection records, from SpeciesLink, and 38,589 conservation status records (including maps), from IUCN, for marine mammal were converted to 2,233,782. RDF triples and linked using well known ontologies. 90% of biodiversity experts, using the tool to determine conservation status, were able to find information about dolphin species, with a satisfactory recovery time, and were able to understand the interactive map. In an information retrieval experiment, when compared with SpeciesLink keyword based search, the prototypes semantic based search performed, on average, 24% better in precision and 22% in recall tests. And that does not takes into account cases where only the prototype returned search results. These results demonstrate the value of having public available linked biodiversity data with semantics.
publishDate	2017
dc.date.none.fl_str_mv	2017-11-20
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://www.teses.usp.br/teses/disponiveis/55/55134/tde-30012018-093704/
url	http://www.teses.usp.br/teses/disponiveis/55/55134/tde-30012018-093704/
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv	Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Liberar o conteúdo para acesso público.
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP
instname_str	Universidade de São Paulo (USP)
instacron_str	USP
institution	USP
reponame_str	Biblioteca Digital de Teses e Dissertações da USP
collection	Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv	virginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.br
_version_	1865491809455570944

Using a provenance model and spatiotemporal information to integrate heterogeneous biodiversity semantic data.

Registros relacionados