Extração automatizada de dados de documentos em formato PDF: aplicação a grandes conjuntos de exames educacionais

Wiechork, Karina

Extração automatizada de dados de documentos em formato PDF: aplicação a grandes conjuntos de exames educacionais

Detalhes bibliográficos
Ano de defesa:	2021
Autor(a) principal:	Wiechork, Karina
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
dARK ID:	ark:/26339/001300000nb65
Idioma:	por
Instituição de defesa:	Universidade Federal de Santa Maria Brasil Ciência da Computação UFSM Programa de Pós-Graduação em Ciência da Computação Centro de Tecnologia
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	PDF Extração automatizada Avaliação Exames educacionais Ground truth Automated extraction Evaluation Educational tests CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Link de acesso:	http://repositorio.ufsm.br/handle/1/23130
Resumo:	The massive production of documents in PDF has motivated research on automated extraction of data contained in these files. Many educational tests use tests available in PDF format, which serve as study and research material. Segmenting, identifying and automatically extracting the content of a test in PDF represents a challenge, as the layout of this type of document can have many variations. Research in the areas of document analysis and recognition, computer vision and information retrieval have produced algorithms and tools that can be applied to this task, but determining their effectiveness for a given set of documents is not a trivial task. This work proposes an approach to evaluate native digital PDF data extraction tools, available in large educational test repositories. For this, the educational tests applied at Enade were used, between the years 2004 to 2019. The files used for the evaluation comprise 343 tests, with 11.196 objective and discursive questions, in addition to all 396 answers, with 14.475 alternatives extracted from the questions objectives. For the construction of ground truth in the tests, the Aletheia tool was used, whose purpose is to define the regions of interest in each question. For the extractions, existing tools were used that perform data extractions in PDF files, defined for three categories: extractions of tabular data, extractions of textual content and extractions of regions of interest. The results of the extractions point out some limitations in relation to the diversity of layout in each year of application of the Enade test, the difficulty in identifying and extracting questions when arranged in two columns on the same page or in multiple columns. The extracted data provide useful information, which can assist students who intend to study for other tests, teachers in order to use these questions for classroom exercises, as well as course coordinators helping to map students’ difficulties from questions in reports.

Metadados do item

id	UFSM_694c535d1656b33a666ae3971e831f19
oai_identifier_str	oai:repositorio.ufsm.br:1/23130
network_acronym_str	UFSM
network_name_str	Manancial - Repositório Digital da UFSM
repository_id_str
spelling	Extração automatizada de dados de documentos em formato PDF: aplicação a grandes conjuntos de exames educacionaisAutomated data extraction from PDF documents: application to large sets of educational testsPDFExtração automatizadaAvaliaçãoExames educacionaisGround truthAutomated extractionEvaluationEducational testsCNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOThe massive production of documents in PDF has motivated research on automated extraction of data contained in these files. Many educational tests use tests available in PDF format, which serve as study and research material. Segmenting, identifying and automatically extracting the content of a test in PDF represents a challenge, as the layout of this type of document can have many variations. Research in the areas of document analysis and recognition, computer vision and information retrieval have produced algorithms and tools that can be applied to this task, but determining their effectiveness for a given set of documents is not a trivial task. This work proposes an approach to evaluate native digital PDF data extraction tools, available in large educational test repositories. For this, the educational tests applied at Enade were used, between the years 2004 to 2019. The files used for the evaluation comprise 343 tests, with 11.196 objective and discursive questions, in addition to all 396 answers, with 14.475 alternatives extracted from the questions objectives. For the construction of ground truth in the tests, the Aletheia tool was used, whose purpose is to define the regions of interest in each question. For the extractions, existing tools were used that perform data extractions in PDF files, defined for three categories: extractions of tabular data, extractions of textual content and extractions of regions of interest. The results of the extractions point out some limitations in relation to the diversity of layout in each year of application of the Enade test, the difficulty in identifying and extracting questions when arranged in two columns on the same page or in multiple columns. The extracted data provide useful information, which can assist students who intend to study for other tests, teachers in order to use these questions for classroom exercises, as well as course coordinators helping to map students’ difficulties from questions in reports.A produção massiva de documentos em formato PDF tem motivado pesquisas sobre extração automatizada de dados contidos nesses arquivos. Muitos exames educacionais utilizam provas disponibilizadas em formato PDF, que servem como material de estudo e pesquisa. Segmentar, identificar e extrair automaticamente o conteúdo de uma prova em PDF representa um desafio, pois o layout deste tipo de documento pode apresentar muitas variações. Pesquisas nas áreas de análise e reconhecimento de documentos, visão computacional e recuperação de informação têm produzido algoritmos e ferramentas que podem ser aplicados a esta tarefa, mas determinar sua eficácia para um dado conjunto de documentos não é uma tarefa trivial. Este trabalho propõe uma abordagem em avaliar ferramentas de extrações de dados em PDF nativamente digitais, disponibilizados em repositórios de exames educacionais. Para isso, foram utilizados os exames educacionais aplicados no Enade, entre os anos de 2004 até 2019. Os arquivos utilizados para a avaliação compreendem 343 provas, com 11.196 questões objetivas e discursivas, além de todos os 396 gabaritos, com 14.475 alternativas extraídas das questões objetivas. Para a construção de ground truth nas provas utilizou-se a ferramenta Aletheia, cuja finalidade é definir as regiões de interesse em cada questão. Para as extrações, utilizou-se ferramentas existentes que realizam extrações de dados em arquivos PDF, definidas para três categorias: extrações de dados tabulares, extrações de conteúdo textual e extrações de regiões de interesse. Os resultados das extrações apontam algumas limitações em relação a diversidade de layout em cada ano de aplicação da prova do Enade, a dificuldade em identificar e extrair questões quando dispostas em duas colunas na mesma página ou em colunas múltiplas. Os dados extraídos fornecem informações úteis, podendo auxiliar estudantes que pretendem estudar para outras provas, professores no intuito de utilizar essas questões para exercícios em sala de aula, além de coordenadores de cursos auxiliando a mapear dificuldades dos alunos a partir de questões em relatórios.Universidade Federal de Santa MariaBrasilCiência da ComputaçãoUFSMPrograma de Pós-Graduação em Ciência da ComputaçãoCentro de TecnologiaCharao, Andrea Schwertnerhttp://lattes.cnpq.br/8251676116103188Trois, CelioFabro, Marcos DidonetWiechork, Karina2021-12-02T19:14:59Z2021-12-02T19:14:59Z2021-04-16info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://repositorio.ufsm.br/handle/1/23130ark:/26339/001300000nb65porAttribution-NonCommercial-NoDerivatives 4.0 Internationalinfo:eu-repo/semantics/openAccessreponame:Manancial - Repositório Digital da UFSMinstname:Universidade Federal de Santa Maria (UFSM)instacron:UFSM2022-07-07T12:32:17Zoai:repositorio.ufsm.br:1/23130Biblioteca Digital de Teses e Dissertaçõeshttps://repositorio.ufsm.br/PUBhttps://repositorio.ufsm.br/oai/requestatendimento.sib@ufsm.br\|\|tedebc@gmail.com\|\|manancial@ufsm.bropendoar:2022-07-07T12:32:17Manancial - Repositório Digital da UFSM - Universidade Federal de Santa Maria (UFSM)false
dc.title.none.fl_str_mv	Extração automatizada de dados de documentos em formato PDF: aplicação a grandes conjuntos de exames educacionais Automated data extraction from PDF documents: application to large sets of educational tests
title	Extração automatizada de dados de documentos em formato PDF: aplicação a grandes conjuntos de exames educacionais
spellingShingle	Extração automatizada de dados de documentos em formato PDF: aplicação a grandes conjuntos de exames educacionais Wiechork, Karina PDF Extração automatizada Avaliação Exames educacionais Ground truth Automated extraction Evaluation Educational tests CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
title_short	Extração automatizada de dados de documentos em formato PDF: aplicação a grandes conjuntos de exames educacionais
title_full	Extração automatizada de dados de documentos em formato PDF: aplicação a grandes conjuntos de exames educacionais
title_fullStr	Extração automatizada de dados de documentos em formato PDF: aplicação a grandes conjuntos de exames educacionais
title_full_unstemmed	Extração automatizada de dados de documentos em formato PDF: aplicação a grandes conjuntos de exames educacionais
title_sort	Extração automatizada de dados de documentos em formato PDF: aplicação a grandes conjuntos de exames educacionais
author	Wiechork, Karina
author_facet	Wiechork, Karina
author_role	author
dc.contributor.none.fl_str_mv	Charao, Andrea Schwertner http://lattes.cnpq.br/8251676116103188 Trois, Celio Fabro, Marcos Didonet
dc.contributor.author.fl_str_mv	Wiechork, Karina
dc.subject.por.fl_str_mv	PDF Extração automatizada Avaliação Exames educacionais Ground truth Automated extraction Evaluation Educational tests CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
topic	PDF Extração automatizada Avaliação Exames educacionais Ground truth Automated extraction Evaluation Educational tests CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
description	The massive production of documents in PDF has motivated research on automated extraction of data contained in these files. Many educational tests use tests available in PDF format, which serve as study and research material. Segmenting, identifying and automatically extracting the content of a test in PDF represents a challenge, as the layout of this type of document can have many variations. Research in the areas of document analysis and recognition, computer vision and information retrieval have produced algorithms and tools that can be applied to this task, but determining their effectiveness for a given set of documents is not a trivial task. This work proposes an approach to evaluate native digital PDF data extraction tools, available in large educational test repositories. For this, the educational tests applied at Enade were used, between the years 2004 to 2019. The files used for the evaluation comprise 343 tests, with 11.196 objective and discursive questions, in addition to all 396 answers, with 14.475 alternatives extracted from the questions objectives. For the construction of ground truth in the tests, the Aletheia tool was used, whose purpose is to define the regions of interest in each question. For the extractions, existing tools were used that perform data extractions in PDF files, defined for three categories: extractions of tabular data, extractions of textual content and extractions of regions of interest. The results of the extractions point out some limitations in relation to the diversity of layout in each year of application of the Enade test, the difficulty in identifying and extracting questions when arranged in two columns on the same page or in multiple columns. The extracted data provide useful information, which can assist students who intend to study for other tests, teachers in order to use these questions for classroom exercises, as well as course coordinators helping to map students’ difficulties from questions in reports.
publishDate	2021
dc.date.none.fl_str_mv	2021-12-02T19:14:59Z 2021-12-02T19:14:59Z 2021-04-16
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://repositorio.ufsm.br/handle/1/23130
dc.identifier.dark.fl_str_mv	ark:/26339/001300000nb65
url	http://repositorio.ufsm.br/handle/1/23130
identifier_str_mv	ark:/26339/001300000nb65
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	Attribution-NonCommercial-NoDerivatives 4.0 International info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Attribution-NonCommercial-NoDerivatives 4.0 International
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Universidade Federal de Santa Maria Brasil Ciência da Computação UFSM Programa de Pós-Graduação em Ciência da Computação Centro de Tecnologia
publisher.none.fl_str_mv	Universidade Federal de Santa Maria Brasil Ciência da Computação UFSM Programa de Pós-Graduação em Ciência da Computação Centro de Tecnologia
dc.source.none.fl_str_mv	reponame:Manancial - Repositório Digital da UFSM instname:Universidade Federal de Santa Maria (UFSM) instacron:UFSM
instname_str	Universidade Federal de Santa Maria (UFSM)
instacron_str	UFSM
institution	UFSM
reponame_str	Manancial - Repositório Digital da UFSM
collection	Manancial - Repositório Digital da UFSM
repository.name.fl_str_mv	Manancial - Repositório Digital da UFSM - Universidade Federal de Santa Maria (UFSM)
repository.mail.fl_str_mv	atendimento.sib@ufsm.br\|\|tedebc@gmail.com\|\|manancial@ufsm.br
_version_	1847153423282077696

Extração automatizada de dados de documentos em formato PDF: aplicação a grandes conjuntos de exames educacionais

Registros relacionados