Towards completely automatized HTML form discovery on the web

Moraes, Maurício Coutinho

Towards completely automatized HTML form discovery on the web

Detalhes bibliográficos
Ano de defesa:	2013
Autor(a) principal:	Moraes, Maurício Coutinho
Orientador(a):	Heuser, Carlos Alberto
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Tese
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Não Informado pela instituição
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Recuperacao : Informacao HTML (Linguagem de marcação) Serviços Web Banco : Dados
Palavras-chave em Inglês:	Deep web Hidden web Crawling Domain-specific search Query form discovery
Link de acesso:	http://hdl.handle.net/10183/70194
Resumo:	The discovery of HTML forms is one of the main challenges in Deep Web crawling. Automatic solutions for this problem perform two main tasks. The first is locating HTML forms on the Web, which is done through the use of traditional/focused crawlers. The second is identifying which of these forms are indeed meant for querying, which also typically involves determining a domain for the underlying data source (and thus for the form as well). This problem has attracted a great deal of interest, resulting in a long list of algorithms and techniques. Some methods submit requests through the forms and then analyze the data retrieved in response, typically requiring a great deal of knowledge about the domain as well as semantic processing. Others do not employ form submission, to avoid such difficulties, although some techniques rely to some extent on semantics and domain knowledge. We offer an up-to-date review of 19 methods for the discovery of domain-specific query forms that do not involve form submission. This thesis details these methods and discusses how form discovery has become increasingly more automated over time, providing the context in which we propose a novel method to advance the current state-of-the-art in domain-specific structured HTML form discovery. The current state-ofthe- art in domain-specific structured HTML form discovery consists mainly of methods that directly or indirectly depend heavily on human intervention. This thesis proposes and evaluates a method capable of discovering domain-specific structured HTML forms on the Web with very little effort from a human expert, who is required only to define the name of the domain of interest (i.e., the domain for which the discovery should be made). The forms discovered by our proposal can be directly used as training data by some form classifiers. Our experimental validation used thousands of real Web forms, divided into six domains, including a representative subset of the publicly available DeepPeep form base (DEEPPEEP, 2010; DEEPPEEP REPOSITORY, 2011). Our results show that it is feasible to mitigate the demanding manual work required by two cutting-edge form classifiers (i.e., GFC and DSFC (BARBOSA; FREIRE, 2007a)), at the cost of a relatively small loss in effectiveness.

Metadados do item

id	URGS_950eb5f8a3ea0435bf6703a6db560118
oai_identifier_str	oai:www.lume.ufrgs.br:10183/70194
network_acronym_str	URGS
network_name_str	Biblioteca Digital de Teses e Dissertações da UFRGS
repository_id_str
spelling	Moraes, Maurício CoutinhoHeuser, Carlos AlbertoMoreira, Viviane Pereira2013-04-11T01:47:42Z2013http://hdl.handle.net/10183/70194000875012The discovery of HTML forms is one of the main challenges in Deep Web crawling. Automatic solutions for this problem perform two main tasks. The first is locating HTML forms on the Web, which is done through the use of traditional/focused crawlers. The second is identifying which of these forms are indeed meant for querying, which also typically involves determining a domain for the underlying data source (and thus for the form as well). This problem has attracted a great deal of interest, resulting in a long list of algorithms and techniques. Some methods submit requests through the forms and then analyze the data retrieved in response, typically requiring a great deal of knowledge about the domain as well as semantic processing. Others do not employ form submission, to avoid such difficulties, although some techniques rely to some extent on semantics and domain knowledge. We offer an up-to-date review of 19 methods for the discovery of domain-specific query forms that do not involve form submission. This thesis details these methods and discusses how form discovery has become increasingly more automated over time, providing the context in which we propose a novel method to advance the current state-of-the-art in domain-specific structured HTML form discovery. The current state-ofthe- art in domain-specific structured HTML form discovery consists mainly of methods that directly or indirectly depend heavily on human intervention. This thesis proposes and evaluates a method capable of discovering domain-specific structured HTML forms on the Web with very little effort from a human expert, who is required only to define the name of the domain of interest (i.e., the domain for which the discovery should be made). The forms discovered by our proposal can be directly used as training data by some form classifiers. Our experimental validation used thousands of real Web forms, divided into six domains, including a representative subset of the publicly available DeepPeep form base (DEEPPEEP, 2010; DEEPPEEP REPOSITORY, 2011). Our results show that it is feasible to mitigate the demanding manual work required by two cutting-edge form classifiers (i.e., GFC and DSFC (BARBOSA; FREIRE, 2007a)), at the cost of a relatively small loss in effectiveness.application/pdfengRecuperacao : InformacaoHTML (Linguagem de marcação)Serviços WebBanco : DadosDeep webHidden webCrawlingDomain-specific searchQuery form discoveryTowards completely automatized HTML form discovery on the webinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisUniversidade Federal do Rio Grande do SulInstituto de InformáticaPrograma de Pós-Graduação em ComputaçãoPorto Alegre, BR-RS2013doutoradoinfo:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da UFRGSinstname:Universidade Federal do Rio Grande do Sul (UFRGS)instacron:UFRGSTEXT000875012.pdf.txt000875012.pdf.txtExtracted Texttext/plain218494http://www.lume.ufrgs.br/bitstream/10183/70194/2/000875012.pdf.txt660d24a4b40ced34d21191d3934e9a10MD52ORIGINAL000875012.pdfTexto completo (inglês)application/pdf875587http://www.lume.ufrgs.br/bitstream/10183/70194/1/000875012.pdf0110f5e494b6e56973dd63e1fbbd7a2bMD5110183/701942021-05-26 04:45:46.862724oai:www.lume.ufrgs.br:10183/70194Biblioteca Digital de Teses e Dissertaçõeshttps://lume.ufrgs.br/handle/10183/2PUBhttps://lume.ufrgs.br/oai/requestlume@ufrgs.br\|\|lume@ufrgs.bropendoar:18532021-05-26T07:45:46Biblioteca Digital de Teses e Dissertações da UFRGS - Universidade Federal do Rio Grande do Sul (UFRGS)false
dc.title.pt_BR.fl_str_mv	Towards completely automatized HTML form discovery on the web
title	Towards completely automatized HTML form discovery on the web
spellingShingle	Towards completely automatized HTML form discovery on the web Moraes, Maurício Coutinho Recuperacao : Informacao HTML (Linguagem de marcação) Serviços Web Banco : Dados Deep web Hidden web Crawling Domain-specific search Query form discovery
title_short	Towards completely automatized HTML form discovery on the web
title_full	Towards completely automatized HTML form discovery on the web
title_fullStr	Towards completely automatized HTML form discovery on the web
title_full_unstemmed	Towards completely automatized HTML form discovery on the web
title_sort	Towards completely automatized HTML form discovery on the web
author	Moraes, Maurício Coutinho
author_facet	Moraes, Maurício Coutinho
author_role	author
dc.contributor.author.fl_str_mv	Moraes, Maurício Coutinho
dc.contributor.advisor1.fl_str_mv	Heuser, Carlos Alberto
dc.contributor.advisor-co1.fl_str_mv	Moreira, Viviane Pereira
contributor_str_mv	Heuser, Carlos Alberto Moreira, Viviane Pereira
dc.subject.por.fl_str_mv	Recuperacao : Informacao HTML (Linguagem de marcação) Serviços Web Banco : Dados
topic	Recuperacao : Informacao HTML (Linguagem de marcação) Serviços Web Banco : Dados Deep web Hidden web Crawling Domain-specific search Query form discovery
dc.subject.eng.fl_str_mv	Deep web Hidden web Crawling Domain-specific search Query form discovery
description	The discovery of HTML forms is one of the main challenges in Deep Web crawling. Automatic solutions for this problem perform two main tasks. The first is locating HTML forms on the Web, which is done through the use of traditional/focused crawlers. The second is identifying which of these forms are indeed meant for querying, which also typically involves determining a domain for the underlying data source (and thus for the form as well). This problem has attracted a great deal of interest, resulting in a long list of algorithms and techniques. Some methods submit requests through the forms and then analyze the data retrieved in response, typically requiring a great deal of knowledge about the domain as well as semantic processing. Others do not employ form submission, to avoid such difficulties, although some techniques rely to some extent on semantics and domain knowledge. We offer an up-to-date review of 19 methods for the discovery of domain-specific query forms that do not involve form submission. This thesis details these methods and discusses how form discovery has become increasingly more automated over time, providing the context in which we propose a novel method to advance the current state-of-the-art in domain-specific structured HTML form discovery. The current state-ofthe- art in domain-specific structured HTML form discovery consists mainly of methods that directly or indirectly depend heavily on human intervention. This thesis proposes and evaluates a method capable of discovering domain-specific structured HTML forms on the Web with very little effort from a human expert, who is required only to define the name of the domain of interest (i.e., the domain for which the discovery should be made). The forms discovered by our proposal can be directly used as training data by some form classifiers. Our experimental validation used thousands of real Web forms, divided into six domains, including a representative subset of the publicly available DeepPeep form base (DEEPPEEP, 2010; DEEPPEEP REPOSITORY, 2011). Our results show that it is feasible to mitigate the demanding manual work required by two cutting-edge form classifiers (i.e., GFC and DSFC (BARBOSA; FREIRE, 2007a)), at the cost of a relatively small loss in effectiveness.
publishDate	2013
dc.date.accessioned.fl_str_mv	2013-04-11T01:47:42Z
dc.date.issued.fl_str_mv	2013
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10183/70194
dc.identifier.nrb.pt_BR.fl_str_mv	000875012
url	http://hdl.handle.net/10183/70194
identifier_str_mv	000875012
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da UFRGS instname:Universidade Federal do Rio Grande do Sul (UFRGS) instacron:UFRGS
instname_str	Universidade Federal do Rio Grande do Sul (UFRGS)
instacron_str	UFRGS
institution	UFRGS
reponame_str	Biblioteca Digital de Teses e Dissertações da UFRGS
collection	Biblioteca Digital de Teses e Dissertações da UFRGS
bitstream.url.fl_str_mv	http://www.lume.ufrgs.br/bitstream/10183/70194/2/000875012.pdf.txt http://www.lume.ufrgs.br/bitstream/10183/70194/1/000875012.pdf
bitstream.checksum.fl_str_mv	660d24a4b40ced34d21191d3934e9a10 0110f5e494b6e56973dd63e1fbbd7a2b
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da UFRGS - Universidade Federal do Rio Grande do Sul (UFRGS)
repository.mail.fl_str_mv	lume@ufrgs.br\|\|lume@ufrgs.br
_version_	1831315925081522176

Towards completely automatized HTML form discovery on the web

Registros relacionados