Learning to schedule web page updates using genetic programming

Detalhes bibliográficos
Ano de defesa: 2013
Autor(a) principal: Aécio Solano Rodrigues Santos
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://hdl.handle.net/1843/ESBF-97GJSQ
Resumo: One of the main challenges endured when designing a scheduling policy regarding freshness is to estimate the likelihood of a previously crawled web page being modified on the web, so that the scheduler can use this estimation to determine the order in which those pages should be visited. A good estimation of which pages have more chance of being modified allows the system to reduce the overall cost of monitoring its crawled web pages for keeping updated versions. In this work we present a novel approach that uses machine learning to generate score functions that produce accurate rankings of pages regarding their probability of being modified on the Web when compared to their previously crawled versions. We propose a flexible framework that uses Genetic Programming to evolve score functions to estimate the likelihood that a web page has been modified. We present a thorough experimental evaluation of the benefits of using the framework over five state-of-the-art baselines. Considering the Change Ratio metric, the values produced by our best evolved function show an improvement from 0.52 to 0.71 on average over the baselines.
id UFMG_e81a2f242dc9f7c14dcb8337f2dae633
oai_identifier_str oai:repositorio.ufmg.br:1843/ESBF-97GJSQ
network_acronym_str UFMG
network_name_str Repositório Institucional da UFMG
repository_id_str
spelling Learning to schedule web page updates using genetic programmingProgramação genética (Computação)ComputaçãoSistemas de recuperação da informaçãoColeta incremental de páginas webProgramação genéticaPolíticas de escalonamentoOne of the main challenges endured when designing a scheduling policy regarding freshness is to estimate the likelihood of a previously crawled web page being modified on the web, so that the scheduler can use this estimation to determine the order in which those pages should be visited. A good estimation of which pages have more chance of being modified allows the system to reduce the overall cost of monitoring its crawled web pages for keeping updated versions. In this work we present a novel approach that uses machine learning to generate score functions that produce accurate rankings of pages regarding their probability of being modified on the Web when compared to their previously crawled versions. We propose a flexible framework that uses Genetic Programming to evolve score functions to estimate the likelihood that a web page has been modified. We present a thorough experimental evaluation of the benefits of using the framework over five state-of-the-art baselines. Considering the Change Ratio metric, the values produced by our best evolved function show an improvement from 0.52 to 0.71 on average over the baselines.Universidade Federal de Minas Gerais2019-08-12T08:06:19Z2025-09-08T23:11:37Z2019-08-12T08:06:19Z2013-03-11info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://hdl.handle.net/1843/ESBF-97GJSQAécio Solano Rodrigues Santosinfo:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMG2025-09-08T23:11:37Zoai:repositorio.ufmg.br:1843/ESBF-97GJSQRepositório InstitucionalPUBhttps://repositorio.ufmg.br/oairepositorio@ufmg.bropendoar:2025-09-08T23:11:37Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false
dc.title.none.fl_str_mv Learning to schedule web page updates using genetic programming
title Learning to schedule web page updates using genetic programming
spellingShingle Learning to schedule web page updates using genetic programming
Aécio Solano Rodrigues Santos
Programação genética (Computação)
Computação
Sistemas de recuperação da informação
Coleta incremental de páginas web
Programação genética
Políticas de escalonamento
title_short Learning to schedule web page updates using genetic programming
title_full Learning to schedule web page updates using genetic programming
title_fullStr Learning to schedule web page updates using genetic programming
title_full_unstemmed Learning to schedule web page updates using genetic programming
title_sort Learning to schedule web page updates using genetic programming
author Aécio Solano Rodrigues Santos
author_facet Aécio Solano Rodrigues Santos
author_role author
dc.contributor.author.fl_str_mv Aécio Solano Rodrigues Santos
dc.subject.por.fl_str_mv Programação genética (Computação)
Computação
Sistemas de recuperação da informação
Coleta incremental de páginas web
Programação genética
Políticas de escalonamento
topic Programação genética (Computação)
Computação
Sistemas de recuperação da informação
Coleta incremental de páginas web
Programação genética
Políticas de escalonamento
description One of the main challenges endured when designing a scheduling policy regarding freshness is to estimate the likelihood of a previously crawled web page being modified on the web, so that the scheduler can use this estimation to determine the order in which those pages should be visited. A good estimation of which pages have more chance of being modified allows the system to reduce the overall cost of monitoring its crawled web pages for keeping updated versions. In this work we present a novel approach that uses machine learning to generate score functions that produce accurate rankings of pages regarding their probability of being modified on the Web when compared to their previously crawled versions. We propose a flexible framework that uses Genetic Programming to evolve score functions to estimate the likelihood that a web page has been modified. We present a thorough experimental evaluation of the benefits of using the framework over five state-of-the-art baselines. Considering the Change Ratio metric, the values produced by our best evolved function show an improvement from 0.52 to 0.71 on average over the baselines.
publishDate 2013
dc.date.none.fl_str_mv 2013-03-11
2019-08-12T08:06:19Z
2019-08-12T08:06:19Z
2025-09-08T23:11:37Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://hdl.handle.net/1843/ESBF-97GJSQ
url https://hdl.handle.net/1843/ESBF-97GJSQ
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Federal de Minas Gerais
publisher.none.fl_str_mv Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFMG
instname:Universidade Federal de Minas Gerais (UFMG)
instacron:UFMG
instname_str Universidade Federal de Minas Gerais (UFMG)
instacron_str UFMG
institution UFMG
reponame_str Repositório Institucional da UFMG
collection Repositório Institucional da UFMG
repository.name.fl_str_mv Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv repositorio@ufmg.br
_version_ 1856414088162181120