Learning to schedule web page updates using genetic programming
| Ano de defesa: | 2013 |
|---|---|
| Autor(a) principal: | |
| Orientador(a): | |
| Banca de defesa: | |
| Tipo de documento: | Dissertação |
| Tipo de acesso: | Acesso aberto |
| Idioma: | por |
| Instituição de defesa: |
Universidade Federal de Minas Gerais
|
| Programa de Pós-Graduação: |
Não Informado pela instituição
|
| Departamento: |
Não Informado pela instituição
|
| País: |
Não Informado pela instituição
|
| Palavras-chave em Português: | |
| Link de acesso: | https://hdl.handle.net/1843/ESBF-97GJSQ |
Resumo: | One of the main challenges endured when designing a scheduling policy regarding freshness is to estimate the likelihood of a previously crawled web page being modified on the web, so that the scheduler can use this estimation to determine the order in which those pages should be visited. A good estimation of which pages have more chance of being modified allows the system to reduce the overall cost of monitoring its crawled web pages for keeping updated versions. In this work we present a novel approach that uses machine learning to generate score functions that produce accurate rankings of pages regarding their probability of being modified on the Web when compared to their previously crawled versions. We propose a flexible framework that uses Genetic Programming to evolve score functions to estimate the likelihood that a web page has been modified. We present a thorough experimental evaluation of the benefits of using the framework over five state-of-the-art baselines. Considering the Change Ratio metric, the values produced by our best evolved function show an improvement from 0.52 to 0.71 on average over the baselines. |
| id |
UFMG_e81a2f242dc9f7c14dcb8337f2dae633 |
|---|---|
| oai_identifier_str |
oai:repositorio.ufmg.br:1843/ESBF-97GJSQ |
| network_acronym_str |
UFMG |
| network_name_str |
Repositório Institucional da UFMG |
| repository_id_str |
|
| spelling |
Learning to schedule web page updates using genetic programmingProgramação genética (Computação)ComputaçãoSistemas de recuperação da informaçãoColeta incremental de páginas webProgramação genéticaPolíticas de escalonamentoOne of the main challenges endured when designing a scheduling policy regarding freshness is to estimate the likelihood of a previously crawled web page being modified on the web, so that the scheduler can use this estimation to determine the order in which those pages should be visited. A good estimation of which pages have more chance of being modified allows the system to reduce the overall cost of monitoring its crawled web pages for keeping updated versions. In this work we present a novel approach that uses machine learning to generate score functions that produce accurate rankings of pages regarding their probability of being modified on the Web when compared to their previously crawled versions. We propose a flexible framework that uses Genetic Programming to evolve score functions to estimate the likelihood that a web page has been modified. We present a thorough experimental evaluation of the benefits of using the framework over five state-of-the-art baselines. Considering the Change Ratio metric, the values produced by our best evolved function show an improvement from 0.52 to 0.71 on average over the baselines.Universidade Federal de Minas Gerais2019-08-12T08:06:19Z2025-09-08T23:11:37Z2019-08-12T08:06:19Z2013-03-11info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://hdl.handle.net/1843/ESBF-97GJSQAécio Solano Rodrigues Santosinfo:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMG2025-09-08T23:11:37Zoai:repositorio.ufmg.br:1843/ESBF-97GJSQRepositório InstitucionalPUBhttps://repositorio.ufmg.br/oairepositorio@ufmg.bropendoar:2025-09-08T23:11:37Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false |
| dc.title.none.fl_str_mv |
Learning to schedule web page updates using genetic programming |
| title |
Learning to schedule web page updates using genetic programming |
| spellingShingle |
Learning to schedule web page updates using genetic programming Aécio Solano Rodrigues Santos Programação genética (Computação) Computação Sistemas de recuperação da informação Coleta incremental de páginas web Programação genética Políticas de escalonamento |
| title_short |
Learning to schedule web page updates using genetic programming |
| title_full |
Learning to schedule web page updates using genetic programming |
| title_fullStr |
Learning to schedule web page updates using genetic programming |
| title_full_unstemmed |
Learning to schedule web page updates using genetic programming |
| title_sort |
Learning to schedule web page updates using genetic programming |
| author |
Aécio Solano Rodrigues Santos |
| author_facet |
Aécio Solano Rodrigues Santos |
| author_role |
author |
| dc.contributor.author.fl_str_mv |
Aécio Solano Rodrigues Santos |
| dc.subject.por.fl_str_mv |
Programação genética (Computação) Computação Sistemas de recuperação da informação Coleta incremental de páginas web Programação genética Políticas de escalonamento |
| topic |
Programação genética (Computação) Computação Sistemas de recuperação da informação Coleta incremental de páginas web Programação genética Políticas de escalonamento |
| description |
One of the main challenges endured when designing a scheduling policy regarding freshness is to estimate the likelihood of a previously crawled web page being modified on the web, so that the scheduler can use this estimation to determine the order in which those pages should be visited. A good estimation of which pages have more chance of being modified allows the system to reduce the overall cost of monitoring its crawled web pages for keeping updated versions. In this work we present a novel approach that uses machine learning to generate score functions that produce accurate rankings of pages regarding their probability of being modified on the Web when compared to their previously crawled versions. We propose a flexible framework that uses Genetic Programming to evolve score functions to estimate the likelihood that a web page has been modified. We present a thorough experimental evaluation of the benefits of using the framework over five state-of-the-art baselines. Considering the Change Ratio metric, the values produced by our best evolved function show an improvement from 0.52 to 0.71 on average over the baselines. |
| publishDate |
2013 |
| dc.date.none.fl_str_mv |
2013-03-11 2019-08-12T08:06:19Z 2019-08-12T08:06:19Z 2025-09-08T23:11:37Z |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
| format |
masterThesis |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
https://hdl.handle.net/1843/ESBF-97GJSQ |
| url |
https://hdl.handle.net/1843/ESBF-97GJSQ |
| dc.language.iso.fl_str_mv |
por |
| language |
por |
| dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais |
| publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais |
| dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG |
| instname_str |
Universidade Federal de Minas Gerais (UFMG) |
| instacron_str |
UFMG |
| institution |
UFMG |
| reponame_str |
Repositório Institucional da UFMG |
| collection |
Repositório Institucional da UFMG |
| repository.name.fl_str_mv |
Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG) |
| repository.mail.fl_str_mv |
repositorio@ufmg.br |
| _version_ |
1856414088162181120 |