Aplicação e comparação de métodos policy gradient em problema de cadeias de suprimentos multiestágio com incertezas

Julio César Alves

Aplicação e comparação de métodos policy gradient em problema de cadeias de suprimentos multiestágio com incertezas

Detalhes bibliográficos
Ano de defesa:	2021
Autor(a) principal:	Julio César Alves
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Tese
Tipo de acesso:	Acesso aberto
Idioma:	por
Instituição de defesa:	Universidade Federal de Minas Gerais
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Computação - Teses. Markov, Processos de. Aprendizado por reforço - Teses. Aprendizado profundo - Teses.
Link de acesso:	https://hdl.handle.net/1843/38570
Resumo:	Deep Reinforcement Learning (DRL) methods have been increasingly used in several areas of knowledge and, recently, this interest has also grown in the Optimization community. In this work, we apply and compare Policy Gradient methods in the problem of planning the production and distribution of products in a supply chain with multiple stages. Most of the previous works that use similar methods only consider serial supply chains or only two echelons, generally limiting the solution possibilities, and none of them consider stochastic lead times. We consider a chain with four echelons and two nodes per echelon, with uncertainties regarding seasonal demands from customers and lead times of production at suppliers and transport along the chain. To our knowledge, this work is the first to apply, in such chain configuration, DRL methods considering a centralized approach to the problem, in which all decisions are taken by a single agent based on the uncertain demands of the end customers. We propose a Markovian Decision Process (MDP) formulation and a Linear Programming (LP)model with uncertain parameters. The MDP formulation is adapted to obtain good results with the application of Policy Gradient methods. In the first phase, after an initial case study, we applied the Proximal Policy Optimization (PPO) algorithm in 17 experimental scenarios, considering seasonal and regular uncertain demands (with different levels of uncertainty) and constant and stochastic lead times. In this phase, an agent built from the solution of a Linear Programming model (given by considering expected demands and average lead times) is used as a baseline. In the second phase, we have compared five algorithms, Advantage Actor-Critic (A2C), Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO), SAC, and Twin Delayed DDPG (TD3), in 8 of the 17 previous scenarios, using statistical tools for proper comparison of the algorithms. The PPO and SAC algorithms had the best performance in the experiments, the first having a better execution time. Experimental results indicate that Policy Gradient methods, especially PPO, are suitable and competitive tools for the proposed problem. In the third phase, we started to work with a multi-product version of the problem, generalizing the MDP formulation and the LP model. Experiments were carried out with the PPO algorithm in 16 multi-product scenarios, considering two and three products and different cost and demand configurations. The results indicate that, as in the original problem, the PPO performs better than the baseline in scenarios with stochastic lead times.

Metadados do item

id	UFMG_f56a2b37d7e7d549b7956ce2b1ca6d4d
oai_identifier_str	oai:repositorio.ufmg.br:1843/38570
network_acronym_str	UFMG
network_name_str	Repositório Institucional da UFMG
repository_id_str
spelling	2021-10-30T19:47:03Z2025-09-09T01:15:32Z2021-10-30T19:47:03Z2021-10-06https://hdl.handle.net/1843/38570Deep Reinforcement Learning (DRL) methods have been increasingly used in several areas of knowledge and, recently, this interest has also grown in the Optimization community. In this work, we apply and compare Policy Gradient methods in the problem of planning the production and distribution of products in a supply chain with multiple stages. Most of the previous works that use similar methods only consider serial supply chains or only two echelons, generally limiting the solution possibilities, and none of them consider stochastic lead times. We consider a chain with four echelons and two nodes per echelon, with uncertainties regarding seasonal demands from customers and lead times of production at suppliers and transport along the chain. To our knowledge, this work is the first to apply, in such chain configuration, DRL methods considering a centralized approach to the problem, in which all decisions are taken by a single agent based on the uncertain demands of the end customers. We propose a Markovian Decision Process (MDP) formulation and a Linear Programming (LP)model with uncertain parameters. The MDP formulation is adapted to obtain good results with the application of Policy Gradient methods. In the first phase, after an initial case study, we applied the Proximal Policy Optimization (PPO) algorithm in 17 experimental scenarios, considering seasonal and regular uncertain demands (with different levels of uncertainty) and constant and stochastic lead times. In this phase, an agent built from the solution of a Linear Programming model (given by considering expected demands and average lead times) is used as a baseline. In the second phase, we have compared five algorithms, Advantage Actor-Critic (A2C), Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO), SAC, and Twin Delayed DDPG (TD3), in 8 of the 17 previous scenarios, using statistical tools for proper comparison of the algorithms. The PPO and SAC algorithms had the best performance in the experiments, the first having a better execution time. Experimental results indicate that Policy Gradient methods, especially PPO, are suitable and competitive tools for the proposed problem. In the third phase, we started to work with a multi-product version of the problem, generalizing the MDP formulation and the LP model. Experiments were carried out with the PPO algorithm in 16 multi-product scenarios, considering two and three products and different cost and demand configurations. The results indicate that, as in the original problem, the PPO performs better than the baseline in scenarios with stochastic lead times.porUniversidade Federal de Minas Geraishttp://creativecommons.org/licenses/by-nc-nd/3.0/pt/info:eu-repo/semantics/openAccessCadeias de suprimentos multiestágioTomada de decisão sequencial sob incertezaAprendizado por reforçoAprendizado profundoMétodos policy gradientComputação - Teses.Markov, Processos de.Aprendizado por reforço - Teses.Aprendizado profundo - Teses.Aplicação e comparação de métodos policy gradient em problema de cadeias de suprimentos multiestágio com incertezasApplying and comparing policy gradient methods to multi-echelon supply chain problem with uncertaintyinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisJulio César Alvesreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGhttp://lattes.cnpq.br/2547158184816891Geraldo Robson Mateushttp://lattes.cnpq.br/6289602045034353André Carlos Ponce de Leon Ferreira de CarvalhoAdriano Alonso VelosoCristiano Arbex ValleDilson Lucas PereiraAlgoritmos de Aprendizado por Reforço (AR) Profundo têm sido cada vez mais utilizados em diversas áreas do conhecimento e, recentemente, este interesse tem crescido também na comunidade de Otimização. Neste trabalho, aplicamos e comparamos algoritmos do tipo Policy Gradient em um problema clássico de otimização de planejamento de produção e distribuição de produtos em uma cadeia de suprimentos com múltiplos estágios. A maior parte dos trabalhos anteriores que utilizam métodos similares, considera somente cadeias de suprimentos seriais ou com até dois estágios, geralmente limitando as possibilidades de solução, e nenhum deles considera tempos de espera estocásticos. Nós consideramos uma cadeia com quatro estágios e dois nós por estágio, com incertezas nas demandas sazonais dos clientes finais e nos tempos de espera de produção nos fornecedores e de transporte ao longo da cadeia. Pelo nosso conhecimento, este trabalho é o primeiro a aplicar, nesta configuração de cadeia, métodos de AR Profundo, considerando uma abordagem centralizada para o problema, na qual todas as decisões são tomadas por um único agente, a partir das demandas incertas dos clientes finais. Propomos uma formulação de Processo de Decisão de Markov (PDM) e um modelo de Programação Linear (PL) com parâmetros incertos. A formulação PDM é adaptada de forma a se obter bons resultados com a aplicação dos algoritmos Policy Gradient. Em uma primeira fase, depois de um estudo de caso inicial, aplicamos o algoritmo Proximal Policy Optimization (PPO) em 17 cenários experimentais, considerando demandas incertas sazonais e regulares, com diferentes níveis de incerteza para as demandas, e tempos de espera constantes e estocásticos. Nesta fase, um agente construído a partir da solução de um modelo de Programação Linear (dado por considerarmos demandas esperadas e tempos de espera médios) é usado como baseline. Em uma segunda fase, comparamos cinco algoritmos, Advantage Actor-Critic (A2C), Deep Deterministic Policy Gradient (DDPG), PPO, Soft Actor-Critic (SAC) e Twin Delayed DDPG (TD3), em 8 dos 17 cenários anteriores, utilizando ferramentas estatísticas para comparação adequada dos algoritmos. Os algoritmos PPO e SAC alcançaram melhor desempenho nos experimentos realizados, sendo que o primeiro tem um melhor tempo de execução. Os resultados experimentais indicam que métodos Policy Gradient, especialmente o PPO, são ferramentas adequadas e competitivas para o problema proposto. Em uma terceira fase, passamos a trabalhar com uma versão multiproduto do problema, generalizando a formulação PDM e o modelo PL com parâmetros incertos. Foram realizados experimentos com o algoritmo PPO em 16 cenários multiproduto, considerando dois e três produtos, e diferentes configurações de custos e de demandas. Os resultados encontrados indicam que, como no problema original, o PPO tem desempenho melhor que o baseline nos cenários com tempos de espera estocásticos.https://orcid.org/0000-0002-4848-9453BrasilPrograma de Pós-Graduação em Ciência da ComputaçãoUFMGCC-LICENSElicense_rdfapplication/octet-stream811https://repositorio.ufmg.br//bitstreams/0a9dd8b5-f0cb-4059-a4aa-16dbf6a81352/downloadcfd6801dba008cb6adbd9838b81582abMD51falseAnonymousREADORIGINALTese_versao_final.pdfapplication/pdf4479522https://repositorio.ufmg.br//bitstreams/188e0a37-d829-4b2d-8e90-8f5b35ae1306/download18b2f3afb9f413f9c2f69dbf4058b03aMD52trueAnonymousREADLICENSElicense.txttext/plain2118https://repositorio.ufmg.br//bitstreams/ec7b2d15-03f0-48aa-8acc-cde43452b96e/downloadcda590c95a0b51b4d15f60c9642ca272MD53falseAnonymousREAD1843/385702025-09-08 22:15:32.271http://creativecommons.org/licenses/by-nc-nd/3.0/pt/Acesso Abertoopen.accessoai:repositorio.ufmg.br:1843/38570https://repositorio.ufmg.br/Repositório InstitucionalPUBhttps://repositorio.ufmg.br/oairepositorio@ufmg.bropendoar:2025-09-09T01:15:32Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)falseTElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEgRE8gUkVQT1NJVMOTUklPIElOU1RJVFVDSU9OQUwgREEgVUZNRwoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSBhbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIChSSS1VRk1HKSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZSBpcnJldm9nw6F2ZWwgZGUgcmVwcm9kdXppciBlL291IGRpc3RyaWJ1aXIgYSBzdWEgcHVibGljYcOnw6NvIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyw7RuaWNvIGUgZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBkZWNsYXJhIHF1ZSBjb25oZWNlIGEgcG9sw610aWNhIGRlIGNvcHlyaWdodCBkYSBlZGl0b3JhIGRvIHNldSBkb2N1bWVudG8gZSBxdWUgY29uaGVjZSBlIGFjZWl0YSBhcyBEaXJldHJpemVzIGRvIFJJLVVGTUcuCgpWb2PDqiBjb25jb3JkYSBxdWUgbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGTUcgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGRlIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBmaW5zIGRlIHNlZ3VyYW7Dp2EsIGJhY2stdXAgZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogZGVjbGFyYSBxdWUgYSBzdWEgcHVibGljYcOnw6NvIMOpIG9yaWdpbmFsIGUgcXVlIHZvY8OqIHRlbSBvIHBvZGVyIGRlIGNvbmNlZGVyIG9zIGRpcmVpdG9zIGNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRlIHN1YSBwdWJsaWNhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3XDqW0uCgpDYXNvIGEgc3VhIHB1YmxpY2HDp8OjbyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgYW8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHB1YmxpY2HDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBQVUJMSUNBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCk8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lKHMpIG91IG8ocykgbm9tZXMocykgZG8ocykgZGV0ZW50b3IoZXMpIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkYSBwdWJsaWNhw6fDo28sIGUgbsOjbyBmYXLDoSBxdWFscXVlciBhbHRlcmHDp8OjbywgYWzDqW0gZGFxdWVsYXMgY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4K
dc.title.none.fl_str_mv	Aplicação e comparação de métodos policy gradient em problema de cadeias de suprimentos multiestágio com incertezas
dc.title.alternative.none.fl_str_mv	Applying and comparing policy gradient methods to multi-echelon supply chain problem with uncertainty
title	Aplicação e comparação de métodos policy gradient em problema de cadeias de suprimentos multiestágio com incertezas
spellingShingle	Aplicação e comparação de métodos policy gradient em problema de cadeias de suprimentos multiestágio com incertezas Julio César Alves Computação - Teses. Markov, Processos de. Aprendizado por reforço - Teses. Aprendizado profundo - Teses. Cadeias de suprimentos multiestágio Tomada de decisão sequencial sob incerteza Aprendizado por reforço Aprendizado profundo Métodos policy gradient
title_short	Aplicação e comparação de métodos policy gradient em problema de cadeias de suprimentos multiestágio com incertezas
title_full	Aplicação e comparação de métodos policy gradient em problema de cadeias de suprimentos multiestágio com incertezas
title_fullStr	Aplicação e comparação de métodos policy gradient em problema de cadeias de suprimentos multiestágio com incertezas
title_full_unstemmed	Aplicação e comparação de métodos policy gradient em problema de cadeias de suprimentos multiestágio com incertezas
title_sort	Aplicação e comparação de métodos policy gradient em problema de cadeias de suprimentos multiestágio com incertezas
author	Julio César Alves
author_facet	Julio César Alves
author_role	author
dc.contributor.author.fl_str_mv	Julio César Alves
dc.subject.por.fl_str_mv	Computação - Teses. Markov, Processos de. Aprendizado por reforço - Teses. Aprendizado profundo - Teses.
topic	Computação - Teses. Markov, Processos de. Aprendizado por reforço - Teses. Aprendizado profundo - Teses. Cadeias de suprimentos multiestágio Tomada de decisão sequencial sob incerteza Aprendizado por reforço Aprendizado profundo Métodos policy gradient
dc.subject.other.none.fl_str_mv	Cadeias de suprimentos multiestágio Tomada de decisão sequencial sob incerteza Aprendizado por reforço Aprendizado profundo Métodos policy gradient
description	Deep Reinforcement Learning (DRL) methods have been increasingly used in several areas of knowledge and, recently, this interest has also grown in the Optimization community. In this work, we apply and compare Policy Gradient methods in the problem of planning the production and distribution of products in a supply chain with multiple stages. Most of the previous works that use similar methods only consider serial supply chains or only two echelons, generally limiting the solution possibilities, and none of them consider stochastic lead times. We consider a chain with four echelons and two nodes per echelon, with uncertainties regarding seasonal demands from customers and lead times of production at suppliers and transport along the chain. To our knowledge, this work is the first to apply, in such chain configuration, DRL methods considering a centralized approach to the problem, in which all decisions are taken by a single agent based on the uncertain demands of the end customers. We propose a Markovian Decision Process (MDP) formulation and a Linear Programming (LP)model with uncertain parameters. The MDP formulation is adapted to obtain good results with the application of Policy Gradient methods. In the first phase, after an initial case study, we applied the Proximal Policy Optimization (PPO) algorithm in 17 experimental scenarios, considering seasonal and regular uncertain demands (with different levels of uncertainty) and constant and stochastic lead times. In this phase, an agent built from the solution of a Linear Programming model (given by considering expected demands and average lead times) is used as a baseline. In the second phase, we have compared five algorithms, Advantage Actor-Critic (A2C), Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO), SAC, and Twin Delayed DDPG (TD3), in 8 of the 17 previous scenarios, using statistical tools for proper comparison of the algorithms. The PPO and SAC algorithms had the best performance in the experiments, the first having a better execution time. Experimental results indicate that Policy Gradient methods, especially PPO, are suitable and competitive tools for the proposed problem. In the third phase, we started to work with a multi-product version of the problem, generalizing the MDP formulation and the LP model. Experiments were carried out with the PPO algorithm in 16 multi-product scenarios, considering two and three products and different cost and demand configurations. The results indicate that, as in the original problem, the PPO performs better than the baseline in scenarios with stochastic lead times.
publishDate	2021
dc.date.accessioned.fl_str_mv	2021-10-30T19:47:03Z 2025-09-09T01:15:32Z
dc.date.available.fl_str_mv	2021-10-30T19:47:03Z
dc.date.issued.fl_str_mv	2021-10-06
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://hdl.handle.net/1843/38570
url	https://hdl.handle.net/1843/38570
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	http://creativecommons.org/licenses/by-nc-nd/3.0/pt/ info:eu-repo/semantics/openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nc-nd/3.0/pt/
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de Minas Gerais
publisher.none.fl_str_mv	Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG
instname_str	Universidade Federal de Minas Gerais (UFMG)
instacron_str	UFMG
institution	UFMG
reponame_str	Repositório Institucional da UFMG
collection	Repositório Institucional da UFMG
bitstream.url.fl_str_mv	https://repositorio.ufmg.br//bitstreams/0a9dd8b5-f0cb-4059-a4aa-16dbf6a81352/download https://repositorio.ufmg.br//bitstreams/188e0a37-d829-4b2d-8e90-8f5b35ae1306/download https://repositorio.ufmg.br//bitstreams/ec7b2d15-03f0-48aa-8acc-cde43452b96e/download
bitstream.checksum.fl_str_mv	cfd6801dba008cb6adbd9838b81582ab 18b2f3afb9f413f9c2f69dbf4058b03a cda590c95a0b51b4d15f60c9642ca272
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv	repositorio@ufmg.br
_version_	1862105707612995584

Aplicação e comparação de métodos policy gradient em problema de cadeias de suprimentos multiestágio com incertezas

Registros relacionados