Content-based video retrieval from natural language

Detalhes bibliográficos
Ano de defesa: 2022
Autor(a) principal: Jorge, Oliver Cabral
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Universidade Tecnológica Federal do Paraná
Curitiba
Brasil
Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial
UTFPR
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://repositorio.utfpr.edu.br/jspui/handle/1/29964
Resumo: More and more, videos are becoming the most common means of communication, leveraged by the popularization of affordable video recording devices and social networks such as TikTok, Instagram, and others. The most common ways of searching for videos on these social networks as well as on search portals are based on metadata linked to videos through keywords and previous classifications. However, keyword searches depend on exact knowledge of what you want and may not necessarily be efficient when trying to find a particular video from a description, superficial or not, of a particular scene, which may lead to frustrating results in the search. The objective of this work is to find a particular video within a list of available videos from a textual description in natural language based only on the content of its scenes, without relying on previously cataloged metadata. From a dataset containing videos with a defined number of descriptions of their scenes, a Siamese network with a triplet loss function was modeled to identify, in hyperspace, the similarities between two different modalities, one of them being the information extracted from a video, and the other information extracted from a text in natural language. The final architecture of the model, as well as the values of its parameters, was defined based on tests that followed the best results obtained. Because videos are not classified into groups or classes and considering that the triplet loss function is based on an anchor text and two video examples, one positive and one negative, a difficulty was identified in the selection of false examples needed for the model training. In this way, methods of choosing examples of negative videos for training were also tested using a random choice and a directed choice, based on the distances of the available descriptions of the videos in the training phase, being the first the most effective. At the end of the tests, a result was achieved with the exact presence of the searched video in 10.67% of the cases in the top 1 and 49.80% of the cases in the top 10. More than the numerical results, a qualitative analysis of the results was conducted. From this analysis, it was identified that the model does not behave satisfactorily for searches in atomic words, with better results in more complex descriptions. Satisfactory results are also mainly related to the use of verbs and nouns, and less to adjectives and adverbs. Still, it was observed that the returned videos have, in some way, similarities of scenes or topics with the searched text, indicating that the network identified the meaning of the original text query. In general, the results obtained are promising and encourage the continuity of the research. Future work will include the use of new models for extracting information from videos and texts, as well as further studies into the controlled choice of negative video examples to reinforce training.
id UTFPR-12_af10a058eef539ef18e5e5a094b8c36c
oai_identifier_str oai:repositorio.utfpr.edu.br:1/29964
network_acronym_str UTFPR-12
network_name_str Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))
repository_id_str
spelling Content-based video retrieval from natural languageRecuperação de vídeos baseada em conteúdo a partir de linguagem naturalVídeos para InternetProcessamento de linguagem natural (Computação)Redes neurais (Computação)Visão por computadorRedes sociais on-lineRecuperação da informaçãoAprendizado do computadorInternet videosNatural language processing (Computer science)Neural networks (Computer science)Computer visionOnline social networksInformation retrievalMachine learningCNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOEngenharia ElétricaMore and more, videos are becoming the most common means of communication, leveraged by the popularization of affordable video recording devices and social networks such as TikTok, Instagram, and others. The most common ways of searching for videos on these social networks as well as on search portals are based on metadata linked to videos through keywords and previous classifications. However, keyword searches depend on exact knowledge of what you want and may not necessarily be efficient when trying to find a particular video from a description, superficial or not, of a particular scene, which may lead to frustrating results in the search. The objective of this work is to find a particular video within a list of available videos from a textual description in natural language based only on the content of its scenes, without relying on previously cataloged metadata. From a dataset containing videos with a defined number of descriptions of their scenes, a Siamese network with a triplet loss function was modeled to identify, in hyperspace, the similarities between two different modalities, one of them being the information extracted from a video, and the other information extracted from a text in natural language. The final architecture of the model, as well as the values of its parameters, was defined based on tests that followed the best results obtained. Because videos are not classified into groups or classes and considering that the triplet loss function is based on an anchor text and two video examples, one positive and one negative, a difficulty was identified in the selection of false examples needed for the model training. In this way, methods of choosing examples of negative videos for training were also tested using a random choice and a directed choice, based on the distances of the available descriptions of the videos in the training phase, being the first the most effective. At the end of the tests, a result was achieved with the exact presence of the searched video in 10.67% of the cases in the top 1 and 49.80% of the cases in the top 10. More than the numerical results, a qualitative analysis of the results was conducted. From this analysis, it was identified that the model does not behave satisfactorily for searches in atomic words, with better results in more complex descriptions. Satisfactory results are also mainly related to the use of verbs and nouns, and less to adjectives and adverbs. Still, it was observed that the returned videos have, in some way, similarities of scenes or topics with the searched text, indicating that the network identified the meaning of the original text query. In general, the results obtained are promising and encourage the continuity of the research. Future work will include the use of new models for extracting information from videos and texts, as well as further studies into the controlled choice of negative video examples to reinforce training.Cada vez mais os vídeos estão se tornando os meios mais comuns de comunicação, alavancadas pela popularização de aparelhos acessíveis de gravação de vídeos e pelas redes sociais como TikTok, Instragram e demais. As formas mais comuns de pesquisa de vídeos nestas redes sociais bem como nos portais de buscas, se baseiam em metadados vinculados aos vídeos por meio de palavras-chaves e classificações prévias. No entanto, buscas por palavras-chaves dependem de um conhecimento exato do que se deseja, e não necessariamente podem ser eficientes ao tentar encontrar um determinado vídeo a partir de uma descrição, superficial ou não, de uma determinada cena, podendo incorrer em resultados frustrantes da busca. O objetivo deste trabalho é encontrar um determinado vídeo dentro de uma lista de vídeos disponíveis a partir de uma descrição textual em linguagem natural baseado apenas no conteúdo de suas cenas, sem depender de metadados previamente catalogados. A partir de um dataset contendo vídeos com um número definido de descrições de suas cenas, foi modelada uma rede siamesa com função de perda tripla para identificar, em um hiperespaço, as similaridades entre duas modalidades diferentes, sendo uma delas as informações extraídas de um vídeo, e a outra as informações extraídas de um texto em linguagem natural. A arquitetura final do modelo, bem como os valores de seus parâmetros, foi definida baseada em testes que seguiram os melhores resultados obtidos. Devido ao fato de que os vídeos não são classificados em grupos ou classes e considerando que a função de perda tripla se baseia em um texto âncora e dois exemplos de vídeos, um positivo e um negativo, foi identificada uma dificuldade na seleção de exemplos falsos necessários para o treinamento da arquitetura. Desta forma, também foram testados métodos de escolha de exemplos de vídeos negativos para treinamento utilizando uma escolha aleatória e uma escolha direcionada, baseada nas distâncias das descrições disponíveis dos vídeos em fase de treinamento, sendo a primeira a mais eficiente. Ao final dos testes, foi alcançado um resultado com presença exata do vídeo buscado em 10,67% dos casos no top-1 e em 49,80% dos casos no top-10. Mais do que os resultados numéricos, foi feita uma análise qualitativa dos resultados. Desta análise, foi identificado que o modelo não se comporta de forma satisfatória para buscas em palavras atômicas, com melhores resultados em descrições mais complexas. Os bons resultados também estão principalmente relacionados ao uso de verbos e substantivos, e menos aos adjetivos e advérbios. Ainda, observou-se que os vídeos retornados possuem, de alguma forma, similaridades de cenas ou de tópicos com o texto procurado, indicando que a rede identificou o significado do texto procurado. De maneira geral, os resultados obtidos são promissores e encorajam a continuidade da pesquisa. Trabalhos futuros incluirão o uso de novos modelos de extração de informação de vídeos e de textos, bem como maior aprofundamento na escolha de exemplos negativos de vídeos para reforçar o treinamento.Universidade Tecnológica Federal do ParanáCuritibaBrasilPrograma de Pós-Graduação em Engenharia Elétrica e Informática IndustrialUTFPRLopes, Heitor Silvériohttps://orcid.org/0000-0003-3984-1432http://lattes.cnpq.br/4045818083957064Lazzaretti, André Eugêniohttps://orcid.org/0000-0003-1861-3369http://lattes.cnpq.br/7649611874688878Gomes, David Menottihttps://orcid.org/0000-0003-2430-2030http://lattes.cnpq.br/6692968437800167Gomes Junior, Luiz Celsohttps://orcid.org/0000-0002-1534-9032http://lattes.cnpq.br/0370301102971417Bugatti, Pedro Henriquehttps://orcid.org/0000-0001-9421-9254http://lattes.cnpq.br/2177467029991118Jorge, Oliver Cabral2022-10-19T20:07:33Z2022-10-19T20:07:33Z2022-09-12info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfJORGE, Oliver Cabral. Content-based video retrieval from natural language. 2022. Dissertação (Mestrado em Engenharia Elétrica e Informática Industrial) - Universidade Tecnológica Federal do Paraná, Curitiba, 2022.http://repositorio.utfpr.edu.br/jspui/handle/1/29964enghttp://creativecommons.org/licenses/by/4.0/info:eu-repo/semantics/openAccessreponame:Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))instname:Universidade Tecnológica Federal do Paraná (UTFPR)instacron:UTFPR2022-10-20T06:06:49Zoai:repositorio.utfpr.edu.br:1/29964Repositório InstitucionalPUBhttp://repositorio.utfpr.edu.br:8080/oai/requestriut@utfpr.edu.br || sibi@utfpr.edu.bropendoar:2022-10-20T06:06:49Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT)) - Universidade Tecnológica Federal do Paraná (UTFPR)false
dc.title.none.fl_str_mv Content-based video retrieval from natural language
Recuperação de vídeos baseada em conteúdo a partir de linguagem natural
title Content-based video retrieval from natural language
spellingShingle Content-based video retrieval from natural language
Jorge, Oliver Cabral
Vídeos para Internet
Processamento de linguagem natural (Computação)
Redes neurais (Computação)
Visão por computador
Redes sociais on-line
Recuperação da informação
Aprendizado do computador
Internet videos
Natural language processing (Computer science)
Neural networks (Computer science)
Computer vision
Online social networks
Information retrieval
Machine learning
CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Engenharia Elétrica
title_short Content-based video retrieval from natural language
title_full Content-based video retrieval from natural language
title_fullStr Content-based video retrieval from natural language
title_full_unstemmed Content-based video retrieval from natural language
title_sort Content-based video retrieval from natural language
author Jorge, Oliver Cabral
author_facet Jorge, Oliver Cabral
author_role author
dc.contributor.none.fl_str_mv Lopes, Heitor Silvério
https://orcid.org/0000-0003-3984-1432
http://lattes.cnpq.br/4045818083957064
Lazzaretti, André Eugênio
https://orcid.org/0000-0003-1861-3369
http://lattes.cnpq.br/7649611874688878
Gomes, David Menotti
https://orcid.org/0000-0003-2430-2030
http://lattes.cnpq.br/6692968437800167
Gomes Junior, Luiz Celso
https://orcid.org/0000-0002-1534-9032
http://lattes.cnpq.br/0370301102971417
Bugatti, Pedro Henrique
https://orcid.org/0000-0001-9421-9254
http://lattes.cnpq.br/2177467029991118
dc.contributor.author.fl_str_mv Jorge, Oliver Cabral
dc.subject.por.fl_str_mv Vídeos para Internet
Processamento de linguagem natural (Computação)
Redes neurais (Computação)
Visão por computador
Redes sociais on-line
Recuperação da informação
Aprendizado do computador
Internet videos
Natural language processing (Computer science)
Neural networks (Computer science)
Computer vision
Online social networks
Information retrieval
Machine learning
CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Engenharia Elétrica
topic Vídeos para Internet
Processamento de linguagem natural (Computação)
Redes neurais (Computação)
Visão por computador
Redes sociais on-line
Recuperação da informação
Aprendizado do computador
Internet videos
Natural language processing (Computer science)
Neural networks (Computer science)
Computer vision
Online social networks
Information retrieval
Machine learning
CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Engenharia Elétrica
description More and more, videos are becoming the most common means of communication, leveraged by the popularization of affordable video recording devices and social networks such as TikTok, Instagram, and others. The most common ways of searching for videos on these social networks as well as on search portals are based on metadata linked to videos through keywords and previous classifications. However, keyword searches depend on exact knowledge of what you want and may not necessarily be efficient when trying to find a particular video from a description, superficial or not, of a particular scene, which may lead to frustrating results in the search. The objective of this work is to find a particular video within a list of available videos from a textual description in natural language based only on the content of its scenes, without relying on previously cataloged metadata. From a dataset containing videos with a defined number of descriptions of their scenes, a Siamese network with a triplet loss function was modeled to identify, in hyperspace, the similarities between two different modalities, one of them being the information extracted from a video, and the other information extracted from a text in natural language. The final architecture of the model, as well as the values of its parameters, was defined based on tests that followed the best results obtained. Because videos are not classified into groups or classes and considering that the triplet loss function is based on an anchor text and two video examples, one positive and one negative, a difficulty was identified in the selection of false examples needed for the model training. In this way, methods of choosing examples of negative videos for training were also tested using a random choice and a directed choice, based on the distances of the available descriptions of the videos in the training phase, being the first the most effective. At the end of the tests, a result was achieved with the exact presence of the searched video in 10.67% of the cases in the top 1 and 49.80% of the cases in the top 10. More than the numerical results, a qualitative analysis of the results was conducted. From this analysis, it was identified that the model does not behave satisfactorily for searches in atomic words, with better results in more complex descriptions. Satisfactory results are also mainly related to the use of verbs and nouns, and less to adjectives and adverbs. Still, it was observed that the returned videos have, in some way, similarities of scenes or topics with the searched text, indicating that the network identified the meaning of the original text query. In general, the results obtained are promising and encourage the continuity of the research. Future work will include the use of new models for extracting information from videos and texts, as well as further studies into the controlled choice of negative video examples to reinforce training.
publishDate 2022
dc.date.none.fl_str_mv 2022-10-19T20:07:33Z
2022-10-19T20:07:33Z
2022-09-12
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv JORGE, Oliver Cabral. Content-based video retrieval from natural language. 2022. Dissertação (Mestrado em Engenharia Elétrica e Informática Industrial) - Universidade Tecnológica Federal do Paraná, Curitiba, 2022.
http://repositorio.utfpr.edu.br/jspui/handle/1/29964
identifier_str_mv JORGE, Oliver Cabral. Content-based video retrieval from natural language. 2022. Dissertação (Mestrado em Engenharia Elétrica e Informática Industrial) - Universidade Tecnológica Federal do Paraná, Curitiba, 2022.
url http://repositorio.utfpr.edu.br/jspui/handle/1/29964
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv http://creativecommons.org/licenses/by/4.0/
info:eu-repo/semantics/openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by/4.0/
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Tecnológica Federal do Paraná
Curitiba
Brasil
Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial
UTFPR
publisher.none.fl_str_mv Universidade Tecnológica Federal do Paraná
Curitiba
Brasil
Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial
UTFPR
dc.source.none.fl_str_mv reponame:Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))
instname:Universidade Tecnológica Federal do Paraná (UTFPR)
instacron:UTFPR
instname_str Universidade Tecnológica Federal do Paraná (UTFPR)
instacron_str UTFPR
institution UTFPR
reponame_str Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))
collection Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))
repository.name.fl_str_mv Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT)) - Universidade Tecnológica Federal do Paraná (UTFPR)
repository.mail.fl_str_mv riut@utfpr.edu.br || sibi@utfpr.edu.br
_version_ 1850498308917690368