On the design of similarity functions for binary data

Veras, Marcelo Bruno de Almeida

On the design of similarity functions for binary data

Detalhes bibliográficos
Ano de defesa:	2022
Autor(a) principal:	Veras, Marcelo Bruno de Almeida
Orientador(a):	Gomes, João Paulo Pordeus
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Tese
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Não Informado pela instituição
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Similarity measure Similarity function Genetic programming Protein function annotation Protein structure Sparse data
Link de acesso:	http://www.repositorio.ufc.br/handle/riufc/66593
Resumo:	The binary feature vector is a widely used representation in many areas of knowledge. They serve to indicate the presence or absence of certain characteristics, therefore, functions that make use of these representations, such as similarity functions, are important to recognize how objects are similar to each other and perform tasks, such as classification, clustering and detection of outliers. The similarity function is a measure that quantifies this similarity and directly influences the performance of a proposed solution. Due to its importance, it is fundamental to properly solve a problem that a good similarity function is used. For choosing a similarity function, two approaches are commonly used: one is to search and analyze existing functions that fit the problem better, and the other is to create a new function with a specialist. In this work, both approaches are examined, and a new proposal is made for each approach outlining both the advantages and disadvantages. In the first one we present a methodology to designing similarity functions and a new function to deal of sparse data, as well as evaluating of proposed function through a series of experiments. In the second one, we propose an automated framework that learns from data to generate similarity function that are appropriate to a given task. This framework was developed to generate functions with theoretical properties necessary for a similarity function. Again, a series of experiments are conducted to asses its importance. We evaluated both studies performances in relation to 63 other similarity functions. Based on the results, we can state that in both cases our proposals were able to outperform classical functions in most of the tested cases.

Metadados do item

id	UFC-7_3b0c0c3fc997d14d0e4c7ae78afa9744
oai_identifier_str	oai:repositorio.ufc.br:riufc/66593
network_acronym_str	UFC-7
network_name_str	Repositório Institucional da Universidade Federal do Ceará (UFC)
repository_id_str
spelling	Veras, Marcelo Bruno de AlmeidaGomes, João Paulo Pordeus2022-06-22T14:04:24Z2022-06-22T14:04:24Z2022VERAS, Marcelo Bruno de Almeida. On the design of similarity functions for binary data. 2022. 62 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2022.http://www.repositorio.ufc.br/handle/riufc/66593The binary feature vector is a widely used representation in many areas of knowledge. They serve to indicate the presence or absence of certain characteristics, therefore, functions that make use of these representations, such as similarity functions, are important to recognize how objects are similar to each other and perform tasks, such as classification, clustering and detection of outliers. The similarity function is a measure that quantifies this similarity and directly influences the performance of a proposed solution. Due to its importance, it is fundamental to properly solve a problem that a good similarity function is used. For choosing a similarity function, two approaches are commonly used: one is to search and analyze existing functions that fit the problem better, and the other is to create a new function with a specialist. In this work, both approaches are examined, and a new proposal is made for each approach outlining both the advantages and disadvantages. In the first one we present a methodology to designing similarity functions and a new function to deal of sparse data, as well as evaluating of proposed function through a series of experiments. In the second one, we propose an automated framework that learns from data to generate similarity function that are appropriate to a given task. This framework was developed to generate functions with theoretical properties necessary for a similarity function. Again, a series of experiments are conducted to asses its importance. We evaluated both studies performances in relation to 63 other similarity functions. Based on the results, we can state that in both cases our proposals were able to outperform classical functions in most of the tested cases.O vetor binário de características é uma representação amplamente utilizada em diversas áreas do conhecimento. Elas servem para indicar a presença ou ausência de determinadas características, portanto, funções que fazem uso dessas representações, como funções de similaridade, são importantes para reconhecer como os objetos são semelhantes entre si e realizar tarefas, como classificação, agrupamento e detecção de valores atípicos. A função de similaridade é uma medida que quantifica essa similaridade e influencia diretamente no desempenho de uma solução proposta. Devido à sua importância, é fundamental para resolver adequadamente um problema que uma boa função de similaridade seja utilizada. Para a escolha de uma função de similaridade, duas abordagens são comumente utilizadas: uma é buscar e analisar funções existentes que melhor se ajustem ao problema, e a outra é criar uma nova função com um especialista. Neste trabalho, ambas as abordagens são examinadas, e uma nova proposta é feita para cada abordagem delineando as vantagens e desvantagens. Na primeira apresentamos uma metodologia para projetar funções de similaridade e uma nova função para lidar com dados esparsos, bem como avaliar a função proposta através de uma série de experimentos. Na segunda, propomos um framework automatizado que aprende com os dados para gerar funções de similaridade apropriadas para uma determinada tarefa. Este framework foi desenvolvido para gerar funções com propriedades teóricas necessárias para uma função de similaridade. Novamente, uma série de experimentos são realizados para avaliar sua importância. Avaliamos o desempenho de ambos os estudos em relação a 63 outras funções de similaridade. Com base nos resultados, podemos afirmar que em ambos os casos nossas propostas foram capazes de superar as funções clássicas na maioria dos casos testados.Similarity measureSimilarity functionGenetic programmingProtein function annotationProtein structureSparse dataOn the design of similarity functions for binary datainfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisengreponame:Repositório Institucional da Universidade Federal do Ceará (UFC)instname:Universidade Federal do Ceará (UFC)instacron:UFCinfo:eu-repo/semantics/openAccessORIGINAL2022_tese_mbaveras.pdf2022_tese_mbaveras.pdfapplication/pdf1108827http://repositorio.ufc.br/bitstream/riufc/66593/3/2022_tese_mbaveras.pdf730dec77b0a884afb9eb0cb8addccea7MD53LICENSElicense.txtlicense.txttext/plain; charset=utf-82152http://repositorio.ufc.br/bitstream/riufc/66593/4/license.txtfb3ad2d23d9790966439580114baefafMD54riufc/665932022-06-22 11:04:25.059oai:repositorio.ufc.br:riufc/66593TElDRU7Dh0EgREUgQVJNQVpFTkFNRU5UTyBFIERJU1RSSUJVScOHw4NPIE7Dg08tRVhDTFVTSVZBIAoKQW8gY29uY29yZGFyIGNvbSBlc3RhIGxpY2Vuw6dhLCB2b2PDqihzKSBhdXRvcihlcykgb3UgdGl0dWxhcihlcykgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIGRhIG9icmEgYXF1aSBkZXNjcml0YSBjb25jZWRlKG0pIMOgIFVuaXZlcnNpZGFkZSBGZWRlcmFsIGRvIENlYXLDoSwgZ2VzdG9yYSBkbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRkMgLSBSSS9VRkMsIG8gZGlyZWl0byBuw6NvLWV4Y2x1c2l2byBkZSByZXByb2R1emlyLCBjb252ZXJ0ZXIgKGNvbW8gZGVmaW5pZG8gYWJhaXhvKSBlL291IGRpc3RyaWJ1aXIgbyBkb2N1bWVudG8gZGVwb3NpdGFkbyBlbSBmb3JtYXRvIGltcHJlc3NvLCBlbGV0csO0bmljbyBvdSBlbSBxdWFscXVlciBvdXRybyBtZWlvLiBWb2PDqiBjb25jb3JkYShtKSBxdWUgYSBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkbyBDZWFyw6EsIGdlc3RvcmEgZG8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZDIC0gUkkvVUZDLCBwb2RlLCBzZW0gYWx0ZXJhciBvIGNvbnRlw7pkbywgY29udmVydGVyIG8gYXJxdWl2byBkZXBvc2l0YWRvIGEgcXVhbHF1ZXIgbWVpbyBvdSBmb3JtYXRvIGNvbSBmaW5zIGRlIHByZXNlcnZhw6fDo28uIFZvY8OqKHMpIHRhbWLDqW0gY29uY29yZGEobSkgcXVlIGEgVW5pdmVyc2lkYWRlIEZlZGVyYWwgZG8gQ2VhcsOhLCBnZXN0b3JhIGRvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGQyAtIFJJL1VGQywgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGRlc3RlIGRlcMOzc2l0byBwYXJhIGZpbnMgZGUgc2VndXJhbsOnYSwgYmFjay11cCBlL291IHByZXNlcnZhw6fDo28uIFZvY8OqIGRlY2xhcmEgcXVlIGEgYXByZXNlbnRhw6fDo28gZG8gc2V1IHRyYWJhbGhvIMOpIG9yaWdpbmFsIGUgcXVlIHZvY8OqKHMpIHBvZGUobSkgY29uY2VkZXIgb3MgZGlyZWl0b3MgY29udGlkb3MgbmVzdGEgbGljZW7Dp2EuIFZvY8OqIHRhbWLDqW0gZGVjbGFyYShtKSBxdWUgbyBlbnZpbyDDqSBkZSBzZXUgY29uaGVjaW1lbnRvIGUgbsOjbyBpbmZyaW5nZSBvcyBkaXJlaXRvcyBhdXRvcmFpcyBkZSBvdXRyYSBwZXNzb2Egb3UgaW5zdGl0dWnDp8Ojby4gQ2FzbyBvIGRvY3VtZW50byBhIHNlciBkZXBvc2l0YWRvIGNvbnRlbmhhIG1hdGVyaWFsIHBhcmEgbyBxdWFsIHZvY8OqKHMpIG7Do28gZGV0w6ltIGEgdGl0dWxhcmlkYWRlIGRvcyBkaXJlaXRvcyBkZSBhdXRvcmFpcywgdm9jw6oocykgZGVjbGFyYShtKSBxdWUgb2J0ZXZlIGEgcGVybWlzc8OjbyBpcnJlc3RyaXRhIGRvIHRpdHVsYXIgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIGRlIGNvbmNlZGVyIMOgIFVuaXZlcnNpZGFkZSBGZWRlcmFsIGRvIENlYXLDoSwgZ2VzdG9yYSBkbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRkMgLSBSSS9VRkMsIG9zIGRpcmVpdG9zIHJlcXVlcmlkb3MgcG9yIGVzdGEgbGljZW7Dp2EgZSBxdWUgb3MgbWF0ZXJpYWlzIGRlIHByb3ByaWVkYWRlIGRlIHRlcmNlaXJvcywgZXN0w6NvIGRldmlkYW1lbnRlIGlkZW50aWZpY2Fkb3MgZSByZWNvbmhlY2lkb3Mgbm8gdGV4dG8gb3UgY29udGXDumRvIGRhIGFwcmVzZW50YcOnw6NvLgogQ0FTTyBPIFRSQUJBTEhPIERFUE9TSVRBRE8gVEVOSEEgU0lETyBGSU5BTkNJQURPIE9VIEFQT0lBRE8gUE9SIFVNIMOTUkfDg08sIFFVRSBOw4NPIEEgSU5TVElUVUnDh8ODTyBERVNURSBSRVBPU0lUw5NSSU86IFZPQ8OKIERFQ0xBUkEgVEVSIENVTVBSSURPIFRPRE9TIE9TIERJUkVJVE9TIERFIFJFVklTw4NPIEUgUVVBSVNRVUVSIE9VVFJBUyBPQlJJR0HDh8OVRVMgUkVRVUVSSURBUyBQRUxPIENPTlRSQVRPIE9VIEFDT1JETy4gCk8gcmVwb3NpdMOzcmlvIGlkZW50aWZpY2Fyw6EgY2xhcmFtZW50ZSBvIHNldShzKSBub21lKHMpIGNvbW8gYXV0b3IoZXMpIG91IHRpdHVsYXIoZXMpIGRvIGRpcmVpdG8gZGUgYXV0b3IoZXMpIGRvIGRvY3VtZW50byBzdWJtZXRpZG8gZSBkZWNsYXJhIHF1ZSBuw6NvIGZhcsOhIHF1YWxxdWVyIGFsdGVyYcOnw6NvIGFsw6ltIGRhcyBwZXJtaXRpZGFzIHBvciBlc3RhIGxpY2Vuw6dhLgpSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRkMuCg==Repositório InstitucionalPUBhttp://www.repositorio.ufc.br/ri-oai/requestbu@ufc.br \|\| repositorio@ufc.bropendoar:2022-06-22T14:04:25Repositório Institucional da Universidade Federal do Ceará (UFC) - Universidade Federal do Ceará (UFC)false
dc.title.pt_BR.fl_str_mv	On the design of similarity functions for binary data
title	On the design of similarity functions for binary data
spellingShingle	On the design of similarity functions for binary data Veras, Marcelo Bruno de Almeida Similarity measure Similarity function Genetic programming Protein function annotation Protein structure Sparse data
title_short	On the design of similarity functions for binary data
title_full	On the design of similarity functions for binary data
title_fullStr	On the design of similarity functions for binary data
title_full_unstemmed	On the design of similarity functions for binary data
title_sort	On the design of similarity functions for binary data
author	Veras, Marcelo Bruno de Almeida
author_facet	Veras, Marcelo Bruno de Almeida
author_role	author
dc.contributor.author.fl_str_mv	Veras, Marcelo Bruno de Almeida
dc.contributor.advisor1.fl_str_mv	Gomes, João Paulo Pordeus
contributor_str_mv	Gomes, João Paulo Pordeus
dc.subject.por.fl_str_mv	Similarity measure Similarity function Genetic programming Protein function annotation Protein structure Sparse data
topic	Similarity measure Similarity function Genetic programming Protein function annotation Protein structure Sparse data
description	The binary feature vector is a widely used representation in many areas of knowledge. They serve to indicate the presence or absence of certain characteristics, therefore, functions that make use of these representations, such as similarity functions, are important to recognize how objects are similar to each other and perform tasks, such as classification, clustering and detection of outliers. The similarity function is a measure that quantifies this similarity and directly influences the performance of a proposed solution. Due to its importance, it is fundamental to properly solve a problem that a good similarity function is used. For choosing a similarity function, two approaches are commonly used: one is to search and analyze existing functions that fit the problem better, and the other is to create a new function with a specialist. In this work, both approaches are examined, and a new proposal is made for each approach outlining both the advantages and disadvantages. In the first one we present a methodology to designing similarity functions and a new function to deal of sparse data, as well as evaluating of proposed function through a series of experiments. In the second one, we propose an automated framework that learns from data to generate similarity function that are appropriate to a given task. This framework was developed to generate functions with theoretical properties necessary for a similarity function. Again, a series of experiments are conducted to asses its importance. We evaluated both studies performances in relation to 63 other similarity functions. Based on the results, we can state that in both cases our proposals were able to outperform classical functions in most of the tested cases.
publishDate	2022
dc.date.accessioned.fl_str_mv	2022-06-22T14:04:24Z
dc.date.available.fl_str_mv	2022-06-22T14:04:24Z
dc.date.issued.fl_str_mv	2022
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	VERAS, Marcelo Bruno de Almeida. On the design of similarity functions for binary data. 2022. 62 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2022.
dc.identifier.uri.fl_str_mv	http://www.repositorio.ufc.br/handle/riufc/66593
identifier_str_mv	VERAS, Marcelo Bruno de Almeida. On the design of similarity functions for binary data. 2022. 62 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2022.
url	http://www.repositorio.ufc.br/handle/riufc/66593
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.source.none.fl_str_mv	reponame:Repositório Institucional da Universidade Federal do Ceará (UFC) instname:Universidade Federal do Ceará (UFC) instacron:UFC
instname_str	Universidade Federal do Ceará (UFC)
instacron_str	UFC
institution	UFC
reponame_str	Repositório Institucional da Universidade Federal do Ceará (UFC)
collection	Repositório Institucional da Universidade Federal do Ceará (UFC)
bitstream.url.fl_str_mv	http://repositorio.ufc.br/bitstream/riufc/66593/3/2022_tese_mbaveras.pdf http://repositorio.ufc.br/bitstream/riufc/66593/4/license.txt
bitstream.checksum.fl_str_mv	730dec77b0a884afb9eb0cb8addccea7 fb3ad2d23d9790966439580114baefaf
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da Universidade Federal do Ceará (UFC) - Universidade Federal do Ceará (UFC)
repository.mail.fl_str_mv	bu@ufc.br \|\| repositorio@ufc.br
_version_	1847793029311627264

On the design of similarity functions for binary data

Registros relacionados