Extração de características a partir de redes complexas: um estudo de caso na classificação de sequências genômicas

Within the scope of bioinformatics, pattern recognition in genomic sequences can be used to classify regions (gene, promoter, non-coding) of a DNA. In this sense, if a model a good classification occurs can be generated to infer unknown sequences. Faced with this prospect, measures that represent ch...

ver descrição completa

Autor principal: Conque, Bruno Mendes Moro
Formato: Trabalho de Conclusão de Curso (Graduação)
Idioma: Português
Publicado em: Universidade Tecnológica Federal do Paraná 2022
Assuntos:
Acesso em linha: http://repositorio.utfpr.edu.br/jspui/handle/1/28364
Tags: Adicionar Tag
Sem tags, seja o primeiro a adicionar uma tag!
Resumo: Within the scope of bioinformatics, pattern recognition in genomic sequences can be used to classify regions (gene, promoter, non-coding) of a DNA. In this sense, if a model a good classification occurs can be generated to infer unknown sequences. Faced with this prospect, measures that represent characteristics within these sequences must be identified. This paper proposes two methods to characterize the genomic sequences based on the theory of complex networks and information theory. Information theory deals with the frequency of occurrences of nucleotide, dinucleotide and trinucleotide within a sequence to calculate entropy, sum entropy and maximum entropy to compose the same characteristics. Complex networks in turn retrate the sequences as a network through the occurring of the nucleotides, dinucleotides and trinucleotides within the same. Measures of methodologies are used in the classification methods such as SVM classifiers, MultiLayerPerceptron, J48, IBK, and NaiveBayes RandomForest, where similar results were obtained among the methods, showing little difference in favor of the complex networks, wherein RandomForest showed the best results with approximately 86 % accuracy, followed by J48 with 84 % and MultiLayerPerceptron with 82 %. The results indicate that by such feature extraction approach can achieve good classification levels considering the simplicity of the methods used since they are only genomic sequences without any further knowledge about them.