Reconhecimento de vídeos de ações humanas em um mundo aberto: contribuições teóricas e metodológicas

Human Action Recognition (HAR) is a widely studied subject in the current Computer Vision, Machine Learning, and Deep Learning research community. However, HAR is usually performed in a closed-world scenario, where all classes are known in advance. In real-world scenarios, the environment tends to c...

ver descrição completa

Autor principal: Gutoski, Matheus
Formato: Tese
Idioma: Português
Publicado em: Universidade Tecnológica Federal do Paraná 2022
Assuntos:
Acesso em linha: http://repositorio.utfpr.edu.br/jspui/handle/1/29245
Tags: Adicionar Tag
Sem tags, seja o primeiro a adicionar uma tag!
Resumo: Human Action Recognition (HAR) is a widely studied subject in the current Computer Vision, Machine Learning, and Deep Learning research community. However, HAR is usually performed in a closed-world scenario, where all classes are known in advance. In real-world scenarios, the environment tends to change, and new classes may appear. Traditional closed-world models are ill-equipped to deal with evolving environments and require retraining with large amounts of labeled data to recognize new categories. This work approaches HAR from the Unsupervised Open-World setting. In Unsupervised Open-World Recognition, the model needs to differentiate between known and unknown classes, automatically label the unknown classes, and incrementally learn them using minimal computational time and resources. Initially, this work tackles each of these tasks separately and, finally, as a combined framework that performs Unsupervised Open-World HAR. A metric learning solution is proposed for feature learning, with a model named Triplet Inflated 3D Convolutional Neural Network (TI3D). A method that automatically estimates the number of clusters was presented using a Hierarchical Agglomerative Clustering algorithm for automatically labeling unknown classes. For Incremental Learning (IL), this work proposed the Dual-Memory Extreme Value Machine (DM-EVM). The DM-EVM can perform IL under dynamical feature representations. The proposed framework was evaluated on publicly available video datasets and presented superior performance to other state-of-the-art methods.Overall, this work offers an interesting solution to the problem posed and contributed to the goal of developing models capable of operating in real-world dynamical environments.