Identificação e verificação de locutores em português e inglês utilizando transfer learning
Speech is one of the biometric modalities that can be used to recognize an individual. Thus, speaker identification systems have applicability in authentication problems, such as automatic surveillance and forensic activities. This recognition process is divided into speaker identification and verif...
Autor principal: | Souza, Michel Gomes de |
---|---|
Formato: | Trabalho de Conclusão de Curso (Graduação) |
Idioma: | Português |
Publicado em: |
Universidade Tecnológica Federal do Paraná
2023
|
Assuntos: | |
Acesso em linha: |
http://repositorio.utfpr.edu.br/jspui/handle/1/30727 |
Tags: |
Adicionar Tag
Sem tags, seja o primeiro a adicionar uma tag!
|
Resumo: |
Speech is one of the biometric modalities that can be used to recognize an individual. Thus, speaker identification systems have applicability in authentication problems, such as automatic surveillance and forensic activities. This recognition process is divided into speaker identification and verification. Most databases for automatic speaker recognition are in foreign languages, such as voxCeleb and Common Voice. Therefore, it was selected a database with Brazilian speakers, the Brazilian Speech Database. This is the first work to use this base, applying methods of identification and verification of speakers to evaluate the characteristics extracted by transfer learning from this dataset. Subsequently, a Common Voice subset was subjected to the same methods in order to compare the data. The best result for the identification task for the Brazilian database was 0.70 ± 0.10 with 10 patches using the early fusion method with the handcrafted characteristics. As for the English database, it was 0.68 ± 0.05 with 10 patches using early fusion of all extractors of the transfer learning method. For the verification problem, Brazilian Speech Database got a rate of 0.97 ± 0.00 using 10 patches with MobileNet, and Common Voice got a rate of 0.98 ± 0.00 with 10 patches for all descriptors applied. It was highlighted that the complementarity of features made with early fusion helped to obtain better results in some cases. The use of feature extraction techniques applying transfer learning, despite being more robust and sophisticated, presented a result statistically equal to the handcrafted techniques. One factor that may have influenced the experiments is that the Brazilian Speech Database is a text-dependent database, while Common Voice is a non-text-dependent database. |
---|