Voice activity detection by upper body motion analysis and unsupervised domain adaptation

We present a novel vision-biased voice activity detection (VAD) method that relies only on automatic upper body motion (UBM) analysis. Traditionally, VAO is performed using audio features only, but the use of visual cues instead of audio can be desirable especially when audio is not available such as due to technical, ethical or legal issues. Psychology literature confirms that the way people move while speaking, is different from while they are not speaking. This motivates us to claim that an effective representation-of UBM can he used to detect "Who is Speaking and When". On the other hand, the way people move during their speech varies a lot from culture to culture, and even person to person in the same culture. This results in unrelated UBM representations, such that the distribution of training and test data becomes disparate. To overcome this, we combine stacked sparse autoencoders and simple subspace alignment methods while a classifier is jointly learned using the VAD labels of the training data only. This yields new domain invariant feature representations for training and test data, showing improved VAD results. Our approach is applicable to any person without requiring re-training The tests applied on a publicly available real-life VAD dataset show better results as compared to the state-qf-the-art video-only VAD methods. Moreover the ablation study justifies the superiority of the proposed method and demonstrates the positive contribution of each component.

(2019). Voice activity detection by upper body motion analysis and unsupervised domain adaptation . Retrieved from https://hdl.handle.net/10446/260632

Voice activity detection by upper body motion analysis and unsupervised domain adaptation

Shahid, Muhammad;Beyan, Cigdem;Murino, Vittorio

2019-01-01

Abstract

We present a novel vision-biased voice activity detection (VAD) method that relies only on automatic upper body motion (UBM) analysis. Traditionally, VAO is performed using audio features only, but the use of visual cues instead of audio can be desirable especially when audio is not available such as due to technical, ethical or legal issues. Psychology literature confirms that the way people move while speaking, is different from while they are not speaking. This motivates us to claim that an effective representation-of UBM can he used to detect "Who is Speaking and When". On the other hand, the way people move during their speech varies a lot from culture to culture, and even person to person in the same culture. This results in unrelated UBM representations, such that the distribution of training and test data becomes disparate. To overcome this, we combine stacked sparse autoencoders and simple subspace alignment methods while a classifier is jointly learned using the VAD labels of the training data only. This yields new domain invariant feature representations for training and test data, showing improved VAD results. Our approach is applicable to any person without requiring re-training The tests applied on a publicly available real-life VAD dataset show better results as compared to the state-qf-the-art video-only VAD methods. Moreover the ablation study justifies the superiority of the proposed method and demonstrates the positive contribution of each component.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di pubblicazione
	
				2019
			
	Tutti gli autori
	
						Shahid, Muhammad; Beyan, Cigdem; Murino, Vittorio
					
	Nelle collezioni:
	
				1.4.01 Contributi in atti di convegno - Conference presentations

File allegato/i alla scheda:

File	Dimensione del file	Formato
IC17_Voice Activity Detection by Upper Body Motion Analysis and Unsupervised.pdf accesso aperto Versione: postprint - versione referata/accettata senza referaggio Licenza: Licenza default Aisberg Dimensione del file 1.11 MB Formato Adobe PDF Visualizza/Apri	1.11 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

Aisberg ©2008 Servizi bibliotecari, Università degli studi di Bergamo | Terms of use/Condizioni di utilizzo

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10446/260632

Citazioni

12

4

social impact