Multimodal Speaker Segmentation and Identification in Presence of Overlapped Speech Segments
Abstract
segmentation and identification with two main contributions:
First, we propose a hidden Markov model architecture
that performs fusion of three information sources: a multicamera system for participant localization, a microphone
array for speaker localization, and a speaker identification
system. Second, we present a novel likelihood model for the
microphone array observations for dealing with overlapped
speech. We propose a modification of the Steered Power
Response Generalized Cross Correlation Phase Transform
(SPR-GCC-PHAT) function that takes into account the
possible microphone occlusions and use its local maxima
as microphone array observations. The likelihood of the
extracted local maxima given positions of active speakers
is modeled using the Joint Probabilistic Data Association
(JPDA) framework.
The state in the proposed hidden Markov model is a vector
of the speaker activity indicators of present participants,
and the unknown parameter is the mapping of participants’
locations to the set of all possible participants’ identities. We
present and compare two ways for the joint estimation of
the states and the unknown parameter: the first, a forward
Bayesian filter that performs sequential estimate updates as
new observations arrive and the second, a batch decoding
using the Viterbi algorithm.
Results show that, for both decoding algorithms, the
proposed method outperforms standard speaker segmentation
systems based on (a) speaker identification and (b)
microphone array processing, for dataset with significant
portion (27.4%) of overlapped speech and scores as high as
94.4% on the F-measure scale.
Keywords
References
Full Text: PDF


