Contributors:  Dawei Liang, Guanyu Zhang



Backgrounds and Motivations

Human speech collected from commercial sensors can be insightful for applications such as automated social diary [1] or analysis of people’s social interaction behaviors [2]. Reliable modeling of the speech data relies on solid ground truth of annotation. However, a common problem in real-world speech annotation is that the speech quality varies a lot due to far-field recording and lack of prior knowledge of the speakers. The situation is relatively mild in voice classification studies with large-scale online datasets where labels can be obtained from clean speech [3], but it is more significant in naturalistic studies where annotators can only infer the events purely by listening back the recordings. When the speech instances are of high uncertainty, for example, consisting of two similar speakers or mixtures of far-filed conversations and virtual vocal activities, the annotators have to ‘guess’ the ground truth labels of the data. In such cases, it will be helpful to develop a tool that can help researchers better identify such instances so that they can take extra steps for better annotation of the data or incorporate the factors of such uncertainty in their studies. Hence, our project aims to explore such a tool that can automatically discover speech instances of potentially higher uncertainty from the real-world collected dataset.

Generally speaking, speech segments can be relatively easier to discriminate if the segments contain only a single speaker or pure non-vocal backgrounds such as sounds of the home appliances. However, the annotation becomes much more difficult to human listeners when 1) speech of different speakers is overlapped with each other, 2) different speakers have similar voice patterns, or 3) the conversation happens in far field with respect to the sensing device such that the actual voice is mixed with the backgrounds. Hence, our project aims to build a model to determine speech instances of such types from the dataset. It is noted that there have been prior efforts of overlapped speech detection [4, 5]. However, we differ from the prior approaches in that we consider the target speaker in our modeling. In wearable sensing, for example, we typically care more about speech of the wearers, and therefore when we determine the speech mixtures we are referring to mixture between the target speaker and the vocal backgrounds rather than overlapped speech segments of universal speakers. In other words, the prior work aims to discover general types of overlapped speech, while our goal is to discover only overlapped speech that contains the target speakers.



Contributions

In general, our contributions are summarized as:
1) We study pipelines to automatically discover speech segments that can be uncertain to human listeners while annotating the real-world collected sound data.
2) We show that by combining end-to-end classification methods and thresholding the output confidence distributions of a voice classifier, we are able to discriminate speech instances consisting of mixed and uncertain speakers from the dataset.



Related Work

Prior work of target speech detection typically does not incorporate much on the outliers and the mixed speech segments of the data [6]. However, annotation error due to uncertain instances may bring negative effects to the model performance and less reliability of reported results, especially in the case of intensive conversations. There has been prior work discussing sound classification with mixed background or components. For example, in [7] the researchers attempted to classify sound with different environmental backgrounds. Neural networks have also been shown to work well for speaker or component counting within the sound segments [8]. There have also been prior work discussing mixed speech detection of general types based on statistical or deep learning methods [4, 5]. However, to the best of our knowledge, there is still no prior work discussing the determination of uncertain sound instances from real-world audio recordings to help better annotation and characterization of the collected data. Hence, we hope to explore a feasible solution to this task.



Descriptions and Dataset

The dataset we are going to use was originally obtained from a real-world data collection effort using commercial wearable smartwatch. The data was collected in a naturalistic home environment, where people were instructed to perform some types of interactions. In total, we collected 41,973 seconds (11.7 hours) of audio recordings from 7 groups of participants for this study. Due to privacy protection reasons, we only have access to deep embedding features extracted from a pre-trained neural network. Also, due to the resource limit, we only accessed a subset of the dataset for our model development and validation.

watch
Figure 1.  The Fossil smartwatch for our data collection

The data was labeled by human labelers ahead of the study. It was categorized into 4 classes. The first type is pure speech from the smartwatch wearers. That is also what we refer as the target speakers in our study. The second type is the background voice. It is voice from speakers other than the target speakers. The third and the fourth types are what we are going to filter from the dataset. The third type is segments of mixed voice of the target and non-target speakers, and the fourth type contains speech instances that are hard to identify by human listeners (ambiguous speech). The typical reasons of the ‘ambiguous’ recordings can be that the target and non-target speakers are too similar in voice, or the conversation is captured too far away from the microphone such that it is not clear enough to be discriminated by listening back to the recordings.

labels distribution labels distribution
Figure 2.  Labels distribution



Model Descriptions

We tried two types of methods to determine the uncertain speech instances. The first method is a straightforward end-to-end classification. We tried a random forest, and a more advanced VGG-like neural network classifier [6]. The second method is different: Rather than obtaining the predicted labels from the classifiers, we directly examine the predicted probability from the classifiers.

The VGG-like classifier has the following structure:

Input => Conv1D[64] => Conv1D[128] => Conv1D[256] => Conv1D[512] => Dense[1024] => Dense[128] => Dense[1]
*The values in the brackets are the number of filters for the convolutional layers / the size of outputs for the dense (fully connected) layers.

Our second method is not a direct classification. The process is shown below in Figure 3. In the training phase, everything is the same as a typical step for building a classifier. The classifier is fed with training data and labels. But in the test phase, we can test the model with classes that it has never seen in the training phase. We determine the instances of such classes by thresholding the predicted probability or confidence levels from the classifier outputs.

Train phase
Training phase
Test phase
Test phase
Figure 3.  Model Usage

In addition, we also add some enhancement for the input features. The first way is to get the running average of every few seconds of the frames. The second way is to stack the frames so that they can be image-like inputs. By adding the enhancement, the input instances can incorporate the temporal shape of the speech.

Mean texture window
Mean texture window
Stacked image-like feature
Stacked image-like feature
Figure 4.  Data Processing



Results

We first examine how we can determine the speech mixtures from the dataset. As we can see from Table 1, by using a neural network classifier, it is possible to directly discover the mixed conversational segments from the dataset in an end-to-end classification manner. Also, neural networks are generally performing better than the random forest for the mixed speech discovery.

Classifier Balanced Acc (%) Macro F1 (%)
RF (n=100) 67.66 68.85
RF (n=500) 67.89 69.78
VGG-s 71.97 74.87
Table 1.  Binary classification, background voice vs mixed (3-sec instance)

We further examine how the different types of input enhancement can change the performance. As we can see from Table 2, all types of feature enhancement can yield similar performance. Mean and variance features here mean to obtain the running average / variance of feature vectors across time. Interestingly, stacking the frames as image-like inputs does not help the network to better capture the temporal information in our task.

Input Feature Type RF (n=100) VGG -s
Mean feature F1 68.85%; Acc 67.66% F1 74.87%; Acc 71.97%
Variance feature F1 69.44%; Acc 67.35% F1 72.23%; Acc 69.37%
Stacked 2D img - F1 73.73%; Acc 72.20%
Table 2.  Binary classification with different input types, background vs mixed (3-sec instance)

The third part of the results is to show how the input length of the speech segments can change the performance. As shown in Figure 5, the unit size of the segments is in 1 sec. As we can see, by increasing the size of the input speech segments, the classification performance generally goes up for both classifiers. This is as expected since the classifiers can capture more distinct patterns between different sound types by incorporating sound information at a larger temporal scale.

mixture result 3
Figure 5.  The relationship between temporal size and F1 score

Finally, we examined how the models can determine instances of uncertain sources. It is more challenging since we need to determine the uncertain speaker data from the well-labeled conversations, and the uncertain speech in fact contains a large proportion of components from the labeled speakers (just not able to be discriminated by human listeners). We first tried an end-to-end approach by using the neural network classifier. However, the model did not perform well, and we obtained a macro F1 score around 58% and the balanced accuracy around 60% for classifying the 3 classes (target speakers, non-target speakers, and uncertain speakers). Hence, we further proceeded with the second approach and trained the model only with data of the target speech and background voice. We then tested the model with all 3 classes, including a new class of the uncertain speech. We conducted the tests for 5 times, and we reported the mean output confidence levels of the classifier for the 3 classes of data. As we can see in Table 3 below, the output confidence levels of the classifier for the target speaker data is mostly closer to 1. The output confidence levels for the background voice instances, on the contrary, is closer to 0. The confidence levels for the uncertain speaker instances generally remain at the middle (average as 0.683 for 5 tests). That means speech instances of uncertain speakers may not be classified in an end-to-end approach by using our neural network classifier, but they can be determined from the dataset by thresholding the output confidence levels of the classifier.

Class Test 1 Test 2 Test 3 Test 4 Test 5 Avg
Target 0.958 0.828 0.943 0.908 0.957 0.919
Background voice 0.149 0.281 0.123 0.242 0.088 0.177
Uncertain 0.692 0.651 0.696 0.718 0.658 0.683
Table 3.  Confidence levels of each categorie (5 times results using VGG-S)


Discussions

This project was inspired by our observations that annotation of real-world speech can be a quite time-consuming process, and of less reliability when dealing with complicated recording conditions. There has been prior work to help human annotation of audio, for example by visualizing the data. However, we by far have not seen any tools to help to discover speech data consisting of uncertain source components to the human listeners. Such uncertainty of data can bring challenges to the reliability of the ground truth, and therefore should be taken with extra steps or considerations in speech analysis. From our project, we explored hybrid methods to discover such instances, but the performance is still far from a perfect level. Since our goal is to enable more accurate annotation, it is required that the performance of our pipeline should also be as accurate as possible. There is still room to be improved towards the goal of the project.



Conclusions and Future Directions

In our work, we explore automatic determination of common uncertain speech instances to human annotators from real-world collected audio. To this end, we first examined end-to-end classification methods to directly distinguish speech instances mixed with the target speech segments and the background voice. Our results showed that such mixed speech segments can be discovered directly based on a neural network classifier. In the next step, we examined determination of speech instances of uncertain speakers from the dataset. From our studies, we found that speech instances of uncertain speakers may not be classified from the dataset directly based on the neural network classifier, but they can be discovered based on the output confidence values of the classifier. In the future, we will work to build a more systematic and integral pipeline to determine such uncertain speech instances. We will also try to improve the performance of the models by testing varying architectures. We hope that our work can help to facilitate human annotation of real-world audio and to improve the reliability of the ground truth.



References

[1] Wyatt, Danny, et al. "Towards the automated social analysis of situated speech data." Proceedings of the 10th international conference on Ubiquitous computing. 2008.

[2] Schmid Mast, Marianne, et al. "Social sensing for psychology: Automated interpersonal behavior assessment." Current Directions in Psychological Science 24.2 (2015): 154-160.

[3] Lukic, Yanick X., et al. "Learning embeddings for speaker clustering based on voice equality." 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2017.

[4] Boakye, Kofi, et al. "Overlapped speech detection for improved speaker diarization in multiparty meetings." 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2008.

[5] Chowdhury, Shammur Absar, Morena Danieli, and Giuseppe Riccardi. "Annotating and categorizing competition in overlap speech." 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015.

[6] Nadarajan, Amrutha, Krishna Somandepalli, and Shrikanth S. Narayanan. "Speaker agnostic foreground speech detection from audio recordings in workplace settings from wearable recorders." ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

[7] Haubrick, Peter, and Juan Ye. "Robust Audio Sensing with Multi-Sound Classification." 2019 IEEE International Conference on Pervasive Computing and Communications (PerCom. IEEE, 2019).

[8] Andrei, Valentin, Horia Cucu, and Corneliu Burileanu. "Overlapped Speech Detection and Competing Speaker Counting–‐Humans Versus Deep Learning." IEEE Journal of Selected Topics in Signal Processing 13.4 (2019): 850-862.

Contact Us

Dawei Liang     Website: https://github.com/dawei-liang
Guanyu zhang   Website: https://github.com/guanyu-zhang