Voiceprint recognition method based on SE-DR-Res2Block

LI Ping; GAO Qingyuan; XIA Yu; ZHANG Xiaoyong; CAO Yi

doi:10.13374/j.issn2095-9389.2022.09.19.001

Volume 45 Issue 11

Nov. 2023

Turn off MathJax

Article Contents

Article Navigation > Chinese Journal of Engineering > 2023 > 45(11): 1962-1969

LI Ping, GAO Qingyuan, XIA Yu, ZHANG Xiaoyong, CAO Yi. Voiceprint recognition method based on SE-DR-Res2Block[J]. Chinese Journal of Engineering, 2023, 45(11): 1962-1969. doi: 10.13374/j.issn2095-9389.2022.09.19.001

Citation:

LI Ping, GAO Qingyuan, XIA Yu, ZHANG Xiaoyong, CAO Yi. Voiceprint recognition method based on SE-DR-Res2Block[J]. Chinese Journal of Engineering, 2023, 45(11): 1962-1969. doi: 10.13374/j.issn2095-9389.2022.09.19.001

Citation:

PDF( 1008 KB)

Voiceprint recognition method based on SE-DR-Res2Block

doi: 10.13374/j.issn2095-9389.2022.09.19.001

LI Ping^{1, 2},
GAO Qingyuan^{1, 2},
XIA Yu^{1, 2},
ZHANG Xiaoyong^{1, 2},
CAO Yi^{1, 2
,
,}

1.
School of Mechanical Engineering, Jiangnan University, Wuxi 214122, China
2.
Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and Technology, Wuxi 214122, China

More Information

Corresponding author: E-mail: caoyi@jiangnan.edu.cn
Received Date: 2022-09-19
Available Online: 2023-03-06
Publish Date: 2023-11-01

Abstract

Abstract

Aiming at the problems of insufficient feature expression ability and weak generalization ability of the traditional Res2Net model in the field of voice print recognition, this paper proposes a feature extraction module known as the SE-DR-Res2Block, which combinedly uses dense connection and residual connection. The combination of low-semantic features with spatial information characteristics allows focusing more on detailed information and high-semantic information that concentrates on global information as well as abstract features. This can compensate for the loss of some detailed information caused by abstraction. First, the feature of each layer in the dense connection structure is derived from the feature output of all previous layers to realize feature reuse. Second, the structure and working principle of the ECAPA-TDNN network using traditional Res2Block is introduced. To achieve more efficient feature extraction, the dense connection is used to further realize full feature mining. Based on SE-block, a more efficient feature extraction module, SE-DR-Res2Net, is proposed by combining the residual join and dense links. As compared to the traditional SE-Block structures, the convolutional layers are used here instead of fully connected layers. Because they not only reduce the number of parameters needed for training but also allow weight sharing, thereby reducing overfitting. Therefore, effective extraction of feature information from different layers is essential for obtaining multiscale expression as well as maximizing the reuse of features. During the collection of more scale-specific feature information, a large number of dense structures can lead to a dramatic increase in parameters and computational complexity. By using partial residual structures instead of dense structures, we can effectively prevent the dramatic increase in parameter quantity while maintaining the performance to a certain extent. Finally, to verify the effectiveness of the module, SE-Res2block, Full-SE-Res2block, SE-DR-Res2block, and Full-SE-DR-Res2block are adopted based on the different network models. Voxceleb1 and SITW (speakers in the wild) datasets were used for Voxceleb1 and SITW, respectively. The performance comparison of Res2Net-50 models with different modules on the Voxceleb1 dataset shows that SE-DR-Res2Net-50 achieves the best equal error rate of 3.51%, which also validates the adaptability of this module on different networks. The usage of different modules on different networks, as well as experiments and analyses conducted on different datasets, were compared. The experimental results showed that the optimal equal error rates of the ECAPA-TDNN network model using SE-DR-Res2block had reached 2.24% and 3.65%, respectively. This verifies the feature expression ability of this module, and the corresponding results based on different test data sets also confirm its excellent generalization ability.
- deep learning,
- voiceprint recognition,
- dense connection,
- residual connection,
- multiscale features

FullText(HTML)

References(27)

References

[1]	鄭方, 李藍天, 張慧等. 聲紋識別技術及其應用現狀. 信息安全研究, 2016, 2(1):44 Zheng F, Li L T, Zhang H, et al. Overview of Voiceprint Recognition Technology and Applications. J Inf Secur Res, 2016, 2(1): 44.
[2]	Hayashi V T, Ruggiero W V. Hands-free authentication for virtual assistants with trusted IoT device and machine learning. Sensors, 2022, 22(4): 1325 doi: 10.3390/s22041325
[3]	Faundez-Zanuy M, Lucena-Molina J J, Hagmueller M. Speech watermarking: An approach for the forensic analysis of digital telephonic recordings[J/OL]. arXiv preprint (2022-03-12) [2022-09-19]. https://arxiv.org/abs/2203.02275
[4]	Garain A, Ray B, Giampaolo F, et al. GRaNN: Feature selection with golden ratio-aided neural network for emotion, gender and speaker identification from voice signals. Neural Comput Appl, 2022, 34(17): 14463 doi: 10.1007/s00521-022-07261-x
[5]	Waghmare K, Gawali B. Speaker recognition for forensic application: A review. J Pos Sch Psychol, 2022, 6(3): 984
[6]	Mittal A, Dua M. Automatic speaker verification systems and spoof detection techniques: review and analysis. Int J Speech Technol, 2022, 25: 105 doi: 10.1007/s10772-021-09876-2
[7]	Burget L, Matejka P, Schwarz P, et al. Analysis of feature extraction and channel compensation in a GMM speaker recognition system. IEEE Trans Audio Speech Lang Process, 2007, 15(7): 1979 doi: 10.1109/TASL.2007.902499
[8]	鮑煥軍, 鄭方. GMM-UBM和SVM說話人辨認系統及融合的分析. 清華大學學報(自然科學版), 2008(S1):693 Bao H J, Zheng F. Combined GMM-UBM and SVM speaker identification system. J Tsinghua Univ Sci Technol, 2008(S1): 693
[9]	Kenny P, Stafylakis T, Ouellet P, et al. JFA-based front ends for speaker recognition // 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, 2014: 1705
[10]	Cumani S, Plchot O, Laface P. On the use of i–vector posterior distributions in probabilistic linear discriminant analysis. IEEE/ACM Trans Audio Speech Lang Process, 2014, 22(4): 846 doi: 10.1109/TASLP.2014.2308473
[11]	Variani E, Lei X, McDermott E, et al. Deep neural networks for small footprint text-dependent speaker verification // 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, 2014: 4052
[12]	Snyder D, Ghahremani P, Povey D, et al. Deep neural network-based speaker embeddings for end-to-end speaker verification // 2016 IEEE Spoken Language Technology Workshop. San Diego, 2016: 165
[13]	Peddinti V, Povey D, Khudanpur S, et al. A time delay neural network architecture for efficient modeling of long tem-poral contexts // Sixteenth Annual Conference of the International Speech Communication Association. Dresden, 2015: 3214
[14]	Okabe K, Koshinaka T, Shinoda K. Attentive statistics pooling for deep speaker embedding // Interspeech. Hyderabad, 2018: 2252
[15]	Jiang Y H, Song Y, McLoughhlin I, et al. An effective deep embedding learning architecture for speaker verification // Interspeech. Graz, 2019: 4040
[16]	Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, 2017: 4700
[17]	Zhou J F, Jiang T, Li Z, et al. Deep speaker embedding extraction with channel-wise feature responses and additive supervision softmax loss function // Interspeech. Graz, 2019: 2883
[18]	Li Z, Zhao M, Li L, et al. Multi-feature learning with canonical correlation analysis constraint for text-independent speaker verification // 2021 IEEE Spoken Language Technology Workshop. Shenzhen, 2021: 330
[19]	Desplanques B, Thienpondt J, Demuynck K. Ecapa-tdnn: emphasized channel attention, propagation and aggregation in tdnn based speaker verification // Interspeech. Shanghai, 2020: 3830
[20]	Gao S H, Cheng M M, Zhao K, et al. Res2net: A new multi-scale backbone architecture. IEEE Trans Pattern Anal and Mach Intell, 2019, 43(2): 652
[21]	Hu J, Shen L, Sun G. Squeeze-and-excitation networks // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018: 7132
[22]	Nagrani A, Chung J S, Zisserman A. Voxceleb: a large-scale speaker identification dataset[J/OL]. arXiv preprint (2018-05-30) [2022-09-19]. https://arxiv.org/abs/1706.08612
[23]	McLaren M, Ferrer L, Castan D, et al. The speakers in the wild (SITW) speaker recognition database // Interspeech. San Francisco, 2016: 818
[24]	郭振超, 楊震, 葛子瑞, 等. 一種基于語音圖信號處理的端點檢測方法. 信號處理, 2022, 38(04):788 doi: 10.16798/j.issn.1003-0530.2022.04.013 Guo Z C, Yang Z, Ge Z R, et al. An endpoint detection method based on speech graph signal processing. J Signal Process, 2022, 38(4): 788 doi: 10.16798/j.issn.1003-0530.2022.04.013
[25]	鄭艷, 姜源祥. 基于特征融合的說話人聚類算法. 東北大學學報(自然科學版), 2021, 42(7):952 Zheng Y, Jiang Y X. Speaker clustering algorithm based on feature fusion. J Northeast Univ Nat Sci, 2021, 42(7): 952
[26]	Deng J K, Guo J, Xue N N, et al. Arcface: Additive angular margin loss for deep face recognition // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, 2019: 4690
[27]	陳志高, 李鵬, 肖潤秋, 等. 文本無關說話人識別的一種多尺度特征提取方法. 電子與信息學報, 2021, 43(11):3266 doi: 10.11999/JEIT200917 Chen Z G, Li P, Xiao R Q, et al. A multiscale feature extraction method for text-independent speaker recognition. J Electron Inf Technol, 2021, 43(11): 3266 doi: 10.11999/JEIT200917