Sound Context Classification based on Joint Learning Model and Multi-Spectrogram Features

Authors

  • Dat Ngo
  • Lam Pham
  • Anh Nguyen
  • Tien Ly
  • Khoa Pham
  • Thanh Ngo

DOI:

https://doi.org/10.47839/ijc.21.2.2595

Keywords:

Acoustic scene classification, Spectrogram, Convolutional neural network, Recurrent neural network, Joint learning architecture, Feature extraction

Abstract

This article presents a deep learning framework applied for Acoustic Scene Classification (ASC), the task of classifying different environments from the sounds they produce. To successfully develop the framework, we firstly carry out a comprehensive analysis of spectrogram representation extracted from sound scene input, then propose the best multi-spectrogram combination for front-end feature extraction. In terms of back-end classification, we propose a novel joint learning model using a parallel architecture of Convolutional Neural Network (CNN) and Convolutional Recurrent Neural Network (C-RNN), which is able to learn efficiently both spatial features and temporal sequences of a spectrogram input. The experimental results have proved our proposed framework general and robust for ASC tasks by three main contributions. Firstly, the most effective spectrogram combination is indicated for specific datasets that none of publication previously analyzed. Secondly, our joint learning architecture of CNN and C-RNN achieves better performance compared with the CNN only which is proposed for the baseline in this paper. Finally, our framework achieves competitive performance compared with the state-of-the-art systems on various benchmark datasets of IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 Task 1, 2017 Task 1, 2018 Task 1A & 1B, and LITIS Rouen.

References

P. Zwan, “Automatic sound recognition for security purposes,” in Audio Engineering Society Convention, 124. Audio Engineering Society, 2008, pp. 7387.

X. Valero and Francesc Alias, “Gammatone wavelet features for sound classification in surveillance applications,” Proceedings of the EUSIPCO, 2012, pp. 1658–1662.

B. N. Schilit, N. Adams, R. Want, et al., Context-aware computing applications, Xerox Corporation, Palo Alto Research Center, 1994. https://doi.org/10.1109/WMCSA.1994.16.

T. Heittola, A. Mesaros, A. Eronen, T. Virtanen, “Context-dependent sound event detection,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2013, no. 1, pp. 1, 2013. https://doi.org/10.1186/1687-4722-2013-1.

Y. Xu, W. J. Li, K. K. Caramon Lee, Intelligent Wearable Interfaces, Wiley Online Library, 2008. https://doi.org/10.1002/9780470222867.

L. Ma, D. J. Smith, B. P. Milner, “Context awareness using environmental noise classification,” Proceedings of the EUROSPEECH, 2003, pp. 2237-2240.

I. V. McLoughlin, Speech and Audio Processing: a MATLAB-based Approach, Cambridge University Press, 2016. https://doi.org/10.1017/CBO9781316084205.

E. Marchi, D. Tonelli, X. Xu, F. Ringeval, J. Deng, S. Squartini, B. Schuller, “Pairwise decomposition with deep neural networks and multiscale kernel subspace learning for acoustic scene classification,” DCASE Technical Report, 2016, pp. 65–69.

E. Marchi, D. Tonelli, X. Xu, F. Ringeval, J. Deng, B. Schuller, “The up system for the 2016 DCASE challenge using deep recurrent neural network and multiscale kernel subspace learning,” DCASE Technical Report, 2016.

A. Vafeiadis, D. Kalatzis, K. Votis, D. Giakoumis, D. Tzovaras, L. Chen, R. Hamzaoui, “Acoustic scene classification: From a hybrid classifier to deep learning,” DCASE Technical Report, 2017.

S. Park, S. Mun, Y. Lee, H. Ko, “Score fusion of classification systems for acoustic scene classification,” DCASE Technical Report, 2016.

J. T. Geiger, M. A. Lakhal, B. Schuller, G. Rigoll, “Learning new acoustic events in an hmm-based system using map adaptation,” Proceedings of the INTERSPEECH, 2011, pp. 293-296. https://doi.org/10.21437/Interspeech.2011-113.

J. T. Geiger, B. Schuller, G. Rigoll, “Large-scale audio feature extraction and SVM for acoustic scene classification,” Proceedings of the 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA, 2013, pp. 1–4. https://doi.org/10.1109/WASPAA.2013.6701857.

L. Vuegen, B. V. D. Broeck, P. Karsmakers, J. F. Gemmeke, B. Vanrumste, H. V. Hamme, “An MFCC-GMM approach for event detection and classification,” Proceedings of the 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA, 2013, pp. 1–3.

H. Phan, L. Hertel, M. Maass, P. Koch, R. Mazur, A. Mertins, “Improved audio scene classification based on label-tree embeddings and convolutional neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1278–1290, 2017. https://doi.org/10.1109/TASLP.2017.2690564.

H. Phan, O. Y. Chén, L. Pham, P. Koch, M. De Vos, I. McLoughlin, A. Mertins, “Spatio-temporal attention pooling for audio scene classification,” arXiv preprint arXiv:1904.03543, 2019. https://doi.org/10.21437/Interspeech.2019-3040.

H. Phan, O. Y. Chén, P. Koch, L. Pham, I. McLoughlin, A. Mertins, M. De Vos, “Beyond equal-length snippets: How long is sufficient to recognize an audio scene,” arXiv preprint arXiv:1811.01095, 2018.

Z. Ren, Q. Kong, J. Han, M. D Plumbley, B. W. Schuller, “Attention-based atrous convolutional neural networks: Visualisation and understanding perspectives of acoustic scenes,” Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, 2019, pp. 56–60. https://doi.org/10.1109/ICASSP.2019.8683434.

S. S. R. Phaye, E. Benetos, Y. Wang, “Subspectralnet – using sub-spectrogram based convolutional neural networks for acoustic scene classification,” Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, 2019, pp. 825–829. https://doi.org/10.1109/ICASSP.2019.8683288.

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, T. Virtanen, “Dcase 2017 challenge setup: Tasks, datasets and baseline system,” DCASE Technical Report, 2017.

T. Lidy, A. Schindler, “CQT-based convolutional neural networks for audio scene classification,” DCASE Technical Report, vol. 90, pp. 1032–1048, 2016.

H. Phan, P. Koch, F. Katzberg, M. Maass, R. Mazur, A. Mertins, “Audio scene classification with deep recurrent neural networks,” arXiv preprint arXiv:1703.04770, 2017. https://doi.org/10.21437/Interspeech.2017-101.

T. Nguyen, F. Pernkopf, “Acoustic scene classification using a convolutional neural network ensemble and nearest neighbor filters,” DCASE Technical Report, 2018. https://doi.org/10.1109/ICMLA.2019.00151.

T. Nguyen, F. Pernkopf, “Acoustic scene classification with mismatched recording devices using mixture of experts layer,” Proceedings of the 2019 IEEE International Conference on Multimedia and Expo ICME, 2019, pp. 1666–1671. https://doi.org/10.1109/ICME.2019.00287.

H. Zeinali, L. Burget, J. Cernocky, “Convolutional neural networks and x-vector embedding for dcase2018 acoustic scene classification challenge,” arXiv preprint arXiv:1810.04273, 2018.

H. Phan, L. Hertel, M. Maass, P. Koch, A. Mertins, “Label tree embeddings for acoustic scene classification,” Proceedings of the 24th ACM International Conference on Multimedia, 2016, pp. 486–490. https://doi.org/10.1145/2964284.2967268.

L. Yang, X. Chen, L. Tao, “Acoustic scene classification using multi-scale features,” DCASE Technical Report, 2018.

T. Zhang, K. Zhang, J. Wu, “Temporal transformer networks for acoustic scene classification,” Proceedings of the Interspeech, 2018, pp. 1349–1353. https://doi.org/10.21437/Interspeech.2018-1152.

T. Zhang, K. Zhang, J. Wu, “Data independent sequence augmentation method for acoustic scene classification,” Proceedings of the Interspeech, 2018, pp. 3289–3293.

T. Zhang, K. Zhang, J. Wu, “Multi-modal attention mechanisms in lstm and its application to acoustic scene classification,” Proceedings of the Interspeech, 2018, pp. 3328–3332. https://doi.org/10.21437/Interspeech.2018-1138.

H. Phan, O. Y. Chén, L. Pham, P. Koch, M. De Vos, I. McLoughlin, A. Mertins, “Spatio-temporal attention pooling for audio scene classification,” arXiv preprint arXiv:1904.03543, 2019. https://doi.org/10.21437/Interspeech.2019-3040.

H. Phan, O. Y. Ché, P. Koch, L. Pham, I. McLoughlin, A. Mertins, M. De Vos, “Beyond equal-length snippets: How long is sufficient to recognize an audio scene?,” arXiv preprint arXiv:1811.01095, 2018.

C. Gousseau, “VGG CNN for urban sound tagging” DCASE Technical Report, 2019.

Z. Li, L. Zhang, S. Du, W. Liu, “Acoustic scene classification based on binaural deep scattering spectra with CNN and LSTM,” DCASE Technical Report, 2018.

A. Rakotomamonjy, G. Gasso, “Histogram of gradients of time–frequency representations for audio scene classification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 142–153, 2014. https://doi.org/10.1109/TASLP.2014.2375575.

A. Mesaros, T. Heittola, T. Virtanen, “TUT database for acoustic scene classification and sound event detection,” Proceedings of the 2016 24th European Signal Processing Conference EUSIPCO, 2016, pp. 1128-1132. https://doi.org/10.1109/EUSIPCO.2016.7760424.

A. Mesaros, T. Heittola, T. Virtanen, “A multi- device dataset for urban acoustic scene classification,” DCASE Technical Report, 2018.

J. Salamon, J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279–283, 2017. https://doi.org/10.1109/LSP.2017.2657381.

B. McFee, C. Raffel, D. Liang, D. P. W. Ellis, M. McVicar, E. Battenberg, O. Nieto, “librosa: Audio and music signal analysis in python,” Proceedings of the 14th Python in Science Conference, 2015, pp. 18–25. https://doi.org/10.25080/Majora-7b98e3ed-003.

R. D. Patterson, “Auditory filters and excitation patterns as representations of frequency resolution,” Frequency selectivity in hearing, 1986.

B. R. Glasberg, B. C. J. Moore, “Derivation of auditory filter shapes from notched-noise data,” Hearing Research, vol. 47, no. 1-2, pp. 103–138, 1990. https://doi.org/10.1016/0378-5955(90)90170-T.

D. P. W. Ellis, “Gammatone-like spectrograms,” 2009. [Online]. Available at: http://www.ee.columbia.edu/dpwe/resources/matlab/gammatonegram.

D. P. Kingma, J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

S. Kullback, R. A. Leibler, “On information and sufficiency,” The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951. https://doi.org/10.1214/aoms/1177729694.

A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, M. D. Plumbley, “Detection and classification of acoustic scenes and events: Outcome of the DCASE 2016 challenge,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 2, pp. 379–393, 2018. https://doi.org/10.1109/TASLP.2017.2778423.

Y. Li, X. Li, Y. Zhang, “The SEIE-SCUT systems for challenge on DCASE 2018: Deep learning techniques for audio representation and classification,” DCASE Technical Report, 2018.

S. Zhao, T. N. T. Nguyen, W.-S. Gan, D. L. Jones, “Acoustic scene classification using deep residual convolutional neural networks,” in DCASE Technical Report, 2017.

D. Wei, J. Li, P. Pham, S. Das, S. Qu, “Acoustic scene recognition with deep neural networks (DCASE challenge 2016),” DCASE Technical Report, 2016.

V. Bisot, S. Essid, G. Richard, “Hog and subband power distribution image features for acoustic scene classification,” Proceedings of the 2015 23rd European Signal Processing Conference EUSIPCO, 2015, pp. 719–723. https://doi.org/10.1109/EUSIPCO.2015.7362477.

J. Tchorz, M. Weg, “Combination of amplitude modulation spectrogram features and MFCCS for acoustic scene classification,” DCASE Technical Report, 2018.

J.-W. Jung, H.-S. Heo, I. H. Yang, S.-H. Yoon, H.-J. Shim, H.-J. Yu, “DNN-based audio scene classification for DCASE 2017: dual input features, balancing cost, and stochastic data duplication,” DCASE Technical Report, 2017.

S. H. Bae, I. Choi, N. S. Kim, “Acoustic scene classification using parallel combination of lstm and CNN,” DCASE Technical Report, 2016, pp. 11–15.

J. Ye, T. Kobayashi, M. Murakawa, T. Higuchi, “Acoustic scene classification based on sound textures and events,” Proceedings of the 23rd ACM International Conference on Multimedia, 2015, pp. 1291–1294. https://doi.org/10.1145/2733373.2806389.

Q. Kong, T. Iqbal, Y. Xu, W. Wang, M. D. Plumbley, “DCASE 2018 challenge surrey cross-task convolutional neural network baseline,” arXiv preprint arXiv:1808.00773, 2018.

J. Wang, “DCASE 2018 task 1A: Acoustic scene classification by bi-LSTM-CNN-net multichannel fusion,” DCASE Technical Report, 2018.

K. J. Piczak, “The details that matter: Frequency resolution of spectrograms in acoustic scene classification,” DCASE Technical Report, 2017.

J. Kim, K. Lee, “Empirical study on ensemble method of deep neural networks for acoustic scene classification,” DCASE Technical Report, 2016.

J. Wang, S. Li, “Self-attention mechanism based system for DCASE 2018 challenge task1 and task4,” DCASE Technical Report, 2018, pp. 1–5.

C. Roletscheck, T. Watzka, A. Seiderer, D. Schiller, E. Andre, “Using an evolutionary approach to explore convolutional neural networks for acoustic scene classification,” DCASE Technical Report, 2018.

I. Kukanov, V. Hautamaki, K. A. Lee, “Recurrent neural network and maximal figure of merit for acoustic event detection,” DCASE Technical Report, 2017.

G. Takahashi, T. Yamada, S. Makino, N. Ono, “Acoustic scene classification using deep neural network and frame- concatenated acoustic feature,” DCASE Technical Report, 2016. https://doi.org/10.1109/APSIPA.2017.8282314.

Y. Yin, R. R. Shah, R. Zimmermann, “Learning and fusing multimodal deep features for acoustic scene categorization,” Proceedings of the 26th ACM International Conference on Multimedia, 2018, pp. 1892–1900. https://doi.org/10.1145/3240508.3240631.

S. Waldekar, G Saha, “Wavelet-based audio features for acoustic scene classification,” DCASE Technical Report, 2018. https://doi.org/10.21437/Interspeech.2018-2083.

L. Zhang, J. Han, “Acoustic scene classification using multi- layered temporal pooling based on deep convolutional neural network,” DCASE Technical Report, 2018.

S. Park, S. Mun, Y. Lee, H. Ko, “Acoustic scene classification based on convolutional neural network using double image features,” DCASE Technical Report, 2017, pp. 1–5.

B. Elizalde, A. Kumar, A. Shah, R. Badlani, E. Vincent, B. Raj, I. Lane, “Experiments on the DCASE challenge 2016: Acoustic scene classification and sound event detection in real life recording,” arXiv preprint arXiv:1607.06706, 2016.

B. Lehner, H. Eghbal-Zadeh, M. Dorfer, F. Korzeniowski, K. Koutini, G. Widmer, “Classifying short acoustic scenes with i-vectors and CNNS: Challenges and optimisations for the 2017 DCASE ASC task,” DCASE Technical Report, 2017.

M. Valenti, A. Diment, G. Parascandolo, S. Squartini, T. Virtanen, “DCASE 2016 acoustic scene classification using convolutional neural networks,” DCASE Technical Report, pp. 95–99, 2016.

J. Ye, T. Kobayashi, N. Toyama, H. Tsuda, M. Murakawa, “Acoustic scene classification using efficient summary statistics and multiple spectro-temporal descriptor fusion,” Applied Sciences, vol. 8, no. 8, pp. 1363, 2018. https://doi.org/10.3390/app8081363.

A. Dang, T. Vu, J.-C. Wang, “Acoustic scene classification using ensemble of convnets,” DCASE Technical Report, 2018.

R. Hyder, S. Ghaffarzadegan, Z. Feng, T. Hasan, “Buet Bosch consortium (b2c) acoustic scene classification systems for DCASE 2017,” DCASE Technical Report, 2017.

O. Mariotti, M. Cord, O. Schwander, “Exploring deep vision models for acoustic scene classification,” DCASE Technical Report, 2018.

Z. Weiping, Y. Jiantao, X. Xiaotao, L. Xiangtao, P. Shaohu, “Acoustic scene classification using deep convolutional neural network and multiple spectrograms fusion,” DCASE Technical Report, 2017.

Y. Han, J. Park, K. Lee, “Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification,” DCASE Technical Report, pp. 1–5, 2017.

V. Bisot, R. Serizel, S. Essid, G. Richard, “Supervised nonnegative matrix factorization for acoustic scene classification,” DCASE Technical Report, pp. 62–69, 2016.

A. Golubkov, A. Lavrentyev, “Acoustic scene classification using convolutional neural networks and different channels representations and its fusion,” DCASE Technical Report, 2018.

S. Mun, S. Park, D. K. Han, H. Ko, “Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane,” DCASE Technical Report, pp. 93–97, 2017.

H. Eghbal-Zadeh, B. Lehner, M. Dorfer, G. Widmer, “CP-JKU submissions for DCASE-2016: A hybrid approach using binaural i-vectors and deep convolutional neural networks,” DCASE Technical Report, 2016.

X. Bai, J. Du, Z.-R. Wang, C.-H. Lee, “A hybrid approach to acoustic scene classification based on universal acoustic models,” Proceedings of the Interspeech 2019, pp. 3619–3623, 2019. https://doi.org/10.21437/Interspeech.2019-2171.

Z. Ren, K. Qian, Y. Wang, Z. Zhang, V. Pandit, A. Baird, B. Schuller, “Deep scalogram representations for acoustic scene classification,” IEEE/CAA Journal of Automatica Sinica, vol. 5, no. 3, pp. 662–669, 2018. https://doi.org/10.1109/JAS.2018.7511066.

S. Mun, S. Shon, W. Kim, D. K. Han, H. Ko, “Deep neural network based learning and transferring mid-level audio features for acoustic scene classification,” Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, 2017, pp. 796–800. https://doi.org/10.1109/ICASSP.2017.7952265.

L. Gao, H. Mi, B. Zhu, D. Feng, Y. Li, Y. Peng, “An adversarial feature distillation method for audio classification,” IEEE Access, vol. 7, pp. 105319–105330, 2019. https://doi.org/10.1109/ACCESS.2019.2931656.

Y. Yang, H. Zhang, W. Tu, H. Ai, L. Cai, R. Hu, F. Xiang, “Kullback–leibler divergence frequency warping scale for acoustic scene classification using convolutional neural network,” Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, 2019, pp. 840–844. https://doi.org/10.1109/ICASSP.2019.8683000.

J. Li, W. Dai, F. Metze, S. Qu, S. Das, “A comparison of deep learning methods for environmental sound detection,” Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, 2017, pp. 126–130. https://doi.org/10.1109/ICASSP.2017.7952131.

T. Nguyen, F. Pernkopf, “Acoustic scene classification with mismatched recording devices using mixture of experts layer,” Proceedings of the 2019 IEEE International Conference on Multimedia and Expo ICME, 2019, pp. 1666–1671. https://doi.org/10.1109/ICME.2019.00287.

R. Hyder, S. Ghaffarzadegan, Z. Feng, J. H. L. Hansen, T. Hasan, “Acoustic scene classification using a CNN-supervector system trained with auditory and spectrogram image features,” Proceedings of the Interspeech, 2017, pp. 3073–3077. https://doi.org/10.21437/Interspeech.2017-431.

L. Yang, L. Tao, X. Chen, X. Gu, “Multi-scale semantic feature fusion and data augmentation for acoustic scene classification,” Applied Acoustics, vol. 163, pp. 107238, 2020. https://doi.org/10.1016/j.apacoust.2020.107238.

Y. Wu, T. Lee, “Enhancing sound texture in CNN-based acoustic scene classification,” Proceedings of the 2019 IEEE International Conference on Multimedia and Expo ICASSP, 2019, pp. 815–819. https://doi.org/10.1109/ICASSP.2019.8683490.

H. Song, J. Han, S. Deng, “A compact and discriminative feature based on auditory summary statistics for acoustic scene classification,” arXiv preprint arXiv:1904.05243, 2019. https://doi.org/10.21437/Interspeech.2018-1299.

H.-S. Heo, J.-W. Jung, H.-J. Shim, H.-J. Yu, “Acoustic scene classification using teacher-student learning with soft-labels,” arXiv preprint arXiv:1904.10135, 2019. https://doi.org/10.21437/Interspeech.2019-1989.

H. Chen, P. Zhang, Y. Yan, “An audio scene classification framework with embedded filters and a DCT-based temporal module,” Proceedings of the 2019 IEEE International Conference on Multimedia and Expo ICASSP, 2019, pp. 835–839. https://doi.org/10.1109/ICASSP.2019.8683636.

Downloads

Published

2022-06-30

How to Cite

Ngo, D., Pham, L., Nguyen, A., Ly, T., Pham, K., & Ngo, T. (2022). Sound Context Classification based on Joint Learning Model and Multi-Spectrogram Features. International Journal of Computing, 21(2), 258-270. https://doi.org/10.47839/ijc.21.2.2595

Issue

Section

Articles