Investigating Methods of Searching for Key Frames in Video Flow with the Use of Neural Networks for Search Systems


  • Natalya Shakhovska
  • Natalia Melnykova
  • Petro Pobereiko
  • Maryana Zakharchuk



key frames, neural networks, unsupervised learning, SIFT, CNN, IndRNN, Leaky ReLU


Various methods of video content data analysis are presented, compared, and evaluated in this paper. Due to the analysis, the most effective strategies for video data processing involve searching for key frames within the video stream. The examined methods are categorized into consistent comparison, global comparison based on clustering, and event/object-based methodologies. Key techniques such as sequence search, classification, frame decoding, and anomaly detection are singled out as particularly valuable for comparison and matching tasks. The research further reveals that artificial intelligence and machine learning-driven methods reign supreme in this domain, with deep learning approaches outperforming traditional techniques. The employment of convolutional neural networks and attention mechanisms to capture the temporal intricacies across variable scopes is especially noteworthy. Additionally, leveraging the Actor-Critic model within a Generative Adversarial Network framework has shown encouraging outcomes. A significant highlight of the study is the proposed approach which incorporates modified Independent Recurrent Neural Networks (IndRNN) complemented by an attention mechanism. The enhancement using mathematical tools, notably the standard deviation, for key frame detection, exemplifies the potential of integrating analytical instruments to refine the system's precision. Such advancements, as presented in this research, pave the way for substantial enhancements in information systems tailored for video content analysis and source identification.


H. Tang, “Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion,” Neurocomputing, vol. 331, pp. 424–433, 2019.

R. Vázquez-Martín, and A. Bandera, “Spatio-temporal feature-based keyframe detection from video shots using spectral clustering,” Pattern Recognition Letters, vol. 34, issue 7, pp. 770–779, 2013.

Z. Qu, et al., “An improved keyframe extraction method based on HSV color space,” Journal of Software, vol. 8, issue 7, pp. 1751–1758, 2013.

C. Lv, “Key frame extraction for sports training based on improved deep learning,” Scientific Programming, ed.Muhammad Usman, vol. 2021, 2021, pp. 1–8.

Y. Yuan, et al., “Spatiotemporal modeling for video summarization using convolutional recurrent neural network,” IEEE Access, vol. 7, pp. 64676–64685, 2019.

E. Apostolidis, et al., “AC-SUM-GAN: Connecting actor-critic and generative adversarial networks for unsupervised video summarization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, issue 8, pp. 3278–3292, 2021.

A. Graves, and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Networks, vol. 18, issue 5–6, pp. 602–610, 2005.

K. Zhou, et al., “Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, issue 1, pp. 7582-7589, 2018.

J. Law-To, et al., “Robust voting algorithm based on labels of behavior for video copy detection,” Proceedings of the 14th ACM International Conference on Multimedia, 2006, pp. 835–844.

H. Yang, B. Wang, S. Lin, D. Wipf, M. Guo and B. Guo, “Unsupervised extraction of video highlights via robust recurrent auto-encoders,” Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 4633-4641.

B. Mahasseni, et al., “Unsupervised video summarization with adversarial LSTM networks,” Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2982–2991.

E. Apostolidis, et al., “Video summarization using deep neural networks: A survey,” Proceedings of the IEEE, vol. 109, issue 11, pp. 1838–1863, 2021.

S. M. Tirupathamma, “Key frame based video summarization using frame difference,” International Journal of Innovative Computer Science & Engineering, vol. 4, no. 3, pp. 160-165, 2017.

S. Jadon and M. Jasim, “Video Summarization using Keyframe Extraction and Video Skimming,” EasyChair Preprint, no. 1181, version 2, pp. 1-5, 2020.

S. Lal, et al. «Online video summarization: Predicting future to better summarize present,” Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 2019, pp. 471–480.

M. Elfeki and A. Borji, “Video summarization via actionness ranking,” Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2019, pp. 754-763,

K. Cho, et al., “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 2014, pp. 1724–1734.

Mahasseni, Behrooz, et al., “Unsupervised video summarization with adversarial LSTM networks,” Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 202–211.

E. Apostolidis, et al., “A stepwise, label-based approach for improving the adversarial training in unsupervised video summarization,” Proceedings of the 1st ACM International Workshop on AI for Smart TV Content Production, Access and Delivery, 2019, pp. 17–25.

X. He, et al., “Unsupervised video summarization with attentive conditional generative adversarial networks,” Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 2296–2304.

S. Li, et al., “Independently recurrent neural network (IndRNN): Building a longer and deeper RNN,” 13 March 2018,

H.-T. Nguyen, and T.-O. Nguyen, “Attention-based network for effective action recognition from multi-view video,” Procedia Computer Science, vol. 192, pp. 971–980, 2021.

Y. Zhang, et al., “Unsupervised object-level video summarization with online motion auto-encoder,” Pattern Recognition Letters, vol. 130, pp. 376–385, 2020.

A. Nasreen, K. Roy, K. Roy, G. Shobha, “Key frame extraction and foreground modelling using K-means clustering,” Proceedings of the International Conference on Computational Intelligence, Communication Systems and Networks (CICSYN), Latvia, 2015, pp. 141-145.

M. Gygli, H. Grabner, L. Van Gool, “Video summarization by learning submodular mixtures of objectives,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3090–3098.

M. Liu, H. Liu, C. Chen, “Enhanced skeleton visualization for view invariant human action recognition,” Pattern Recognit., vol. 68, pp. 346–362, 2017.

Z. Wang, Y-J. Cha, “Unsupervised deep learning approach using a deep auto-encoder with a one-class support vector machine to detect damage,” Structural Health Monitoring, vol. 20, issue 1, pp. 406-425, 2021.




How to Cite

Shakhovska, N., Melnykova, N., Pobereiko, P., & Zakharchuk, M. (2023). Investigating Methods of Searching for Key Frames in Video Flow with the Use of Neural Networks for Search Systems. International Journal of Computing, 22(4), 455-461.