Attr4Vis: Revisiting Importance of Attribute Classification in Vision-Language Models for Video Recognition

Alexander Zarichkovyi; Inna V. Stetsenko

doi:10.47839/ijc.23.1.3440

Authors

Alexander Zarichkovyi
Inna V. Stetsenko

DOI:

https://doi.org/10.47839/ijc.23.1.3440

Keywords:

computer vision, video recognition, cross-model exploration, vision-language models, lexicon enrichment algorithm

Abstract

Vision-language models (VLMs), pretrained on expansive datasets containing image-text pairs, have exhibited remarkable transferability across a diverse spectrum of visual tasks. The leveraging of knowledge encoded within these potent VLMs holds significant promise for the advancement of effective video recognition models. A fundamental aspect of pretrained VLMs lies in their ability to establish a crucial bridge between the visual and textual domains. In our pioneering work, we introduce the Attr4Vis framework, dedicated to exploring knowledge transfer between Video and Text modalities to bolster video recognition performance. Central to our contributions is the comprehensive revisitation of Text-to-Video classifier initialization, a critical step that refines the initialization process and streamlines the integration of our framework, particularly within existing Vision-Language Models (VLMs). Furthermore, we emphasize the adoption of dense attribute generation techniques, shedding light on their paramount importance in video analysis. By effectively encoding attribute changes over time, these techniques significantly enhance event representation and recognition within videos. In addition, we introduce an innovative Attribute Enrichment Algorithm aimed at enriching set of attributes by large language models (LLMs) like ChatGPT. Through the seamless integration of these components, Attr4Vis attains a state-of-the-art accuracy of 91.5% on the challenging Kinetics-400 dataset using the InternVideo model.

References

J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, June 2019, pp. 4171–4186.

T. Brown, B. Mann, N. Ryder, M. Subbiah, et al., “Language models are few-shot learners,” in: H. Larochelle and M. Ranzato and R. Hadsell and M.F. Balcan and H. Lin (Eds.), Advances in Neural Information Processing Systems (NeurIPS 2020), vol. 33, 2020, pp. 1877–1901.

Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “Ernie: Enhanced language representation with informative entities,” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July – 2 August 2019, pp. 1441–1451. https://doi.org/10.18653/v1/P19-1139.

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020.

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, et al., “Learning transferable visual models from natural language supervision,” Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual, 18-24 July 2021, pp. 8748–8763.

C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, et al., “Scaling up visual and vision-language representation learning with noisy text supervision,” Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual, 18-24 July 2021, pp. 4904–4916.

J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini and Y. Wu. Coca: Contrastive captioners are image-text foundation models, 2022, [Online]. Available at: https://arxiv.org/abs/2205.01917.

L. Yuan, D. Chen, Y.-L. Chen, N. Codella, et al., Florence: A new foundation model for computer vision, 2021, [Online]. Available at: https://arxiv.org/abs/2111.11432.

Z. Lin, S. Geng, R. Zhang, P. Gao, et al., “Frozen CLIP models are efficient video learners,” Lecture Notes in Computer Science, vol. 13695, pp. 388–404, 2022. https://doi.org/10.1007/978-3-031-19833-5_23.

J. Pan, Z. Lin, X. Zhu, J. Shao, and H. Li, “St-adapter: Parameter-efficient image-to-video transfer learning,” S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh Advances (Eds.), Advances in Neural Information Processing Systems (NeurIPS’2022), vol. 35, 2022, pp. 26462-26477.

C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie, “Prompting visual-language models for efficient video understanding,” Lecture Notes in Computer Science, vol. 39, 2022, pp. 105–124. https://doi.org/10.1007/978-3-031-19833-5_7.

B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, “Expanding language-image pretrained models for general video recognition,” Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV. Springer, 2022, pp. 1–18. https://doi.org/10.1007/978-3-031-19772-7_1.

M. Wang, J. Xing, and Y. Liu. Actionclip: A new paradigm for video action recognition, 2021, [Online] Available at: https://arxiv.org/abs/2109.08472.

W. Wu, Z. Sun, and W. Ouyang, “Revisiting classifier: Transferring vision-language models for video recognition,” Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, Washington, D.C., USA, February 7–14, 2023, pp. 2847-2855. https://doi.org/10.1609/aaai.v37i3.25386.

W. Wu, X. Wang, H. Luo, J. Wang, Y. Yang, and W. Ouyang, “Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp. 6620-6630. https://doi.org/10.1109/CVPR52729.2023.00640.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Proceedings of the International Conference NeurIPS, 2012, pp. 1-9.

I. Paliy, A. Sachenko, V. Koval and Y. Kurylyak, “Approach to face recognition using neural networks,” Proceedings of the 2005 IEEE Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS’2005, Sofia, Bulgaria, 2005, pp. 112-115, https://doi.org/10.1109/IDAACS.2005.282951.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008.

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An Image is Worth 16x16 words: Transformers for Image Recognition at Scale, 2020, [Online]. Available at: https://arxiv.org/abs/2010.11929.

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jégou, Training Data-efficient Image Transformers & Distillation Through Attention, 2020, [Online]. Available at: https://arxiv.org/abs/2012.12877.

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, et el., “Swin transformer: Hierarchical vision transformer using shifted windows,” Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp. 9992-10002. https://doi.org/10.1109/ICCV48922.2021.00986.

G. Bertasius, H. Wang, and L. Torresani, “Is SpaceTime Attention All You Need for Video Understanding?” Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual, 18-24 July, 2021, pp. 813–824.

A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic, and C.V. Schmid, “ViViT: A video vision transformer,” Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp. 6816-6826. https://doi.org/10.1109/ICCV48922.2021.00676.

Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin and H. Hu, “Video swin transformer,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 3202–3211. https://doi.org/10.1109/CVPR52688.2022.00320.

H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, et el., “Multiscale vision transformers,” Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 6804-6815. https://doi.org/10.1109/ICCV48922.2021.00675.

A. Van den Oord, Y. Li, and O. Vinyals, Representation Learning with Contrastive Predictive Coding, 2018, [Online]. Available at: https://arxiv.org/abs/1807.

J. Yang, C. Li, P. Zhang, B. Xiao, C. Liu, L. Yuan, and J. Gao, “Unified contrastive learning in image-text-label space,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 19141-19151. https://doi.org/10.1109/CVPR52688.2022.01857.

L. Wang, Z. Tong, B. Ji, and G.Wu, “TDN: Temporal difference networks for efficient action recognition,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, June 19-25, 2021, pp. 1895–1904. https://doi.org/10.1109/CVPR46437.2021.00193.

H. Luo, L. Ji, M. Zhong, Y. Chen, W. Lei, N. Duan, and T. Li, “CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning,” Neurocomputing, vol. 508, pp. 293–304, 2022. https://doi.org/10.1016/j.neucom.2022.07.028.

Y. Wang, K. Li, Y. Li, Y. He, B. Huang, et el., InternVideo: General Video Foundation Models via Generative and Discriminative Learning, 2022, [Online]. Available at: https://arxiv.org/abs/2212.03191.

L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, et el., Text Embeddings by Weakly-Supervised Contrastive Pre-training, 2022, [Online]. Available at: https://arxiv.org/abs/2212.03533.

Z. Wang, J. Yu, A.W. Yu, Z. Dai, Y. Tsvetkov and Y. Cao, “SimVLM: Simple visual language model pretraining with weak supervision,” Proceedings of the Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.

P. Wang, A. Yang, R. Men, J. Lin, S. Bai, et el., “Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” Proceedings of the 39th International Conference on Machine Learning, ICML 2022, Baltimore, Maryland, USA, July 17-23, 2022, pp. 23318-23340.

W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, et al., Image as a Foreign Language: Beit Pretraining for all Vision and Vision-language Tasks, 2022. https://doi.org/10.1109/CVPR52729.2023.01838.

H. Bao, L. Dong, S. Piao, F. Wei, “BEiT: BERT pre-training of image transformers,” Proceedings of the Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.

T.-J. Fu, L. Li, Z. Gan, K. Lin, W. Yang Wang, et el., Violet: End-to-end video-language transformers with masked visual-token modeling, 2021, [Online]. Available at: https://arxiv.org/abs/2111.12681.

A. J. Wang, Y. Ge, R. Yan, Y. Ge, X. Lin, et el., All in one: Exploring unified video-language pre-training, 2022. https://doi.org/10.1109/CVPR52729.2023.00638.

L. Li, Z. Gan, K. Lin, C.-C. Lin, Z. Liu, C. Liu and L. Wang, Lavender: Unifying video-language understanding as masked language modeling, 2022. https://doi.org/10.1109/CVPR52729.2023.02214.

R. Zellers, J. Lu, X. Lu, Y. Yu, Y. Zhao, et el., “MERLOT RESERVE: Neural script knowledge through vision and language and sound,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 16354-16366. https://doi.org/10.1109/CVPR52688.2022.01589.

A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman, “End-to-end learning of visual representations from uncurated instructional videos,” Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 9876-9886. https://doi.org/10.1109/CVPR42600.2020.00990.

H. Xu, G. Ghosh, P.-Y. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer, Videoclip: Contrastive Pre-training for Zero-shot Video-text Understanding, 2021. https://doi.org/10.18653/v1/2021.emnlp-main.544.

C. Sun, A. Myers, C. Vondrick, K. P. Murphy, and C. Schmid. “Videobert: A joint model for video and language representation learning,” Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 7463-7472. https://doi.org/10.1109/ICCV.2019.00756.

L. Zhu, Y. Yang, “ActBERT: Learning global-local video-text representations,” Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 8743-8752. https://doi.org/10.1109/CVPR42600.2020.00877.

J. Lei, L. Li, L. Zhou, Z. Gan, T.L. Berg, M. Bansal and J. Liu. “Less is more: ClipBERT for video-and-language learning via sparse sampling,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, June 19-25, 2021, 2021, pp. 7331-7341. https://doi.org/10.1109/CVPR46437.2021.00725.

M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: a joint video and image encoder for end-to-end retrieval,” Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 202, pp. 1708-1718. https://doi.org/10.1109/ICCV48922.2021.00175.

W. Kay, J. Carreira, K. Simonyan, B. Zhang, et al. The Kinetics Human Action Video Dataset, 2017, [Online]. Available at: https://arxiv.org/abs/1705.06950.

J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman, A Short Note about Kinetics600, 2018, [Online]. Available at: https://arxiv.org/abs/1808.01340.

F. C. Heilbron, V. Escorcia, B. Ghanem, J. C. Niebles, “ActivityNet: A large-scale video benchmark for human activity understanding,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 961-970. https://doi.org/10.1109/CVPR.2015.7298698.

G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?,” Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual, 18-24 July, 2021, pp. 813–824.

M. Ryoo, A.J. Piergiovanni, A. Arnab, M. Dehghani, A. Angelova. “TokenLearner: adaptive space-time tokenization for videos,” Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 12786–12797.

S. Yan, X. Xiong, A. Arnab, Z. Lu, M. Zhang, C. Sun and C. Schmid, “Multiview transformers for video recognition,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 3333–3343, 2022. https://doi.org/10.1109/CVPR52688.2022.00333.

B. Zhang, J. Yu, C. Fifty, W. Han, A.M. Dai, R. Pang, and F. Sha, Co-training Transformer with Videos and Images Improves Action Recognition, 2021, [Online] Available at: https://arxiv.org/abs/2112.07175.

International Journal of Computing

Attr4Vis: Revisiting Importance of Attribute Classification in Vision-Language Models for Video Recognition

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information