:::info Authors:
(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;
(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;
(3) Muhammad Maaz, Mohamed bin Zayed University of AI;
(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Mubarak Shah, University of Central Florida;
(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.
:::
:::tip Editor's Note: This is Part 5 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.
:::
Table of Links\ Supplementary Material
\ For audio transcript extraction, base Whisper model is used. Our grounding module is based on GroundingDINOT variant and CLIP ViT-B/32. For the image-tagging model we use RAM Swin-Large variant (with input size 384). DEVA Tracker is applied under online-setting in our experiments.
\ Vicuna-13b-v1.5 model is used in performing videobased conversational benchmarking, zero-shot question answering evaluation, and extracting the key noun or referring expression from the model output in the quantitative evaluation of the spatial grounding task. Further, Vicuna-13b-v1.5 was used to implement the entity matching as in [49].
\
:::info This paper is available on arxiv under CC BY 4.0 DEED license.
:::
\
All Rights Reserved. Copyright , Central Coast Communications, Inc.