UAE Researchers Teach AI to Watch, Listen, and Understand Videos Like Humans

DM Television

It took 6 years but Powerbeats Pro 2 are finally here: What’s new?

February

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

UAE Researchers Teach AI to Watch, Listen, and Understand Videos Like Humans

Tags: audio video

Author: DATE POSTED:December 20, 2024

Feed: Hacker Noon - Medium

View: Original article

:::info Authors:

(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;

(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;

(3) Muhammad Maaz, Mohamed bin Zayed University of AI;

(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;

(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;

(6) Mubarak Shah, University of Central Florida;

(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.

:::

:::tip Editor's Note: This is Part 5 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.

:::

Table of Links

Abstract and 1 Introduction
2. Related Works
3. PG-Video-LLaVA
3.1. Overview
3.2. Architecture
4. Experiments
4.1. Implementation Details
4.2. Stronger Baseline
4.3. Spatial Grounding in Videos
4.4. Zero-Shot Visual Question Answering
5. Conclusion and References

\ Supplementary Material

4.1. Implementation Details

\ For audio transcript extraction, base Whisper model is used. Our grounding module is based on GroundingDINOT variant and CLIP ViT-B/32. For the image-tagging model we use RAM Swin-Large variant (with input size 384). DEVA Tracker is applied under online-setting in our experiments.

\ Vicuna-13b-v1.5 model is used in performing videobased conversational benchmarking, zero-shot question answering evaluation, and extracting the key noun or referring expression from the model output in the quantitative evaluation of the spatial grounding task. Further, Vicuna-13b-v1.5 was used to implement the entity matching as in [49].

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

Feed: Hacker Noon - Medium

View: Original article

Tags: audio video