• Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
Sunday, June 1, 2025
newsaiworld
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
Morning News
No Result
View All Result
Home Machine Learning

A foundational visible encoder for video understanding

Admin by Admin
August 8, 2024
in Machine Learning
0
Image18.png
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


An astounding variety of movies can be found on the Net, overlaying quite a lot of content material from on a regular basis moments individuals share to historic moments to scientific observations, every of which incorporates a singular document of the world. The fitting instruments may assist researchers analyze these movies, reworking how we perceive the world round us.

Movies provide dynamic visible content material way more wealthy than static pictures, capturing motion, adjustments, and dynamic relationships between entities. Analyzing this complexity, together with the immense variety of publicly obtainable video information, calls for fashions that transcend conventional picture understanding. Consequently, most of the approaches that greatest carry out on video understanding nonetheless depend on specialised fashions tailored for explicit duties. Lately, there was thrilling progress on this space utilizing video basis fashions (ViFMs), equivalent to VideoCLIP, InternVideo, VideoCoCa, and UMT. Nevertheless, constructing a ViFM that handles the sheer variety of video information stays a problem.

With the aim of constructing a single mannequin for general-purpose video understanding, we introduce “VideoPrism: A Foundational Visible Encoder for Video Understanding”. VideoPrism is a ViFM designed to deal with a large spectrum of video understanding duties, together with classification, localization, retrieval, captioning, and query answering (QA). We suggest improvements in each the pre-training information in addition to the modeling technique. We pre-train VideoPrism on an enormous and various dataset: 36 million high-quality video-text pairs and 582 million video clips with noisy or machine-generated parallel textual content. Our pre-training method is designed for this hybrid information, to be taught each from video-text pairs and the movies themselves. VideoPrism is extremely straightforward to adapt to new video understanding challenges, and achieves state-of-the-art efficiency utilizing a single frozen mannequin.

 

play silent looping video
pause silent looping video

VideoPrism is a general-purpose video encoder that permits state-of-the-art outcomes over a large spectrum of video understanding duties, together with classification, localization, retrieval, captioning, and query answering, by producing video representations from a single frozen mannequin.

Pre-training information

A strong ViFM wants a really massive assortment of movies on which to coach — just like different basis fashions (FMs), equivalent to these for giant language fashions (LLMs). Ideally, we’d need the pre-training information to be a consultant pattern of all of the movies on the earth. Whereas naturally most of those movies shouldn’t have excellent captions or descriptions, even imperfect textual content can present helpful details about the semantic content material of the video.

To offer our mannequin the very best start line, we put collectively an enormous pre-training corpus consisting of a number of private and non-private datasets, together with YT-Temporal-180M, InternVid, VideoCC, WTS-70M, and so on. This contains 36 million fastidiously chosen movies with high-quality captions, together with a further 582 million clips with various ranges of noisy textual content (like auto-generated transcripts). To our information, that is the most important and most various video coaching corpus of its form.





Statistics on the video-text pre-training information. The big variations of the CLIP similarity scores (the upper, the higher) show the various caption high quality of our pre-training information, which is a byproduct of the assorted methods used to reap the textual content.

 

Two-stage coaching

The VideoPrism mannequin structure stems from the usual imaginative and prescient transformer (ViT) with a factorized design that sequentially encodes spatial and temporal info following ViViT. Our coaching method leverages each the high-quality video-text information and the video information with noisy textual content talked about above. To begin, we use contrastive studying (an method that minimizes the gap between optimistic video-text pairs whereas maximizing the gap between detrimental video-text pairs) to show our mannequin to match movies with their very own textual content descriptions, together with imperfect ones. This builds a basis for matching semantic language content material to visible content material.

After video-text contrastive coaching, we leverage the gathering of movies with out textual content descriptions. Right here, we construct on the masked video modeling framework to foretell masked patches in a video, with just a few enhancements. We practice the mannequin to foretell each the video-level international embedding and token-wise embeddings from the first-stage mannequin to successfully leverage the information acquired in that stage. We then randomly shuffle the anticipated tokens to forestall the mannequin from studying shortcuts.

What is exclusive about VideoPrism’s setup is that we use two complementary pre-training indicators: textual content descriptions and the visible content material inside a video. Textual content descriptions typically concentrate on what issues seem like, whereas the video content material gives details about motion and visible dynamics. This allows VideoPrism to excel in duties that demand an understanding of each look and movement.

 

Outcomes

We conduct in depth analysis on VideoPrism throughout 4 broad classes of video understanding duties, together with video classification and localization, video-text retrieval, video captioning, query answering, and scientific video understanding. VideoPrism achieves state-of-the-art efficiency on 30 out of 33 video understanding benchmarks — all with minimal adaptation of a single, frozen mannequin.





VideoPrism in comparison with the earlier best-performing FMs.

 

Classification and localization

We consider VideoPrism on an present large-scale video understanding benchmark (VideoGLUE) overlaying classification and localization duties. We discover that (1) VideoPrism outperforms the entire different state-of-the-art FMs, and (2) no different single mannequin persistently got here in second place. This tells us that VideoPrism has realized to successfully pack quite a lot of video indicators into one encoder — from semantics at totally different granularities to look and movement cues — and it really works properly throughout quite a lot of video sources.

 

Combining with LLMs

We additional discover combining VideoPrism with LLMs to unlock its means to deal with varied video-language duties. Particularly, when paired with a textual content encoder (following LiT) or a language decoder (equivalent to PaLM-2), VideoPrism will be utilized for video-text retrieval, video captioning, and video QA duties. We examine the mixed fashions on a broad and difficult set of vision-language benchmarks. VideoPrism units the brand new cutting-edge on most benchmarks. From the visible outcomes, we discover that VideoPrism is able to understanding advanced motions and appearances in movies (e.g., the mannequin can acknowledge the totally different colours of spinning objects on the window within the visible examples under). These outcomes show that VideoPrism is strongly appropriate with language fashions.

 

play silent looping video
pause silent looping video

play silent looping video
pause silent looping video

play silent looping video
pause silent looping video

We present qualitative outcomes utilizing VideoPrism with a textual content encoder for video-text retrieval (first row) and tailored to a language decoder for video QA (second and third row). For video-text retrieval examples, the blue bars point out the embedding similarities between the movies and the textual content queries.

 

Scientific functions

Lastly, we check VideoPrism on datasets utilized by scientists throughout domains, together with fields equivalent to ethology, behavioral neuroscience, and ecology. These datasets usually require area experience to annotate, for which we leverage present scientific datasets open-sourced by the group together with Fly vs. Fly, CalMS21, ChimpACT, and KABR. VideoPrism not solely performs exceptionally properly, however really surpasses fashions designed particularly for these duties. This implies instruments like VideoPrism have the potential to rework how scientists analyze video information throughout totally different fields.





VideoPrism outperforms the area specialists on varied scientific benchmarks. We present absolutely the rating variations to spotlight the relative enhancements of VideoPrism. We report imply common precision (mAP) for all datasets, aside from KABR which makes use of class-averaged top-1 accuracy.

 

Conclusion

With VideoPrism, we introduce a robust and versatile video encoder that units a brand new customary for general-purpose video understanding. Our emphasis on each constructing an enormous and assorted pre-training dataset and revolutionary modeling strategies has been validated by means of our in depth evaluations. Not solely does VideoPrism persistently outperform robust baselines, however its distinctive means to generalize positions it properly for tackling an array of real-world functions. Due to its potential broad use, we’re dedicated to persevering with additional accountable analysis on this house, guided by our AI Ideas. We hope VideoPrism paves the best way for future breakthroughs on the intersection of AI and video evaluation, serving to to comprehend the potential of ViFMs throughout domains equivalent to scientific discovery, training, and healthcare.

 

Acknowledgements

This weblog put up is made on behalf of all of the VideoPrism authors: Lengthy Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Solar, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, and Boqing Gong. We sincerely thank David Hendon for his or her product administration efforts, and Alex Siegman, Ramya Ganeshan, and Victor Gomes for his or her program and useful resource administration efforts. We additionally thank Hassan Akbari, Sherry Ben, Yoni Ben-Meshulam, Chun-Te Chu, Sam Clearwater, Yin Cui, Ilya Figotin, Anja Hauth, Sergey Ioffe, Xuhui Jia, Yeqing Li, Lu Jiang, Zu Kim, Dan Kondratyuk, Invoice Mark, Arsha Nagrani, Caroline Pantofaru, Sushant Prakash, Cordelia Schmid, Bryan Seybold, Mojtaba Seyedhosseini, Amanda Sadler, Rif A. Saurous, Rachel Stigler, Paul Voigtlaender, Pingmei Xu, Chaochao Yan, Xuan Yang, and Yukun Zhu for the discussions, assist, and suggestions that significantly contributed to this work. We’re grateful to Jay Yagnik, Rahul Sukthankar, and Tomas Izo for his or her enthusiastic assist for this venture. Lastly, we thank Tom Small, Jennifer J. Solar, Hao Zhou, Nitesh B. Gundavarapu, Luke Friedman, and Mikhail Sirotenko for the large assist with making this weblog put up.

 

READ ALSO

Agentic RAG Functions: Firm Data Slack Brokers

The Hidden Safety Dangers of LLMs


An astounding variety of movies can be found on the Net, overlaying quite a lot of content material from on a regular basis moments individuals share to historic moments to scientific observations, every of which incorporates a singular document of the world. The fitting instruments may assist researchers analyze these movies, reworking how we perceive the world round us.

Movies provide dynamic visible content material way more wealthy than static pictures, capturing motion, adjustments, and dynamic relationships between entities. Analyzing this complexity, together with the immense variety of publicly obtainable video information, calls for fashions that transcend conventional picture understanding. Consequently, most of the approaches that greatest carry out on video understanding nonetheless depend on specialised fashions tailored for explicit duties. Lately, there was thrilling progress on this space utilizing video basis fashions (ViFMs), equivalent to VideoCLIP, InternVideo, VideoCoCa, and UMT. Nevertheless, constructing a ViFM that handles the sheer variety of video information stays a problem.

With the aim of constructing a single mannequin for general-purpose video understanding, we introduce “VideoPrism: A Foundational Visible Encoder for Video Understanding”. VideoPrism is a ViFM designed to deal with a large spectrum of video understanding duties, together with classification, localization, retrieval, captioning, and query answering (QA). We suggest improvements in each the pre-training information in addition to the modeling technique. We pre-train VideoPrism on an enormous and various dataset: 36 million high-quality video-text pairs and 582 million video clips with noisy or machine-generated parallel textual content. Our pre-training method is designed for this hybrid information, to be taught each from video-text pairs and the movies themselves. VideoPrism is extremely straightforward to adapt to new video understanding challenges, and achieves state-of-the-art efficiency utilizing a single frozen mannequin.

 

play silent looping video
pause silent looping video

VideoPrism is a general-purpose video encoder that permits state-of-the-art outcomes over a large spectrum of video understanding duties, together with classification, localization, retrieval, captioning, and query answering, by producing video representations from a single frozen mannequin.

Pre-training information

A strong ViFM wants a really massive assortment of movies on which to coach — just like different basis fashions (FMs), equivalent to these for giant language fashions (LLMs). Ideally, we’d need the pre-training information to be a consultant pattern of all of the movies on the earth. Whereas naturally most of those movies shouldn’t have excellent captions or descriptions, even imperfect textual content can present helpful details about the semantic content material of the video.

To offer our mannequin the very best start line, we put collectively an enormous pre-training corpus consisting of a number of private and non-private datasets, together with YT-Temporal-180M, InternVid, VideoCC, WTS-70M, and so on. This contains 36 million fastidiously chosen movies with high-quality captions, together with a further 582 million clips with various ranges of noisy textual content (like auto-generated transcripts). To our information, that is the most important and most various video coaching corpus of its form.





Statistics on the video-text pre-training information. The big variations of the CLIP similarity scores (the upper, the higher) show the various caption high quality of our pre-training information, which is a byproduct of the assorted methods used to reap the textual content.

 

Two-stage coaching

The VideoPrism mannequin structure stems from the usual imaginative and prescient transformer (ViT) with a factorized design that sequentially encodes spatial and temporal info following ViViT. Our coaching method leverages each the high-quality video-text information and the video information with noisy textual content talked about above. To begin, we use contrastive studying (an method that minimizes the gap between optimistic video-text pairs whereas maximizing the gap between detrimental video-text pairs) to show our mannequin to match movies with their very own textual content descriptions, together with imperfect ones. This builds a basis for matching semantic language content material to visible content material.

After video-text contrastive coaching, we leverage the gathering of movies with out textual content descriptions. Right here, we construct on the masked video modeling framework to foretell masked patches in a video, with just a few enhancements. We practice the mannequin to foretell each the video-level international embedding and token-wise embeddings from the first-stage mannequin to successfully leverage the information acquired in that stage. We then randomly shuffle the anticipated tokens to forestall the mannequin from studying shortcuts.

What is exclusive about VideoPrism’s setup is that we use two complementary pre-training indicators: textual content descriptions and the visible content material inside a video. Textual content descriptions typically concentrate on what issues seem like, whereas the video content material gives details about motion and visible dynamics. This allows VideoPrism to excel in duties that demand an understanding of each look and movement.

 

Outcomes

We conduct in depth analysis on VideoPrism throughout 4 broad classes of video understanding duties, together with video classification and localization, video-text retrieval, video captioning, query answering, and scientific video understanding. VideoPrism achieves state-of-the-art efficiency on 30 out of 33 video understanding benchmarks — all with minimal adaptation of a single, frozen mannequin.





VideoPrism in comparison with the earlier best-performing FMs.

 

Classification and localization

We consider VideoPrism on an present large-scale video understanding benchmark (VideoGLUE) overlaying classification and localization duties. We discover that (1) VideoPrism outperforms the entire different state-of-the-art FMs, and (2) no different single mannequin persistently got here in second place. This tells us that VideoPrism has realized to successfully pack quite a lot of video indicators into one encoder — from semantics at totally different granularities to look and movement cues — and it really works properly throughout quite a lot of video sources.

 

Combining with LLMs

We additional discover combining VideoPrism with LLMs to unlock its means to deal with varied video-language duties. Particularly, when paired with a textual content encoder (following LiT) or a language decoder (equivalent to PaLM-2), VideoPrism will be utilized for video-text retrieval, video captioning, and video QA duties. We examine the mixed fashions on a broad and difficult set of vision-language benchmarks. VideoPrism units the brand new cutting-edge on most benchmarks. From the visible outcomes, we discover that VideoPrism is able to understanding advanced motions and appearances in movies (e.g., the mannequin can acknowledge the totally different colours of spinning objects on the window within the visible examples under). These outcomes show that VideoPrism is strongly appropriate with language fashions.

 

play silent looping video
pause silent looping video

play silent looping video
pause silent looping video

play silent looping video
pause silent looping video

We present qualitative outcomes utilizing VideoPrism with a textual content encoder for video-text retrieval (first row) and tailored to a language decoder for video QA (second and third row). For video-text retrieval examples, the blue bars point out the embedding similarities between the movies and the textual content queries.

 

Scientific functions

Lastly, we check VideoPrism on datasets utilized by scientists throughout domains, together with fields equivalent to ethology, behavioral neuroscience, and ecology. These datasets usually require area experience to annotate, for which we leverage present scientific datasets open-sourced by the group together with Fly vs. Fly, CalMS21, ChimpACT, and KABR. VideoPrism not solely performs exceptionally properly, however really surpasses fashions designed particularly for these duties. This implies instruments like VideoPrism have the potential to rework how scientists analyze video information throughout totally different fields.





VideoPrism outperforms the area specialists on varied scientific benchmarks. We present absolutely the rating variations to spotlight the relative enhancements of VideoPrism. We report imply common precision (mAP) for all datasets, aside from KABR which makes use of class-averaged top-1 accuracy.

 

Conclusion

With VideoPrism, we introduce a robust and versatile video encoder that units a brand new customary for general-purpose video understanding. Our emphasis on each constructing an enormous and assorted pre-training dataset and revolutionary modeling strategies has been validated by means of our in depth evaluations. Not solely does VideoPrism persistently outperform robust baselines, however its distinctive means to generalize positions it properly for tackling an array of real-world functions. Due to its potential broad use, we’re dedicated to persevering with additional accountable analysis on this house, guided by our AI Ideas. We hope VideoPrism paves the best way for future breakthroughs on the intersection of AI and video evaluation, serving to to comprehend the potential of ViFMs throughout domains equivalent to scientific discovery, training, and healthcare.

 

Acknowledgements

This weblog put up is made on behalf of all of the VideoPrism authors: Lengthy Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Solar, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, and Boqing Gong. We sincerely thank David Hendon for his or her product administration efforts, and Alex Siegman, Ramya Ganeshan, and Victor Gomes for his or her program and useful resource administration efforts. We additionally thank Hassan Akbari, Sherry Ben, Yoni Ben-Meshulam, Chun-Te Chu, Sam Clearwater, Yin Cui, Ilya Figotin, Anja Hauth, Sergey Ioffe, Xuhui Jia, Yeqing Li, Lu Jiang, Zu Kim, Dan Kondratyuk, Invoice Mark, Arsha Nagrani, Caroline Pantofaru, Sushant Prakash, Cordelia Schmid, Bryan Seybold, Mojtaba Seyedhosseini, Amanda Sadler, Rif A. Saurous, Rachel Stigler, Paul Voigtlaender, Pingmei Xu, Chaochao Yan, Xuan Yang, and Yukun Zhu for the discussions, assist, and suggestions that significantly contributed to this work. We’re grateful to Jay Yagnik, Rahul Sukthankar, and Tomas Izo for his or her enthusiastic assist for this venture. Lastly, we thank Tom Small, Jennifer J. Solar, Hao Zhou, Nitesh B. Gundavarapu, Luke Friedman, and Mikhail Sirotenko for the large assist with making this weblog put up.

 

Tags: encoderfoundationalUnderstandingVideovisual

Related Posts

1 mkll19xekuwg7kk23hy0jg.webp.webp
Machine Learning

Agentic RAG Functions: Firm Data Slack Brokers

May 31, 2025
Bernd dittrich dt71hajoijm unsplash scaled 1.jpg
Machine Learning

The Hidden Safety Dangers of LLMs

May 29, 2025
Pexels buro millennial 636760 1438081 scaled 1.jpg
Machine Learning

How Microsoft Energy BI Elevated My Information Evaluation and Visualization Workflow

May 28, 2025
Img 0258 1024x585.png
Machine Learning

Code Brokers: The Way forward for Agentic AI

May 27, 2025
Jason dent jvd3xpqjlaq unsplash.jpg
Machine Learning

About Calculating Date Ranges in DAX

May 26, 2025
1748146670 default image.jpg
Machine Learning

Do Extra with NumPy Array Sort Hints: Annotate & Validate Form & Dtype

May 25, 2025
Next Post
Security shutterstock.jpg

How AI Is Lending a Serving to Hand within the Knowledge Safety Trade

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

0 3.png

College endowments be a part of crypto rush, boosting meme cash like Meme Index

February 10, 2025
Gemini 2.0 Fash Vs Gpt 4o.webp.webp

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

January 19, 2025
1da3lz S3h Cujupuolbtvw.png

Scaling Statistics: Incremental Customary Deviation in SQL with dbt | by Yuval Gorchover | Jan, 2025

January 2, 2025
0khns0 Djocjfzxyr.jpeg

Constructing Data Graphs with LLM Graph Transformer | by Tomaz Bratanic | Nov, 2024

November 5, 2024
How To Maintain Data Quality In The Supply Chain Feature.jpg

Find out how to Preserve Knowledge High quality within the Provide Chain

September 8, 2024

EDITOR'S PICK

The Importance Of Data Quality In The Age Of Digital Wallets.jpg

The Significance of Knowledge High quality within the Age of Digital Wallets

January 13, 2025
Image1 1024x576 1.png

Knowledge Science: From College to Work, Half III

March 28, 2025
3d Printing Futurist.webp.webp

3D Printing: Revolutionizing Manufacturing or Disrupting the International Order?

January 14, 2025
0nw41 Mqlvi1wvklg.jpeg

Meet GPT, The Decoder-Solely Transformer | by Muhammad Ardi | Jan, 2025

January 6, 2025

About Us

Welcome to News AI World, your go-to source for the latest in artificial intelligence news and developments. Our mission is to deliver comprehensive and insightful coverage of the rapidly evolving AI landscape, keeping you informed about breakthroughs, trends, and the transformative impact of AI technologies across industries.

Categories

  • Artificial Intelligence
  • ChatGPT
  • Crypto Coins
  • Data Science
  • Machine Learning

Recent Posts

  • Cardano Backer Particulars Case for SEC Approval of Spot ADA ETF ⋆ ZyCrypto
  • The Secret Energy of Information Science in Buyer Help
  • FTX Set for $5 Billion Stablecoin Creditor Cost This Week
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy

© 2024 Newsaiworld.com. All rights reserved.

No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us

© 2024 Newsaiworld.com. All rights reserved.

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?