MacMusic  |  PcMusic  |  440 Software  |  440 Forums  |  440TV  |  Zicos
researchers
Recherche

Microsoft researchers develop new tech for video AI agents

mardi 2 septembre 2025, 12:15 , par ComputerWorld
Microsoft researchers are developing technologies for a new class of video AI agents to explore three-dimensional spaces before making decisions.

The technology framework, called MindJourney, uses a range of AI technologies to understand and analyze 3D spaces, reason about the surroundings, and predict movement, the researchers wrote in a blog entry late last month.

[ Related: Agentic AI – News and insights ]

MindJourney includes video-generation systems, vision language models (VLMs), and reasoning techniques that can predict surroundings, patterns, and movement. These technologies are packaged around “world models” that simulate real-world surroundings.

Vision language models analyze pixels in visual surroundings to identify and reason around objects and surroundings. For example, recent work by Nvidia for its Cosmos VLMs helps robots move and take action in their surroundings.

MindJourney explores spaces by combining real-world images with scenes generated by the world model. For example, the framework’s reasoning capabilities generate multiple visual scenarios that agents may see when moving in different directions. This is much like how text-based AI generators work.

“This enhancement could enable agents more accurately interpret spatial relationships and physical dynamics, helping them to operate effectively in changing environments,” the researchers wrote in the blog entry.

VLMs excel at 2D surroundings, but the visual world is in 3D, and MindJourney provides better viewpoints of real-world scenarios, and ultimately aims to forecast how scenes change over time, according to the Microsoft researchers.

MindJourney “sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration,” the researchers wrote in a paper.

MindJourney’s technologies could improve assistive robots and remote inspection, and enrich virtual and augmented reality experiences, the researchers wrote in the paper.

But there are also concerns.

“More capable spatial reasoning can enhance autonomous surveillance systems or military platforms; and greater autonomy could displace certain manual-labor jobs,” the researchers wrote.

Early AI research such as Google’s milestone cat detector (PDF)  focused on identifying still images through vision models.

Video AI is the next frontier, with Nvidia leading the charge. Nvidia is focused on strong vision capabilities through robotic eyes. The company in late August announced a new computer for robots called Jetson Thor that is capable of running VLMs locally.

Most of the popular large-language models are now able to handle images, video, and text, but are limited in scope when it comes to visual AI.
https://www.computerworld.com/article/4049703/microsoft-researchers-develop-new-tech-for-video-ai-ag...

Voir aussi

News copyright owned by their original publishers | Copyright © 2004 - 2025 Zicos / 440Network
Date Actuelle
mer. 3 sept. - 03:08 CEST