Robohub.org
 

Artificial intelligence in action


by
11 April 2018



share this:

Aude Oliva (right), a principal research scientist at the Computer Science and Artificial Intelligence Laboratory and Dan Gutfreund (left), a principal investigator at the MIT–IBM Watson AI Laboratory and a staff member at IBM Research, are the principal investigators for the Moments in Time Dataset, one of the projects related to AI algorithms funded by the MIT–IBM Watson AI Laboratory.
Photo: John Mottern/Feature Photo Service for IBM

By Meg Murphy
A person watching videos that show things opening — a door, a book, curtains, a blooming flower, a yawning dog — easily understands the same type of action is depicted in each clip.

“Computer models fail miserably to identify these things. How do humans do it so effortlessly?” asks Dan Gutfreund, a principal investigator at the MIT-IBM Watson AI Laboratory and a staff member at IBM Research. “We process information as it happens in space and time. How can we teach computer models to do that?”

Such are the big questions behind one of the new projects underway at the MIT-IBM Watson AI Laboratory, a collaboration for research on the frontiers of artificial intelligence. Launched last fall, the lab connects MIT and IBM researchers together to work on AI algorithms, the application of AI to industries, the physics of AI, and ways to use AI to advance shared prosperity.

The Moments in Time dataset is one of the projects related to AI algorithms that is funded by the lab. It pairs Gutfreund with Aude Oliva, a principal research scientist at the MIT Computer Science and Artificial Intelligence Laboratory, as the project’s principal investigators. Moments in Time is built on a collection of 1 million annotated videos of dynamic events unfolding within three seconds. Gutfreund and Oliva, who is also the MIT executive director at the MIT-IBM Watson AI Lab, are using these clips to address one of the next big steps for AI: teaching machines to recognize actions.

Learning from dynamic scenes

The goal is to provide deep-learning algorithms with large coverage of an ecosystem of visual and auditory moments that may enable models to learn information that isn’t necessarily taught in a supervised manner and to generalize to novel situations and tasks, say the researchers.

“As we grow up, we look around, we see people and objects moving, we hear sounds that people and object make. We have a lot of visual and auditory experiences. An AI system needs to learn the same way and be fed with videos and dynamic information,” Oliva says.

For every action category in the dataset, such as cooking, running, or opening, there are more than 2,000 videos. The short clips enable computer models to better learn the diversity of meaning around specific actions and events.

“This dataset can serve as a new challenge to develop AI models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis,” Oliva adds, describing the factors involved. Events can include people, objects, animals, and nature. They may be symmetrical in time — for example, opening means closing in reverse order. And they can be transient or sustained.

Oliva and Gutfreund, along with additional researchers from MIT and IBM, met weekly for more than a year to tackle technical issues, such as how to choose the action categories for annotations, where to find the videos, and how to put together a wide array so the AI system learns without bias. The team also developed machine-learning models, which were then used to scale the data collection. “We aligned very well because we have the same enthusiasm and the same goal,” says Oliva.

Augmenting human intelligence

One key goal at the lab is the development of AI systems that move beyond specialized tasks to tackle more complex problems and benefit from robust and continuous learning. “We are seeking new algorithms that not only leverage big data when available, but also learn from limited data to augment human intelligence,” says Sophie V. Vandebroek, chief operating officer of IBM Research, about the collaboration.

In addition to pairing the unique technical and scientific strengths of each organization, IBM is also bringing MIT researchers an influx of resources, signaled by its $240 million investment in AI efforts over the next 10 years, dedicated to the MIT-IBM Watson AI Lab. And the alignment of MIT-IBM interest in AI is proving beneficial, according to Oliva.

“IBM came to MIT with an interest in developing new ideas for an artificial intelligence system based on vision. I proposed a project where we build data sets to feed the model about the world. It had not been done before at this level. It was a novel undertaking. Now we have reached the milestone of 1 million videos for visual AI training, and people can go to our website, download the dataset and our deep-learning computer models, which have been taught to recognize actions.”

Qualitative results so far have shown models can recognize moments well when the action is well-framed and close up, but they misfire when the category is fine-grained or there is background clutter, among other things. Oliva says that MIT and IBM researchers have submitted an article describing the performance of neural network models trained on the dataset, which itself was deepened by shared viewpoints. “IBM researchers gave us ideas to add action categories to have more richness in areas like health care and sports. They broadened our view. They gave us ideas about how AI can make an impact from the perspective of business and the needs of the world,” she says.

This first version of the Moments in Time dataset is one of the largest human-annotated video datasets capturing visual and audible short events, all of which are tagged with an action or activity label among 339 different classes that include a wide range of common verbs. The researchers intend to produce more datasets with a variety of levels of abstraction to serve as stepping stones toward the development of learning algorithms that can build analogies between things, imagine and synthesize novel events, and interpret scenarios.

In other words, they are just getting started, says Gutfreund. “We expect the Moments in Time dataset to enable models to richly understand actions and dynamics in videos.”




MIT News

            AUAI is supported by:



Subscribe to Robohub newsletter on substack



Related posts :

Robot Talk Episode 153 – Origami-inspired robots, with Chenying Liu

  24 Apr 2026
In the latest episode of the Robot Talk podcast, Claire chatted to Chenying Liu from University of Oxford about how a robot's physical form can actively contribute to sensing, processing, decision-making, and movement.

Sony AI table tennis robot outplays elite human players

  22 Apr 2026
New robot and AI system has beaten professional and elite table tennis players.

AI system learns to keep warehouse robot traffic running smoothly

  20 Apr 2026
This new approach adapts to decide which robots should get the right of way at every moment, avoiding congestion and increasing throughput.

Robot Talk Episode 152 – Dexterous robot hands, with Rich Walker

  17 Apr 2026
In the latest episode of the Robot Talk podcast, Claire chatted to Rich Walker from Shadow Robot Company about their advanced robotic hands for research and industry.

What I’ve learned from 25 years of automated science, and what the future holds: an interview with Ross King

and   14 Apr 2026
Ross King created the first robot scientist back in 2009. He spoke to us about the nature of scientific discovery, the role AI has to play, and his recent work in DNA computing.

Robot Talk Episode 151 – Robots to study the ocean, with Simona Aracri

  10 Apr 2026
In the latest episode of the Robot Talk podcast, Claire chatted to Simona Aracri from National Research Council of Italy about innovative robot designs for oceanography and environmental monitoring.

Generative AI improves a wireless vision system that sees through obstructions

  08 Apr 2026
With this new technique, a robot could more accurately detect hidden objects or understand an indoor scene using reflected Wi-Fi signals.

Resource-constrained image generation and visual understanding: an interview with Aniket Roy

  07 Apr 2026
Aniket tells us about his research exploring how modern generative models can be adapted to operate efficiently while maintaining strong performance.



AUAI is supported by:







Subscribe to Robohub newsletter on substack




 















©2026.02 - Association for the Understanding of Artificial Intelligence