Google, which owns YouTube, announced on Oct. 19 a new dataset of film clips, designed to teach machines how humans move in the world. Called AVA, or “atomic visual actions,” the videos aren’t anything special to human eyes—they’re three second clips of people drinking water and cooking curated from YouTube. But each clip is bundled with a file that outlines the person that a machine learning algorithm should watch, as well as a description of their pose, and whether they’re interacting with another human or object. It’s the digital version of pointing at a dog with a child and coaching them by saying, “dog.”
This technology could help Google to analyze the years of video it processes on YouTube every day. It could be applied to better target advertising based on whether you’re watching a video of people talk or fight, or in content moderation. The eventual goal is to teach computers social visual intelligence, the authors write in an accompanying research paper, which means “understanding what humans are doing, what might they do next, and what they are trying to achieve.”
Google’s video dataset is free.
In 2015, I speculated on Twitter:
I wonder if @google already has enough @youtube videos to create a video version of Wikipedia (and if they already are machine learning it)