A team of New York University researchers that includes Facebook AI Lab Director Yann LeCun recently published a paper explaining how they built a deep learning model capable of predicting the position of human limbs in images. That field of computer vision, called human pose estimation, doesn’t get as much attention as things like facial recognition or object recognition, but it’s actually quite difficult and potentially very important in fields such as human-computer interaction and computer animation.
Computers that can accurately identify the positions of people’s arms, legs, joints and general body alignment could lead to better gesture-based controls for interactive displays, more-accurate markerless (i.e., no sensors stuck to people’s bodies) motion-capture systems, and robots (or other computers) that can infer actions as well as identify objects. Even in situations where it’s difficult or impossible to see or distinguish a part of somebody’s body, or even an entire side, pose-estimation systems should be smart enough to predict how those limbs are positioned.
It’s no wonder why Facebook, which stores billions of images and videos and now owns the Oculus technology, would be interested in this field — if LeCun’s involvement in this recent research is indeed an indication of Facebook’s interest. Other companies such as Microsoft, Disney and Google certainly are.
Actually, the NYU research, which is dubbed MoDeep, isn’t the first attempt to apply deep learning to pose estimation. Two other NYU teams (here and here, all featuring some combination of researchers LeCun, Jonathan Tompson, Arjun Jain and/or Christoph Bregler) have published papers on it, and a team of Google researchers published a paper in December 2013 highlighting a system called DeepPose (not to be confused with Facebook’s DeepFace facial-recognition system). The featured image on this post is the result of DeepPose’s work on a set of sports images.
Google claimed state-of-the-art performance against a test dataset called FLIC (short for Frames Labeled In Cinema), and the recent NYU research claims to have significantly improved upon those results. All of the research has relied on convolutional neural networks, which are the deep learning technique of choice for computer vision tasks, and all of it took advantage of the algorithms’ ability to learn context from the entirety of an image (e.g., the position of someone’s eyes, ears and nose) rather than just predetermined features relating to the joints.
The most-recent research out of NYU, however, used a slightly different architecture for its network and also created a new training dataset that includes information about the motion of the joints in the images (thus the MoDeep moniker). Essentially, they took the FLIC images, paired them with neighboring frames from the associated movies, and averaged the images to calculate the flow of body parts between them.
Pose-estimation, it seems, is the natural next step after proving the accuracy of deep learning for object recognition — a task that it has come to dominate over the past few years. Whereas object recognition is more holistic (i.e., what is that object), pose estimation is more local (i.e., what is the position of the elbow joint on that object). Aside from computer vision, deep learning has also proven especially adept at things such as speech recognition, machine listening and natural language processing.