Principal Investigators
Omar Medjaouri
Inclusive Dates 
11/27/2023 to 03/27/2024

Background

In recent years, researchers in the field of computer vision have proposed several solutions to the problem of estimating object pose using both hand-crafted visual features and learned ones through the use of deep neural networks; however, these approaches can be extremely limited, requiring objects to have a sufficient number of opportunistic features of interest, or requiring the development of specialized neural networks specific to the object of interest.

Two images, color and silhouette, from one of the motion capture system cameras.

Figure 1. Left: color image from one of the motion capture system cameras. Right: a corresponding silhouette image produced by a machine learning algorithm.

Approach

SwRI created a processing pipeline that employs a digital twin of the object of interest and the set of cameras that comprise the motion capture system. This digital twin model allows us to generate virtual views of the object (for each camera) as a function of the object’s pose. We use this capability to generate silhouette images, masks indicating which pixels in a frame belong to the object. Importantly, this rendering process is implemented in an algorithmic differentiation framework that allows the output masks to be differentiated with respect to the object pose.

Parallel to the digital twin, our process pairs real camera images with a machine learning algorithm that performs segmentation of the object, producing silhouette images. With these silhouette images providing observations, the next stage of the pipeline applies a convex optimization algorithm that aligns the digital twin silhouette images to the observation images by refining the pose of the digital twin.

Multiple images showing tracking with differentiable rendering.

Figure 2. Tracking with Differentiable Rendering. The pixels highlighted in green represent rendered silhouettes (model predictions), the pixels highlighted in red represent segmentation masks (observations), and the pixels that are yellow (i.e., both red and green) represent alignment between the model and observations.

Accomplishments

  • Demonstrated object tracking of several objects of interest, including balls, baseball bats, and golf clubs.
  • Demonstrated tracking through occlusions caused by humans in the capture volume.
  • Validated system against a marker-based motion capture system, showing a root mean square position error of just 1.5% of the length of tracked object.