Computer Vision and other Fun

Jamil Draréni

3D Head Tracking in Video

How a simple model can do big.

In this post, I will describe my own implementation of a head tracker. 3D Head Tracking (HT) consists of inferring the 3D orientation and displacement of the head, often from a (single) video source. Here, the video source will be a Logitech C910 webcam. Of course, any webcam will do. Video grabing and image processing will be done using OpenCV library.

The outline of the algorithm is as follow:

  1. Grab a frame and detect 2D features.
  2. Initialize the head pose.
  3. Compute 3D features→FTold.
  4. Grab a frame and detect 2D features.
  5. Compute 3D features →FTnew.
  6. Compute motion that registers FTnew→FTold.
  7. Update head pose.
  8. FTold = FTnew and go to 4.

At first glance, the toughest step in this outliine seems to be the 2D→3D features conversion. It turns out this is among the easiest task thanks to a simple idea: Cylindrical head model. In a nutshell, 2D features are unprojected from the camera reference to a virtual cylinder. This intersection provides the
sought 3D positions of the image features. But first thing first…

Grabing an image is easy using OpenCV. Boiler plate code for that is a loop that looks like:

Mat frame, img;
VideoCapture capture;
int dev_id = 1; //Device number.

capture.open(dev_id);
if (!capture.isOpened()){
    cerr<< "Failed to open video device "
        << dev_id<<" \n"<<endl;
    return 1;
}

for (;;){
    capture>>frame;
    if ( frame.empty() )
        continue;

    frame.copyTo(image);
    imshow( window_name , image );
    char key = (char) waitKey(5);

    if( key == ' ' )
        break;
 }

In each input frame, 2D features are detected. Among the myriad of features, KLT are probably the most suited to our real-time needs. Indeed, KLT are easy and fast to compute because there is no descriptor computation and no scale-space analysis is involved (at least not as SIFT). Using OpenCV, KLT features are retrieved as follow:

int MAX_COUNT=100;
TermCriteria termcrit(CV_TERMCRIT_ITER|
                      CV_TERM_CRIT_EPS,
                      20, 0.3);
// We use two sets of points in order to swap
// pointers.
vector<Point2d> points[2];
Size subPixWinSize(10,10), winSize(21,21);

//Convert image to gray scale.
cvtColor(image,gray,CV_RGB2GRAY);

//Feature detection is performed here...
goodFeaturesToTrack(gray, points[1], MAX_COUNT,
                    0.01, 10, Mat(), 3, 0, 0.04);
cornerSubPix(gray, points[1], subPixWinSize,
             Size(-1,-1), termcrit);

Now that features are detected, they are unprojected and intersected with the virtual cylinder. Exact solution to this ray-cylinder intersection could easily be found on the net. Now that we have 3D positions of features at time Tt-1 the same features are tracked in the upcoming frame using optical flow routine from OpenCV:

 calcOpticalFlowPyrLK(prev_gray, gray,
                     points[0], points[1],
                     status, err);

The result of this tracking is a set of features at time Tt. To get the change in head pose, we register the 3D features at time Tt-1 with 2D features at time Tt. This is performed using a PnP algorithm. Because the virtual cylinder represents the head (a rough estimate!), it must be updated with the incremental pose
just computed. In a sens, the cylinder is a state object of the tracked head.

The head pose algorithm runs comfortably on a 2.4 ghz laptop using a Logitech C910 webcam as the following video depicts:

xxxxxxxxxx
100:0