Here is a university project I particularly enjoyed working on and I hope some ideas here are useful some somebody else.
- The task: Computer Vision project with something to do with Face Detection.
- The project: Simplified Viola-Jones implementation (in MATLAB), from the ground up.
Following is an overview. I won’t publish any code; you’d miss out on all the fun!
First, reading and sources:
Of course, the original paper: Robust Real-Time Face Detection. Viola P.; Jones M.J. (2001)
And the material from CSE 455: Computer Vision Shapiro L. (U. Washington) (2017): Image datasets, more down to earth theory and implementation suggestions.
Quick
overview:
Some “features” made up of rectangles are generated randomly within some bounds. One can compute the value of each feature for a particular image by adding pixel values under white rectangles and subtracting those under black rectangles. What’s called an Integral Image speeds ups computations significantly.
The name of
the game is identifying which features best tell faces apart from non-faces.
This is the result of training over an image dataset.
Training:
Overview:
A schematic
of the training process is shown below. Thousands (or tens of thousands) of
images are processed, computing a number of features on them. Then, this data
together with labels indicating which images are faces and which are not, is
passed to the AdaBoost learning algorithm, that chooses the best ones and
assigns a weight to them as a function of their error.
The best
classifiers are saved for later use in detection.
Training:
AdaBoost:
The
following diagrams describe the workings of AdaBoost. Four images of faces and
four backgrounds are used as training data in this example, and we are
calculating two features only.
The colored
discs hold the feature value of each image, blue means face and red “background”.
As seen on the diagram, features are ordered by numeric value, keeping track of
what values relate to faces and which don’t. Disk size represents the “weight” of
each image for error calculation. At the start all image weights are equal. We’ll
come back to this later.
Once the features are ordered, the first iteration begins:
- First identify the best threshold (i.e. minimum error) that separates both categories, as well as a polarity (i.e. faces are bigger than or lesser than the threshold), for each feature.
- Then choose the best feature for the current iteration, the one with smallest error.
- We may now proceed to turn this feature into a weak classifier by assigning it a weight, bigger the better it is. In case of perfect classification (zero error), quite unlikely with a big dataset, weight is limited as it would be infinite.
- Finally, image weights are adjusted so that images that were incorrectly classified have a higher weight.
A next iteration commences, with the updated weight, and the process continues util the desired number of classifiers is
reached.
The final
result is the strong classifier, shown below. Its threshold can be varied, as a
tradeof between sensitivity and false positive rate.
Detection:
Finally,
the detection process is outlined below. It is pretty straightforward after the
training is done. The image is scanned, and each window evaluated. Those that are
classified as faces are framed in red. Then comes non-maximum suppression,
which consolidates multiple detections of the same face into one. Example images from here and here.
Note that this implementation has a single layer. One of the core ideas of Viola-Jones is a cascaded detector, in which many such layers are cascaded, tweaking the strong classifier threshold so that windows not likely to be faces get promptly discarded, while others progress down the cascade, and those that make it through all layers are then classified as faces. This gif is pretty illustrative
Finally,
some numbers, how good is it?
Not bad,
but it is pretty slow. The single layer approach and MATLAB are to blame here, the
images above taking about 60 s each.
Subtleties:
There are
many, many things we haven’t looked at: Feature calculation, image normalization,
data structures… as always, the devil is in the details.
Comments
Post a Comment