Google researchers have introduced a new face detection framework called BlazeFace. Google BlazeFace is adapted from the Sigle Shot Multibox Detector (SSD) framework and optimized for inference on mobile GPUs. As the name of the framework suggests, BlazeFace, it has been introduced for the people who are too impatient to wait for their novelty images to pop up on their smartphone.
BlazeFace is supposed to meet their expectation by providing the minimum delay. The face detector runs on at the speed of 200-1000+ FPS on flagship smartphones, and the speed numbers look astonishing. Researchers first proposed a compact feature-extractor convolutional neural network which was inspired by MobileNet V1/V2, and MobileNet uses the 3×3 convolution kernels and the pointwise parts.
For BlazeFace, researchers used 5×5 kernels in the model architecture bottlenecks, and the two pointwise convolutions accelerated the receptive field which resulted in the essential higher abstraction level layers of BlazeFace.
Apart from developing super-fast abstraction level layers, the researchers also developed a new GPU friendly anchor scheme modified from SSD. Even though with time, many obstacles may arise for the researchers and developers to tackle and modify, as of now the main focus of the researchers and their attention are solely fixed on making the face detection as efficient as possible via the smartphone cameras.
To make sure a seamless facial detection, they have added six additional facial keypoint coordinates to estimate face rotation for the video processing pipeline and built separate models for front and rear cameras. BlazeFace was trained on a dataset of 66,000 images, and performance was evaluated on a geographically diverse dataset consisting of 2,000 images. It showed 98.61% average precision with 0.6 ms inference time while detecting face using the front camera.
If a framework, such as Blazeface is introduced, it is only natural that it will give birth to a plethora of new applications. This model can be deployed into any face-related computer vision application, including 2D/3D facial keypoints, contour, or surface geometry estimation, facial features or expression classification, and face region segmentation.