It is reported today that Netease Interactive Entertainment AI Lab proposes a single-image real-time high-resolution face replay algorithm that can generate 1440×1440 and 256×256 resolution face replays at real-time frame rates on desktop GPUs and mobile CPUs, respectively. The main idea behind this method is to decouple and encode the appearance and motion information of the face before using a large number of videos to concentrate their prior knowledge through self-supervised learning.
According to the different encoding methods of motion information, related work can be divided into two categories: warp-based and direct synthesis. The deformation-based method, for example, displays the motion information as a motion field, whereas the direct synthesis method encodes the appearance and motion information of the face in a low-dimensional latent space and then decodes it to obtain the synthesis result.
The core concept of this synthesis method is incorporated into the deformation-based algorithm flow, which consists primarily of two modules: First, because the deformation-based algorithm does not need to recreate all face information, it has the potential to create a network structure that supports real-time applications.
As a result, this scheme uses the deformation-based algorithm framework as the foundation and proposes a lightweight U-shaped Deformation network structure; at the same time, the pose encoding method combined with the direct synthesis method encodes the three-dimensional pose of the head that drives the face and injects it into the network to improve the quality of its large pose generation.
Second, to improve the algorithm’s efficiency even further, this scheme proposes a hierarchical motion field prediction network to estimate pixel motion from the source face to the driving face. This scheme, unlike existing single-scale motion field estimation algorithms, can be based on a variety of features. The scale’s feature point image predicts the motion field from coarse to fine, reducing the algorithm’s complexity and ensuring calculation accuracy.
The source image and driving image pair are used as input in the training phase of this method. First, the 3DMM algorithm is used to fit the shape, expression, and head pose parameters of the photo’s face, and then the corresponding feature point image is calculated.