Table of Contents
Highlights
- Real-time style transfer is now possible with on-device AI, which improves mobile photographs without the need for cloud services.
- Model5-like lightweight models provide a good balance between expressiveness, efficiency, and privacy for deployment on mobile devices.
- Future research is directed towards low-end devices, video style transfer, and optimization across different platforms.
The integration of AI in photography has been a significant shift as well as a controversial one. The steps for how images are captured and processed, particularly through real-time style transfer and the underlying camera technology that facilitates on-the-fly photo editing. These forms of advancement have allowed mobile devices to transform the aesthetic of a photograph as it is being taken, offering new possibilities in mobile photography and increasing the scale of the platform through real-time augmentation.
On-Device AI for Real-Time Style Transfer
Conventionally, neural style transfer (NST) models depended on cloud servers for computation, which were costly to operate and compromised user privacy. Recently, there has been a move towards the incorporation of AI models within mobile hardware to facilitate on-device AI with real-time style transfer. This method avoids the need to depend on external servers, thus cutting costs as well as greatly improving user privacy.
The objective is to have real-time operation of sophisticated style transfer models on mobile computing platforms like smartphones, tablet computers, and embedded systems, which are typified by scarce computing resources and memory.

Challenges and Solutions in Mobile Deployment
Designing deep learning models for mobile use carries a fundamental trade-off between computation efficiency and visual quality. Minimizing model size and number of parameters tends to degrade performance through reduced data processing and computational abilities. As a solution, researchers have introduced lightweight NST models with various optimization methods based on architectures such as MobileNet and ResNet.
Key Architectural Optimisations
Depth wise Separable Convolutions: Introduced by MobileNet, this technique significantly reduces the computational cost of CNN models by decomposing standard convolution operations into depth wise and pointwise convolutions. This decomposition reduces parameters and floating-point operations while aiming to maintain performance.
The computational cost of a standard convolution increases with input/output channels and the square of the filter size. Depthwise separable convolution breaks this into applying a k x k kernel independently to each input channel (depthwise) and then a 1 x 1 kernel to compute interactions across channels (pointwise), leading to a significant reduction in computational cost.
Residual Bottleneck Structure: Inspired by ResNet, this structure addresses the vanishing gradient problem in deep networks and reduces computational complexity by decreasing the number of parameters while maintaining network depth. MobileNetV2 further improved this by introducing Inverted Bottleneck and Linear Bottleneck concepts.
The Linear Bottleneck omits non-linear activation functions in reduced-dimensional spaces to prevent information loss, while the Inverted Bottleneck expands the number of channels initially before using depthwise convolution and then reducing dimensions, enhancing feature representation while reducing complexity.

Optimized Upsampling Techniques: Instead of computationally expensive transposed convolutions, methods like nearest neighbor interpolation followed by depthwise separable convolution are used in the decoder to reduce checkerboard artifacts and improve visual quality and efficiency. Model5 further refined this by using PyTorch’s ConvTranspose2d for upsampling, demonstrating improved computational cost and memory usage.
All models developed in this context are based on an autoencoder architecture consisting of an encoder, residual blocks, and a decoder. The encoder compresses the input image for feature extraction, and the decoder reconstructs the transformed image.
Reflection padding is used to minimize edge distortions, and stride adjustments are employed for downsampling instead of pooling operations to improve efficiency. To balance efficiency and stability, batch normalization is applied to the encoder and decoder, while instance normalization is selectively used in residual blocks to enhance style transfer performance.
Model Variations and Performance
Five model variations (Model1-5) were designed and evaluated based on parameters, floating-point operations (GFLOPs), memory usage, and image transformation quality.

• Model1, Model2, and Model3 shared the same encoder and decoder, differing in their residual block structures (standard, depthwise separable, and ResNet-style bottleneck, respectively).
• Model 4 was a lightweight model achieved by simply reducing input filter sizes and output
channels, resulting in a model with only 9331 parameters. While lightweight, it showed some limitations in expressiveness compared to Model2 and Model3.
• Model5 adopted the Inverted Bottleneck and Linear Bottleneck concepts from MobileNetV2, prioritizing channel expansion in residual blocks for enhanced expressiveness. Despite having more than twice the parameters of Model4, Model5 demonstrated superior efficiency in memory usage and computational cost, as well as excellent visual quality. It was able to perform real-time inference at 512×512 resolution on mobile CPUs and at 1024×1024 resolution with Android GPU acceleration (NNAPI).
For training, approximately 4,800 images from the COCO2017 dataset were used as content images, and OpenAI’s DALL-E model generated diverse artistic-style images. The VGG16 network, pre-trained on ImageNet, served as the feature extractor, with its weights remaining fixed during training. The total loss function was a weighted sum of content loss (measured by MSE of ReLU2_2 features from VGG16) and style loss (computed using the Gram matrix from ReLU1_2, ReLU2_2, ReLU3_3, and ReLU4_3 layers of VGG16). The weight ratio of style to content loss was set to 2.5 x 10^4 for comparative experiments.

Camera Technology and Mobile Device Integration
To enable real-time style transfer on mobile devices, PyTorch-trained models are converted to optimized formats like ONNX (Open Neural Network Exchange) for cross-platform deployment, with ONNX Runtime used for execution on Android devices. For Apple devices, CoreML is utilized, optimized for Apple hardware, and leveraging the GPU and Neural Engine.
A lightweight Android application demonstrated the integration of an ONNX-converted style transfer model, performing direct inference on a Samsung Galaxy S21. The application resized photos (e.g., to 1152×1536) before style transfer and used post-processing techniques like color enhancement with the OpenCV library to further improve visual output quality. Proper memory management is critical in Android to prevent memory leaks and out-of-memory errors during repeated inference processes.
Model5, for instance, achieved real-time inference at 512×512 resolution on mobile CPUs of devices like the Samsung Galaxy S21, Google Pixel 6, and Pixel 7 virtual devices. With Android GPU acceleration via the Neural Networks API (NNAPI), real-time inference was achieved at 1024×1024 resolution. NNAPI is available on Android 8.1 (API level 27) or higher, supporting efficient execution of machine learning models.
Impact and Future Directions
The study verifies the practicability of real-time style transfer on mobile phones beyond previous cloud-based or GPU-facilitated approaches. This design for efficiency provides a key benefit for real-world applications, allowing effortless artistic conversions in mobile photography, augmented reality, and creative software without the need for external processing resources.

Future research in this area involves enhancing efficiency on older, low-end hardware by investigating additional model pruning and computational requirement minimizations, perhaps by breaking high-resolution images into smaller blocks that can be processed sequentially.
Expanding the research to iOS platforms and optimizing for Apple’s CoreML framework would offer insights into cross-platform performance. In addition, the progress presented herein opens up opportunities for applying real-time video style transfer using smartphone cameras, which would necessitate additional investigation of video processing methods that reduce frame processing time.
While the provided sources extensively cover real-time style transfer, there is no explicit mention of “smart composition” as a distinct feature or technology within the context of camera editing. The focus is primarily on applying artistic styles to images and videos.