An implementation of the paper: A Neural Algorithm of Artistic Style
By: Amol Budhiraja @amolbudhiraja
- Title: A Neural Algorithm of Artistic Style
- Authors: Leon A. Gatys, Alexander S. Ecker, Matthias Bethge
- Link: https://arxiv.org/abs/1508.06576
For enhanced speed and processing power, it is highly recommended to run this notebook with GPUs. I highly recommend using online services like Google Collab that enable the usage of GPUs when running this code.
If you are trying to run the code locally, please refer to the falling steps:
- Clone this repository.
- Create a virtual environment:
python3 -m venv venv
- Active the virtual environment:
source venv/bin/activate
- Install the dependencies:
pip install -r requirements.txt
- Run the code in
main.py
!
To transfer the artistic style of one image to another, I split this problem into two main parts:
- Detect the artistic style of the first image.
- Apply the artistic style to the second image.
To achomplish this task, I leveraged key properties of Convolutional Neural Networks (CNNs) that enabled me to grasp an understanding of the artistic style contained in our image. Specifically, with lower levels of our network, I can better capture lower level features about the pixel values directly (i.e, brightness, intensity, etc.) and with higher levels of our network I captured higher level features like shape, color, and content. Additionally, I used a feature space that was intended to capture texture information. This will enable us to capture a much more representative space than a standard model architecture.
I leveraged a VGG-Network architecture to extract the style information. The VGGNet (Visual Geometry Group) is a very deep convolutional network that is optimzied for object detection tasks.
I used the fature space provided by the 16 convolutional and 5 pooling layers of the VGG-19 model. I did not use any of the fully connected layers. I also used mean-pooling instead of max-pooling given its improved performance with outliers and artifacts in the image.
Specifically, based on the approach used in the paper, I will reconstruct the input image with the following layers from the original VGG-Network.
conv1_1
conv2_1
conv3_1
conv4_1
conv5_1
According to the paper, one should observe that the lower layers yield an almost perfect construction while in the higher level layers, detailed pixel information is lost while the higher-level content is preserved.
Based on the paper, I used the following loss function:
where
On top of the CNN, I built a style representation that computes the correlations between the different filter responses. Specifically, I computed a Gram Matrix:
where
To compute another image that matches the style representation of the original image, I performed gradient descent from a white noise image and applied the following loss function that assess the mean-squared distance between the entires of the Gram matrix of the original matrix and the image to be generated:
where,
I combined these loss functions together and define a squared error loss function between the two feature representations:
where
I explored various
The final algorithm turned out great! It is able to succesfully transfer the style of the style image to the content image. Interestingly enough, it is able to also do the reverse by inverting the weightage and input-order.
To measure the impact of the weight and number of optimizer steps on the results of the model, I tried various
To test the model, I used two pictures: (1) My favorite basketball player Lebron James and (2) A design pattern I found online. Here are the original images.
As expected an increase in style weight corresponded to a greater emphasis of style in the resulting image. A similar trend was noticed with changing content weight - hence it is omitted from this presentation.
Interestingly, lesser gradient steps corresponded to a better result. This appears to occur because more gradient steps leads to a greater bias towards the larger weight, which for the sake of the results presented below was style. Hence, some of the core aspects of the content tended to be overshadowed in the results with the larger step count.