Project Proposal
Introduction/Background
For our project, we want to create an algorithm that takes in a real-life picture and changes it into a cartoon style image. We find this application of machine learning to be very interesting and want to learn more about it. A lot of work has already been done in the area of altering real life pictures to give them a specific art style [1], but not a lot has been done with cartoons [2]. We want to focus on “cartoonifying” backgrounds and landscapes. With the limited time we have, we will initially focus on backgrounds and add additional complexity time and resources permitting.
Problem Definition
Animation is a long and grueling process that requires a lot of artistic ability. Creating the landscapes and backgrounds that give life to our favorite cartoons is either done by hand or with drawing software, meaning it is not automated [3]. Our idea is to make an algorithm that takes an image of real-life scenery and alters it to look like an animated background. Something like this can drastically speed up the process of making a movie, reduce costs, and make cartoon background creation readily available to anyone; all one needs is a camera and a setting!
Our project’s goal is to make cartoon background creation easy and accessible. Being able to transfer the desired style to a real-life picture would make our algorithm very versatile and work for any user. If the user is a novice they can take already existing stills from the media with the style they want and transform their pictures into that style. If the user is more advanced or even a professional they can create their own set of style images and then use them for all future projects. The main goal is to make the algorithm not require any knowledge of machine learning to use. Just a user feeding in style images and real-life pictures to create cartoon backgrounds.
Data Collection
We wanted to test our algorithms out with different types of datasets, both in style and size. Finding datasets of cartoon images from specific cartoons proved to be surprisingly difficult, so for some of them we had to create our own datasets. We used a total of 4 datasets: Simpsons, Family Guy, Shinkai, and AllStyles. All these datasets can be found on our Google Drive.
The Simpsons and Family Guy datasets are datasets we made by compiling images from google images of the specified cartoon. The Simpsons was our worst quality dataset. It consists of 8 images of varying quality and different settings (outdoor, indoors, water, etc.). The Family Guy dataset consists of 83 high quality images in varying settings and through different seasons of the show. Having the images come from different seasons makes the art style more varied since the style evolves throughout the years of the show.
The Shinkai dataset is a dataset we found of images from Shinkai Makoto’s movies Your Name and Weathering With You. This was a high quality dataset full of stills from the movies, consisting of about 2,500 pictures. It also contains about 6,500 real world images for training. All the pictures are very high quality and varied in setting, making it a very high quality dataset.
To acquire the AllStyles dataset, we investigated a collection of images from “AnimeGANv2”, which uses the GAN machine learning framework to produce fast-photo optimization into an anime style utilizing a tensor-flow implementation. The input images are real-world images that reflect different landscape settings and also natural photos of buildings and special structures. The style images comprise four smaller datasets containing about 1,000 style images each of different anime styles. These styles are Paprika, Shinkai, SummerWar and Hayao. We created this dataset to try and train a “generic anime” model that does not have a defined style.
Since the datasets were already very good we did not need to do any cleaning for them. We did alter their size by removing random images to make them smaller and more manageable with our limited computer power. We tried training models with the full size datasets and it would have taken several days to process so we played around with the size to try and make the computing requirements more manageable.
Methods
For the first half of our project, we wanted to familiarize ourselves with the various image style algorithms available. We became interested in style transfer algorithms due to their versatility. The most introductory style transfer tool we could find was the Neural Style Transfer algorithm by Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. We used Tensorflow’s implementation of the Neural Style Transfer Algorithm. Tensorflow provides tutorials and google collab notebooks to get started on style transfer, so we started running tests with handpicked testing images to see what kind of results we would get.
Although TensorFlow’s tutorials were very helpful in visualizing how these types of algorithms work, they fell very short of what we wanted to accomplish. The resulting images were nowhere near the type of results we were looking for, as shown by the images in our midterm report. We started conducting more research and came across two GitHub Repos, AnimeGANv2 and CartoonGAN, that attempted and explained how they solved our defined problem.
CartoonGAN is an implementation of the ideas presented in “CartoonGAN: Generative Adversarial Networks for Photo Cartoonization”, the second entry in our reference list. The idea behind it is to use Generative Adversarial Networks, GAN for short, to create the images. GANs are pairs of CNNs; one is trained to generate images that try and fool the other CNN. The Discriminator Network, D, is the CNN that is trained to try and identify if the input image is a generated image or one from the dataset. The Generator Network, G, is the CNN that is trained to try and fool D. The image below is from reference 2 and visualizes the makeup of the generator and discriminator networks.
The game of cat and mouse these two CNNs play where G is training to generate images to fool D and D is training to distinguish dataset images from generated images leads to G creating images that resemble the dataset’s images that eventually become indistinguishable to the discriminator. If the G and D are created correctly, then the generated images will also be indistinguishable in style to people.
AnimeGANv2 has the same premise as CartoonGAN, but the “generator“ G is different to account for the difference in style from a traditional western cartoon to that of anime; the “discriminator” D is also slightly different. Below is the makeup of the Generator and Discriminator Network for AnimeGAN.
We trained several models of CartoonGAN using different datasets. Before feeding any data into the model for training, we ran the style images through a smoothing algorithm to remove unnecessary details that would distract the algorithm from focusing on the style. We made a model with the Simpsons dataset, Family Guy dataset, and attempted to train a model with the Shinkai dataset but had to stop about a third of the way through the training process because it was taking too long. Since the Simpsons dataset was very simple it did not require any tricks and we did not experiment with it. We just fed it to the model and let it do its thing.
The Family Guy dataset let us experiment little by little. Since we were creating it from scratch, we played around with how the number of style images affected the final output. We trained it with 8, 30, 50, and 83 images. We did not go any higher than 83 style images because it started becoming very difficult to find new images. As we expected, the model performed better when we had more images, but its performance was not exemplary at all. But it helped us visualize that with more images the model performs better; it also takes way longer to train.
Most of the experimenting and trouble came when we tried to use the Shinkai dataset. This was a huge dataset and we have a lot of computational limitations. We started training a model with the complete dataset, but after doing the math we realized it would have taken us over 400 hours to finish training. This was way too much time. We started shrinking the dataset, reducing the learning rate, batch size, and other parameters to get the model to train faster. In the end we settled for training for less epochs, with half the batch size and learning rate and see what results were produced.
For AnimeGANv2, we used pretrained weights to produce output images. The weights were pre trained on over 1000-1500 images for each artist. The training images were 256x256 pixels and had a smoothing algorithm run over them. The smoothing algorithm allows the unimportant or the fine details to be removed so the GAN model focuses on the style and content of the images. There exists pretrained weights for the styles of artists Hayao and Shinkai, as well as the anime Paprika. Utilizing the pretrained weights on our dataset, we converted 790 256x256 real life images into the style of the respective artist.
We also attempted to train the AnimeGANv2 architecture on the AllStyles dataset as well as the Simpsons and Family Guy dataset, but the computational power of Google Collab was insufficient and would have taken excessive time. We did the math for how long it would have taken to train and it was around 500 hours.We tried training a simpler model with a smaller learning rate for the generator and discriminator network, lower number of epochs, smaller training data set, larger batch sizes, but it did not decrease the training time enough for this approach to be viable.
We used both algorithms to output images in Shinkai Makoto’s style, the creator of Your Name and Weathering With You. For AnimeGANv2 we used the dataset described in the “Dataset” section of the report. CartoonGAN already had a pre built model in this style, so to save time we used that and used the images from our midterm report as input images to test the results.
Ideal Results
We want our resulting images to be distinctively cartoon-like. The following examples are comparisons between real-life settings and stills from Makoto Shinkai’s movie Your Name[4].
Results
Original Images
These are some of the original images we applied style transfer to:
Sample Cartoon Images:
Images from The Simpsons, Family Guy, and Shinkai’s work for style reference.
CartoonGAN with Simpsons Model:
About the Simpsons Results:
The resulting images we got from training the CartoonGAN with the Simpsons dataset were not exemplary, but they do show that the Simpsons’ style was transferred. When compared to the original images we can see that the images have been altered in the right direction. First thing we notice is that the vibrant greens of the original images have been turned into that dark shade of green present in the Simpsons. Everything has also gained a yellow hue and all the colors have been darkened and moved to the Simpsons’ color palette. And edges are not clearly defined and seem to exhibit some level of chromatic aberration which may be a result of the low resolution training images. We believe the results were limited by the quality of the dataset.
CartoonGAN with Family Guy Dataset
The resulting image we got for the Family Guy Dataset was very bad. We do not know exactly where we went wrong with this model. The possible errors are not having enough content images, the style images where more character focused since all the stills from the show are character focused, or we messed up our architecture of the GANs somehow. The bottom line is that we messed up the training and did not have time to go back and fix it. Below is the output image, even though it is very bad.
Original Image:
Resulting Image:
CartoonGAN with Our Shinkai Model
About our Shinkai Results:
Our Shinkai dataset was huge. We tried training the model with the full dataset but even pre training was taking way too long. We started cutting down on the number of style and content images drastically until we got it down to about 500 style images and about 1000 content images. Even with these drastically reduced numbers, the training was taking too long. We trained the model for about 15 hours. This got us through 27 epochs out of the regular 100. The results are better than we expected. There is not much difference from the original images but we can tell the training was going in the right direction. There is a blending of features going on within objects in the image, boundaries between objects are not being blurred at the moment, and the colors are slightly changing to fit Shinkai’s color palette better. We are proud of these results and wish we had known just how computationally expensive this process would be so we could have planned accordingly to overcome our computational limitations.
CartoonGAN with Shinkai Pretrained Model:
For these images we used a pretrained model in the Shinkai style to see how our images would have looked if we had the chance to finish our training.
About the Shinkai Pretrained Model Results:
As we can see, the results drastically improve when the model is fully trained. Everything that was good about our trained model is amplified, as expected, with no noticeable flaws.
AnimeGAN with Pretrained Models:
Again, we used pretrained models of the styles mentioned below, including Shinkai, to compare the performance of AnimeGANv2 to CartoonGAN.
About the Results In General
Our results were severely affected by our computer limitations. Working with images is very computationally expensive and none of the tools we had available were enough to fully train our models. Given this we were happy with our results, except the Family Guy test. With our Simpsons dataset we played around with training with different types of content images, different training values, etc. Since the dataset was small it did not take long to train and we could do these alterations fairly undisturbed. We settled for the images that we thought got the style transfer best.
For our heavy work with the Shinkai model we were happy with what we were able to produce. It clearly shows that we were training our model on the right track. Although we liked our results for the models we trained, we thought they were not the best proxies to compare CartoonGAN and AnimeGANv2. To compare the algorithms we used pretrained models we found on these algorithms and ran them to produce images in the Shinkai style.
For our comparison metric we asked our family and friends to choose which set of images came out looking more like the intended style. We showed them stills from Your Name and Weathering With You and then showed them our resulting images. We chose this method as our comparison metric because our project’s idea at its core is a consumer product, so we wanted to test the results using possible consumers.
The feedback we got was not surprising. Out of the 20 people we asked, 17 chose CartoonGAN as the method that outputs the best images. We agree with our consumers. Although CartoonGAN’s output images are objectively better, all the resulting images present the same attributes. Both algorithms create a loss of detail and blend areas belonging to the same objects together to create a cartoon style. Where CartoonGAN shines is that when it blends areas, it blends areas that belong to the same object, so a window does not get blended in with a wall. This is where AnimeGANv2 fell short and where we will need to do some work in the future. Both algorithms did not deal with human subjects in pictures correctly. Everything got blended together and removed all the features, such as eyes and mouths. Since we only want to focus on backgrounds now, we did not pay too much thought to this, but it is something noteworthy for future work.
Discussion
We found github repos containing the AnimeGANv2 and CartoonGAN model. Although the code existed, we encountered many technical difficulties and had to do a lot of debugging to get the models working. None of the models were as simple as just pulling the repo and running the model. We faced issues with having the incorrect dependencies, the architecture was incorrectly written, and overall errors that we had to read and understand the code to fix.
When we finally fixed the code and got to start testing with our datasets, we realized just how computationally expensive GANs with images are. We were severely limited to how much we could train our models. We tried with smaller models as mentioned before to try and get decent output without overworking our machines for literally days but we realized that to train a good style transfer model like the pre-trained ones we found we were going to need very big datasets that our machines and even Google Collab would take over 400 hours to complete. This is why we trained models with small datasets and used the pretrained ones to output the best possible images to test the upper ends of both algorithms. For future reference we know that if we want to work with image processing we need to find a service that removes the computational handicap we were working with so we can train our models with the data we want and not what our computers can handle.
Having said that, we still wanted to compare the two algorithms we used. We compared them with the pretrained models because they are what show what the algorithms are capable of. We were surprised that CartoonGAN seemed like the better performer. AnimeGAN was built as an improvement on CartoonGAN, but our results show otherwise. This might hint at a discrepancy between what makes a good cartoon or anime versus what results we want to get from models. Art is very objective, thus what might be aesthetically pleasing to the eye might not be the “best” image. Some differences we noticed between CartoonGAN and AnimeGAN were that the boundaries between objects were more defined, the blending of features was smoother, and the overall style was closer to what we desired in CartoonGAN. AnimeGAN was good too but it smudged way more boundaries of objects which is not ideal for cartoons.
There is still plenty we can do with this project if we decide to come back to it in the future. First we need to find or create better datasets. We also need to change up how we do our training to improve our results, mainly on what platform or machine we train. We would still go the route of using GANs for style transfer because we have seen what good results are and we like how they look. Once we better understand how to create an effective model we can move on to our other plans. One of those plans we wanted to do but ran out of time was train models for more styles and test them out on both CartoonGAN and AnimeGANv2. By more styles we mean using more popular shows like Gravity Falls, Teen Titans, The Boondocks, and even Pixar and DIsney movies to see how versatile our algorithm is and how much tweaking it takes for every new style. The next big step after completing the background aspect of our project would be to create style transfer machine learning algorithms for people, so turning a picture of a person into a specific style. We can also utilize google forms to survey more people on their preferred images. There is still a lot of work that we can do in this project and if we have time in the future we will definitely come back to it because we all find this application of machine learning techniques to be very interesting.
Conclusion
This project helped us all familiarize ourselves and get first hand experience with machine learning techniques applied to image processing. We were all completely new to this at the beginning of the semester but with the help of the course and each other we found and created datasets, trained models, and got output images that resemble what we wanted to do in our problem description. The resulting images were not as good as other existing models but that was expected for our first attempt at something like this. We also became aware of important things to consider when training your models, like the quality and size of your datasets, the importance of your batch size and learning rates, and how important computational power is. If your machine or platform is not powerful enough, it becomes almost impossible to train your models with optimal parameters. We are proud of our work up to now but realize there is still a lot more we can do to learn and improve.
References
[1] Li, X., Liu, S., Kautz, J. and Yang, M., 2018. Learning Linear Transformations For Fast Arbitrary Style Transfer. [online] Arxiv.org. Available at: https://arxiv.org/pdf/1808.04537.pdf [Accessed 1 October 2020].
[2]Y. Chen, Y. Lai and Y. Liu, “CartoonGAN: Generative Adversarial Networks for Photo Cartoonization,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 9465-9474, doi: 10.1109/CVPR.2018.00986.Available at:https://openaccess.thecvf.com/content_cvpr_2018/papers/Chen_CartoonGAN_Generative_Adversarial_CVPR_2018_paper.pdf [Accessed 1 October 2020].
[3] YouTube. 2016. What Goes Into The Background Art?. [online] Available at: https://www.youtube.com/watch?v=8_HWs0ro3PY [Accessed 1 October 2020].
[4] Nakamine, K., 2020. A Pilgrimage To The Real Life Locations Of Your Name. [online] Tofugu. Available at: https://www.tofugu.com/japan/your-name-locations/ [Accessed 1 October 2020].
[5] TensorFlow: https://www.tensorflow.org/tutorials/generative/style_transfer
[6] J. Chen, G. Liu, and X. Chen, “AnimeGAN: A Novel Lightweight GAN for Photo Animation.”. [online] Available at: https://file.groupme.com/v1/61386024/files/7b4ec2fd-593e-4f2f-b219-201029c85a68 [Accessed 7 December 2020].
[7] AnimeGANv2 GitHub: https://github.com/TachibanaYoshino/AnimeGANv2
[8] CartoonGAN GitHub: https://github.com/mnicnc404/CartoonGan-tensorflow
[9] AllStyles Dataset: https://github.com/TachibanaYoshino/AnimeGAN/releases/tag/dataset-1
[10] Our Google Drive: https://drive.google.com/drive/folders/1d5R816jvcUjPJkMEu4G48oxCjJ2ATg_a?usp=sharing