Images and AI: All in the cloud with the power of serverless GPUs

Mar 10, 2024

In today's rapidly evolving technological landscape, the role of artists is being fundamentally challenged by advancements in AI. With just a simple text prompt, AI can generate stunningly realistic images in seconds, including depictions of things that don’t even exist in reality.

And this is just the beginning. The last few weeks I read a lot about Image AI. How you can animate faces with LivePortrait, how you can scale image creation in the cloud with the power of serverless GPUs.

Imagine what it will be like in five years. In this newsletter, I want to tell you a bit more about image diffusion. I'm not claiming to be an expert, but I've been experimenting with it over the past 4-6 weeks, some weeks more frequently than others.

Image diffusion in the current landscape

The funny thing is, a month ago I tried to connect my wireless printer to my desktop. I had to install software. It was quite a hassle. But thinking that AI will soon have a solution for that too makes life, functionality, and technology much more accessible for everyone.

Let’s start a bit more in depth. What is image diffusion?

It employs algorithms that mimic the physical process of diffusion, where particles spread from areas of high concentration to low concentration. It relies heavily on neural networks, with many ways neurons can connect. The simplest neural network involves fully connected layers, where every neuron connects to every neuron in the next layer.

However, this structure isn’t efficient for images due to the sheer number of pixels involved. For instance, a 100x100 pixel image would necessitate 100 million connections, rendering fully connected layers impractical for image processing.

Instead, convolutional layers are employed, where each output pixel is determined by a small grid of surrounding input pixels using a kernel. This significantly reduces the number of parameters, making the process more manageable.

What are layers?

I don’t want to get too technical. It took me some time to understand it. And it is still very complex to comprehend exactly how it is constructed.

Convolutional layers are pivotal in computer vision, which involves identifying and classifying objects within images. In easier words: focuses on recognizing and classifying objects within images.

Convolutional layers are crucial in computer vision, a field that focuses on recognizing and classifying objects within images. This area advances through different stages, from basic image classification to complex tasks like instance segmentation, where every pixel is classified and multiple instances of the same object are identified.

Imagine you have this image of a room with a plant, a chair, and some decor items. Convolutional layers in computer vision help the computer recognize and classify the objects in the image, such as the plant, the chair, and the decor items.

At the simplest level, the computer might just recognize the overall image as a room. But with more advanced techniques, it can identify and classify every object in the image. For example, it can distinguish between the plant, the chair, the cushion, and the decorative basket.

A major advancement in this technology came from the field of biomedical imaging, specifically through a neural network called U-Net. This network was originally designed to analyze images of cells but has proven very effective for other types of images too.

The U-Net works by shrinking the image to get a broad overview and then expanding it back to its original size while carefully identifying and segmenting each part. For instance, if U-Net were applied to this image, it would help the computer accurately segment and identify the plant, the chair, and the decor items, even if there were only a few examples to learn from.

This technology makes it easier for computers to understand and work with images, which can be incredibly useful for non-technical marketers looking to leverage visual content in their strategies.

By training the network to identify and subtract noise incrementally, clear and high-quality images can be generated from noisy inputs.

Now I want to talk about ComfyUI, ControlNet and Runpod.

ComfyUI

This ui will let you design and execute advanced stable diffusion pipelines using a graph/nodes/flowchart based interface. It can be useful for various products using its core concepts of:

Text 2 Image / Image 2 Image Tasks
Inpainting Tasks
Outpainting Tasks
ControlNet Tasks
Background Removal
Style Transfer
Image Upscaling

It’s quite easy to install. It takes 10 - 15 minutes to install it from your terminal. Here is the information on Github.

ControlNet

This is the most important feature of stable diffusion and ComfyUI, helping you convert sketches, poses, depth maps, linearts etc to the images. Through using canny, openpose, depth, lineart or MLSD ControlNet with stable diffusion models you can convert these sketches of poses to realistic pictures.

The whole process of ControlNet is to first get an input image and then depending on the ControlNet, you sketch, get those details out of it and then make another image following those details and prompt to generate a complete new image.

For example, here we are passed an input image, we get the edges from the character and later generate a new image from the edges and prompts.

In ComfyUI, I work with nodes and groups to design and structure user interfaces effectively. Nodes are like the building blocks of my UI elements—each one representing a specific component such as buttons, labels, or input fields. When I create a UI, I define these nodes to determine how each element looks and behaves.

Groups in ComfyUI are essential for organizing these nodes. They act as containers where I can group related elements together, making it easier to manage complex layouts. For instance, I use groups to organize forms with multiple input fields and labels, ensuring everything stays organized and intuitive during the design process. Together, nodes and groups allow me to maintain a clear and structured approach to building user interfaces.

At the end it looks something like this:

If I were setting up a plant near the couch, I'd consider a few factors to ensure it fits well in the space. First, I'd choose a plant that complements the aesthetics of the room and the couch itself. This might involve considering the color and texture of the plant's leaves or flowers to harmonize with the couch fabric and overall decor.

Next, I'd think about practical considerations like the size of the plant. It should be proportionate to the couch and not overwhelm the seating area. A medium-sized plant in a decorative pot could add a touch of greenery without cluttering the space.

Placement is key too. I'd position the plant strategically, perhaps in a corner near the couch or on a side table next to it. This ensures it enhances the ambiance without obstructing movement or conversation around the couch.

Now think about the image below. This is not what I have done myself. This is an example of Huggingface. Think about it, you’re a shoe-owner and have different shoes. Plain images with white background:

As the above example, it’s possible now to create these kinds of lifestyle images by AI.

Background Removal

It is also ideal to use Rembg for removing image backgrounds. However, it is important to consider the shadows. I have used this for my products as well. It enhances the image, not so much in terms of the photo's realism, but in the overall atmospheric impression of the photo. Also use quality images as it gets less sharp.

Check their Github page for more information.

Why Runpod?

Runpod offers pre-created template pods for image generation tools like ComfyUI with access to GPU at much cheaper rates and much faster processing RAM servers.

With AWS Charging us 0.54$/hr for a 16GB GPU with a 16GB RAM
Runpod charges us 0.36$/hr for a 20GB GPU (slightly previous generation) with a 40-50 GB RAM with more cores.

These comparisons make Runpod easier to use as well as cheaper for quickly building a solution out of ComfyUI and using GPU for my applications.

The basic purpose of Runpod is to host the trained models on it and provide its platform to do inference from it. Once the solution is developed and dockerized for deployment on cloud, Runpod serverless offers us a solution to deploy our pods serverless helping us scale our application based on users from 0 to 100s within a short span.

Or how they mentioned: Develop, train, and scale AI models. All in the cloud with the power of serverless GPUs.

It can deploy LLMs, Image Generation models and provides us templates as well to deploy development environments be it for Pytorch, ComfyUI, Automatic A111 or Focus or other famous tools so that the server creation and startup time is very low.