Articles - Edge AI and Vision Alliance

The Guide to Fine-tuning Stable Diffusion with Your Own Images

Brian Dipert — Mon, 09 Oct 2023 18:49:35 +0000

This article was originally published at Tryolabs’ website. It is reprinted here with the permission of Tryolabs.

Have you ever wished you were able to try out a new hairstyle before finally committing to it? How about fulfilling your childhood dream of being a superhero? Maybe having your own digital Funko Pop to use as your profile picture? All of these are possible with DreamBooth, a new tool developed by researchers at Google that takes recent progress in text-conditional image synthesis to the next level.

In our previous post, we discussed text-to-image generation models and the massive impact that models like DALL·E and Stable Diffusion are having throughout the Machine Learning community.

Now, in this blog post, we will guide you through implementing DreamBooth so that you can generate images like the ones you see below. To do so, we’ll implant ourselves into a pre-trained Stable Diffusion model’s vocabulary. Be warned, generating images of yourself (or your friends) is highly addictive. Don’t say we didn’t warn you!

Also, if you know part of our team, you may recognize some faces in the following images.

DreamBooth motivation

Feel free to skip this section if you’re not particularly interested in the theory behind the approach and prefer to dive straight into the implementation.

The first step towards creating images of ourselves using DreamBooth is to teach the model how we look. To do so, we’ll follow a special procedure to implant ourselves into the output space of an already trained image synthesis model.

You may be wondering why we need to follow such a special procedure. After all, these new generation image synthesis models have unprecedented expressive power. Can’t we just feed the model an extremely detailed description of the person and be done with it? The short answer is no. It’s still very hard for these models to reconstruct the key visual features that characterize a specific person. Instead, the model must learn what we look like down to the last detail so that it can later reproduce us in the most fictional scenarios.

To achieve this, we’ll fine-tune this model with a set of images, binding them to a unique identifier that references us.

But wait a minute… How many of these images will we need? Deep Learning models usually require large amounts of data to produce meaningful results (even more so these large image synthesis models). Does this mean that we need thousands of pictures of ourselves for the model to reproduce us faithfully?

Fortunately, the answer is no. The technique we’re about to show you achieves results like you have seen above with no more than a dozen images of your face. Still, these images must exhibit some variation in terms of different perspectives of your face (e.g., front, profile, angles in between), facial expressions (e.g., neutral, smiling, frowning), and backgrounds. Here are examples from the three victims we chose for this blog post: Fernando, Giuls, and Luna (from left to right).

Once you’ve collected these images, the next step is to label them with a text prompt. Following the instructions in DreamBooth’s paper, we’ll use the prompt A [token name] [class noun] where [token name] is an identifier that will reference us, and [class noun] is an already existing class in the model’s vocabulary which describes us at a high level. For instance, for Fernando Bernuy (co-writer and one of the victims of our experiment), a possible prompt would be A fbernuy man. Other examples of class nouns include woman, child, teenager, dog, or sunglasses. Yes, this approach works with animals and other objects too!

The motivation behind linking our unique identifier with a class noun during training is to leverage the model’s strong visual prior of the subject’s class. In other words, it will be much easier for the model to learn what we look like if we tell it that we are a person and not a refrigerator. The authors of DreamBooth found that including a relevant class noun in the training prompts decreased training speed and increased the visual fidelity of the subject’s reproduced features.

However, there are still two issues we must address before we can fine-tune the model:

The first one is overfitting: these extremely large generative models will inevitably overfit such a small set of images, no matter how varied it may be. This means that the model will learn to reproduce the subject with high fidelity, but mostly in the poses and contexts present in the training images.

Prior-preservation loss acts as a regularizer that alleviates overfitting, allowing pose variability and appearance diversity in a given context. Image and caption from DreamBooth’s paper.

The second is language drift: since the training prompts contain an existing class noun, the model forgets how to generate different instances of the class in question. Instead, when prompted for a [class noun], the model returns images resembling the subject on which it was fine-tuned. Essentially, it replaces the visual prior it had for the class with the specific subject that we introduced into its output space. And although Fernando is a handsome man, not all men look like him!

Language drift. Without prior-preservation loss, the fine-tuned model cannot generate dogs other than the fine-tuned one. Image taken from DreamBooth’s paper.

To solve both issues, the authors of DreamBooth propose a class-specific prior-preservation loss. Simply put, the idea is to supervise the fine-tuning process with the model’s own generated samples of the class noun. In practice, this means having the model fit our images and the images sampled from the visual prior of the non-fine-tuned class simultaneously. These prior-preserving images are sampled and labeled using the [class noun] prompt. This helps the model remember what a generic member of the subject class looks like. The authors recommend sampling a number of 200×N [class noun] images, where N stands for the number of images of the subject.

Training approach. The subject’s images are fitted alongside images from the subject’s class, which are first generated using the same Stable Diffusion model. The super resolution component of the model (which upsamples the output images from 64 x 64 up to 1024 x 1024) is also fine-tuned, using the subject’s images exclusively. Image taken from DreamBooth’s paper.

Now that we’ve covered all the relevant pieces of the theory, all that’s left is to fine-tune the image synthesis model. Let’s do it!

Fine-tuning stable diffusion with your photos

Three important elements are needed before fine-tuning our model: hardware, photos, and the pre-trained stable diffusion model.

The original implementation requires a large amount of GPU resources to train, making it difficult for common Machine Learning practitioners to reproduce. However, a community in discord has developed an unofficial implementation that requires less computing resources. If you happen to have access to a machine with at least 16GB VRAM GPU, you can easily train your model following Hugging Face’s DreamBooth training example instructions. If you don’t, we’ve got you covered! In this post, we’ll show you how to train and run inference in a free-tier Google Colab. Yes, you’ve read that right, a free-tier Google Colab!

Note that the notebook used may be outdated due to the rapid advancements in the libraries used, but it has been tested and confirmed to still be functional. January 2022.

The second element is the subject’s photos. In this tutorial, we’re gonna use pictures of members of the TryoGang and one of our pets. In any case, there are some rules we need to follow to get the best possible results.

As mentioned in the motivation section, Stable Diffusion tends to overfit the training images. To prevent this, make sure that the training subset contains the subject in different poses and locations. Even though the original paper recommends using 4 to 6 images, the community in Discord has found that using 10 to 12 images leads to better results. As a rule of thumb, we’ll use 2 images that include the torso and 10 of the face, with different backgrounds, styles, expressions, looking and not looking at the camera, etc.

If you’re looking at the camera and smiling in every photo, don’t expect the model to generate you looking sideways or with a neutral face, so avoid using selfies only!

In addition, make sure to crop the training images to a square ratio since Stable Diffusion scales them down to 64 x 64 to use them for training.

And last but not least, we’ll need the pre-trained Stable Diffusion model’s weights. These can be downloaded from Hugging Face, for which we’ll need to create an account, read the model card and accept the terms and conditions. Don’t download the model manually because the training script will do it automatically.

Now that we’ve got everything set up, let’s fine-tune the model!

Training

We will use this implementation that includes a notebook ready to use in Google Colab. You can open the notebook by clicking on this link.

Before running it, let’s modify it for our use case (we’ll use Fernando as the subject to illustrate the instructions). We need to define four parameters for the training process:

TOKEN NAME: corresponds to the unique identifier which will reference the subject we want to add. This name should be unique, so we don’t have to compete with an existing representation. Here we can use a simple first initial + last name token name, such as fbernuy.
CLASS NAME: This is the class name we introduced in the motivation section. The original DreamBooth paper recommends using generic classes such as man, woman, or child (if the subject is a person) or cat or dog (if the subject is a pet). However, the Discord community implementing the approach on Stable Diffusion has found that using celebrities who are similar to the subject produces better results. In our case, we used George Clooney when the subject is a man and Jennifer Anniston when it’s a woman. We still used the “cat” class for Luna, as we couldn’t think of a suitable famous cat other than Garfield.
NUMBER OF REGULARIZATION IMAGES: As mentioned in the motivation section, we need the class-specific prior-preservation loss to prevent overfitting and language drift issues. We followed the original authors’ recommendation of using 200 images per training image. Remember that using more regularization images may lead to better results.
TRAINING ITERATIONS: This parameter defines the number of iterations the model will run during the fine-tuning process. If this number is too low, the model will underfit the subject’s images and won’t be able to reproduce it accurately during inference. If it’s too high, the model will overfit instead, making it unable to reproduce the subject with expressions, poses, or contexts outside of those in the training subset. A rule of thumb that has shown good results in our experiments is to use between 100 and 200 iterations per training image. Since we have 12 images of Fernando, let’s use 2400 iterations.

Now let’s modify the notebook with these parameters as follows:

Settings and run: we’ll modify the CLASS_NAME to georgeclooney. Also, we’ll replace the default sks token name with fbernuy in the INSTANCE_DIR and OUTPUT_DIR. This will make it easier to identify the directory in which the model and the data will be saved.
Start Training:

# replace the instance_prompt parameter to our token name: --instance_prompt=="photo of fbernuy george clooney" # check that the class_prompt is set as: --class_prompt="photo of {CLASS_NAME}" # set: --num_class_images=200 --max_train_steps=2400 --gradient_accumulation_steps=2 --lerning_rate=1e-6

Now we are ready to run the notebook and fine-tune our model. The first few cells will install the required dependencies. After this, we’ll be prompted to log in to HuggingFace using our access token.

Then, we’ll be asked to upload the subject’s photos. Here, can use the Choose Files button and select the images from our computer or upload them directly to the subject’s directory inside the data folder in the Colab instance. The next cell is where the magic happens. We finally get to fine-tune the model! The script will download the pretrained model’s weights, generate the regularization images, and then execute the specified number of training iterations. The entire process should take about an hour and a half, so be patient. Remember to keep an eye on the notebook!

Once training is over, we’ll be prompted to convert the model to a ckpt file. This is highly recommended since it’s a requirement for an extremely useful web interface that we’ll introduce further down in this blog post. Once we’ve saved the ckpt file in the notebook instance, we’ll download it to our local machine or save it to our drive folder.

We can test our fine-tuned model by running the cells below the “Inference” section of the notebook. The first cell loads the model we just trained and creates a new Stable Diffusion pipeline from which to sample images. We can set a seed to control random effects in the second cell. And now, the moment you’ve been anticipating since you started reading this blog post: generating our custom images!

The cell titled “Run for generating images” controls the image-generating process. There’s a total of 7 parameters that we can modify to customize our image:

prompt: the text prompt that will guide the image’s generation. Here’s where we should include the token name that references our subject.
negative_prompt: serves to specify what we don’t want to see in the image. For instance, if we want to generate an image with a cloudy sky, we enter clear sky as the negative prompt.
num_samples: the number of images the model will generate in a single batch.
guidance_scale: also known as CFG Scale, is a float that controls how much importance is given to the input text prompt. Lower values of this parameter will allow the model to take more artistic liberties when generating the images.
num_inference_steps: the number of denoising steps that the model will run. A higher number of steps will usually lead to more detailed images at the cost of an increased inference time. Be careful with this parameter, though, since too many steps may lead to visual artifacts in the images.
height: the height of the generated image in pixels.
width: the width of the generated image in pixels.

There’s no magic formula to generate the perfect image, so you’ll probably have to play around with these parameters for a while before achieving the results you want. If you’re having trouble generating cool images, don’t get discouraged! Some of the most common issues have pretty straightforward solutions, according to Joe Penna (one of the managers at the Stable Diffusion Discord channel).

If they don’t look like the subject: Check to see if the prompt is right and if the images follow the tips we gave before. Try including the class name in the prompt and the token name (i.e., a photo of TOKEN_NAME georgeclooney). We may also need to train for more iterations.
If they look too much like the training images: we might have trained for too long, used too few images, or our images may be too similar. We modify the prompt by including the token name towards the end of it, for instance: an exquisite portrait photograph, 85mm medium format photo of TOKEN_NAME with a classic haircut.
If using a complex prompt doesn’t give us the desired results: we might have trained for too few iterations. We can try repeating the token name in the prompt, for instance: TOKEN_NAME in a portrait photograph, TOKEN_NAME in an 85mm medium format photo of TOKEN_NAME.

Although the notebook is extremely useful for training the model, it’s far from being the best platform to generate images. In the following section, we’ll introduce an incredibly powerful tool to enhance the image generation process further.

In practice: generating cool images

Creating great images requires both practice and patience. However, this process can be alleviated by using the right tools. The one we’re about to show you is truly mind-blowing; it’s so versatile that we can’t recommend it enough! It’s a WebUI that makes the entire process more interactive and fun.

To use it, we must run a web server and follow the Install instructions available for Linux, Windows, or Apple Silicon. Alternatively, we can run the server on another Colab using this link. Beware that time flies when generating images, and Colab’s free tier is limited!

Once installed, we’ll copy our model’s ckpt file in the web server folder, stable-diffusion-webui/models/Stable-diffusion, and then run the web server script (webui.sh or webui.bat). This gives us the UI’s address and port so we can open it using our favorite browser.

WebUI tool for Stable Diffusion, from AUTOMATIC1111

The UI has many different features. We highly recommend exploring the project’s wiki. The development of Stable Diffusion and this UI are moving fast, so be aware that this may change!

The first thing we need to do is to select our fine-tuned Stable Diffusion model. At the top of the WebUI page, we’ll find a drop-down menu with all the available ckpt files. If you don’t see yours in the list, verify that you copied the ckpt file to the correct directory.

For this tutorial, we’ll focus on explaining the UI’s main three functionalities: text to image, image to image, and inpainting.

Text to Image (txt2img)

Text to image is the most straightforward way to use our model: write a prompt, set some parameters, and voilà! The model generates an image that matches the prompt according to the chosen parameters.

This might sound easy at first glance. However, we might need to try several parameter combinations before hitting the spot. Based on our experience, these are the steps we recommend following to generate the coolest images:

Pick a style from lexica.art and add your subject to its prompt. For instance, let’s see what Fernando would look like with a new haircut: fbernuy. epic haircut. hairstyling photography.
Use a random seed until you get something similar to what you have in mind. It might not look exactly like the subject, but we can fix that later.
Copy the seed from the image description and use it to generate the same image with different parameters. The best way to do this is to use the X/Y plot script: select a list of steps (10, 15, 20, 30) and a list of CFG Scales (2.0, 2.5, 3.0, 3.5, 4.0). The tool will plot a matrix with one image for each input step and scale combination. We can also use other parameters as the X and Y variables.
Then, pick the one you like the most, copy its corresponding parameter values, and remove the script to generate the selected image alone. If you don’t like any of the images, try with different parameters, a different seed, or a different prompt!

Selected random image

Parameters exploration

Final result

Image to Image (img2img)

The second alternative is to generate a new image based on an existing image and a prompt. The model will modify the entire image, so we can apply new styles or make a small retouch.

Let’s start with a txt2img prompt: very very intricate photorealistic photo of a fbernuy funko pop, detailed studio lighting, award - winning crisp details. Following the strategy explained above, we use txt2img and generate undoubtedly cool looking Funko Pop. However, we’d like to improve the beard to be closer to our subject and lighten the nose color.

To do this, we’ll click on the Send to img2img button and manually draw the beard style and nose we want using the MS Paint-like tool of the WebUI (center). We can reduce the denoising strength parameter to have a result as similar as possible to the original and experiment with the rest of the usual parameters until we get the result we are looking for (right).

txt2img generated image	simple image modifications	img2img result

Following the same img2img strategy, we slightly improved Luna’s fur colors in this epic picture and added some smile lines to the anime version of Giuls.

txt2img generated images

img2img improved image

Inpainting

The third alternative allows us to specify a region in the image for our model to fill, maintaining the rest of the image intact (unlike the img2img method, which modifies the entire input image). This can be useful for swapping a face in an existing photo (if the subject is a person) or generating an image of the subject in a different scenario or lighting condition while preserving the background and context. Keep in mind that using this method is a bit more challenging because there are more parameters to explore.

For example, let’s generate an image of Fernando as Ironman. Since the armor has a lot of important details, we’ll use an original image from the movie poster as the source and swap Ironman’s face using the Inpainting tool.

The first thing we’ll do is select the Inpainting tool inside the img2img tab. After uploading our reference image, we’ll select the area around the head with the brush tool and input a photo of fbernuy as the prompt since we don’t want the model to fill this region with anything else but Fernando’s face.

Before generating the image, let’s take a look at the most relevant parameters added in inpaint.

Masked content: defines what to fill the masked region with. We can select original (the default) if the original content is similar to what we want to achieve, experiment with fill to help us keep the surrounding information, or latent noise to use noise. Regardless of the option we pick, random noise will be added based on the Denoising strength parameter.
Denoising strength: defines the standard deviation of the random noise added to the masked region. The higher this parameter, the lower the similarity with the content in the unmasked portion of the image.
Inpaint at full resolution: inpainting resizes the whole image to the specified target resolution by default. With this parameter enabled, only the masked region is resized, and the result is pasted back into the original picture. This helps get better results for small masks as the inpainted region is rendered at a much larger resolution.

For this example, we’ll use original masked content (since the masked region is already a face) with 0.50 denoising strength and enable inpainting at full resolution. Then, we’ll set a random seed -1 and repeat the process we’ve done before: patiently generate images until we get one similar to what we desire. Finally, we’ll fix the seed and use the X/Y plot script to explore different Sampling Steps and CFG Scale combinations.

Original image	Intermediate inpaint results

Pretty awesome, right? At this point, we’ve generated a great image that kept all the details of the original picture but with Fernando’s face instead of Robert Downey Jr.’s. Still, there’s one small detail we want to fix in the beard.

The best way to fix this is by using inpainting again, but using the already inpainted image instead of the original (didn’t see that one coming, did you?). This way, we can instruct the model to modify the region around the beard exclusively and input a more specific prompt, such as a photo of fbernuy with a beard.

Final inpaint result with beard details

We have shown you how to create cool images of you, your friends, your pets, or any particular item you want, either starting from just an idea, a sketch, or an existing image!

Now you are ready to generate cool images on your own! Here are some images we generated from our subjects that can be useful for you to get some inspiration. Have fun!

Giuls in Game of Thrones	Luna with a birthday hat	Fernando, oil canvas


Fernando’s business portrait	Luna with sunglasses	Luna with pearl earrings

Final thoughts

Stable Diffusion signified one of the biggest leaps toward democratizing large image synthesis models. Techniques such as DreamBooth (and their community-driven implementations) allow us to reap the benefits of these models even further, with imagination being our only limit. We are extremely excited to know where this new democratic AI paradigm will lead us and the various ways in which the world will benefit from it.

Fernando Bernuy
Lead Machine Learning Engineer, Tryolabs

Guillermo Etchebarne
Lead Machine Learning Engineer, Tryolabs

The post The Guide to Fine-tuning Stable Diffusion with Your Own Images appeared first on Edge AI and Vision Alliance.

Automatically Measuring Soccer Ball Possession with AI and Video Analytics

Brian Dipert — Thu, 22 Jun 2023 14:58:44 +0000

This article was originally published at Tryolabs’ website. It is reprinted here with the permission of Tryolabs.

The World Cup is just around the corner and in Tryolabs everybody is excited to have their national team compete. As the teams prepare for the event, they more than ever rely on AI-assisted sports analytics for inspecting their performance based on both recordings of their previous games and real-time information delivered to the coach during the matches.

AI is being used to identify patterns that lead to success and to compute indicators that give coaches objective numbers over which to optimize to maximize their team’s chances of scoring. One such indicator is ball possession.

Leveraging Norfair, our open-source tracking library, we put our video analytics skills to the test and wrote some code to showcase how AI can compute ball possession by team automatically by watching a video of a match. Here is how it looks in a snippet of a match between Chelsea and Manchester City:

We also made it open source on GitHub. You can inspect every single part of the code, and use it as a base to build something else!

In this post, we are going to dig a bit deeper into what exactly ball possession is, explain how this system was built from simpler components and how these work.

Defining ball possession

Ball possession is one of the most important statistics in a soccer game. According to a 2020 study conducted over 625 UEFA Champions League matches, teams with more ball possession won 49.2%, drew 22.0%, and lost 28.7% of the matches overall, exceeding the winning rates of their rivals. This effect was even greater when the gap of ball possession percentage between two teams in a match was higher.

There is definitely something in this number. The team with a higher possession is more likely controlling the match.

Season	Matches	Won	Draw	Lost
2014/2015	124	65 (52.4%)	28 (22.6%)	31 (25.0%)
2015/2016	117	53 (45.3%)	21 (17.9%)	43 (36.8%)
2016/2017	119	58 (48.7%)	32 (26.9%)	29 (24.4%)
2017/2018	120	59 (49.2%)	25 (20.8%)	36 (30.0%)
2018/2019	119	60 (50.4%)	26 (21.8%)	33 (27.7%)
Total	599	295 (49.2%)	132 (22.0%)	172 (28.7%)

We now know how important ball possession is to soccer analytics. But what exactly is it? How is it computed? There are actually two methods that yield different results.

Method 1: based on calculating passes

The first method is to consider ball possession in relation to the number of passes. The possession metric would be the number of passes on each team during the match, divided by the total number of passes in the game.

While straightforward to compute if counting passes manually (with a clicker), this method has a big drawback: it doesn’t account for the total time that players have control of the ball.

Method 2: based on time

Another method that is widely used consists of controlling a clock manually for each team. A person has to be in charge of starting a timer when a team gets a hold of the ball and stopping it when they lose it. The timers also need to be paused when the game is stopped. While accurate, this approach needs the scorekeeper to be very focused for the duration of the game. It’s also susceptible to human errors. Forgetting to turn on or off the clock can mess up the metrics (and you don’t want that!).

Using AI to compute ball possession

Do we really need to burden someone with this task in this day and age?

Turns out that with deep learning and computer vision techniques it should be very feasible to automate this. For the remainder of this post, we will understand ball possession as the % of the time that each team has the ball (method 2).

In a nutshell, we will build a clock powered by AI instead of a human.

Divide & conquer: the steps

Let’s start by breaking down the problem. We are going to need to:

Get a few videos to test.
Detect important objects.
- Detect the players.
- Detect the ball.
Identify each player’s team.
Determine which player has the ball at all times.

There are also some “nice to haves”:

Drawing the trailing path of the ball during the match (for better visual inspection, remember we are doing sports analytics here!).
Detecting and drawing passes completed by each team.

Now that we know what we need to do, let’s see how we can develop each step.

Step 1: Getting a few videos to test

There is no better way to test our algorithm than on live soccer footage videos. However, we can’t just use any live soccer footage video, we have to be sure that the video does not change between different cameras. Fortunately, this can be easily accomplished by trimming a live soccer match from Youtube.

We searched for a few matches with different teams to test our project in different situations. We are going to test our project in the following matches:

Step 2: Detecting objects

What is a soccer match without players and a ball? Just a referee running from one side of the field to the other .

It’s crucial to know the location of the players and the ball before ever thinking about calculating the ball possession. To do this, we need two deep learning models: one for detecting the players and another for detecting the ball.

Detecting players

What would have been a challenging problem a decade ago is nothing but a few lines of code in 2022. There are multiple pre-trained models we can use to detect people. In this case, we will use YOLOv5 trained in the COCO dataset. This will get us bounding boxes around each player, and confidence scores. Here are the results:

https://tryolabs.com/assets/blog/2022-10-17-measuring-soccer-ball-possession-ai-video-analytics/player_detection_clip_compresed-a2df4bf16b.mp4

Detecting players using YOLOv5.

Detecting the ball

The COCO dataset has a specific label for sports balls, but the results we got using a model trained using it were not good enough for this type of live soccer footage so we had to come up with a different approach.

In order to improve the results for our specific task, we finetuned the YOLOv5 detector trained in COCO with a specific dataset of balls containing only footage from live soccer matches.

https://tryolabs.com/assets/blog/2022-10-17-measuring-soccer-ball-possession-ai-video-analytics/ball_detection_clip_compresed-a7c8e425fc.mp4

Ball detection using YOLOv5 and a custom-built dataset.

The dataset was created with footage of soccer videos with a similar camera view, and the labeling was made with an open-source labeling tool named LabelImg. The finetuning process was made following the instructions of the official YOLOv5 repository.

Please note that this is by no means a fully robust ball detection model, the development of which is outside the scope of this blog post. Our resulting model mostly works on the videos we used for these demos. If you want to run it on other videos, the model will need more finetuning over labeled data.

Putting things together

Great! We now have two models that we can use as if it was only one model by adding both of the detections. The result is a perfect model for soccer video analytics.

https://tryolabs.com/assets/blog/2022-10-17-measuring-soccer-ball-possession-ai-video-analytics/all_detection_clip_compresed-c97e1edb05.mp4

Detection of players and ball.

Step 3: Identifying player’s teams

How can we know which team has the ball if we don’t know the team of each player? We need to have a stage that takes the players’ detection as inputs and outputs the detections with their respective classifications.

Here, there are several approaches we could follow, each one with its pros and cons. In this section, we will restrict ourselves to two simple ones and leave out more complex techniques such as clustering, siamese networks, or contrastive learning (we encourage you to try these, though!).

Approach 1: Neural network based on jersey

One approach is to train a neural network for image classification based on the team’s jersey. The dataset can be generated by running a video with a player detection model and saving the crops of the detections as a dataset for training. The labeling can be easily done from a Jupyter Notebook with a tool like pigeonXT.

A neural network could be a good approach when there are complex scenarios to classify. For example, to distinguish between jerseys with similar colors. Or when there are occlusions of players. However, this advantage comes with a cost. This approach requires you to create a dataset and train the model for every match that you want to analyze. This can be daunting if you would want to be able to analyze the ball possession in many different soccer matches.

Approach 2: Color filtering with HSV

As in most sports, different teams are expected to use easily distinguishable jerseys for a match, so what if we can leverage that information using some classical computer vision?

Let’s take an example of a match between Chelsea and Manchester City. Here we have four distinctive colors:

Classification	Jersey Color
Chelsea player	Blue
Manchester City player	Sky Blue
Chelsea Goalkeeper	Green
Referee	Black

Note that there is no color for the Manchester City Goalkeeper. This is because his jersey is black, and would therefore have the same color as the referee’s. We didn’t claim this approach could cover every single case

For each color, we created an HSV filter that will tell us how many pixels of that color the image has.

The reason we chose to filter with HSV values instead of RGB is that HSV filtering is more robust to lighting changes. By adjusting only the hue value, you are choosing what type of color you want to keep (blue, red, yellow, etc), independent of how dark or light the color is.

Before filtering the colors, the player image is cropped in order to keep just the jersey. The crop is a specific percentage of the image. This helps to reduce unnecessary information for the classification algorithm. For example, in this image taken from a bounding box, the cropped image removes most of the pixels of the unwanted player, but still keeps most of the information of the desired one.

Cropping to the player’s jersey.

For each color range, we created a filter that keeps the pixels in the color range and sets the rest to black.

The cropped image is then fed to the 4 color filters specified before. The output of the 4 filters is then passed through a median blur filter that will be in charge of removing unwanted noise.

For each output, we count the number of pixels that passed through the filter (i.e. the non-black ones). The filter with the highest count will give us the team that represents the player!

The following animations show the described process:

https://tryolabs.com/assets/blog/2022-10-17-measuring-soccer-ball-possession-ai-video-analytics/city_animation-e0d45e24d1.mp4

HSV Classifier classifying a Manchester City player.

https://tryolabs.com/assets/blog/2022-10-17-measuring-soccer-ball-possession-ai-video-analytics/chelsea_animation-aae9e0ef49.mp4

HSV Classifier classifying a Chelsea player.

https://tryolabs.com/assets/blog/2022-10-17-measuring-soccer-ball-possession-ai-video-analytics/chelsea_gk_animation-6a3319e073.mp4

HSV Classifier classifying the Chelsea Goalkeeper.

If a team has more than one color, the sum of the non-black pixels of all the corresponding colors will be taken into account for the comparison.

For more details on how the classifier is implemented, please take a look at this file.

Improving classification with inertia

Our HSV classifier works well… most of the time. Occlusions and imprecisions of the bounding boxes sometimes make the predictions unstable. In order to stabilize them, we need tracking. By introducing Norfair, we can link players’ bounding boxes from one frame to the next, allowing us to look not just at the current team prediction but past ones for the same player.

Let’s see Norfair in action:

https://tryolabs.com/assets/blog/2022-10-17-measuring-soccer-ball-possession-ai-video-analytics/id_compressed-f9fa174a9c.mp4

Tracking players in a soccer match using Norfair.

It’s common sense that a player shouldn’t be able to change teams across the video. We can use this fact to our advantage. Therefore, the principle of inertia for classification states that a player’s classification shouldn’t be based only on the current frame but on the history of the nn previous classifications of the same object.

For example, if the classifier has set inertia equal to 1010, the player classification on frame ii will be decided by the mode of classification results from frames i−10i−10 to ii. This ensures that a subtle change in the classification due to noise or occlusion will not necessarily change the player’s team.

Ideally, infinite inertia would be great. However, the tracker can mix some ids of the players. In this case, if inertia is too large, it can take too much time for the classifier to start predicting the correct team of the player. An ideal value of inertia is not too big or small.

Here are examples comparing the HSV classifier with and without inertia:

https://tryolabs.com/assets/blog/2022-10-17-measuring-soccer-ball-possession-ai-video-analytics/inertia_classifier_clip_compressed-30021642a7.mp4

HSV classifier without inertia.

https://tryolabs.com/assets/blog/2022-10-17-measuring-soccer-ball-possession-ai-video-analytics/hsv_classifier_clip_compressed-f965ccf6c3.mp4

HSV classifier with inertia 20 on a 25 FPS video.

Step 4: Determining the player with the ball

This is the final piece to our computation of ball possession. We need to decide who has the ball at all times.

A simple method works by determining the distance from all the players to the ball. The closest player will be the one who has the ball. This is not infallible of course but mostly works fine.

For our demo, we defined the distance as the distance from the closest foot of the player to the center of the ball. For simplicity, we will consider the bottom left corner of the bounding box as the left foot of the player, and the bottom right corner of the bounding box as the right foot.

Distance from the right foot to the center of the ball.	Distance from the left foot to the center of the ball.

Once we know who the closest player to the ball is, we have to define a threshold to know if the player is near enough to the ball to be considered in possession of it.

No player with the ball, the closest player is too far from the ball.	A player has the ball, the closest player is near the ball.

From models and heuristics to possession

We now have everything that we needed in order to calculate the ball possession metrics. However, determining which team is in possession of the ball is not as simple as stating that it’s the team of the player that has the ball in each frame. As with our team classification process, the algorithm for ball possession should also have some inertia.

In this case and given a team with the ball, inertia states how many consecutive frames the players from the other team should have the ball for us to consider that the possession changed. It’s very important to consider consecutive frames. If there is an interruption, the inertia restarts.

Without inertia, the possession will change unnecessarily in events such as a rebound, or when the ball passes near a player that didn’t have the ball.

https://tryolabs.com/assets/blog/2022-10-17-measuring-soccer-ball-possession-ai-video-analytics/no_inertia_possession_compressed-7c090bbe98.mp4

No inertia: ball possession mistakenly changes during a rebound.

https://tryolabs.com/assets/blog/2022-10-17-measuring-soccer-ball-possession-ai-video-analytics/inertia_possession_compressed-7890a0f94d.mp4

Inertia: ball possession doesn’t change during the rebound.

Making our demo pretty

Thanks to the new release of Norfair, we can understand how the camera moves in a specific video. This information allowed us to draw the path of the ball in the exact location of the field even when the camera moved from one side of the field to the other.

We also developed an easy-customizable scoreboard that can be used with every possible team to keep track of the metrics in a fashionable manner.

Our final result with every step together and our custom design looks like this:

https://tryolabs.com/assets/blog/2022-10-17-measuring-soccer-ball-possession-ai-video-analytics/possession_compressed_1-6ba1f68fb2.mp4

Final result with the Tryolabs AI possession scoreboard.

Freebie: detecting passes!

Our approach so far can also give us more useful information to analyze a soccer match.

We can define a passing event as when the ball changes from one player to another one from the same team. As we have seen before, we have everything in place for this. Let’s draw the passes and see how it looks!

https://tryolabs.com/assets/blog/2022-10-17-measuring-soccer-ball-possession-ai-video-analytics/passes_possession_compressed-96f2394469.mp4

AI automatically marks passes in a soccer match.

Trying it on different matches

Here we have results over other video snippets:

https://tryolabs.com/assets/blog/2022-10-17-measuring-soccer-ball-possession-ai-video-analytics/real_madrid_vs_barcelona_2_compressed-59b8da41bb.mp4

Real Madrid vs Barcelona.

https://tryolabs.com/assets/blog/2022-10-17-measuring-soccer-ball-possession-ai-video-analytics/france_vs_croatia_compressed-cd03edcdd5.mp4

2018 FIFA World Cup Final — France vs Croatia.

Show me the code!

The entire repository is available on GitHub.

Apart from the computation of possession, we also include code for drawing the passes of each team with arrows.

Caveats

It’s important to keep in mind that this project is only a demo with the purpose of showing what can be done with AI and video analytics in a short amount of time. It does not attempt to be a robust system and cannot be used on real matches. Much work remains to be done for that!

Some weak points of the current system:

The ball detection model that we used is not robust. This can be easily improved by fine-tuning an object detection model on a better dataset.
The player detection model might not work very well when many players are near each other, especially when there is a significant occlusion.
The team jersey detection method can be improved.
Our system breaks if the camera vantage point changes. Most live TV footage usually takes close-ups of players with different cameras.
Our system doesn’t detect special events in the match such as corners, free kicks, and injuries. In these events, the ball possession will still be calculated and there will be a difference between the calculation of the real ball possession and the one calculated with our algorithm. In order to perfect our system, we would need to detect these events and stop the clock accordingly.

Conclusion

Video analytics is a lot of fun, but it’s no magic. With this blog post, we tried to shed some light on how some of these solutions are implemented behind the scenes and release the code for everyone to play with. We hope that you found it useful!

Again, by no means is this a perfect system. Professional development of these tools for sports analytics will likely need to use several high-precision cameras, larger custom datasets, and possibly recognizing the 3D position of the objects for improved accuracy.

When faced with the development of a large system, we might feel daunted. We must remember that every complex system starts from simple components.

Alan Descoins
CTO & Partner, Tryolabs

Diego Marvid
Machine Learning Engineer, Tryolabs

The post Automatically Measuring Soccer Ball Possession with AI and Video Analytics appeared first on Edge AI and Vision Alliance.

Monitoring Protection Gear in Hazardous Working Spaces Using DeepView ModelPack & VisionPack

Brian Dipert — Thu, 06 Apr 2023 20:57:52 +0000

This article was originally published at Au-Zone Technologies’ website. It is reprinted here with the permission of Au-Zone Technologies.

Working in a hazardous environment always requires protection to prevent injuries. In most fatal accidents the workers are not wearing the right protection or using it properly. Due to the dynamic nature of some work, danger is always present, thus it is needed to monitor how protected the workers are during working hours. In this article we explain how it is possible to use DeepView ModelPack [1] and VisionPack [2] to build a higher end application capable of monitoring, and notifying in real-time, how safe workers are when executing hazardous tasks at work. For this experiment, a hard hat dataset was used to train ModelPack using eIQ Portal 2.5. After 25 epochs, the full integer quantized model achieved over 60% MAP (Mean Average Precision) and takes only 16ms (60 fps) to run the entire pipeline (load image, inference, decoding, and NMS (non-max-suppression)) when integrated with VisionPack on the i.MX 8MP Plus EVK or Au-Zone’s Maivin AI Vision Starter Kit.

Introduction

Despite advances in safety equipment, technology and training, for hazardous working environments, some industries like construction continues to face high rates of fatal and non-fatal injuries and accidents among its workers. According to [3], more than 950 people die every day and over 720,000 workers get hurt because of occupational accidents. For instance, in India, over 48,000 workers die annually because of occupational accidents and more than 37 million are out of work for at least 4 days for injuries [4]. All these accidents have a big impact in the gross national product worldwide. Annually, more than 3,023,718 million USD are paid to workers as compensation due to accidents. Beyond the economic impact, there are additional implications that impact the physical and psychological performance of millions of families every year.

An accident is a term that was initially introduced in 20th Centuries to relate undesired situations occurred around automobilist industry because of media manipulation. More recently, the concept was modified to be an unintended, normally unwanted event, that was not directly caused by humans [5] and implies that nobody should be blamed, but the event may have been caused by unrecognized or unaddressed risks (https://en.wikipedia.org/wiki/Accident). With the aim of solving this issue, protection equipment has been developed to keep workers as safe as possible while performing dangerous tasks. Hard hats, reflecting vests, specialized boots and gloves, safety glasses, are listed as the few more commonly used items. In most industries, it is required to use the right protection equipment, but often lacks the necessary supervision to ensure compliance.

In this article we present a framework for using Computer Vision at the edge for monitoring hazardous workplaces. Our framework can run in real-time (from 30 fps to 60 fps) on an embedded target device (EVK, Maivin) without sacrificing accuracy. This framework is based on two ready to use components from Au-Zone technologies, DeepView ModelPack and VisionPack.

We will describe all the steps needed to create a highly efficient pipeline to solve a real-world problem using edge computing and proprietary software. Our solution will be a real-time monitoring system aided by object detection and vision pipelines capable of notifying every time a worker is not wearing the complete set of safety equipment required in the workplace. The following diagram shows the main building blocks utilized by our system.

Architecture diagram using the Maivin platform.

This application takes the outputs from two ModelPack instances running on the same target, and combines them. One instance of ModelPack is trained on a pedestrian detection dataset and will be used to detect people. The other ModelPack instance is trained on the Hard Hats dataset and the model will be executed only on frames that have detected a person. Finally, our routines will compute some heuristics to infer if each person is wearing the full protection equipment or not. Taking this into consideration, we will be using the following visualization scheme (reference the cover graphic):

Safe (green): The person is wearing the entire protection equipment
Warning (orange): The person is only wearing 1 piece of equipment (Vest or Hard Hat)
Danger (red): The person is not wearing any protection equipment

The rest of the paper is distributed as follows. Section 2 (Related Work) lists the current research in Object Detection and how this problem has been solved so far in Computer Vision. Section 3 explains in more detail how every piece of software is connected while section 4 explains potential transformations we had to apply to the dataset, training configuration, quantization, results, and main contributions. In section 5 conclusions are given and finally, section 6 covers future research and use cases that can be showcased when combining DeepView ModelPack with VisionPack.

Related Work

Object detection algorithms are highly involved in most of Vision-Based problems from real world. People detection has been one of the hottest topics in object detection field due to its high impact in different industries such as surveillance, monitoring, recreation, robotics, and autonomous driving [7,8]. This field has also been extended to the 3D case [9] with high impact in autonomous driving. To know where people are is an attractive solution that helps make decisions in different environments.

With the power of embedded computing, multiple areas of application have emerged. Monitoring safe working areas is an interesting use case capable of measuring the grade of protection each worker has while performing a task. According to [14], safety performance is generally measured by reactive (after the event) and proactive indicators. For the sake of simplicity and application scope, our work is focused on proactive monitoring. In [10] it was proposed a solution based on internal traffic planning and object detection to evaluate the risk when executing the task from aerial images. A closer to reality idea was presented in [11] where authors used Yolo V4, trained on Darknet framework to detect personal protective equipment (PPE). R-CNN is also used to detect non hardhat usage in videos [12]. Another attempt of detecting the usage of hard hats while working at construction is made in [13], this time using Yolo v3.

The main disadvantage of the solutions above is they lack practical application and granularity. A vast majority of solutions reported in this field are based on using a heavy model that produces a very good accuracy sacrificing the scalability and viability of the product. It is practically impossible to have these algorithms working in real time at the edge since they are considerable large. For that reason, we proposed to combine specialized in edge computing technologies to solve the problem.

Safety Assessment with DeepView ModelPack and VisionPack

With the aim of solving the problem of monitoring if workers are wearing protection equipment, we built an application by combining ModelPack and VisionPack. Our application takes ModelPack trained over two different datasets, which result in a ModelPack-People-Detection and ModelPack-Safety-Gear. These two models are loaded into VisionPack and executed on each input frame. Finally, the results are combined into a simple strategy that matches a person with their safety gear and produces a bounding box for the entire person colored according to the scheme explained above (green, orange, red). In this section we explain in detail what ModelPack and VisionPack are. Also, we dedicate a section to show the way the pieces are connected.

DeepView ModelPack

ModelPack for Detection provides a state-of-the-art detection algorithm [6] featuring real-time performance on i.MX 8M Plus platforms. It is especially well optimized for utilizing the NPU and GPU found on this platform and has been fine tuned to support full per-tensor quantization while preserving high accuracy in comparison with the float model. DeepView ModelPack is fully compatible with DeepView Creator and eIQ Toolkit which provide an easy-to-use graphical interface for creating datasets, training, model validation, and deploying optimized models to i.MX 8M Plus platforms.

A key feature on ModelPack is we can find higher optimized weights that lead training sessions to be successful since the beginning, showing higher accuracy and speed on convergence. Since ModelPack is based on Yolo V4 algorithm, the anchors generation occur in an automatic way, which are refined using genetic algorithm in a second stage. Programming languages, complicated loss functions, data augmentation and training frameworks are transparent to the user since the integration with the GUI is straightforward and easy to use.

Architecture diagram using the Maivin AI Vision Starter Kit.

The above figure shows different working scenarios related with construction. There, it is observed how some of the workers are using all the protection equipment while the rest are using none of them. These images were created by running one of the resulting ModelPack checkpoints on VisionPack, this last one oversees controlling the execution pipeline at the edge.

DeepView VisionPack

DeepView VisionPack provides vision pipeline solutions for edge computing and embedded machine learning applications. Whether targeting NXP i.MX Application Processors and i.MX RT Crossover MCUs, Au-Zone tunes these pipelines for maximum efficiency. VisionPack provides end-to-end optimizations, including sensor support, pre-processing, and full integration with Au-Zone’s DeepViewRT runtime engine. At the edge, VisionPack can detect your model and establish the right communication with the decoder so the boxes can be extracted from the raw inference in an automatic way. Highly optimized non-max-suppression algorithms are included as well as a ready to use Python API that allows you to connect with DeepViewRT runtime engine and get inference in three lines of code.

For solving this problem, we decided to use VisionPack since it acts like a higher end application library that will connect your model with the production environment in few lines of code. Using VisionPack enabled us to focus on solving the vision problem, and not how to implement the vision pipeline, allowing a full prototype to be developed in only 2 days.

DeepView VisionPack Overview

Pipeline Integration

The safety framework starts by connecting VisionPack to the input source. VisionPack will load two models: ModelPack-People-Detection and ModelPack-Safety-Gear. The first one will be executed every frame while the second one is conditioned by the outcome of the first one. If no person was detected, the second model won’t be executed. For the case the first model detects the second model will be executed and all the objects in the frame will be retrieved with their respective bounding boxes (hardhat, no_hardhat, no_vest, and vest). Finally, the following heuristic is applied to check the safety status of each worker.

Where S is the heuristic function, x is the outcome of ModelPack-People-Detection and y is the outcome of ModelPack-Safety-Gear. The final application will draw the boxes around people taking the following color scheme:

Red: Danger
Orange: Warning
Green: Safe

Visualization of the monitoring.

People wearing vest and hard hats will be drawn in green. People wearing only one of the protection equipment (hard hat or vest) will be drawn on orange, otherwise, people are unsafe and will be drawn in red because they are not wearing any protection equipment.

Notifications will be sent according to the requirements of each working area.

In the above figure is described the entire workflow of our application. The application logic is implemented in node S (heuristic).

Experiments and Contributions

For developing this system, the first difficult task was dataset cleaning. The dataset is from public domain and was taken from this resource (https://drive.google.com/drive/u/2/folders/14zw_X1ImyOo71jEGyxT7ARxVV8t1rYDU) avoiding the planer_images as the author says in his Youtube channel. The dataset is in yolo format and contains 4 classes (hard hat, boots, mask, vest). In our case we suppressed the boots and mask classes since they lack importance for this demo. Next the dataset was reviewed and we corrected bad annotations and re-annotated all images with two more classes (no-vest and not-hard-hat). All dataset manipulation was completed using eIQ Portal (class removal, dataset review and correction, and new classes annotations).

Dataset

The resulting dataset has a large unbalance factor since the two new classes (no_hardhat and no_vest) were included to reduce the false positives. Initially, the dataset with two classes was accurately detecting vests, but predictions were full of false positives as well. By adding these two classes we mitigate these false detections. To add these two classes was easy to do using the eIQ GUI.

Dataset Curator

Finally, the dataset and class distribution are shown the following chart.

Train and Test annotations distribution across all the classes in the dataset.

DeepView ModelPack Customization

To make ModelPack accurate on both problems, people detection and safety gear detection, we used a ModelPack from DeepView Zoo, trained on COCO-2017 dataset for solving the first problem (people detection). For the case of the second problem (safety gear detection) we had to train ModelPack on the safety-gear dataset.

To do that, we used the following hyperparameters:

Input Resolution: 416 x416
Learning-Rate: Linear decay, starting at 1e-3 with a decay factor of 0.9 executed after 5 epochs.
Batch size: 10
Number of Epochs: 25
Augmentation: Default-Oriented Augmentation

ModelPack training graph after 25 epochs

Once the training process ended, we validated the model and get 52.42% mAP@50. This validation was performed over the per-tensor quantized model, using uint8 input and int8 output. As a benchmark, we also tested the free SSD model contained in eIQ Portal, which was only 37% map@50.

Inference time vs mAP@50. SSD vs ModelPack.

For benchmarking purposes, we trained both models for 25 epochs only. Notice ModelPack has a 15% higher mAP than SSD.

From the above chart, we can infer than SSD is faster but less accurate than ModelPack. In our case, we decided to move forward with ModelPack since it is 14ms on NPU which can be translated into 60 FPS. Notice that our camera is limited to run at 30 FPS and 60 FPS speed won’t be observed in the result. This time, we decided to take the more accurate model and run two instances of the same model (people detection and safety gear detection) on the same device while keeping a detection rate of 30 FPS.

Conclusions

This article presents a use case based on real-time monitoring if workers are wearing protective equipment in the workplace. The system responds operates at a full 30 FPS, and alerts every time a worker is not wearing the appropriate protection equipment. Utilizing Au-Zone’s middleware, ModelPack and VisionPack, a solution was developed in < 3 days. Both software packages are highly specialized in optimizing edge computing tasks, which makes it possible to build a new vision based solution in record time. During experimentation, we noticed that the free model from eIQ Portal (SSD) is not as accurate for this dataset and to be faster is not a requirement in this case since we are limited by the camera speed (30 FPS). Finally, we tested our software in both environments, indoor and outdoor and the obtained results can be shown in the following videos.

Bibliography

ModelPack: Au-Zone Technologies solution for Object Detection at the Edge. https://www.embeddedml.com/deepview-model-pack (May, 2022).
VisionPack: Higher end vision pipeline from Au-Zone Technologies for solving Computer Vision problems at the edge. https://www.embeddedml.com/deepview-vision-pack (May, 2022).
Patel, Dilipkumar Arvindkumar, and Kumar Neeraj Jha. “An estimate of fatal accidents in Indian construction.” Proceedings of the 32nd annual ARCOM conference. Vol. 1. 2016.
Hämäläinen, Päivi. “Global estimates of occupational accidents and fatal work-related diseases.” (2010).
Woodward, Gary C. The Rhetoric of Intention in Human Affairs. Lexington Books, 2013.
Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. “Yolov4: Optimal speed and accuracy of object detection.” arXiv preprint arXiv:2004.10934 (2020).
Tian, Di, et al. “A review of intelligent driving pedestrian detection based on deep learning.” Computational intelligence and neuroscience 2021 (2021).
Linder, Timm, et al. “Cross-Modal Analysis of Human Detection for Robotics: An Industrial Case Study.” 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021.
Wang, Yingjie, et al. “Multi-modal 3d object detection in autonomous driving: a survey.” arXiv preprint arXiv:2106.12735 (2021).
Kim, Kyungki, Sungjin Kim, and Daniel Shchur. “A UAS-based work zone safety monitoring system by integrating internal traffic control plan (ITCP) and automated object detection in game engine environment.” Automation in Construction 128 (2021): 103736.
Kumar, Saurav, et al. “YOLOv4 algorithm for the real-time detection of fire and personal protective equipments at construction sites.” Multimedia Tools and Applications (2021): 1-21.
Qi Fang, Heng Li, Xiaochun Luo, Lieyun Ding, Hanbin Luo, Timothy M. Rose, Wangpeng An, Detecting non-hardhat-use by a deep learning method from far-field surveillance videos, Automation in Construction, 2018.
Hu, Jing, et al. “Detection of workers without the helments in videos based on YOLO V3.” 2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE, 2019.
Hinze, Jimmie, Samuel Thurman, and Andrew Wehle. “Leading indicators of construction safety performance.” Safety science 51.1 (2013): 23-28.

Reinier Oves Garcia
Senior Machine Learning Developer, Au-Zone Technologies

The post Monitoring Protection Gear in Hazardous Working Spaces Using DeepView ModelPack & VisionPack appeared first on Edge AI and Vision Alliance.

Access the Latest in Vision AI Model Development Workflows with NVIDIA TAO Toolkit 5.0

Brian Dipert — Thu, 06 Apr 2023 08:00:14 +0000

This article was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA.

NVIDIA TAO Toolkit provides a low-code AI framework to accelerate vision AI model development suitable for all skill levels, from novice beginners to expert data scientists. With NVIDIA TAO (Train, Adapt, Optimize) Toolkit, developers can use the power and efficiency of transfer learning to achieve state-of-the-art accuracy and production-class throughput in record time with adaptation and optimization.

At NVIDIA GTC 2023, NVIDIA announced NVIDIA TAO Toolkit 5.0, bringing groundbreaking features to enhance any AI model development. The new features include source-open architecture, transformer-based pretrained models, AI-assisted data annotation, and the capability to deploy models on any platform.

Figure 1. NVIDIA TAO Toolkit workflow diagram

Deploy NVIDIA TAO models on any platform, anywhere

NVIDIA TAO Toolkit 5.0 supports model export in ONNX. This makes it possible to deploy a model trained with NVIDIA TAO Toolkit on any computing platform—GPUs, CPUs, MCUs, DLAs, FPGAs—at the edge or in the cloud. NVIDIA TAO Toolkit simplifies the model training process and optimizes the model for inference throughput, powering AI across hundreds of billions of devices.

Figure 2. NVIDIA TAO Toolkit architecture

STMicroelectronics, a global leader in embedded microcontrollers, integrated NVIDIA TAO Toolkit into its STM32Cube AI developer workflow. This puts the latest AI capabilities into the hands of millions of STMicroelectronics developers. It provides, for the first time, the ability to integrate sophisticated AI into widespread IoT and edge use cases powered by the STM32Cube.

Now with NVIDIA TAO Toolkit, even the most novice AI developers can optimize and quantize AI models to run on STM32 MCU within the microcontroller’s compute and memory budget. Developers can also bring their own models and fine-tune using TAO Toolkit. More information about this work is captured in the demo below from STMicroelectronics.

Video 1. Learn how to deploy a model optimized with TAO Toolkit on STM microcontroller

While TAO Toolkit models can run on any platform, these models achieve the highest throughput on NVIDIA GPUs using TensorRT for inference. On CPUs, these models use ONNX-RT for inference. The script and recipe to replicate these numbers will be provided once the software becomes available.

	NVIDIA Jetson Orin Nano 8 GB	NVIDIA Jetson AGX Orin 64 GB	T4	A2	A100	L4	H100
PeopleNet	112	679	429	242	3,264	797	7,062
DINO – FAN-S	3	11.4	29.9	16.5	174	52.7	292
SegFormer – MiT	1.3	4.7	6.2	4	40.6	10.4	70
OCRNet	981	3,921	3,903	2,089	27,885	7,241	53,809
EfficientDet	61	227	303	184	1,521	522	2,428
2D Body Pose	136	557	593	295	4,140	1,010	7,812
3D Action Recognition	52	212	269	148	1,658	529	2,708

Table 1. Performance comparison (in FPS) of several NVIDIA TAO Toolkit vision models, including new vision transformer models on NVIDIA GPUs

AI-assisted data annotation and management

Data annotation remains an expensive and time-consuming process for all AI projects. This is especially true for CV tasks like segmentation that require generating a segmentation mask at pixel level around the object. Generally, the segmentation masks cost 10x more than object detection or classification.

It is faster and less expensive to annotate segmentation masks with new AI-assisted annotation capabilities using TAO Toolkit 5.0. Now you can use the weakly supervised segmentation architecture, Mask Auto Labeler (MAL) to assist in segmentation annotation and in fixing and tightening bounding boxes for object detection. Loose bounding boxes around an object in ground truth data can lead to suboptimal detection results, but with the AI-assisted annotation, you can tighten your bounding boxes over objects, leading to more accurate models.

Figure 3. NVIDIA TAO Toolkit auto labeling workflow

MAL is a transformer-based, mask auto labeling framework for instance segmentation using only box annotations. MAL takes box-cropped images as inputs and conditionally generates the mask pseudo-labels. It uses COCO annotation format for both input and output labels.

MAL significantly reduces the gap between auto labeling and human annotation for mask quality. Instance segmentation models trained using the MAL-generated masks can nearly match the performance of the fully supervised counterparts, retaining up to 97.4% performance of fully supervised models.

Figure 4. Mask Auto Labeler (MAL) network architecture

When training the MAL network, a task network and a teacher network (sharing the same transformer structure) work together to achieve class-agnostic self-training. This enables refining the prediction masks with conditional random field (CRF) loss and multi-instance learning (MIL) loss.

TAO Toolkit uses MAL in both the auto labeling pipeline and data augmentation pipeline. Specifically, users can generate pseudo-masks on the spatially augmented images (sheared or rotated, for example), and refine and tighten the corresponding bounding boxes using the generated masks.

State-of-the-art vision transformers

Transformers have become the standard architecture in NLP, largely because of self-attention. They have also gained popularity for a range of vision AI tasks. In general, transformer-based models can outperform traditional CNN-based models due to their robustness, generalizability, and ability to perform parallelized processing of large-scale inputs. All of this increases training efficiency, provides better robustness against image corruption and noise, and generalizes better on unseen objects.

TAO Toolkit 5.0 features several state-of-the-art (SOTA) vision transformers for popular CV tasks, as detailed below.

Fully Attentional Network

Fully Attentional Network (FAN) is a transformer-based family of backbones from NVIDIA Research that achieves SOTA in robustness against various corruptions. This family of backbones can easily generalize to new domains and be more robust to noise, blur, and more.

A key design behind the FAN block is the attentional channel processing module that leads to robust representation learning. FAN can be used for image classification tasks as well as downstream tasks such as object detection and segmentation.

Figure 5. Activation heat map on a corrupted image for ResNet50 (center) compared to FAN-Small (right)

The FAN family supports four backbones, as shown in Table 2.

Model	# of parameters/FLOPs	Accuracy
FAN-Tiny	7 M/3.5 G	71.7
FAN-Small	26 M/6.7	77.5
FAN-Base	50 M/11.3 G	79.1
FAN-Large	77 M/16.9 G	81.0

Table 2. FAN backbones with size and accuracy

Global Context Vision Transformer

Global Context Vision Transformer (GC-ViT) is a novel architecture from NVIDIA Research that achieves very high accuracy and compute efficiency. GC-ViT addresses the lack of inductive bias in vision transformers. It achieves better results on ImageNet with a smaller number of parameters through the use of local self-attention.

Local self-attention paired with global context self-attention can effectively and efficiently model both long and short-range spatial interactions. Figure 6 shows the GC-ViT model architecture. For more details, see Global Context Vision Transformers.

Figure 6. GC-ViT model architecture

As shown in Table 3, the GC-ViT family contains six backbones, ranging from GC-ViT-xxTiny (compute efficient) to GC-ViT-Large (very accurate). GC-ViT-Large models can achieve Top-1 accuracy of 85.6 on the ImageNet-1K dataset for image classification tasks. This architecture can also be used as backbone for other CV tasks like object detection and semantic and instance segmentation.

Model	# of parameters/FLOPs	Accuracy
GC-ViT-xxTiny	12 M/2.1 G	79.6
GC-ViT-xTiny	20 M/2.6 G	81.9
GC-ViT-Tiny	28 M/4.7 G	83.2
GC-ViT-Small	51 M/8.5 G	83.9
GC-ViT-Base	90 M/14.8 G	84.4
GC-ViT-Large	201 M/32.6 G	85.6

Table 3. GC-ViT backbones with size and accuracy

DINO

DINO (detection transformer with improved denoising anchor) is the newest generation of detection transformers (DETR). It achieves faster training convergence time than its predecessor. Deformable-DETR (D-DETR) requires at least 50 epochs to converge, while DINO can converge in 12 epochs on the COCO dataset. It also achieves higher accuracy when compared with D-DETR.

DINO achieves faster convergence through the use of denoising during training, which helps the bipartite matching process at the proposal generation stage. The training convergence of DETR-like models is slow due to instability of bipartite matching. Bipartite matching removed the need for handcrafted and compute-heavy NMS operations. However, it often required much more training because incorrect ground truths were matched to the predictions during bipartite matching.

To remedy such a problem, DINO introduced noised positive ground-truth boxes and negative ground-truth boxes to handle “no object” scenarios. As a result, training converges very quickly for DINO. For more information, see DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection.

Figure 7. DINO architecture

DINO in TAO Toolkit is flexible and can be combined with various backbones from traditional CNNs, such as ResNets, and transformer-based backbones like FAN and GC-ViT. Table 4 shows the accuracy on the COCO dataset on various versions of DINO with popular YOLOv7. For more details, see YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors.

Model	Backbone	AP	AP50	AP75	APS	APM	APL	Param
YOLOv7	N/A	51.2	69.7	55.5	35.2	56.0	66.7	36.9M
DINO	ResNet50	48.8	66.9	53.4	31.8	51.8	63.4	46.7M
	FAN-Small	53.1	71.6	57.8	35.2	56.4	68.9	48.3M
	GCViT-Tiny	50.7	68.9	55.3	33.2	54.1	65.8	46.9M

Table 4. DINO and D-DETR accuracy on the COCO dataset

SegFormer

SegFormer is a lightweight transformer-based semantic segmentation. The decoder is made of lightweight MLP layers. It avoids using positional encoding (mostly used by transformers), which makes the inference efficient at different resolutions.

Adding FAN backbone to SegFormer MLP decoder results in a highly robust and efficient semantic segmentation model. FAN base hybrid + SegFormer was the winning architecture at the Robust Vision Challenge 2022 for semantic segmentation.

Figure 8. SegFormer with FAN prediction (right) on noisy input image (left)

Model	Dataset	Mean IOU (%)	Retention rate (robustness) (%)
PSPNet	Cityscapes Validation	78.8	43.8
SegFormer – FAN-S-Hybrid	Cityscapes validation	81.5	81.5

Table 5. Robustness of SegFormer compared to PSPNet

CV tasks beyond object detection and segmentation

NVIDIA TAO Toolkit accelerates a wide range of CV tasks beyond traditional object detection and segmentation. The new character detection and recognition models in TAO Toolkit 5.0 enable developers to extract text from images and documents. This automates document conversion and accelerates use cases in industries like insurance and finance.

Detecting anomalies in images is useful when the object being classified varies greatly, such that training with all the variation is impossible. In industrial inspection, for example, a defect can come in any form. Using a simple classifier could result in many missed defects if the defect has not been previously seen by the training data.

For such use cases, comparing the test object directly against a golden reference would result in better accuracy. TAO Toolkit 5.0 features a Siamese neural network in which the model calculates the difference between the object under test and a golden reference to classify if the object is defective.

Automate training using AutoML for hyperparameter optimization

Automated machine learning (autoML) automates the manual task of finding the best models and hyperparameters for the desired KPI on a given dataset. It can algorithmically derive the best model and abstract away much of the complexity of AI model creation and optimization.

AutoML in TAO Toolkit is fully configurable for automatically optimizing the hyperparameters of a model. It caters to both AI experts and nonexperts. For nonexperts, the guided Jupyter notebook provides a simple, efficient way to create an accurate AI model.

For experts, TAO Toolkit gives you full control of which hyperparameters to tune and which algorithm to use for sweeps. TAO Toolkit currently supports two optimization algorithms: Bayesian and Hyperband optimization. These algorithms can sweep across a range of hyperparameters to find the best combination for a given dataset.

AutoML is supported for a wide range of CV tasks, including several new vision transformers such as DINO, D-DETR, SegFormer, and more. Table 6 shows the full list of supported networks (bold items are new to TAO Toolkit 5.0).

Image classification	Object detection	Segmentation	Other
FAN	DINO	SegFormer	LPRNet
GC-ViT	D-DETR	UNET
ResNet	YoloV3/V4/V4-Tiny	MaskRCNN
EfficientNet	EfficientDet
DarkNet	RetinaNet
MobileNet	FasterRCNN
	DetectNet_v2
	SSD/DSSD

Table 6. Models supported by AutoML in TAO Toolkit, including several new vision Transformer models (bold items are new to TAO Toolkit 5.0)

REST APIs for workflow integration

TAO Toolkit is modular and cloud-native, meaning it is available as containers and can be deployed and managed using Kubernetes. TAO Toolkit can be deployed as a self-managed service on any public or private cloud, DGX, or workstations. TAO Toolkit provides well-defined REST APIs, making it easy to integrate in your development workflow. Developers can call the API endpoints for all training and optimization tasks. These API endpoints can be called from any applications or user interface, which can trigger training jobs remotely.

Figure 9. TAO Toolkit architecture for cloud native deployment

Better inference optimization

To simplify productization and increase inference throughput, TAO Toolkit provides several turnkey performance optimization techniques. These include model pruning, lower precision quantization, and TensorRT optimization, which can combine to deliver 4x to 8x performance boost, compared to a comparable model from public model zoos.

Figure 10. Performance comparison between TAO Toolkit optimized and public models on a wide range of GPUs

Open and flexible, with better support

An AI model predicts output based on complex algorithms. This can make it difficult to understand how the system arrived at its decision and challenging to debug, diagnose, and fix errors. Explainable AI (XAI) aims to address these challenges by providing insights into how AI models arrive at their decisions. This helps humans understand the reasoning behind the AI output and makes it easier to diagnose and fix errors. This transparency can help to build trust in AI systems.

To help with transparency and explainability, TAO Toolkit will now be available as source-open. Developers will be able to view feature maps from internal layers, as well as plot activation heat maps to better understand the reasoning behind AI predictions. In addition, having access to the source code will give developers the flexibility to create customized AI, improve debug capability, and increase trust in their models.

NVIDIA TAO Toolkit is enterprise-ready and available through NVIDIA AI Enterprise (NVAIE). NVAIE provides companies with business-critical support, access to NVIDIA AI experts, and priority security fixes. Join NVAIE to get support from AI experts.

Integration with cloud services

NVIDIA TAO Toolkit 5.0 is integrated into various AI services that you might already use, such as Google Vertex AI, AzureML, Azure Kubernetes service, and Amazon EKS.

Figure 11. TAO Toolkit 5.0 is integrated with various AI services

Summary

TAO Toolkit offers a platform for any developer, in any service, and on any device to easily transfer-learn their custom models, perform quantization and pruning, manage complex training workflows, and perform AI-assisted annotation with no coding requirements. At GTC 2023, NVIDIA is announcing TAO Toolkit 5.0. Sign up to be notified about the latest updates to TAO Toolkit.

Download NVIDIA TAO Toolkit and get started creating custom AI models. You can also experience NVIDIA TAO Toolkit on LaunchPad.

Chintan Shah
Product Manager, NVIDIA

Debraj Sinha
Product Marketing Manager for Metropolis, NVIDIA

Yu Wang
Senior Engineer, Intelligent Video Analytics Team, NVIDIA

Sean Cha
Deep Learning Software Engineer, Intelligent Video Analytics Team, NVIDIA

Subhashree Radhakrishnan
Senior Deep Learning Engineer, Intelligent Video Analytics Team, NVIDIA

The post Access the Latest in Vision AI Model Development Workflows with NVIDIA TAO Toolkit 5.0 appeared first on Edge AI and Vision Alliance.

From DALL·E to Stable Diffusion: How Do Text-to-Image Generation Models Work?

Brian Dipert — Thu, 12 Jan 2023 09:00:49 +0000

This article was originally published at Tryolabs’ website. It is reprinted here with the permission of Tryolabs.

The machine learning community lost its mind when OpenAI released DALL·E in early 2021. Previous years had seen a lot of progress in models that could generate increasingly better (and more realistic) images given a written caption, but with an unprecedented level of range and flexibility in the content and style of the images it could generate, DALL·E was the pinnacle of what one could expect from a text-to-image generation model… One year later, DALL·E is but a distant memory, and a new breed of generative models has absolutely shattered the state-of-the-art of image generation.

DALL·E results for the caption “An armchair in the shape of an avocado”. Source: OpenAI’s DALL·E blogpost.

DALL·E 2 results for the caption “An armchair in the shape of an avocado”. Generated with OpenAI’s DALL·E 2 beta.

Stable Diffusion results for the caption “An armchair in the shape of an avocado”. Generated locally in Tryolabs’ hardware.

So what’s changed? We at Tryolabs are in full awareness that one year in Machine Learning research is equivalent to about a decade in other fields — but even for us, this is a massive leap in performance for such a short period. The answer isn’t hard to come by when analyzing the similarities between the latest models: a completely novel approach to text-to-image generation through the use of diffusion models has surfaced — and it’s here to stay.

If you haven’t been living under a rock for the last couple of months, the names DALL·E 2, Imagen, and Stable Diffusion should ring a bell. Each generated a new weeks-long wave of Twitter threads, discussing architectural details and showcasing results that escaped the very boundaries of the AI communities that managed to create them.

In this blog post, we will take a peek at how diffusion works for generating images, explain exactly where the differences between these three models lie, and analyze what real-world tasks these models might aid companies and individuals with in the long run. Don’t hesitate to read along if math isn’t your strongest suit — we’ll make sure to keep it to a minimum!

The DALL·E way

To better understand what has changed, let’s first dive into how OpenAI’s original DALL·E worked.

Released in January 2021 and following the release of GPT-3 a few months earlier, DALL·E made use of a Transformer, a deep learning architecture that surfaced in 2017 and has since then been the de facto choice for text encoding and processing sequential input, and a variational autoencoder (VAE), a model trained to encode an image into a low-dimensional probability distribution and then decode it off of it, that can be used for generating new images by sampling from the intermediate distribution and passing that through the decoder.

The Transformer was trained autoregressively (i.e., making it try to predict future values based on past ones) on the concatenation of a sequence of text tokens and a sequence of image tokens, the second of which were obtained with the VAE trained beforehand. During sampling, the Transformer is given the whole text prompt and generates each token in the image sequentially, to be later decoded by the VAE to obtain the full output image.

At the time of its release, DALL·E showed truly mind-blowing results — besides having some limitations, the most blatant of them being the trouble it had when trying to generate photorealistic images instead of cartoonish/artistic-looking ones.

DALL·E results for the caption “the exact same cat on top as a sketch on the bottom”. Source: OpenAI’s DALL·E blogpost.

2021 went by without much news in the image generation space, except for a single paper out of OpenAI titled GLIDE. GLIDE, which admittedly flew under our radar at the time, proposed using diffusion models for the problem of text-conditional image synthesis, guiding the diffusion process towards whatever researchers wanted the final image to look like. This approach, or variants of it, was then used by DALL·E 2, Imagen and Stable Diffusion in their respective models. Before diving into how this guidance works, let’s do a quick recap on what diffusion models are.

How does diffusion work?

Diffusion models are generative models able to synthesize high-quality images from a latent variable. Wait, isn’t that what GANs do? GANs and diffusion models (and VAEs and flow-based models, while we’re at it) are similar in that they pretend to produce an image from randomness — but different in every other way.

The GAN approach has been the standard for generating images for several years now, especially when needing to generate images belonging to a tight distribution, such as faces of people or breeds of dogs. We have written about GANs in the past, but in a nutshell, their training consists of spawning two models, called the generator and the discriminator, and having the generator try to generate image samples that trick the discriminator into thinking they come from the real data distribution being trained with. Although the paradigm of having two models train each other is quite amusing, GANs are also famous for being especially hard to train, with generators that straight up don’t learn or fall into mode collapse (i.e. that learn to generate the same image every time) being extremely common. For the interested reader, there are some hints as to why this is in this Twitter thread.

On the other hand, diffusion models consist of generating a chain of N increasingly-noisy images by gradually adding Gaussian noise to an image, and then training a model to predict the noise that was added to the image from one step to the following one. If the steps are small enough, one can ensure that the image obtained at the end of the sequence can be approximated by the same Gaussian the noise is being sampled with — which allows us to generate a completely new image by sampling from that same distribution and then passing it N times through our trained model. We won’t get into the math behind them here, but we recommend checking out Lilian Weng’s blog post if that’s your jam!

Illustration of the diffusion process. Source: NVIDIA Developer.

Not only do all the approaches we will dive into in this post use diffusion, but they also all use variants of the same model to predict the noise added to the image in each step of the chain: a U-Net. Surfaced in 2015 and initially proposed as a new way to tackle the problem of biomedical image segmentation, U-Nets possess the characteristic of having the same input and output shapes, which makes them ideal for inputting an image and obtaining how much noise was added to each of its pixels. They consist of a stack of residual layers and downsampling convolutions followed by a stack of residual layers with upsampling convolutions, with skip connections linking the layers with the same spatial size in the two symmetric halves of the network. The contracting path allows for capturing the general context of the input image, while the skip connections provide the upsampling layers with the detailed information needed at each step.

Overview of U-Net’s architecture. Source: U-Net’s research paper.

How can we guide the diffusion process?

We’ve learned how diffusion can help us to generate an image from random noise, but if that were as much as there was to it we would end up with a model that is only able to generate random images. How can we make use of this model to synthesize images that correspond with a class name in our training data, a piece of text, or another image? This is where GLIDE, the paper we mentioned earlier, comes in. It builds on two important concepts proposed by papers that precede it:

Conditioned diffusion, which comprises feeding conditioning data (such as a class label or a text embedding) to the diffusion model through its input layer and through attention at its inner layers, to aid it in producing a sample that corresponds to that data.
Classifier guidance, which consists of adding the gradient with respect to a target class predicted by a classifier to the noise predicted by the diffusion model, thus forcing the diffusion process towards the expected class in every step of the way.

GLIDE proposes two novel ways of guiding a diffusion model towards a caption of our liking:

CLIP guidance works in a very similar way to classifier guidance except it sums the gradient with respect to the CLIP loss between the partially-generated image and the caption (which measures how much the two correspond with each other — if you don’t know how CLIP works, we strongly recommend checking it out in OpenAI’s blog!)
Classifier-free guidance works by training a conditioned model but with the caption being dropped randomly during training (thus obtaining a model that can work both conditioned and unconditioned), and during inference passing the image through the model both with and without a conditioning caption, taking the difference between the two and extrapolating the final prediction towards the direction of the conditioned prediction.

Note that a diffusion model can use both conditioning and guidance at the same time, and the authors found that a combo of text conditioning plus classifier-free guidance worked best in their experiments, both in terms of photorealism and caption similarity.

DALL·E 2

Good news — if you followed this far and understood how guided diffusion works, you already know how DALL·E 2, Imagen, and Stable Diffusion work! Each of these uses conditioned diffusion models to attain the mind-shattering results we’ve grown accustomed to. The devil’s in the details though, so let’s dive into what makes each approach unique.

Released by OpenAI in April 2022, a little more than a year after its predecessor, this model should be called GLIDE 2 by the looks of it — but that doesn’t sound as catchy, does it? DALL·E 2 builds on the foundation established by GLIDE and takes it a step further by conditioning the diffusion process with CLIP image embeddings, instead of with raw text embeddings as proposed in GLIDE.

Overview of DALL·E 2’s architecture. Source: DALL·E 2’s research paper.

To obtain this image embedding, a CLIP model is trained on CLIP and DALL·E’s combined datasets, and an auxiliary model called the prior (for which they try both autoregressive and diffusion variants) is trained to produce the image embeddings conditioned on both the encoded captions and their CLIP embeddings. This image embedding (and optionally the encoded caption too) is then used to condition the diffusion model used to generate the final image, called the decoder. Classifier-free guidance is enabled too on the same conditioning information, which according to the paper “improves sample quality a lot”.

Using the generated CLIP image embeddings as conditioning not only improves sample diversity compared to GLIDE but also enables some cool byproducts, such as creating variations of an input image by encoding it and decoding it or generating variations between a pair of images or captions by interpolating their embeddings.

Variations of an image generated by DALL·E 2, original image on top. Source: DALL·E 2’s research paper.

Variations between two images generated by DALL·E 2, original images on each side. Source: DALL·E 2’s research paper.

Variations between two captions generated by DALL·E 2, original images on each side. Source: DALL·E 2’s research paper.

Even though DALL·E 2 clearly solves its predecessor’s issues with generating photorealistic pictures, it has some limitations. It struggles with some common issues in multimodal learning, such as compositionality and variable binding — and amusingly, with generating written text.

Showcasing of some of DALL·E 2’s limitations. Source: DALL·E 2’s research paper.

Imagen

If you felt DALL·E 2’s approach seemed overly complicated, Google is here to tell you they agree. Released only a month after its competitor in May 2022, and claiming “an unprecedented degree of photorealism and a deep level of language understanding”, Imagen improves on GLIDE by simply swapping its custom-trained text encoder for a generic large language model pre-trained on text-only corpora.

Overview of Imagen’s architecture. Source: Imagen’s research paper.

Instead of the 1.2B-parameter Transformer that GLIDE’s authors train on DALL·E’s dataset as part of their training regime, Imagen decides to use a frozen T5-XXL model, which is a huge 4.6B-parameter Transformer trained on a much larger dataset. In the published paper they state to have found generic large language models to be surprisingly effective text encoders for text-to-image generation, and that scaling the size of the frozen text encoder improves sample quality significantly more than scaling the size of the image diffusion model. Note that apart from the core text encoder and text-to-image models, two more diffusion ones are used to scale the output image up from 64×64 to a whopping 1024×1024 px — this is common practice, and is done in a similar way by the other approaches.

Examples of Imagen results. Source: Imagen’s research paper.

Imagen additionally introduces:

Dynamic thresholding, a neat trick to allow performing stronger guidance during sampling to improve photorealism and text-alignment without generating overly-saturated images.
Efficient U-Net, a new U-Net which is simpler, more memory-efficient and converges faster.
DrawBench, a proposed set of prompts to be used to standardize the evaluation of text-to-image models, according to which Imagen outperformed all other existing alternatives at that time.

As if the image generation landscape wasn’t competitive enough as-is, a second team at Google Research working in parallel with Imagen’s decided to put some more wood in the fire by releasing Parti, yet another text-to-image model. Parti takes the autoregressive approach the original DALL·E took (but scaling it to a mindblowing 20B parameters!) instead of the diffusion one so we won’t explore it in detail in this post, but its just as good, if not better, as its diffusion-based counterparts — specially at generating images from longer input descriptions!

Examples of Parti results. Source: Parti’s research paper.

Stable Diffusion

Although DALL·E 2, Imagen and Parti produce astonishing results, the former is currently under a beta only select users have limited free access to, and the latter have not been released to the public at all. While seeing these huge advancements being made in the field is a feat in itself, at the moment it is impossible for external organizations or individuals to do follow-up research on them nor to build AI-powered products that make use of such technology. Well, that was up until a few days ago, when Stability AI open-sourced Stable Diffusion to the world.

Not only is Stable Diffusion’s model public (and with public we really do mean public — both code and weights have been released and the model can be set up in minutes through HuggingFace’s diffusers library!), but it is also small enough to fit inside consumer GPUs — which is definitely not the case for the massive models used by the previous two approaches. We obviously had to give it a try on our own and didn’t miss the opportunity to generate some cool dinos that might just end up decorating our office’s walls.

Stable Diffusion results for the caption “Triceratops programming on a MacBook in a startup office with other dinosaurs on the back, oil canvas”. Generated locally in Tryolabs’ hardware.

Still not sold on what a big deal open-sourcing months of research in this manner can mean for the whole AI community? Since its release users have obviously used the model to create true masterpieces — even in video format, giving birth to what could even be called a new kind of art. But Stability’s license for their models doesn’t just allow using it for free for personal purposes: it can be baked into new or existing products to expose its whole array of capabilities to less tech-savvy users! This post provides a comprehensive list of the myriad of services, user interfaces and integrations based on Stable Diffusion that have already emerged and that will most surely keep sprouting and evolving in the upcoming months.

So how exactly does Stable Diffusion manage to compress a model comparable to DALL·E 2 into such a tiny footprint? The secret lies in latent diffusion models, a paper published earlier this year, which found that running diffusion directly on an image’s pixel space is not only super slow and computationally expensive, but also unnecessary. Instead, they propose applying diffusion models on the lower-dimensional latent space of a powerful pre-trained autoencoder (similar to the one we mentioned when explaining DALL·E’s approach), and then using the autoencoder itself to decode the final image off it. This allows to separate the compressive from the generative learning phase — i.e. let diffusion models shine at what they do best by generating a sample on a space that is perceptually equivalent to the image space but that offers significantly reduced computational complexity, and let a tried and tested autoencoder do the heavy lifting to obtain the full-sized final image.

Commercial applications

We get it — we’ve gotten really, really good at generating cool images off of a short description… but what for? Are there any real-world applications for this tech? Or is it just for show?

According to a recent article in TechCrunch some businesses are already experimenting with DALL·E 2’s beta, testing out possible use cases for when it becomes stable enough to be knit into their products. StitchFix has experimented using the model to generate images of clothing based on customers’ descriptions, which a stylist could then match with an item in their inventory. Klaviyo, Cosmopolitan, and Heinz have all given DALL·E 2 a spin for marketing purposes, having it generate product visualizations, magazine covers, and brand art respectively, with mixed results. BTW, we have been partnering with companies in the retail industry for a while now, and we curated a guide full of use cases that go further than text-to-image generation models.

The general consensus between these companies seems to be that the model provides value to the people using it, but when used as a tool in the creative process to generate the final image, rather than to generate the final image itself. The ability to produce several different variants for one same prompt can improve creativity or provide an original idea of what the final product should look like — even if none of them is good enough in itself to be used commercially without alterations. It is worth mentioning that prompt engineering, a.k.a. the art of crafting the correct prompts to get a generative model to output exactly what you want, is almost as important as the models’ capabilities themselves, with even books being written to help users make the most out of them!

Diffusion models might have a future in the world of gaming, with some designers testing them out for generating video game assets to then be animated. Although this does raise the age-old question of whether these systems could be used to replace their human counterparts, it seems collaboration between the two is the direction both parts seem to be moving in. It looks like if anyone were to panic because of advancements in this field, it should be the people running stock photo websites, rather than graphic designers and digital artists.

Another possible application sprouts out of these models’ inpainting capabilities, which allow modification of an image on command given a text prompt and a mask. This could not only enable automatic photo editing but also iteratively creating complex scenes from scratch. One can’t fail to imagine such a tool being commercialized in the same way as DALL·E — or even making its way into a new feature in your favorite photo editing software.

Text-conditional image inpainting performed by GLIDE, original image + erased region on the left, result on the right. Source: GLIDE’s research paper.

Final thoughts

This year has been quite a journey for generative AI. The increase in capabilities these models have experienced in such a short time is truly mind-boggling — and the fact that you are now able to run one for free in consumer GPUs even more so. Having several organizations and the hundreds of brilliant individuals that work at them competing to outdo each other is perfect proof that competition drives innovation.

For several years, it looked like AI could be a tool for solving some challenging problems, but would never help much with those soft ones that require human skills or creativity. Now, it looks like AI progress has surprised us, once again acting as a reminder about how hard it is to predict the direction that things will take — even to field experts.

We hope this new breed of powerful AI models can empower more individuals to be creative, giving them tools that — just a few months back — would have only been a dream.

Ian Spektor
Lead Machine Learning Engineer, Tryolabs

The post From DALL·E to Stable Diffusion: How Do Text-to-Image Generation Models Work? appeared first on Edge AI and Vision Alliance.

DNN-Based Object Detectors

Brian Dipert — Wed, 30 Nov 2022 09:00:25 +0000

This article was originally published at Au-Zone Technologies’ website. It is reprinted here with the permission of Au-Zone Technologies.

Unlike image classifiers, which simply report on the most important objects within an image, object detectors determine where objects of interest are located, their sizes and class labels within an image. Consequently, object detectors are central to numerous computer vision applications.

In this article, we provide a technical introduction to deep-neural-network-based object detectors. We explain how these algorithms work, and how they have evolved in recent years, utilizing examples of popular object detectors. We will discuss some of the trade-offs to consider when selecting an object detector for an application, and touch on accuracy measurement. We also discuss performance comparison among the models discussed in this article.

Early Approaches

The seminal work in this regard was reported in the form of technical report at UC Berkeley, in 2014 by Ross Girshick et al. entitled “Rich feature hierarchies for accurate object detection and semantic segmentation”. This is popularly known as “R-CNN” (Regions with CNN features). They approached this problem in a methodical way with having 3 distinct algorithmic stages which is shown in Figure 1 below.

Figure 1: R-CNN [“Rich feature hierarchies for accurate object detection and semantic segmentation”]

As seen above, there are three stages:

Region Proposal: Generate and extract category independent region proposals, using selective search.
Feature Extractor: Extract feature from each candidate region using a deep CNN
Classifier. Classify features as one of the known classes using linear SVM classifier model.

Here the main issue was that it produces lots of overlapping bounding boxes and post processing and filtering was needed to produce reliable results. Consequently, it was very slow and require 3 different algorithms to work in tandem. A faster version of this approach was published in 2015 entitled “Fast R-CNN” where authors combined second and third stages into one network with two outputs – one for classification and the other for bounding-box regression.

Shortly afterwards, they published “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks” where region proposal scheme was also implemented using a separate neural network.

Recent Approaches

YOLO

In 2016 a mark departure was proposed by Joseph Redmon et al. entitled “You Only Look Once: Unified, Real-Time Object Detection” or in short, YOLO. It divides the input image into an S × S grid. If the centre of an object falls into a grid cell, that grid cell is responsible for detecting that object. Each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. The Figure 2 below shows the basic idea.

Figure 2: YOLO [You Only Look Once: Unified, Real-Time Object Detection]

Figure 3 below shows the network architecture for YOLO detector. The network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1 × 1 layers reduce the features space from preceding layers. The convolutional layers were pretrained on the ImageNet classification task.

Figure 3:YOLO Network Architecture

The predictions are encoded as an S × S × (B ∗ 5 + C) tensor. YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes. It can only have one class in a cell. This model struggles with small objects that appear in groups, such as flocks of birds.

YOLO-v3 was proposed in 2018 which utilizes multi-resolution approach with three different grid sizes 8×8, 16×16 and 32×32. Objects were learned based on the resolution and anchors. Each resolution has at least 3 anchors to define objects ratios at each cell.

Recently, in 2021 YOLOX was proposed which is anchor free and utilized decoupled output heads for class labels, bounding box regression and IoU, as shown in Figure 4 below.

Figure 4: YOLOX

SSD (Single Shot Detector)

This approach evaluates a small set of default boxes of different aspect ratios (anchor boxes) at each location in several feature maps with different scales (8 × 8 and 4 × 4 in (b) and (c) in Figure 5 below).

Figure 5: [SSD: Single Shot MultiBox Detector]

The SSD model adds several “feature scaler layers” to the end of a backbone network, which predict the offsets to default boxes of different scales and aspect ratios and their associated confidences. The figure below highlights the differences between YOLO and SSD. For each default box, it predicts both the shape offsets and the confidences for all object categories.

Figure 6: SSD vs YOLO

Inference frame rate (speed with batch size 8 using Titan X and cuDNN v4 with Intel Xeon E5-2667v3@3.20GHz.) is 46 fps for SSD of 300×300 pixel images with mAP = 74.3%. Whereas for YOLO (VGG-16) frame rate is 21 fps of 448×448 pixel images with mAP=66.4%. For SSD, output candidate bounding boxes can be very large (~8732) therefore Non-Maxima Suppression (NMS) post-processing needs to be optimized.

RetinaNet

Figure 7: [Focal Loss for Dense Object Detection]

This technique utilizes Feature Pyramid Network (FPN) backbone on top of feedforward ResNet architecture. They introduced Focal Loss to handle class imbalance in object detection dataset.

RetinaNet with ResNet-101-FPN matches the accuracy of ResNet- 101-FPN Faster R-CNN. The inference time is 172 msec per image on nVidia M40 GPU.

CenterNet

In 2019 CenterNet was proposed which had very different approach as compared to other object detection techniques discussed so far. It is trained to produce gaussian blob (heatmaps) at the centre of the object, thereby eliminating the need for anchor boxes as well as NMS post filtering.

Figure 8:[Objects as Points]

Object are localized by detecting the local maxima from the heatmaps. Single network predicts the keypoints (local maxima of heatmaps), offset, and size. The network predicts a total of C + 4 outputs at each location.

CenterNet is not the best in terms of accuracy but provides excellent accuracy-speed trade-offs. It can easily be extended to 3D object detection, multi-object tracking and pose estimation etc.

Evaluation Framework – Accuracy Measurement

True Positive (TP) — Correct detection made by the model.
False Positive (FP) — Incorrect detection made by the detector.
False Negative (FN) — A Ground-truth missed (not detected) by the object detector.

Intersection over Union (IoU)

The IoU metric in object detection evaluates the degree of overlap between the ground (gt) truth and predicted (pd) bounding boxes. It is calculated as follows:

IoU ranges between 0 and 1, where 0 shows no overlap and 1 means perfect overlap between ground truth and predictions.

IoU is used as a threshold (say, α), and using this threshold one can decide if a detection is correct or not. Consequently:

True Positive (TP) is a detection for which IoU(gt,pd)≥ α
False Positive (FP) is a detection for which IoU(gt,pd)< α
False Negative (FN) is a ground-truth missed with gt for which IoU(gt,pd)< α

If we consider IoU threshold, α = 0.5 , then TP, FP and FNs can be identified as shown in the Figure above. If we raise IoU threshold above 0.86, the first instance will be a FP and if we lower IoU threshold below 0.24, the second instance becomes TP.

Precision is the degree of exactness of the model in identifying only relevant objects. It is the ratio of TPs over all detections made by the model.
Recall measures the ability of the model to detect all ground truths—TPs among all ground truths.

A good model is supposed to have high precision and high recall.

Raising the IoU threshold means that more objects will be missed by the model (more FNs and therefore low recall and high precision). Conversely, a low IoU threshold will mean that the model gets more FPs (hence low precision and high recall).

The precision-recall (PR) curve is a plot of precision and recall at varying values of IoU thresholds. As shown in the figures below, the P-R curve is not monotonic, therefore smoothing is applied.

Figure 9: Precision-Recall Curve

Figure 10: Precision-Recall Curve (smoothed)

Average Precision (AP)

AP@α is Area Under the Precision-Recall Curve(AUC-PR) evaluated at α IoU threshold. AP50 and AP75 mean AP calculated at IoU=0.5 and IoU=0.75, respectively.

Mean Average Precision (mAP)

There are as many AP values as the number of classes. These AP values are averaged to obtain the metric – mean Average Precision (mAP). Pascal VOC benchmark requires computing AP at 11 points – AP@[0, 0.1, …, 1.0]. MS COCO describes 12 evaluation metrics shown below.

Comparisons

In this section we summarize and present comparisons which are published in the publicly available research papers.

The table above provides comparisons between SSD, YOLO and others on PASCAL VOC 2017 dataset. The tests were done on Titan X GPU and cuDNN v4 with Intel Xeon E5-2667v3@3.20GHz.

Here we see that SSD model is faster with higher mAP but produces lots of boxes and therefore decoding and NMS post filtering can be time consuming.

The table below provides comparison for RetinaNet with others on MS COCO dataset. Here RetinaNet with ResNet-101-FPN and a 600×600 pixel image size (on nVidia M40 GPU) runs at 122 msec per image. It shows that RetinaNet can good very good accuracy with ResNext-101-FPN backbone.

The next table below provides comparisons with CenterNet and the previous detectors on COCO dataset. The tests were performed with Intel Core i7-8086K CPU, Titan Xp GPU, Pytorch 0.4.1, CUDA 9.0 and CUDNN 7.1.

Here it can be observed that CenterNet gives best tradeoff between FPS and accuracy.

Choosing An Object Detector

There is no single best detector and choosing one involves looking several factors including application and hardware requirements. In general, highly accurate models are more complex and of larger sizes making them unattractive for embedded or resource constrained devices. On the other hand, smaller models tend to be faster but less accurate. Model size can be reduced, and inference speed can be enhanced using quantization from 32-bit floats to 16-bit floats or 8-bit integers. Obviously, significant boost in speed can be achieved on dedicated hardware such as GPU, NPU etc. at the expense of portability. The table below provides general observations when it comes to deployment.

Model Type	Accuracy	Speed	Size	Portability
FP-32 bit natively trained	Highest	Lowest	Largest	High
FP 16	Low	Low	Small	Medium
Int 8, Int 16	Lower	Fast	Smaller	Low
Compute Unit	Lower	Fastest	Small	Low

Challenges On Edge Devices

In earlier sections we established some guidelines about Object Detection algorithms as well as pros and cons. To select the model that fits best for a given problem is a challenging task when working on edge devices. Most of the time, the objective is to maximise the accuracy as much as possible while maintaining a low inference workload and timing. The problem is that these are often conflicting requirements when compute resources are constrained or fixed as shown in the diagram below.

Training on detectors with custom datasets with eIQ

Usually, selected models should be fine-tuned or retrained on custom dataset without having clues about the generalization level. The end goal is the model learns to generalize as much as possible real-world images after seeing training samples. This point constitutes a challenge by itself given the vast number of parameters the model must train. On the other hand, there are few external parameters that play important roles within the training process, commonly known as hyper parameters. Hyper parameters involve learning rate, batch size, epochs, decay, etc. Since the training process oversees tuning millions of parameters we need to be carefully when selecting these hyper parameters because that could be the difference in having successfully training sessions or not. Along with the hyper parameters, we need good initialization values, so our model learns to generalize in custom dataset as faster as possible. If initialization values are not properly set, we could have undesirable situations like overfitting or poor generalization.

Deploying trained detectors to embedded platforms

Before deploying any model on any embedded platforms, it is important to have a well-defined and widely tested pipeline, otherwise compatibility problems, memory errors and issues related to unsupported ops could arise. Additionally, a deep understanding on the model architecture as well as in Machine Learning might be required to modify models and make them able to run on the edge devices.

Unlike classification, object detection algorithms return raw predictions that need to be decoded for specific functions. In most of the cases these functions do not have a side-by-side support on embedded platforms libraries. Quantization support is not always fully achievable in all the models, which make them to run in mixed precisions, some ops run on low bits (uint8, int8) while the rest run on (float16, float32). Even when this characteristic is not a big problem, to have the full model running in low bits (uint8, int8) always improve timings, and reduce memory and energy consumption.

To mitigate the impact of these common problems encountered when performing object detection at the edge, eIQ Portal provides a set of tools to handle these issues for embedded developer. These tools are integrated within easy-to-use GUI for quick and transparent execution – thereby reducing trial and error steps in the process. Complex concepts like data annotation, data augmentation, training a model (classification or detection), model evaluation on targets devices and model deployment are transformed into GUI components so the normal user does not have to deal with all the intricacies of training and deploying AI models on target hardware.

Conclusions

In this blog we introduced early DNN-based object detectors, R-CNN, Fast R-CNN, Faster R-CNN. The recent ones include YOLO, YOLOX, SSD, RetinaNet and CenterNet. We introduced object detection error types, precision-recall curve as well details of computation of mAP. We discussed performance and speed comparisons mentioned in the literature along with our own (@Au-Zone) inference performance on in-house dataset on iMx8mPlus HW. Finally, we provided general guidelines and factors involved in choosing object detectors. We will be following up on this post with Part II, which will provide details on real world performance for many common DNN based detection models on NXP processors including the i.MX8MPlus. i.MXRT1060 and i.MXRT1170.

Azhar Quddus, PhD.
Senior Computer Vision Engineer, Au-Zone Technologies

References

R-CNN
Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation”, Tech report (v5), UC Berkeley, 2014
https://arxiv.org/pdf/1311.2524.pdf

Fast R-CNN
Ross Girshick, “Fast R-CNN”, Microsoft Research, 2015.
https://arxiv.org/pdf/1504.08083.pdf

Faster R-CNN
Shaoqing Ren, Kaiming He, Ross Girshick and Jian Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, Advances in Neural Information Processing Systems, 2015.
https://arxiv.org/pdf/1506.01497.pdf

YOLO-v1
Joseph Redmon, Santosh Divvalay, Ross Girshick, Ali Farhadiy, “You Only Look Once: Unified, Real-Time Object Detection”, 2015.
https://arxiv.org/pdf/1506.02640.pdf

YOLO-v3
Redmon, Joseph, and Ali Farhadi. “Yolov3: An incremental improvement.“, 2018.
https://arxiv.org/pdf/1804.02767.pdf

YOLO-v4
Bochkovskiy A, Wang CY, Liao HY. Yolov4: Optimal speed and accuracy of object detection, 2020.
https://arxiv.org/pdf/2004.10934.pdf

YOLOX
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, Jian Sun. YOLOX: Exceeding YOLO Series in 2021.
https://arxiv.org/pdf/2107.08430.pdf

SSD
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg, “SSD: Single Shot MultiBox Detector“, 2015
https://arxiv.org/pdf/1512.02325.pdf

RetinaNet
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár, “Focal Loss for Dense Object Detection”, 2017.
https://arxiv.org/pdf/1708.02002.pdf

CenterNet
Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, Qi Tian, “CenterNet: Keypoint Triplets for Object Detection”, 2019.
https://arxiv.org/pdf/1904.08189.pdf

Accuracy
Object Detection Metrics With Worked Example
https://towardsdatascience.com/on-object-detection-metrics-with-worked-example-216f173ed31e#:~:text=Average%20Precision%20(AP)%20and%20mean,COCO%20and%20PASCAL%20VOC%20challenges

The post DNN-Based Object Detectors appeared first on Edge AI and Vision Alliance.

How We Cleaned Up PASCAL and Improved mAP By 13%

Brian Dipert — Mon, 22 Aug 2022 08:01:54 +0000

This article was originally published at Hasty’s website. It is reprinted here with the permission of Hasty.

We cleaned up all 17.120 images of the PASCAL VOC 2012 dataset in a week using Hasty’s AI-powered QC feature. We found that 6.5% of the images in PASCAL had different errors (missing labels, class label errors, etc.). So, we fixed them in record time and improved our model’s performance by 13% mAP. In this blog post, we dive deeper into how we did that and what the results were.

Background

More often than not, poor model performance can be traced back to the insufficient quality of the training data. Even in 2022, with data being one of the most vital assets for modern companies, developers often struggle with its poor quality. In Hasty, we want to simplify and derisk vision AI solutions development by making it much faster and more efficient to clean up data.

We have developed an AI Consensus Scoring (AI CS) feature, a part of the Hasty ecosystem that makes manual consensus scoring a thing of the past. It integrates AI into the Quality Control process, making it faster, cheaper, and scaling in performance the more data you add.

Our previous blog post showed that AI Consensus Scoring makes the QC process 35x cheaper than today’s approaches. Today we want to go even further and share a practical example of what you can achieve by paying considerable attention to Quality Control. In this post, we will use AI CS to improve, update, and upgrade one of the most popular Object Detection benchmark datasets, PASCAL VOC 2012.

If you are not familiar with PASCAL, you should know it is a well-known academic dataset used for benchmarking models for vision AI tasks such as Object Detection and Semantic Segmentation. Even though PASCAL is over a decade old, it’s still frequently used. It was used in 160 papers in 4 recent years. You can also check the results academics achieved on various versions of PASCAL.

The general task seems challenging. The dataset did not change over the past ten years, and research teams worldwide actively use it “AS-IS” for their studies. However, the dataset was annotated a long time ago when the algorithms were not as accurate as today, and the annotation requirements were not as strict. Therefore, the annotators missed some labels that should be in a modern dataset of PASCAL’s caliber.

There is no label for the horse despite the horse being in the foreground and visible. These quality issues are common in PASCAL.

Using a manual workforce to get through the dataset would be costly and incredibly time-consuming, but using AI to do the quality control and improve the quality of PASCAL, we wanted to test if having better data results in better models. To perform this test, we set an experiment that consisted of the following steps:

Cleaning PASCAL VOC 2012 using AI Consensus Scoring on the Hasty platform;
Training a custom model on the original PASCAL training set using the Faster R-CNN architecture;
Preparing a custom model on the cleaned-up PASCAL training set using the same Faster R-CNN architecture and parameters;
Drawing our conclusions.

Cleaning PASCAL VOC 2012

Our top priority was to improve the dataset. We got it from Kaggle, uploaded it to the Hasty platform, imported the annotations, and scheduled two AI CS runs. For those unfamiliar with our AI Consensus Scoring capabilities, the feature supports Class, Object Detection, and Instance Segmentation reviews, so it checks annotations’ class labels, bounding boxes, polygons, and masks. Doing the review, AI CS looks for extra or missing labels, artifacts, annotations with the wrong class, and bounding boxes or instances with imprecise shapes.

PASCAL VOC 2012 consists of 17.120 images and ~37.700 labels of 20 different classes. We have run the Object Detection and Class reviews for our task that highlighted 28.900 (OD) and 1.320 (Class) potential errors.

With our AI consensus scoring, you can use AI to find potential issues. Then, you can focus on fixing errors instead of spending days (or weeks) finding them first.

Our goal was to review these potential errors and resolve them while trying to be more accurate than the original annotators. In simple words, it means that:

We tried to fix all possible issues on every image where our AI CS predicted at least one potential error;
We did not go deep into the background and did not aim to annotate every object possible. If the annotation missed an object and was in the foreground and/or visible by a human eye without zoom, we labeled it;
We tried to make our bounding boxes pixel-perfect;
We also annotated partials (unlabeled parts of the dataset’s class object) because the original dataset features them.

With the cleaning strategy outlined and the goal clear, we embarked upon the fascinating journey of improving PASCAL.

We started by reviewing the Class review run that checked existing annotations’ class labels, trying to find potential mistakes. More than 60% of AI CS suggestions were of great use as they helped identify the not immediately apparent issues of the original dataset. As an example, the annotators used the sofa and chair classes interchangeably. We fixed that by relabelling more than 500 labels across the dataset;

An example of the original annotations. There are two sofas and two armchairs. One of the two armchairs is labeled a sofa, whereas the other is annotated as a chair. Something weird is going on – it needs to be fixed.

This is how we fixed the issue. The armchair is a chair, and the sofa is a sofa.

When analyzing OD and Class reviews, we found out that PASCAL’s most prominent issue is not misclassified annotations, weird bounding boxes, or extra labels. The most significant problem it has is the absence of many potential annotations. It is hard to estimate the exact number, but we feel like the there are thousands of unlabeled objects that should have been labeled;
The OD review went through the dataset, looking for extra or missing labels and bounding boxes of the wrong shape. Not all of the absent annotations were highlighted by AI CS, but we have tried to do our best to improve all the pictures that had at least one missing label predicted by AI CS. As a result, the OD review helped us find 6.600 missing annotations across 1.140 images;

https://hasty.ai/media/pages/content-hub/articles/cleaning-pascal-improving-map-by-13/6ab791b895-1650922797/video_1280.mp4

An example of how we reviewed images and resolved errors

https://hasty.ai/media/pages/content-hub/articles/cleaning-pascal-improving-map-by-13/72a49a985a-1650922847/video_1280-1.mp4

We spent roughly 80 person-hours reviewing all the suggestions and cleaning up the dataset. It is a fantastic result. Many R&D and Data Science teams that use the conventional approach of manual QC would have killed for an opportunity to review 17.120 images in that timeframe.

Excellent, we did the most challenging task with some significant improvement stats, so it was time to chill and enjoy training some neural networks to prove that you can get better models with better data.

The custom model trained on the original PASCAL

As mentioned above, we decided to set up two experiments – train two models – one on the initial PASCAL and the other on the cleaned version of PASCAL. To do the NNs training, we used another Hasty feature known as Model Playground, a no-code solution allowing you to build AI models in a simplified UI while keeping control over architecture and every crucial NN parameter.

Before jumping to training, we looked out for articles featuring PASCAL to understand better the mean Average Precision metric value achievable on this dataset. We stopped our search on the Faster R-CNN article as we have Faster R-CNN implemented in Model Playground. In the paper, with the help of the Faster R-CNN architecture, researchers achieved ~45-55 COCO mAP (IoU in the range of 0.5 to 1.0 with a step size of 0.1) on PASCAL VOC 2012. We will not compete with them or draw a direct comparison between the models. We will keep in mind 45-55 COCO mAP as a potential metric value we want to achieve with our solution. We used the PASCAL VOC 2012 train set; for validation, we used the PASCAL VOC 2012 validation set.

We went through several iterations of the model throughout the work, trying to find the best hyperparameters for the task. In the end, we opted for:

Faster R-CNN architecture with ResNet101 FPN as a backbone;
R101-FPN COCO weights were used for model initialization;
Blur, Horizontal Flip, Random Crop, Rotate, and Color Jitter as augmentations;
AdamW was the solver, and ReduceLROnPlateau was the scheduler;
Just like in every other OD task, a combination of losses was used (RPN Bounding Box loss, RPN Classification loss, final Bounding Box regression loss, and final Classification loss);
As a metric, we had COCO mAP. Fortunately, it is directly implemented in Model Playground.

It took the model about a day and a half to train. Assuming the depth of the architecture, the number of images the network was processing, the number of scheduled training iterations (10.000), and the fact that COCO mAP was calculated every 50 iterations across 5.000 pictures, it did not take too long. Here are the results the model achieved.

AverageLoss graph across the training iterations for the original model.

COCO mAP graph across the validation iterations for the original model.

The final COCO mAP result we achieved with this architecture was 0.42 mAP on validation. So the model trained on the original PASCAL does not perform as well as the state-of-the-art architectures. Still, it is not a bad result given the low amount of time and effort we spent building the model (we went through 3 iterations with only one person-hour spent on each). In any case, such a value makes our experiment more interesting. Let’s see if we can get to the desired metric value by improving the data without tuning the model’s parameters.

The custom model trained on the updated PASCAL

We took the same images for training and validation to train the following model as the baseline. The only difference was that the data in the split was better (more labels added and some labels fixed).

Unfortunately, the original dataset does not feature each of the 17.120 images in its train/test split. Some pictures are left out. So, despite adding 6.600 labels to the original dataset, in the train/test split, we got only ~ 3.000 new labels and ~190 fixed ones.

Nevertheless, we proceeded and used the improved train/test split of the PASCAL VOC 2012 to train and validate the model.

AverageLoss graph across the training iterations for the updated model.

COCO mAP graph across the validation iterations for the updated model.

Head to head comparison

As you can see, the new model performs better than the original one. It achieved 0.49 COCO mAP on validation compared to the 0.42 value of the previous model. At this point, it is evident that our experiment was successful.

Fantastic, the result is within 45-55 COCO mAP, which means the updated model works better than the original one and delivers the desired metric value. It is time to draw some conclusions and discuss what we have just witnessed.

Conclusions

We have shown you the concept of data-centric AI development in the post. The idea is to improve the data to get a better model, precisely what we achieved. Nowadays, when you start hitting that upper roof of model performance, it might be challenging and expensive to improve results beyond 1-2% on your key metric by tweaking the model. However, it would be best if you never forgot that your success does not rest only on your model’s shoulders. There are two crucial components – the algorithm and the data.

An AI solution is as good as the data it was trained on. In Hasty, we take this to heart, believe that it is the core reason for success, and aim to assist by giving you all the necessary capabilities to produce better data. Sure, you should not give up on improving your model, but it might be worth it to take a step back and see if your data is any good.

In this post, we did not try to beat any SOTA or get results better than the previous researchers. We wanted to show you that some time spent improving your data benefits your model’s performance. And we hope that our case, which gave us the 13% COCO mAP increase by adding 3.000 missing labels, was convincing enough and encourages you to find and fix some issues in your own data.

The results you can get by cleaning your data and adding more labels to images are hard to predict. They depend greatly on your task, NN parameters, and many other factors. Even in our case, we cannot be sure that 3.000 more labels will give us another 13% mAP increase (probably not ). Still, the results speak for themselves. Even though it is sometimes hard to determine the upper roof for improving model metrics through having better data, you should give it a shot if you get stuck with an unsatisfactory metric value. Data Science is not only about NN parameters tuning. As the name suggests – it is also about the data, and you should keep that in mind.

How Hasty can help (aka a shameless plug)

At Hasty, we’re experts in building vision AI in an economically viable way. With our unique methodology enabled by our platform, we helped 100s of companies to:

deploy vision AI solutions in weeks,
stay on budget,
and move the needle for their business.

If you are looking to replicate such fascinating results as you’ve seen in this post. Please, book a demo.

If you are interested in the clean data asset. Using the following links, you can download the clean version of PASCAL VOC 2012:

For the raw data, you can go to Kaggle directly.

Vladimir Lyashenko
Content Manager, Hasty

The post How We Cleaned Up PASCAL and Improved mAP By 13% appeared first on Edge AI and Vision Alliance.

How to Build a Custom Embedded Stereo System for Depth Perception

Brian Dipert — Wed, 10 Aug 2022 22:42:53 +0000

This article was originally published at Teledyne FLIR’s website. It is reprinted here with the permission of Teledyne FLIR.

There are various 3D sensor options for developing depth perception systems including, stereo vision with cameras, lidar, and time-of-flight sensors. Each option has its strengths and weaknesses. A stereo system is typically low cost, rugged enough for outdoor use, and can provide a high-resolution color point cloud.

There are various off-the-shelf stereo systems available on the market today. Depending on factors such as accuracy, baseline, field-of-view, and resolution, there are times when system engineers need to build a custom system to address specific application requirements.

In this article, we first describe the main parts of a stereo vision system then provide instructions on making your own custom stereo camera using off-the-shelf hardware components and open-source software. As this setup is focused on being embedded, it will compute a depth map of any scene in real-time, without the need of a host computer. In a separate article, we discuss how to build a custom stereo system to use with a host computer when space is less of a constraint.

Stereo Vision Overview

Stereo vision is the extraction of 3D information from digital images by comparing the information in a scene from two viewpoints. Relative positions of an object in two image planes provides information about the depth of the object from the camera.

An overview of a stereo vision system is shown in Figure 1 and consists of the following key steps:

Calibration: Camera calibration refers to both the intrinsic and extrinsic The intrinsic calibration determines the image center, focal length, and distortion parameters, while the extrinsic calibration determines the 3D positions of the cameras. This is a crucial step in many computer vision applications especially when metric information about the scene, such as depth, is required. We will discuss the calibration step in detail in Section 5 below.
Rectification: Stereo rectification refers to the process of reprojecting image planes onto a common plane parallel to the line between camera centers. After rectification, corresponding points lie on the same row, which greatly reduces cost and ambiguity of matching. This step is done in the code provided to build your own system.
Stereo matching: This refers to the process of matching pixels between the left and right images, which generates disparity images. The Semi-Global Matching (SGM) algorithm will be used in the code provided to build your own system.
Triangulation: Triangulation refers to the process of determining a point in 3D space given its projection onto the two images. The disparity image will be converted to a 3D point cloud.

Figure 1: Overview of a stereo vision system

Design Example

Let’s go through a stereo system design example. Here are the requirements for a mobile robot application in a dynamic environment with fast moving objects. The scene of interest is 2 m in size, the distance from the cameras to the scene is 3 m and the desired accuracy is 1 cm at 3 m.

You can refer to this article for more details on stereo accuracy. The depth error is given by: ΔZ=Z²/Bf * Δd which depends on the following factors:

Z is the range
B is the baseline
f is the focal length in pixels, which is related to the camera field-of-view and image resolution

There are various design options that can fulfill these requirements. Based on the scene size and distance requirements above, we can determine the focal length of the lens for a specific sensor. Together with the baseline, we can use the formula above to calculate the expected depth error at 3 m, to verify that it meets the accuracy requirement.

Two options are shown in Figure 2, using lower resolution cameras with a longer baseline or higher resolution cameras with a shorter . The first option is a larger camera but has lower computational need, while the second option is a more compact camera but has a higher computational need. For this application, we chose the second option as a compact size is more desirable for the mobile robot and we can use the Quartet Embedded Solution for TX2 which has a powerful GPU onboard to handle the processing needs.

Figure 2: Stereo system design options for an example application

Hardware Requirements

For this example, we mount two Blackfly S board level 1.6 MP cameras using the IMX273 Sony Pregius global shutter sensor on a 3D printed bar at 12 cm baseline. Both cameras have similar 6 mm S-mount lenses. The cameras connect to the Quartet Embedded Solution for TX2 customized carrier board using two FPC cables. To synchronize the left and right camera to capture images at the same time, a sync cable is made that connects the two cameras. Figure 3 shows the front and back views of our custom embedded stereo system.

Figure 3: Front and back views of our custom embedded stereo system

The following table lists all the hardware components:

Part	Description	Quantity	Link
ACC-01-6005	Quartet Carrier with TX2 Module 8GB	1	https://www.flir.com/products/quartet-embedded-solution-for-tx2/
BFS-U3-16S2C-BD2	1.6 MP, 226 FPS, Sony IMX273, Color	2	https://www.flir.com/products/blackfly-s-board-level
ACC-01-5009	S-Mount & IR filter for BFS color board level cameras	2	https://www.flir.com/products/s_mount-front
BW3M60B-1000	6 mm S-Mount Lens		http://www.boowon.co.kr/site/down.asp?fileName=BW3M60B-1000.pdf
ACC-01-2401	15 cm FPC cable for board level Blackfly S	2	https://www.flir.com/products/15-cm-fpc-cable-for-board-level-blackfly-s/
XHG302	NVIDIA® Jetson TX2/TX2 4GB/TX2i Active Heat Sink	1	https://connecttech.com/product/nvidia-jetson-tx2-tx1-active-heat-sink/
	Synchronization cable (make your own)	1	https://www.flir.com/support-center/iis/machine-vision/application-note/configuring-synchronized-capture-with-multiple-cameras/
	Mounting bar (make your own)	1

Both lenses should be adjusted to focus the cameras on the range of distances your application requires. Tighten the screw (circled in red in Figure 4) on each lens to keep the focus.

Figure 4: Side view of our stereo system showing the lens screw

Software Requirements

a. Spinnaker

Teledyne FLIR Spinnaker SDK comes pre-installed on your Quartet Embedded Solutions for TX2. Spinnaker is required to communicate with the cameras.

b. OpenCV 4.5.2 with CUDA support

OpenCV version 4.5.1 or newer is required for SGM, the stereo matching algorithm we are using. Download the zip file containing the code for this article and unzip it to StereoDepth folder. The script to install OpenCV is OpenCVInstaller.sh. Type the following commands in a terminal:

cd ~/StereoDepth

chmod +x OpenCVInstaller.sh

./OpenCVInstaller.sh

The installer will ask you to input your admin password. The installer will start installing OpenCV 4.5.2. It may take a couple of hours to download and build OpenCV.

Calibration

The code to grab stereo images and calibrate them can be found in the “Calibration” folder. Use the SpinView GUI to identify the serial numbers for the left and right cameras. For our settings, the right camera is the master and left camera is the slave. Copy the master and slave camera serial numbers to file grabStereoImages.cpp lines 60 and 61. Build the executable using the following commands in a terminal:

cd ~/StereoDepth/Calibration

mkdir build

mkdir -p images/{left, right}

cd build

cmake ..

make

Print out the checkerboard pattern from this link and attach it to a flat surface to use as the calibration target. For best results while calibrating, in SpinView set Exposure Auto to Off and adjust the exposure so the checkerboard pattern is clear and the white squares are not over exposed, as shown in Figure 5. After the calibration images are collected, gain and exposure can be set to auto in SpinView.

Figure 5: SpinView GUI settings

To start collecting the images, type

./grabStereoImages

The code should start collecting images at about 1 frame/second. Left images are stored in images/left folder and right images are stored in images/right folder. Move the target around so that it appears in every corner of the image. You may rotate the target, take images from close by and from further away. By default, the program captures 100 image pairs, but can be changed with a command line argument:

./grabStereoImages 20

This will collect only 20 pairs of images. Please note this will overwrite any images previously written in the folders. Some sample calibration images are shown in Figure 6.

Figure 6: Sample calibration images

After collecting the images, run the calibration Python code by typing:

cd ~/StereoDepth/Calibration

python cameraCalibration.py

This will generate 2 files called “intrinsics.yml” and “extrinsics.yml” which contain the intrinsic and extrinsic parameters of the stereo system. The code assumes 30mm checkerboard square size by default but can be edited if needed. At the end of the calibration, it will display the RMS error which indicates how well the calibration is. Typical RMS error for good calibration should be below 0.5 pixel.

Real-time Depth Map

The code to calculate disparity in real-time is in the “Depth” folder. Copy the serial numbers of cameras to file live_disparity.cpp lines 230 and 231. Build the executable using the following commands in a terminal:

cd ~/StereoDepth/Depth

mkdir build

cd build

cmake ..

make

Copy the “intrinsics.yml” and the “extrinsics.yml” files obtained in the calibration step to this folder. To run the real-time depth map demo, type

./live_disparity

It would display the left camera image (raw unrectified image) and the depth map (our final output). Some example outputs are shown in Figure 7. The distance from the camera is color-coded according to the legend on the right of the depth map. The black region in the depth map means no disparity data was found in that region. Thanks to the NVIDIA Jetson TX2 GPU, it can run up to 5 frames/second at a resolution of 1440 × 1080 and up to 13 frames/second at a resolution of 720 × 540.

To see the depth at a particular point, click on that point in the depth map and the depth will be displayed, as shown in the last example in Figure 7.

Figure 7: Sample left camera images and corresponding depth map. The bottom depth map also shows the depth at a particular point.

Summary

Using stereo vision to develop a depth perception has the advantages of working well outdoors, the ability to provide a high-resolution depth map, and very accessible with low-cost off-the-shelf components. Depending on the requirements, there are a number of off-the-shelf stereo systems on the market. Should it be necessary for you to develop a custom embedded stereo system, it is a relatively straightforward task with the instructions provided here.

The post How to Build a Custom Embedded Stereo System for Depth Perception appeared first on Edge AI and Vision Alliance.

Buy or Build ML Solutions

Brian Dipert — Fri, 29 Jul 2022 08:01:45 +0000

This article was originally published at Hasty’s website. It is reprinted here with the permission of Hasty.

This article will look at one of the most complex decisions for most organizations starting new AI projects. Should they buy or build the software they’ll use to develop their AI models? We’ll present arguments for both sides and share some of our experiences from discussions we had in the past.

As this article is very long, let’s start with the summary.

TL;DR

Summary

Our take (of course with an obvious bias) is that it doesn’t make sense for most organizations to build software across the AI lifecycle. What’s available on the market is getting so good that the chances of you building something better is low. After all, we are dealing with a space that has had billions of euros and dollars invested in it in the last five years.

Of course, there are exceptions to this. As we outlined, commercial software has its disadvantages. But, considering the downside of building something yourself, we strongly recommend that you evaluate the market before making something yourself.

Build AI software if…

If what you need can’t be found on the market
If getting AI solutions to market fast is not the top priority
Building solutions take time, leading to high time-to-value for the organization
You have the budget to do so (you need at least a couple of million dollars or euros to build commercial grade software)
You have the in-house capacity to maintain the software after release
You need full control of the development process
Another important aspect here is having in-house experience when it comes to building software and having the significant budget required.

Buy AI software if…

Getting started on any and all AI projects are of the highest importance to you
If cost is important for you – the difference being thousands compared with millions of euros or dollars
If you want the flexibility and full functionality developed over years with previous customers and users
You want to avoid spending years maintaining software
Simply, you want something that works out of the box
You also will create more focus internally on what matters for most – creating better and better models.

With the summary done, let’s explain how we got to these conclusions:

A background on how the AI software market has changed

The push to get AI into production environments started around a decade ago. At that point, early adopters began to look at using AI for very particular purposes – mainly autonomous driving. Then, there wasn’t a lot of AI software on the market. If you wanted to work with AI, you had to build the tooling yourself.

This was the case for most of the last decade. However, at the end of the decade, many software and tooling became available on the open market. First, we saw many open source projects created to solve particular problems in the AI development process. Then, we also started seeing commercial players like ourselves entering the space.

Today, there are many different options available for you to use if you want to buy software. However, the core question remains – should you purchase something off-the-shelf or build it yourself?

Understanding your needs

To answer that question, first, you need to understand your needs and how they will evolve. A good exercise can be to answer the following questions:

1. How core is AI development to our organization?

This is an essential question for organizations as it is probably the most important deciding if you want to build or buy a solution. Essentially, you need to figure out if AI is at the heart of your organization’s future. To give some examples:

For an autonomous driving company, AI is undoubtedly at the core of the business. It’s the main component of the product.
If you are an agricultural organization with an already existing business that wants to get into precision agriculture, AI will, in all likelihood, not be a core competency for you. Although it can have a tremendous impact on your business, your business is not AI. You use the technology to improve what you already have.

Understanding this is important as it is one of the main factors deciding if you build or buy.

2. What approach do we want to take to AI?

There are many different approaches to AI. In vision AI, for example, you can do anything from image classification to instance segmentation.

Deciding on your approach to AI is one of the most important decisions you will make, as it will lock you in for the foreseeable future in that approach.

For more information on picking the right approach, you can read a previous article we did on the subject here.

3. What are our requirements for the software?

Different organizations have different needs. Mapping out in terms of what you need in terms of both core functionality (i.e we should be able to do train an ML model and compare different models with each other) and in supplementary functionality (it would help us if we can delegate work in the software itself) will give you a shopping list of needs that can then be used to source software.

Internally, when we source software, we use the MOSCOW system to map out what features we must have, what we should have, and what would be a bonus but not crucial.

4. What is our budget?

Another crucial requirement that you need to understand is your budget. Do you have millions to spend or a couple of thousand? Depending on your answer, you will know if it’s even feasible to build and get an upper limit for how much you can spend if you decide to buy.

How answering these questions help you in making a decision

With these needs and requirements mapped out, you now have the initial data you need to decide on the first steps on how to proceed.

From here, generally, our advice would always be to first look at what’s available on the market. The reason for that is two-fold. First, there’s so much available AI software today that chances are someone will provide what you need. Secondly, the cost of building AI software can be pretty high.

Looking at ourselves and our competitors as examples, the cost for developing commercial-grade AI software ranges from five to hundreds of millions. So any buying decision will likely be a massive investment. We’ve seen that building a complete AI software solution is out-of-scope for everyone but the most prominent players.

There’s still room to build software for specific parts of your AI pipeline. Here, understanding how AI is core to your business is essential. For example, an AI consultancy might say that their core competency is developing state-of-the-art models for their clients. For them, it might make sense to build software for that specific part of the AI development process as it could give them a competitive advantage while buying solutions for other tasks such as data annotation.

Another example is a medical company where data privacy and security are critical. Although they might decide to buy most tooling, building their data lake and integrating their chosen software with their data lake could make sense. That way, they control what’s essential for them.

With all that said, let’s look at some advantages and disadvantages of the different choices.

The advantages of building it yourself

Control

It almost goes without saying, but building it yourself gives you complete control. You can create what you and your organization want without compromises, and any new needs will automatically be the top priority for development.

Particular needs on core functionality

Linked to a large extent to control are particular needs. Today, there is software for all of the common approaches to AI. But you might be in a field with specific requirements that commercially available tooling doesn’t solve. For example, the audio ML space is underserved in terms of available software, so you might have to build it yourself if you have a complex use case in that space.

Building the correct supplementary functionality

Another common reason to build your software is that you need particular supplementary functionality. For example, let’s say you have very exact needs for monitoring a model on edge, where you need to use state-of-the-art encryption for any communication between your software and your model. The chances are that most ML monitoring software doesn’t support what you need in terms of encryption yet, so this would require you having to build it.

Complex integration requirements

Another common reason (and advantage) for developing software in-house is that you have precise requirements in terms of integrations. Maybe you’ve spent a lot of time and money on building a data lake with complex data – let’s say 3D video data for the sake of argument). Also, you have vast amounts of essential metadata structured in complex ontologies. Now you want any software you use to integrate with that data lake. Here, you might have trouble finding a commercial solution as you are working with a data type that’s relatively new to the field, and you require the solution to handle complex ontologies. Chances of finding both in one commercially available software are low.

Creating a competitive advantage

There’s also an argument to be made that you can create a competitive advantage by building the right solution in-house. The focus here lies on it being the right solution as you don’t want to spend millions of dollars or euros making something with a similar feature set that competitors can just buy off the shelf.

To give an example: If you decided to build your annotation tool in 2014, that would have made a lot of sense. At the time, the tooling available on the market wasn’t very mature, and the competitive advantage you could gain compared to others without a massive investment was quite huge.

Today that is no longer the case for most standard annotation methods. So sinking a couple of million into creating an annotation tool – with further maintenance costs to come – is probably not the smartest choice.

However, there are always new frontiers in AI where innovative or very specialized companies can gain an advantage by developing in-house.

Data governance

Another factor in building things yourself is data governance. Essentially, here we are talking about who owns the data. For many organizations, data privacy and security are essential, and it’s becoming increasingly important with regulations like GDPR coming into force around the world. However, the commercial market is adapting to these regulations too. Hasty, as an example, is a European company. As we are directly affected by GDPR, we have done a lot of work making sure we are compliant. We also offer users solutions for keeping data in their environments while still using our solution. We are not alone in this. Most commercial companies understand that this is a growing need, so it’s not as much of a reason to build as it used to be.

Disadvantages of building it yourself

Time-to-value

For most organizations, the biggest issue with building something yourself is time-to-value. Essentially, if you decide to make it, you can expect at least a couple of months of initial development time, followed by a month or two of testing and bug fixing before you have something useable. If what you require can be bought instead, you have spent months building something you could have purchased and started using in a day instead.

Opportunity cost

Another disadvantage is the opportunity cost – i.e., what could the team building your new solution have done instead?

For most organizations we talk to, the goal is to build the best possible AI model to solve a specific problem. You do this by creating great data and spending a lot of time and effort training and benchmarking different models. Adding software development to that equation means less time and budget for creating that model.

Here, you must know the answer to what parts of the AI lifecycle are at the absolute core of your business. If what you are building isn’t part of that core, chances are you are wasting resources that would give you more of a return elsewhere.

Financial cost

We’ve already talked about it, but building software yourself can be very expensive. Before you start building, you better be sure that you need to do so. The cost for building solutions of the same quality and feature sets as you can find on the market is high and goes into millions of euros or dollars.

Maintenance

Often overlooked and underestimated, when building software, you also have to consider the maintenance that goes with it. From the discussions we had with teams; usually, you start with a smaller scope for what your software is supposed to do. Over time, as your needs change, that scope starts creeping, new features are required, and you have to develop a new version of your software. This, combined with everyday maintenance and bug fixing, quickly builds up and leads to high costs both in time and money.

Maintenance is also a considerable bottleneck if what you build is highly integrated into other parts of your software stack. If that’s the case, it means you have to maintain your software and the integrations to other software. If something else changes somewhere else, you have to react to that.

Risk

Additionally, another important aspect here is taking on all of the risks. The project could go over budget or over time. Any data leaks happening are your sole responsibility etc. As you are likely building your solution from scratch, it’s not a question if something goes wrong (from my own experience, it does) but how critical a problem it becomes.

The advantages of buying AI software

Speed

Of course, the main reason for buying anything is speed. You pay, and then you can start working. Compared with spending months building software yourself, you can get started right away. As most teams we talk to have goals they need to achieve in the near future, they often cite this as the main reason for buying solutions.

Cost

The second most crucial aspect in teams choosing to buy solutions is cost. Comparing the budgets needed for building something similar makes most software pricing very fair. This makes sense as you share development costs with all their other customers. So instead of spending millions, you are spending thousands (or even hundreds) of euros and dollars.

There is an asterisk here. Pricing for AI software varies greatly depending on the provider. For similar solutions, you can see a difference in the price of 10x or even more, depending on the provider. AI software is a new field, and there’s not yet any standardization in pricing. More prominent, better-known providers can ask for higher prices while newer competitors often ask for less.

Another aspect to take into account is pricing transparency. Some solutions require you to talk to sales before giving you a cost proposal. This makes it tricky for customers to choose a software until they’ve gotten a couple of proposals in. If you can excuse us for a second, this is something we are against here in Hasty. We think transparency in pricing is essential for potential customers to understand the cost when evaluating us.

Flexibility

A commercial software provider will also have built a lot of functionality over years of operating. Although it does not always seem relevant at first, there’s a reason that they built that functionality in the first place. Using Hasty as an example, we often find that teams that are new to the space and starting to use our product often focus on annotation and model building. Over time, for many users, it becomes clear that they need better quality control of their data. Luckily for them, that’s something we already built.

Another example is changing AI approaches. If you mid-stream decide to switch from doing object detection to instance segmentation, for example, you can do so without having to rebuild your own software from scratch.

In general, this means that the software you buy almost always can grow with you and that you don’t have to wait on your internal IT team shipping features.

Focus

The most significant positive effect of buying tooling instead of building it is the focus you gain. Instead of having an AI team with multiple functions – both building models and software – they can wholeheartedly create the best possible model.

Integrations

Similar to functionality, many commercial software providers have already built standard integrations. This means less work for you when you integrate the solution with your AI pipeline.

No maintenance

Of course, another advantage of buying a solution is that you have maintenance included in the price. Except for giving you peace of mind, this also gives you greater cost certainty for the lifetime of your project(s) as you don’t have to worry about maintaining software.

Experience

Another overlooked advantage of buying a solution is that the provider often has valuable experience for your project. Using Hasty as an example, we often find ourselves helping new teams orientate the pitfalls and roadblocks that most companies experience when starting new AI projects. In that way, a good provider will provide you with a great product and act as a sounding board for you.

Disadvantages of buying AI software

Control

Of course, if you buy something, you can’t control it. That means you are dependent on another organization and their customer support for any needs you have. Here, it can be a good idea to check the customer service record of any provider you are evaluating. Are user reviews positive? Do they have any guarantees regarding time to answer or time for fixing any issues?

If you are giving up control, it will be vital to understand who you are giving up control to and ensure that they are reliable.

Integrations

Hey, we already had this as an advantage! Surely this must be a typo. Unfortunately not. Although most providers have built integrations that plug into user environments, there is a lack of existing integrations between different providers today. For you, that means that if you are planning on using more than one solution, you might have to build integrations between them yourself.

Lack of edge case feature coverage

Commercial tool providers develop with a market in mind. For us, it makes sense to go for the most significant market and build features for the most common use cases. This works well for 90% of cases, but it can be tricky to find a provider if you are in a niche field or working on rare use cases.

Foresight

Another common problem with software providers for customers is an insight into what will come next or when you can expect that particular feature you need. Usually, companies have two different approaches to this. The first approach is to share what they are working on next publicly so that you as a user know what will be coming next. The second approach is more personal, with providers having a dialogue with you and keeping you up-to-date.

However, you will never have the same foresight into what’s going on as if you built something yourself.

Customization

Finally, you have the problem of customization. When using commercially available software, you often find that the software would be more beneficial to you with some tweaks. It can be something minor, like being able to link directly from the software to your internal documentation, or something more extensive like changing the logic of the interface in a way that better fits your workflows.

Although commercial providers often listen to your customization needs, from their perspective, what you propose needs to be relevant for the rest of their user base to if they are going to build it. Therefore, your ability to customize commercial software is limited.

A shameless plug

For those of you that are looking for an ML platform – look no further! Hasty is a vision AI platform that helps you throughout the ML lifecycle. To date, we can help you with:

Automating up to 90% of all automation
Make quality control 35x faster
Train models directly on your data using our low-code model builder
Take any custom models trained in Hasty and deploy them back to the annotation environment in one click
Export any models you create in commonly used formats
Or host any model in our cloud
Monitor inferences made in production
Most importantly, we offer all this through an API for easy integration.

In short, we take care of a lot of the MLOps so you don’t have to. Book a demo if you want to know more.

Appendix A: A single end-to-end ML software solution Vs. multiple expert ones

If you have chosen to buy an ML solution, you will face another dilemma. As you might know, there are end-to-end ML solutions that support you through the whole ML lifecycle and smaller tools covering different aspects of the process. Let’s say that the first option can be categorized as platforms that aim to handle all your needs, and the second option can be called expert tools that fulfill one function you need. In the context of an AI project, this leaves you with a choice. Do you use an end-to-end solution or use multiple specialist tools?

On the one hand, using an end-to-end solution is an easy path. If you make the right choice, the tool will have all the capabilities you need to work on your Machine Learning project, and you will have no friction between different stages of development. However, an end-to-end solution covers all the stages of the ML lifecycle without specializing in any of them. Therefore, there might be tools with greater functionality for a particular stage.

On the other hand, expert tools have specialization and offer advanced capabilities in specific fields. However, having multiple specialist tooling means you will have to build integrations between them – today, there’s very little in terms of pre-built integrations for you to use. Unfortunately, this process is rather unobvious and time-consuming since you need to figure out whether the tools you have chosen even work together and how to make them work.

If you succeed in that, you will also spend some time setting up your working pipeline and solving the source-of-truth problem (you will need another software at this point, potentially). Still, your working pipeline will remain unstable since any minor change in the tool’s APIs can strongly affect it, resulting in a pipeline adaptation task. Thus, using many expert tools comes with maintenance costs.

To summarize, both options have advantages and disadvantages. Still, the primary question you need to answer to decide is whether you want the ease-of-use of one solution or need that expert functionality of multiple tools despite the complexity of using them.

Appendix B: Outsourcing AI software development

If you decide to build your software, you might be interested in outsourcing that project to a software development provider. This can seem like a viable solution, but it comes with its trickery.

Most software development services have little to no experience in building software for AI use cases. As these projects often require an understanding of how AI works, and expertise in integrating AI in the product itself, it can be difficult to find a provider with the expertise needed to build your solution.

Secondly, when outsourcing software development, you are giving up much of the control that probably made you choose to build the software yourself in the first place.

Finally, if you decide to go down this road, ensure that you have a clear hand-over plan as you, in all likelihood, don’t want to pay a software developer in perpetuity.

Vladimir Lyashenko
Content Manager, Hasty

The post Buy or Build ML Solutions appeared first on Edge AI and Vision Alliance.

Transformers in Computer Vision

Brian Dipert — Mon, 09 May 2022 08:01:27 +0000

This technical article was originally published at Axelera AI’s website. It is reprinted here with the permission of Axelera AI.

Convolutional Neural Networks (CNN) have been dominant in Computer Vision applications for over a decade. Today, they are being outperformed and replaced by Vision Transformers (ViT) with a higher learning capacity. The fastest ViTs are essentially a CNN/Transformer hybrid, combining the best of both worlds: (A) CNN-inspired hierarchical and pyramidal feature maps, where embedding dimensions increase and spatial dimensions decrease throughout the network are combined with local receptive fields to reduce model complexity, while (B) Transformer-inspired self-attention increases modeling capacity and leads to higher accuracies. Even though ViTs outperform CNNs in specific cases, their dominance has not yet been asserted. We illustrate and conclude that SotA CNNs are still on-par, or better, than ViTs in ImageNet validation, especially when (1) trained from scratch without distillation, (2) in the lower-accuracy <80% regime, and (3) for lower network complexities optimized for Edge devices.

Convolutional Neural Networks

Convolutional Neural Networks (CNN) have been the dominant Neural Network architectures in Computer Vision for almost a decade, after the breakthrough performance of AlexNet^[1]on the ImageNet^[2] image classification challenge. From this baseline architecture, CNNs have evolved into variations of bottlenecked architectures with residual connections such as ResNet^[3], RegNet^[4] or into more lightweight networks optimized for mobile contexts using grouped convolutions and inverted bottlenecks, such as Mobilenet^[5] or EfficientNet^[6]. Typically, such networks are benchmarked and compared by training them on small images on the ImageNet data set. After this pretraining, they can be used for applications outside of image classification such as object detection, panoptic vision, semantic segmentation, or other specialized tasks. This can be done by using them as a backbone in an end-to-end application-specific Neural Network and finetuning the resulting network to the appropriate data set and application.

A typical ResNet-style CNN is given in Figure 1-1 and Figure 1-4 (a). Typically, such networks have several features:

They interleave or stack 1×1 and kxk convolutions to balance the cost of convolutions with building a large receptive field,
Training is stabilized by using batch-normalization and residual connections.
Feature maps are built hierarchically by gradually reducing the spatial dimensions (W,H), finally downscaling them by a factor of 32x.
Feature maps are built pyramidally, by increasing the embedding dimensions of the layers from the range of 10 channels in the first layers to 1000s in the last

Figure 1-1: Illustration of ResNet34 [3]

Within these broader families of backbone networks, researchers have developed a set of techniques known as Neural Architecture Search (NAS)^[7] to optimize the exact parametrizations of these networks. Hardware-Aware NAS methods automatically optimize a network’s latency while maximizing accuracy, by efficiently searching over its architectural parameters such as the number of layers, the number of channels within each layer, kernel sizes, activation functions and so on. So far, due to high training costs, these methods have failed to invent radically new architectures for Computer Vision. They mostly generate networks within the ResNet/MobileNet hybrid families, leading to only modest improvements of 10-20% over their hand-designed baseline^[8].

Transformers in Computer Vision

A more radical evolution in Neural Networks for Computer Vision, is the move towards using Vision Transformers (ViT)^[9] as a CNN-backbone replacement. Inspired by the astounding performance of Transformer models in Natural Language Processing (NLP)^[10], research has moved towards applying the same principles in Computer Vision. Notable examples, among many others, are XCiT^[11], PiT^[12], DeiT^[13] and SWIN-Transformers^[14]. Here, analogously to NLP processing, images are essentially treated as sequences of image patches, by modeling feature maps as vectors of tokens, each token representing an embedding of a specific image patch.

Figure 1-2: Illustration of the original basic vision transformer (ViT), taken from [10]

Figure 1-3: Illustration of a self-attention module. K, Q and V are linear projections of the same input feature map. The attention map is a softmax function of the matrix product QKT. Image taken from source[15].

An illustration of a basic ViT is given in Figure 1-2. The ViT is a sequence of stacked MLPs and self-attention layers, with or without residual connections . This ViT uses the multi-headed self-attention mechanism developed for NLP Transformer, see Figure 1-3. Such self-attention layer has two distinguishing features. It can (1) dynamically ‘guide’ its attention by dynamically reweighting the importance of specific features depending on the context and (2) has a full receptive field in case global self-attention is used. The latter is the case when self-attention is applied across all possible input tokens. Here all tokens, representing embeddings related to specific spatial image patches, are correlated with each other, giving a full perspective field. Global self-attention is typical in ViTs, but not a requirement. Self-attention can also be made local, by limiting the scope of the self-attention module to a smaller set of tokens, in turn reducing the operation’s receptive field at a particular stage.

This ViT architecture contrasts strongly with CNNs. In vanilla CNNs without attention mechanisms, (1) features are statically weighted using pretrained weights, rather than dynamically reweighted based on the context as in ViTs and and (2) receptive fields of individual network layers are typically local and limited by the convolutional kernel size.

Part of the success of CNNs, is their strong architectural inductive bias implied in the convolutional approach. Convolutions with shared weights explicitly encode how specific identical patterns are repeated in images. This inductive bias ensures easy training convergence on relatively small datasets, but also limits the modeling capacity of CNNs. Vision Transformers do not enforce such strict inductive biases. This makes them harder to train, but also increases their learning capacity, see Figure 1-5. To achieve good results using ViTs in Computer Vision, these networks are often trained using knowledge distillation with a large CNN-based teacher (as in DeiT^[16] for example). This way, part of the inductive bias of CNNs can be more softly forced into the training process.

Initially, ViTs where directly inspired by NLP Transformers: massive models with a uniform topology and global self-attention, see Figure 1-4 (b). Recent ViTs have a macro-architecture that is closer to that of CNNs (Figure 1-4 (a)), using hierarchical pyramidal feature maps (as in PiT (Footnote 12); see Figure 1-4 (c)) and local self-attention (as in Swin-Transformers (Footnote 14). A high-level overview of this evolution is discussed in Table 1.)

Figure 1-4: comparing the dimension configurations of networks of (a) ResNet-50, a classical CNN with pyramidal feature maps, (b) an early ViT-S/16 [10] with a uniform macro-architecture and (c) a modern PiT-S [Footnote 12] with CNN-ified pyramidal feature maps. Figure taken from [Footnote 12].

Table 1: Comparing early ViTs, recent ViTs and modern CNNs

Comparing CNNs and ViTs for Edge Computing

Even though ViTs have shown State-of-the-Art (SotA) performance in many Computer Vision tasks, they do not necessarily outperform CNNs across the board. This is illustrated in Figure 1-5 and Figure 1-6. These figures compare the performance of ViTs and CNNs in terms of ImageNet validation accuracy versus model size and complexity, for various training regimes. It’s important to distinguish between these training regimes, as not all training methodologies are feasible for specific downstream tasks. First, for some applications there are only relatively small datasets available. In that case, CNNs typically perform better. Second, many ViTs rely on distillation approaches to achieve high performance. For that to work, they need a highly-accurate pretrained CNN as a teacher, which is not always available.

Figure 1-5 (a) illustrates how CNNs and ViTs compare in terms of model size versus accuracy if all types of training are allowed, including distillation approaches and using additional data (such as JFT-300^[17]). Here ViTs perform on-par or better than large-scale CNNs, outperforming them in specific ranges. Notably, XCiT (Footnote 11) models perform particularly well in the +/- 3M-Parameters range. However, when neither distillation, nor training on extra data is allowed, the difference is less pronounced, see Figure 1-5 (b). In both Figures, EfficientNet-B0 and ResNet-50 are indicated as references for context.

Figure 1-5: Comparing CNNs to ViTs in terms of model size (# Params) and ImageNet Top-1 Validation accuracy. (a) Shows data for all types of training: (i) training on ImageNet1k training data, (ii) using extra data such as ImageNet21k or JFT [17] and (iii) training using knowledge distillation using a CNN teacher. (b) Shows data for a subset of networks that are trained from-scratch, without CNN-based knowledge distillation , but with state-of-the-art training techniques on ImageNet. Figure (b) illustrates the lasting competivity of CNNs over ViTs, especially in the Edge domain for models with less than 25M parameters where performance is very similar between CNNs and ViTs. ResNet-50 and EfficientNet-B0 are given as reference points. Data is taken from this source[18] and the respective scientific papers.

Figure 1-6 illustrates the same in terms of accuracy versus model complexity for a more limited set of known networks. Figure 1-6(a) and (b) show CNNs are mostly dominant for lower accuracies and networks with lower complexity (<1B FLOPS) for all types of training. This holds even for CNN-ified Vision-Transformers such as PiT (Footnote 12) which use a hierarchical architecture with pyramidal feature maps and for SWIN transformers which optimize complexity by using local self-attention. Without extra data or distillation, CNNs typically outperform ViTs across the board, especially for networks with a lower complexity or for networks with accuracies lower than 80%. For example, at a similar complexity, both RegNets and EfficientNet-style networks significantly outperform XCiT ViTs, see Figure 1-6 (b).

Figure 1-6: Comparing SotA CNNs to ViTs in terms of computational cost (# FLOPS) and ImageNet Top-1 Validation Accuracy. (a) Shows data for all types of training: (i) training on ImageNet1k training data, (ii) using extra data such as ImageNet21k or JFT [17] and (iii) training using knowledge distillation using a CNN teacher. (b) Shows data for a subset of networks that are trained from-scratch, without extra data or knowledge distillation, but with state-of-the-art training techniques on ImageNet. Figure (b) illustrates how CNNs are still dominant in the <80% accuracy regime. Even CNN-ified modern ViTs with hierarchical pyramidal models such as PiT [Footnote 12] do not outperform EfficientNet [Foonote 6] and RegNet [Footnote 4] style CNNs. In the 80%+ range, networks with local self-attention such as SWIN [Foonote 14] are on par or better than RegNets [Footnote 4]. Data is taken from Footnote 16 and the respective scientific papers.

Apart from the high-level differences in Table 1 and the performance differences in this section, there are some other key different requirements in bringing ViTs to edge devices. Compared to CNNs, ViT rely much more on 3 specific operations that must be properly accelerated on-chip. First, ViTs rely on accelerated softmax operators as part of self-attention, while CNNs only require softmax as the final layer in a classification network. On top of that, ViTs typically use smooth-nonlinear activation functions, while CNNs mostly rely on Rectified Linear Units (ReLU) which are much cheaper to execute and accelerate. Finally, ViTs typically require LayerNorm, a form of layer normalization with dynamic computation of mean and standard deviation to stabilize training. CNNs however, typically use batch-normalization, which must only be computed during training and can essentially be ignored in inference by folding the operation into neighbouring convolutional layers.

Conclusion

Vision Transformers are rapidly starting to dominate many applications in Computer Vision. Compared to CNNs, they achieve higher accuracies on large data sets due to their higher modeling capacity and lower inductive biases as well as their global receptive fields. Modern, improved and smaller ViTs such as PiT and SWIN are essentially becoming CNN-ified, by reducing receptive fields and using hierarchical pyramidal feature maps. However, CNNs are still on-par or better than SotA ViTs on ImageNet in terms of model complexity or size versus accuracy, especially when trained without knowledge distillation or extra data and when targeting lower accuracies.

References

^[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems25 (2012): 1097-1105.

[2]^[2] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255).

^[3] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

^[4] Radosavovic, Ilija, et al. “Designing network design spaces.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

^[5] Howard, Andrew, et al. “Searching for mobilenetv3.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.

^[6] Tan, Mingxing, and Quoc Le. “Efficientnet: Rethinking model scaling for convolutional neural networks.” International Conference on Machine Learning. PMLR, 2019.

^[7] He, Xin, Kaiyong Zhao, and Xiaowen Chu. “AutoML: A Survey of the State-of-the-Art.” Knowledge-Based Systems 212 (2021): 106622.

^[8] Moons, Bert, et al. “Distilling optimal neural networks: Rapid search in diverse spaces.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021

^[9] Dosovitskiy, Alexey, et al. “An image is worth 16×16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).

^[10] Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).

^[11] El-Nouby, Alaaeldin, et al. “XCiT: Cross-Covariance Image Transformers.” arXiv preprint arXiv:2106.09681 (2021).

^[12] Heo, Byeongho, et al. “Rethinking spatial dimensions of vision transformers.” arXiv preprint arXiv:2103.16302 (2021).

^[13] Touvron, Hugo, et al. “Training data-efficient image transformers & distillation through attention.” International Conference on Machine Learning. PMLR, 2021.

^[14] Liu, Ze, et al. “Swin transformer: Hierarchical vision transformer using shifted windows.” arXiv preprint arXiv:2103.14030 (2021).

^[15] Li, Yawei, et al. “Spatio-Temporal Gated Transformers for Efficient Video Processing.”, NeurIPS ML4AD Workshop, 2021

^[16] Touvron, Hugo, et al. “Training data-efficient image transformers & distillation through attention.” International Conference on Machine Learning. PMLR, 2021.

^[17] Sun, Chen, et al. “Revisiting unreasonable effectiveness of data in deep learning era.” Proceedings of the IEEE international conference on computer vision. 2017.

^[18] Ross Wightman, “Pytorch Image Models”, https://github.com/rwightman/pytorch-image-models, seen on January 10, 2022

Bert Moons
System Architect, Axelera AI

The post Transformers in Computer Vision appeared first on Edge AI and Vision Alliance.