Bear-ing Fruit: A Beginner's Guide to Training a Deep Learning Model for Image Classification

The year is 2023, Web3 is (apparently) out, AI is in, and as long as ChatGPT doesn't start asking for a certain John Connor's whereabouts, it appears society's nuptials with this technology intimate a relationship set to last ad infinitum. For the longest time, machine learning in general, and deep learning in particular, has been an inordinately exclusive game. Why? While the reasons may be numerous, one thing stands out; poorly written textbooks that introduce concepts without defining them.

As Jeremy Howard puts it, "most technical subjects at university are taught “bottom-up”: start with basic foundations and gradually work up to complete useful solutions to real-world problems. Students are expected to spend years doing rote memorization and learning dry, disconnected fundamentals that we claim will pay off later, long after most of them quit the subject." Paul Lockhart, a Columbia math Ph.D., former Brown professor, and K-12 math teacher imagines in the influential essay "A Mathematician's Lament" a nightmare world where music and art are taught the way math is taught. Unfortunately, this is where many teaching resources on deep learning begin.

Some inspiration

Thankfully, and contrary to what some may believe, you don't need a Ph.D. or any particular academic background to succeed at deep learning. One of the most influential papers of the last decade, "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks", with over 14,000 citations, was written by Alec Radford when he was an undergraduate! Here is Elon Musk on the extremely challenging task of making self-driving cars at Tesla:

A Ph.D. is definitely not required. All that matters is a deep understanding of AI & ability to implement [Neural Networks] in a way that is actually useful. Don’t care if you even graduated high school.

Whether you're excited to identify if plants are diseased from pictures of their leaves, pick out individuals from a family photo, diagnose TB from X-rays, or distinguish a Basquiat from a Botticelli, deep learning can be set to work on ALMOST any problem. And before you ask, no you don't need the computing power nor the datasets of an institution like Google to conduct deep learning.

This is not to say the journey to becoming an AI wizard(grandmaster?) is smooth sailing. There will be moments when the journey will feel difficult, moments where you will feel stuck and frustration will creep in and make you doubt yourself. This is common, but you will conquer all these challenges if you persevere. DON'T GIVE UP!

Prerequisites

What you don't need

What you need

Lots of math

High school math is sufficient to start

Lots of data

Record-breaking results have been achieved with <50 items of data

Expensive computers

You can get what you need for state-of-the-art work for free

The only prerequisite to get started would be knowledge of how to code ( a year's experience is fine), preferably in Python. If you want to learn Python, you can find a link to free courses under the Resources section of this article.

Let's get started

  1. You'll want to create a Kaggle account if you don't already have one. Kaggle provides access to GPUs for free which we'll use to train our model.

To use a GPU on Kaggle, your account must be phone verified. You can enable this on your account page (after you’ve signed up and are logged in) under “Phone Verification”.

  1. Once you’ve got your Kaggle account set up, you’ll need to get familiar with Jupyter Notebook, which is the platform most deep learning researchers and engineers use for their work. If you haven’t used it before, you can learn the ins and outs of it here: Jupyter Notebook 101.

What are we building, exactly?

We are going to be building a bear classifier! A bear what, you ask? Let's imagine you're camping with your family; you spot a lost bear cub near your campsite. The cuteness of the little silvertip drowns out the blaring alarm bells that should be going off in your head, so you stand there trying to figure out what type of bear you're looking at. Fortunately for you, you read this article, followed the steps and built a model that could help identify our little massive red flag. You whip out your phone, snap a pic, and your model correctly identifies it as a grizzly.

So, how do we go about identifying different flavors of danger?

Step 1

# Make sure we've got the latest version of fastai:
!pip install -Uqq fastai duckduckgo_search

from duckduckgo_search import ddg_images
from fastcore.all import *
from fastai.vision.widgets import *

# our function to search for our images, capped to 150 images
def search_images(term, max_images=150):
    print(f"Searching for '{term}'")
    return L(ddg_images(term, max_results=max_images)).itemgot('image')

results = search_images('grizzly bear')
ims = results.attrgot('contentUrl')
len(ims)
  • fastai is the library that we'll use to train our model

  • fastcore is a library built on top of the fastai library. It provides a set of functionalities that are common and useful across many different deep-learning tasks

  • duckduckgo will be used to collect our images to train our model

Running this code will retrieve 150 images from duckduckgo for the search term "grizzly bear"

Step 2

dest = 'images/grizzly.jpg'
download_url(ims[0], dest)

im = Image.open(dest)
im.to_thumb(128,128)

Here, we want to have a look at one of the images downloaded. We need to see if our function is returning the results we'd be expecting. From the result above, it looks like our search function is working.

Step 3

bear_types = 'grizzly','black','teddy'
path = Path('bears')

if not path.exists():
    path.mkdir()
    for o in bear_types:
        dest = (path/o)
        dest.mkdir(exist_ok=True)
        results = search_images_bing(key, f'{o} bear')
        download_images(dest, urls=results.attrgot('contentUrl'))

fns = get_image_files(path)
fns

We'll use fastai's download_images to download all the URLs for each of our search terms. We'll put each in a separate folder.

Our expected output from this stage would look something like this:

[Path('bears/black/00000149.jpg'),Path('bears/black/00000095.jpg'),Path('bears/black/00000133.jpg'),Path('bears/black/00000062.jpg'),Path('bears/black/00000023.jpg'),Path('bears/black/00000029.jpg'),Path('bears/black/00000094.jpg'),Path('bears/black/00000124.jpg'),Path('bears/black/00000105.jpg'),Path('bears/black/00000046.jpg')...]

Step 4

failed = verify_images(fns)
failed
# Our output should look something like this
[Path('bears/black/bdc07369-de49-446c-add5-5fbe1f827808.jpg'),Path('bears/black/a3a95009-c287-4181-a5b5-8938bbb78725.jpg'),Path('bears/black/0fd14f45-cd7f-40d2-b41e-240ddd3e225b.jpg'),Path('bears/black/214163f9-5a36-4b46-a8d7-9b0f806d7f77.jpg'),Path('bears/black/28cf6d45-4a96-48de-85be-41060d42c1f4.jpg'),Path('bears/black/e4be01ee-73f4-4714-9784-d69d15833eab.jpg'),Path('bears/black/275ba4fe-342d-4776-a32a-bb54cc0de6ed.jpg'),Path('bears/teddy/68f60534-fed4-4281-beac-7916236c61f1.jpg'),Path('bears/teddy/b5b02629-e06e-4342-a17a-f3dd9e89f2ad.jpg'),Path('bears/teddy/55bef5e1-1257-4ce3-a324-98a3fb6eb47f.jpg')...]

Here we check if any of the images we've downloaded our corrupt. To remove all the failed images, we use unlink on each of them.

failed.map(Path.unlink);

Step 5

bears = DataBlock(
    blocks=(ImageBlock, CategoryBlock), 
    get_items=get_image_files, 
    splitter=RandomSplitter(valid_pct=0.2, seed=42),
    get_y=parent_label,
    item_tfms=Resize(128))

Now that we have downloaded some data, we need to assemble it in a format suitable for model training. Let's break it down:

  • First, we provide a tuple where we specify what types we want for the independent and dependent variables:

    blocks=(ImageBlock, CategoryBlock)

    The independent variable is the thing we are using to make predictions, and the dependent variable is our target. In this case, our independent variables are images, and our dependent variables are the categories (type of bear) for each image.

  • get_items=get_image_files

    The get_image_files function takes a path, and returns a list of all of the images in that path (recursively, by default)

  • splitter=RandomSplitter(valid_pct=0.2, seed=42)

    Often, datasets that you download will already have a validation set defined. The most important parameter to mention here is valid_pct=0.2. This tells fastai to hold out 20% of the data and not use it for training the model at all. This 20% of the data is called the validation set; the remaining 80% is called the training set. The validation set is used to measure the accuracy of the model. By default, the 20% that is held out is selected randomly. The parameter seed=42 sets the random seed to the same value every time we run this code, which means we get the same validation set every time we run it—this way, if we change our model and retrain it, we know that any differences are due to the changes to the model, not due to having a different random validation set.

  • get_y=parent_label

    The independent variable is often referred to as x and the dependent variable is often referred to as y. Here, we are telling fastai what function to call to create the labels in our dataset. parent_label is a function provided by fastai that simply gets the name of the folder a file is in. Because we put each of our bear images into folders based on the type of bear, this is going to give us the labels that we need.

  • item_tfms=Resize(128)

    Our images are all different sizes, and this is a problem for deep learning: we don't feed the model one image at a time but several of them (what we call a mini-batch). To group them in a big array (usually called a tensor) that is going to go through our model, they all need to be of the same size. So, we need to add a transform that will resize these images to the same size. Item transforms are pieces of code that run on each item, whether it be an image, category, or so forth. fastai includes many predefined transforms; we use the Resize transform here

Step 6

dls = bears.dataloaders(path)

We still need to tell fastai the actual source of our data—in this case, the path where the images can be found.

A DataLoaders includes validation and training dataloaders. DataLoader is a class that provides batches of a few items at a time to the GPU. We can take a look at a few of those items by calling the show_batch method on a DataLoader:

dls.valid.show_batch(max_n=4, nrows=1)

By default Resize crops the images to fit a square shape of the size requested, using the full width or height. This can result in losing some important details. Alternatively, you can ask fastai to pad the images with zeros (black), or squish/stretch them:

bears = bears.new(item_tfms=Resize(128, ResizeMethod.Squish))
dls = bears.dataloaders(path)
dls.valid.show_batch(max_n=4, nrows=1)

bears = bears.new(item_tfms=Resize(128, ResizeMethod.Pad, pad_mode='zeros'))
dls = bears.dataloaders(path)
dls.valid.show_batch(max_n=4, nrows=1)

All of these approaches seem somewhat wasteful, or problematic. If we squish or stretch the images they end up as unrealistic shapes, leading to a model that learns that things look different from how they actually are, which we would expect to result in lower accuracy. If we crop the images then we remove some of the features that allow us to perform recognition.

Instead, what we normally do in practice is to randomly select part of the image, and crop it to just that part. On each epoch (which is one complete pass through all of our images in the dataset) we randomly select a different part of each image. This means that our model can learn to focus on, and recognize, different features in our images. It also reflects how images work in the real world: different photos of the same thing may be framed in slightly different ways.

Here's another example where we replace Resize with RandomResizedCrop, which is the transform that provides the behavior we just described. The most important parameter to pass in is min_scale, which determines how much of the image to select at a minimum each time:

bears = bears.new(item_tfms=RandomResizedCrop(128, min_scale=0.3))
dls = bears.dataloaders(path)
dls.train.show_batch(max_n=4, nrows=1, unique=True)

We used unique=True to have the same image repeated with different versions of this RandomResizedCrop transform. This is a specific example of a more general technique, called data augmentation.

Step 7

Time to conduct data augmentation!

bears = bears.new(item_tfms=Resize(128), batch_tfms=aug_transforms(mult=2))
dls = bears.dataloaders(path)
dls.train.show_batch(max_n=8, nrows=2, unique=True)

Data augmentation refers to creating random variations of our input data, such that they appear different, but do not actually change the meaning of the data. Examples of common data augmentation techniques for images are rotation, flipping, perspective warping, brightness changes and contrast changes

Because our images are now all the same size, we can apply these augmentations to an entire batch of them using the GPU, which will save a lot of time. To tell fastai we want to use these transforms on a batch, we use the batch_tfms parameter

Now that we have assembled our data in a format fit for model training, let's train an image classifier using this data.

Step 8

bears = bears.new(
    item_tfms=RandomResizedCrop(224, min_scale=0.5),
    batch_tfms=aug_transforms())
dls = bears.dataloaders(path)

We don't have a lot of data for our problem (150 pictures of each sort of bear at most), so to train our model, we'll use RandomResizedCrop with an image size of 224 px, which is fairly standard for image classification, and the default aug_transforms

Step 9

learn = vision_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(4)

Our image recognizer tells fastai to create a convolutional neural network (CNN) and specifies what architecture to use (i.e. what kind of model to create), what data we want to train it on, and what metric to use. Why a CNN? It's the current state-of-the-art approach to creating computer vision models.

There are many different architectures in fastai. Most of the time, however, picking an architecture isn't a very important part of the deep learning process.

There are some standard architectures that work most of the time, and in this case, we're using one called ResNet. The 34 in resnet34 refers to the number of layers in this variant of the architecture (other options are 18, 50, 101, and 152). Models using architectures with more layers take longer to train and are more prone to overfitting (i.e. you can't train them for as many epochs before the accuracy on the validation set starts getting worse). On the other hand, when using more data, they can be more accurate.

What is a metric? A metric is a function that measures the quality of the model's predictions using the validation set and will be printed at the end of each epoch. In this case, we're using error_rate, which is a function provided by fastai that does just what it says: tells you what percentage of images in the validation set are being classified incorrectly.

Using pre-trained models is the most important method we have to allow us to train more accurate models, more quickly, with less data, and less time and money.

Using a pre-trained model for a task different from what it was originally trained for is known as transfer learning.

Step 10

Now let's see what mistakes are model is making. To visualize this, we can create a confusion matrix:

interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

The rows represent all the black, grizzly, and teddy bears in our dataset, respectively. The columns represent the images that the model predicted as black, grizzly, and teddy bears, respectively. Therefore, the diagonal of the matrix shows the images which were classified correctly, and the off-diagonal cells represent those that were classified incorrectly. This is one of the many ways that fastai allows you to view the results of your model. It is calculated using the validation set. With the color coding, the goal is to have white everywhere except the diagonal, where we want dark blue.

It's helpful to see where exactly our errors are occurring, to see whether they're due to a dataset problem (e.g., images that aren't bears at all, or are labeled incorrectly, etc.), or a model problem (perhaps it isn't handling images taken with unusual lighting, or from a different angle, etc.). To do this, we can sort our images by their loss.

The loss is a number that is higher if the model is incorrect (especially if it's also confident of its incorrect answer), or if it's correct, but not confident of its correct answer.

Step 11

Let's see where our errors are coming from. This way we can determine if the problem is with our model or our dataset. To accomplish this, we will sort our images by their loss

interp.plot_top_losses(5, nrows=1)

This output shows that the image with the highest loss is one that has been predicted as "black" with high confidence. However, it's labeled (based on our image search) as "grizzly". I'm no mammalogist but this label seems incorrect and we should probably change it!

Given the last image, it's evident we have some problems with some of the images in our dataset that we'll have to clean.

Step 12

The intuitive approach to doing data cleaning is to do it before you train a model. But as you've seen in this case, a model can help you find data issues more quickly and easily. So, we normally prefer to train a quick and simple model first, and then use it to help us with data cleaning.

cleaner = ImageClassifierCleaner(learn)
cleaner

fastai includes a handy GUI for data cleaning called ImageClassifierCleaner that allows you to choose a category and the training versus validation set and view the highest-loss images (in order), along with menus to allow images to be selected for removal or relabeling

Using this widget, we can filter our images and correct our dataset.

Step 13

We can see that amongst our "black bears" is an image that contains two bears: one grizzly, and one black. So, we should choose <Delete> in the menu under this image. ImageClassifierCleaner doesn't do the deleting or changing of labels for you; it just returns the indices of items to change.

# Delete (unlink) all images selected for deletion
for idx in cleaner.delete(): cleaner.fns[idx].unlink()
# Move images for which we've selected a different category
for idx,cat in cleaner.change(): shutil.move(str(cleaner.fns[idx]), path/cat)

"Cleaning the data and getting it ready for your model are two of the biggest challenges for data scientists; they say it takes 90% of their time. The fastai library aims to provide tools that make it as easy as possible."

-Sylvain Gugger

Step 14

Voila! Congratulations, you have successfully built a deep learning model and you can deploy it to be used in practice ( I will save that for another article)

Some considerations

You now have a model ready to be converted into an online application. However, before you get excited and start planning your next camping trip, it's important to consider issues that you could run into in deployment. One of the biggest issues to consider is that understanding and testing the behavior of a deep learning model is much more difficult than with most other code you write. With normal software development, you can analyze the exact steps that the software is taking, and carefully study which of these steps match the desired behavior that you are trying to create. But with a neural network, the behavior emerges from the model's attempt to match the training data, rather than being exactly defined.

This could be disastrous! For instance, let's say we were rolling out a bear detection system that will be attached to video cameras around campsites in national parks and will warn campers of incoming bears. If we used a model trained with the dataset we downloaded there would be all kinds of problems in practice, such as:

  • Working with video data instead of images

  • Handling nighttime images, which may not appear in this dataset

  • Dealing with low-resolution camera images

  • Ensuring results are returned fast enough to be useful in practice

  • Recognizing bears in positions that are rarely seen in photos that people post online (for example from behind, partially covered by bushes, or when a long way away from the camera)

A big part of the issue is that the kinds of photos that people are most likely to upload to the internet are the kinds of photos that do a good job of clearly and artistically displaying their subject matter—which isn't the kind of input this system is going to be getting. So, we may need to do a lot of our own data collection and labeling to create a useful system.

Conclusion

This is but a fraction of the possibilities that exist within deep learning. Less than a decade ago, this would have been considered impossible. So well done to you! Deep learning has power, flexibility, and simplicity with capabilities that can be applied across many disciplines. These include the social and physical sciences, the arts, medicine, finance, scientific research, and many more. For example, despite having no background in medicine, Jeremy started Enlitic, a company that uses deep learning algorithms to diagnose illness and disease. Within months of starting the company, it was announced that its algorithm could identify malignant tumors more accurately than radiologists. The possibilities are endless and we've only just scratched the surface!

Acknowledgments

If you looking to explore the deep learning space and possibly make it a career, there is no better place to start than fast.ai. There are 9 lessons, and each lesson is around 90 minutes long. The course is based on this 5-star rated book (which did most of the heavy lifting for this article), it is freely available online, and if you're skeptical, you don't have to believe me, but I think this man might be a BIT more convincing:

You don’t need any special hardware or software — you’ll be shown how to use free resources for both building and deploying models. You don’t need any university math either — you'll be taught the calculus and linear algebra you need during the course. The lecturer and founder of the course Jeremy Howard is a wonderful teacher breaking down complex concepts in ways that make it easy to follow along.

Thank you so much, Jeremy, you are a godsend!

Resources

fast.ai - if you want to learn how to apply deep learning and machine learning to practical problems

python - if you're looking to learn python

20 Animals Waving Goodbye ideas | animals, waves goodbye, cute animals