Conduct Machine Learning using Torchvision

We will conduct machine learning experiment through Faster R-CNN library provided by Torchvision. To speed up time of train and evaluate, we use Colab's GPU.

1. Set up the Colab environment

        1) First, we need to enable GPUs for the notebook

Navigate to Edit → Notebook Settings

Select GPU from the Hardware Accelerator drop-down

        2) Download some requirment for torchvision


pip install cython
# Install pycocotools, the version by default in Colab
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

# Download TorchVision repo to use some files from
# references/detection
git clone https://github.com/pytorch/vision.git
cd vision
git checkout v0.3.0

cp references/detection/utils.py ../
cp references/detection/transforms.py ../
cp references/detection/coco_eval.py ../
cp references/detection/engine.py ../
cp references/detection/coco_utils.py ../

        Now, you can do all necessary imports

import os
import numpy as np
import torch
import torch.utils.data
from PIL import Image
import pandas as pdimport torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from engine import train_one_epoch, evaluate
import utils
import transforms as T

        When using PyTorch, a bug related to numpy may occur. To prevent this, downgrade the installed version of numpy.

pip install numpy==1.17.4

        3) Mount google drive to Colab

from google.colab import drive

        When we do this, our current directory becomes '/content/drive/My Drive/'.

2. Define the custom Dataset

        Dataset structure required by Torchvision :

  • image: a PIL Image of size (H, W)
  • target: a dict containing the following fields

         - boxes (FloatTensor[N, 4]): the coordinates of the N bounding boxes in [x0, y0, x1, y1] format, ranging from 0 to W and 0 to H

         - labels (Int64Tensor[N]): the label for each bounding box. 0 represents always the background class.

         - image_id (Int64Tensor[1]): an image identifier. It should be unique between all the images in the dataset, and is used during evaluation

         - area (Tensor[N]): The area of the bounding box. This is used during evaluation with the COCO metric, to separate the metric scores between small, medium and large boxes.

         - iscrowd (UInt8Tensor[N]): instances with iscrowd=True will be ignored during evaluation.

         - (optionally) masks (UInt8Tensor[N, H, W]): The segmentation masks for each one of the objects

         - (optionally) keypoints (FloatTensor[N, K, 3]): For each one of the N objects, it contains the K keypoints in [x, y, visibility] format, defining the object. visibility=0 means that the keypoint is not visible. Note that for data augmentation, the notion of flipping a keypoint is dependent on the data representation, and you should probably adapt references/detection/transforms.py for your new keypoint representation

        We don't need masks, keypoints because we will only use Faster R-CNN.

        1) Prepare annotation.csv file

        We will use the annotation.csv file as input. The structure of the file is [filename, minX, maxX, minY, maxY, classname], and the order of the coordinates does not matter. Don't forget that you need the original image for each item as well as the animation.csv file!

        How to extract RoI from images is described in detail in another note. If you are curious, please refer to the note.

        2) Parse one annotation of image from annotation.csv file

def parse_one_annot(filepathfilename):
    Load image and check position and classname of RoI
    At this time, convert classname to label(integer type). # The reason it starts from 1 is that 0 is set as the label of the background.

    data = pd.read_csv(filepath)
    boxes_array = data[data["filename"] == filename][["minX""minY""maxX""maxY"]].values
    for i in range(len(boxes_array)) :
        minX = boxes_array[i, 0]
        minY = boxes_array[i, 1]
        maxX = boxes_array[i, 2]
        maxY = boxes_array[i, 3]

    classnames = data[data["filename"] == filename][["classname"]]
    classes = []
    for i in range(len(classnames)) :
        if classnames.iloc[i, 0] == 'classname1' : classes.append(1)
        elif classnames.iloc[i, 0] == 'classname2' : classes.append(2)
        elif classnames.iloc[i, 0] == 'classname3' : classes.append(3)
    return boxes_array, classes

        3) Define our custom Dataset class

        root : path where images are stored

        df_path : path where annotation.csv file is stored

class OpenDataset(torch.utils.data.Dataset):
# Class for creating dataset and importing dataset into the Datalader
# Transforms means whether or not the image is preprocessed (left/right transform, etc.)
    def __init__(selfrootdf_pathtransforms=None):
        self.root = root
        self.transforms = transforms
        self.df = df_path
        names = pd.read_csv(df_path)[['filename']]
        names = names.drop_duplicates()
        self.imgs = list(np.array(names['filename'].tolist()))

    def __getitem__(selfidx):
        Load image and check image information
        img_path = os.path.join(self.root, self.imgs[idx])
        if img_path.split('.')[-1] != 'png' : img_path += '.png'
        img = Image.open(img_path).convert("RGB")
        box_list, classes = parse_one_annot(self.df, self.imgs[idx])

        Convert to format suitable for learning(torch.tensor type)
        boxes = torch.as_tensor(box_list, dtype=torch.float32)
        labels = torch.as_tensor(classes, dtype=torch.int64)
        image_id = torch.tensor([idx])
        area means the area corresponding to RoI
        area_list = [(i[2] - i[0]) * (i[3] - i[1]) for i in box_list]
        areas = torch.as_tensor(area_list, dtype=torch.float32)
        whether the roi is hidden from others
        0 if hidden, 1 if not
        iscrowd = torch.zeros((len(boxes),), dtype=torch.int64)

        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["image_id"] = image_id
        target["area"] = areas
        target["iscrowd"] = iscrowd
        if self.transforms is not None:
            img, target = self.transforms(img, target)
        return img, target
    def __len__(self):
        return len(self.imgs)

        Now we can instantiate our training and testing data classes and assign them to Dataloader that control how images are loaded during training and testing (batch size etc).

def get_transform(train):
   transforms = []
   # Converts the image, a PIL image, into a PyTorch Tensor
   if train:
      Transform the image left and right with 50% probability when learning
   return T.Compose(transforms)
dataset_train = OpenDataset(train_root,'/content/drive/My Drive/test/train.csv', transforms = get_transform(train=True))
dataset_val = OpenDataset(val_root,'/content/drive/My Drive/test/val.csv', transforms = get_transform(train=False))

Randomly reorder images in a dataset
indices_train = torch.randperm(len(dataset_train)).tolist()
indices_val = torch.randperm(len(dataset_val)).tolist()
dataset_train = torch.utils.data.Subset(dataset_train, indices_train)
dataset_val = torch.utils.data.Subset(dataset_val, indices_val)

# Define Dataloader
data_loader = torch.utils.data.DataLoader(
    dataset_train, batch_size=4, shuffle=True, num_workers=4,

data_loader_val = torch.utils.data.DataLoader(
    dataset_val, batch_size=1, shuffle=False, num_workers=4,

print("We have: {} examples, {} are training and {} testing".format(len(dataset_train)+len(dataset_val),
len(dataset_train), len(dataset_val)))

        Unlike the above(define separate datasets for learning and validation), what would you like to divide one dataset for learning and validation?

dataset_train = OpenDataset(train_root,'/content/drive/My Drive/test/train.csv', transforms = get_transform(train=True))
dataset_val = OpenDataset(train_root,'/content/drive/My Drive/test/train.csv', transforms = get_transform(train=False))

Split dataset for learning and validation
40 images of the total are used for validation and the rest for learning.
indices = torch.randperm(len(dataset)).tolist()
dataset_train = torch.utils.data.Subset(dataset_train, indices[:-40])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-40:])

data_loader = torch.utils.data.DataLoader(
    dataset_train, batch_size=4, shuffle=True, num_workers=4,

data_loader_val = torch.utils.data.DataLoader(
    dataset_val, batch_size=1, shuffle=False, num_workers=4,

print("We have: {} examples, {} are training and {} testing".format(len(dataset_train)+len(dataset_val),
len(dataset_train), len(dataset_val)))

3. Train the model

        1) Download and adjust the model

        If you want to start from a model pre-trained on COCO and want to finetune it for your particular classes. Below is a possible way of doing it.

def get_instance_segmentation_model(num_classes):
    # Load a model pre-trained pre-trained on COCO
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

    # Replace the classifier with a new one, that has
    # num_classes which is user-defined
    # Get number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    # Replace the pre-trained head with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

        2) Set up the model

Proceed with GPU for learning but if GPU is not available, use CPU
device = torch.device('cuda'if torch.cuda.is_available() else torch.device('cpu')
num_classes = 4  # 3 class (number of classname) + 1 class (background)
model = get_instance_segmentation_model(num_classes)

# Move model to GPU or CPU

# Construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
                            momentum=0.9, weight_decay=0.0005)

# Construct a learning rate scheduler
# Learning rate scheduler decreases by 10x every 5 epochs
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,

        In order to do the training we now must write our own for loop over the number of epochs we wish to train on, then call PyTorch’s train_one_epoch function, adjust the learning rate and finally evaluate once per epoch.

num_epochs = 10

for epoch in range(num_epochs):
    # Train for 1 epoch and print every 10 iterations
    train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
    # Update learning rate
    # Evaluate on the validation data
    evaluate(model, data_loader_val, device=device)

        3) Save the model

torch.save(model.state_dict(), "./model.pth")

4. Test the model

        1) Load the model

model = get_instance_segmentation_model(num_classes)

        2) Draw prediction on image

from PIL import ImageDraw

def drawPrediction(imglabel_boxesprediction) :
    image = Image.fromarray(img.mul(255).permute(120).byte().numpy())
    draw = ImageDraw.Draw(image)

    # Draw prediction on image
    for elem in range(len(label_boxes)):
        draw.rectangle([(label_boxes[elem][0], label_boxes[elem][1]),
        (label_boxes[elem][2], label_boxes[elem][3])], 
        outline ="green", width =3)
    for element in range(len(prediction[0]["boxes"])):
        boxes = prediction[0]["boxes"][element].cpu().numpy()
        score = np.round(prediction[0]["scores"][element].cpu().numpy(),
                            decimals= 4)
        draw.rectangle([(boxes[0], boxes[1]), (boxes[2], boxes[3])], 
        outline ="red", width =3)
        draw.text((boxes[0], boxes[1]), text = str(score))
    return image

        3) Test the model

dataset_test = OpenDataset(test_root,'/content/drive/My Drive/test/test.csv', transforms = get_transform(train=False))
for i in range(len(dataset_test)) :
    img, _ = dataset_test[i]
    label_boxes = np.array(dataset_test[i][1]["boxes"])
    # Put the model in evaluation mode
    with torch.no_grad():
        prediction = model([img.to(device)])
    result = drawPrediction(img, label_boxes, prediction)

