import pprint as pp

import cv2
import pandas as pd
from fastai2.basics import *
from fastai2.vision.all import *

The Goals

This is the first in a series of post where I start exploring how to build a real-time video classificaiton system. Warning, I'm not a seasoned deep learning practioner and I've never worked with video data so if you're very experience in this area you might want to stop reading here. If you're like me and new to this space then welcome, hopefully you'll find something that helps you on your journey.

In the rest of this post I'm going do the following:

  • Outline the data I'll be working with
  • Create some utility functions to work with videos
  • Explore basic data augmentation
  • Explore creating Dataset objects

and I'll also put forth some ideas and thoughts that have come to mind while getting my hands dirty. Lastly, I'll be sprinkling useful links throughout the post.

Setup & Configuration with Some Gotchya's

First you're going to have to have some libraries installed for working with video.

My first set of googling led me to install openCV through conda-forge:

conda install -c conda-forge opencv

After starting down this path though, I came across some information that video's will now be first class citizens in PyTorch (version 0.4.0). You can read about the release here. The TLDR of the release notes are that they're building in video transforms, IO support through PyAV and adding to their model zoo for transfer learning.

That said, I ended up installing both libaries and I'll be using both; starting out with openCV and moving over to PyAV. One major difference you should consider before chosing a library to work with is that if you ever plan to use your code on an edge device you might be stuck with one or the other. I believe the Jetson Nano only supports gstreamer as a backend with support for openCV but PyAV utilizes ffmpeg so there may be issues trying to embed PyAV code on the Jeston Nano... maybe, this isn't confirmed but could be an issue and worth researching.

The Data

I chose to work with a small subset of classes from UCF-101's dataset:

In brief, UCF-101 contains 101 videos of various types of actions; human-object interactions, human body motions, human-human interactions, playing instruments, and sports.

The subset of videos I'm working with are all sports related. I downloaded data and reorganized the subset in a directory I'm calling DATAPATH:

DATAPATH = Path("/home/bibsian/Desktop/action_subset")
pp.pprint([x for x in DATAPATH.ls()])
[Path('/home/bibsian/Desktop/action_subset/BalanceBeam'),
 Path('/home/bibsian/Desktop/action_subset/trainlist_subset.txt~'),
 Path('/home/bibsian/Desktop/action_subset/trainlist_subset.txt'),
 Path('/home/bibsian/Desktop/action_subset/CliffDiving'),
 Path('/home/bibsian/Desktop/action_subset/testlist_subset.txt'),
 Path('/home/bibsian/Desktop/action_subset/BasketballDunk')]

Here are the class labels I'm working with (the above output is a little hard to read):

video_subsets = [x.name for x in sorted(Path(DATAPATH).ls()) if "txt" not in x.name]
print(video_subsets)
['BalanceBeam', 'BasketballDunk', 'CliffDiving']

And just an FYI, the clip duration for these videos are somewhere between 3-6 seconds. And here's the label for an actual video:

(DATAPATH/"BalanceBeam").ls()[0].name
'v_BalanceBeam_g25_c04.avi'

Train/Validation Labels

As we can see from above, each video label has the following syntax: v_[Action Label]_g[Group Number]_c[Clip Number]

When creating the train and validation sets you need to make sure that no clips from the same group are represented in both of the sets, otherwise there would be some data leakage and you'd get better than usual performance.

The source was kind enough to provide text files for multiple train/validaiton sets but since I'm working with a subset I just handmade the splits to account for that detail above; let's read the labels in and check them out.

df_train = pd.read_csv(Path(DATAPATH)/"trainlist_subset.txt", header=None, names=["label"])
df_valid = pd.read_csv(Path(DATAPATH)/"testlist_subset.txt", header=None, names=["label"])
df_train.label.head(4)
0    BalanceBeam/v_BalanceBeam_g08_c01.avi
1    BalanceBeam/v_BalanceBeam_g08_c02.avi
2    BalanceBeam/v_BalanceBeam_g08_c03.avi
3    BalanceBeam/v_BalanceBeam_g08_c04.avi
Name: label, dtype: object
df_valid.label.head(4)
0    BalanceBeam/v_BalanceBeam_g01_c01.avi
1    BalanceBeam/v_BalanceBeam_g01_c02.avi
2    BalanceBeam/v_BalanceBeam_g01_c03.avi
3    BalanceBeam/v_BalanceBeam_g01_c04.avi
Name: label, dtype: object

Working with the data

So now that you know how to create the train/validation splits let's start writing some code to get a feel for working with videos.

Image Shapes

Before we start working with videos let take a step back and think about how we usually work with images.

You can typically represent a regular colored image in the form of a tensor with dimensions, $(C, H, W)$, where $C=Channel$, $H=height$, and $W=width$. Then when passing the data through a network you can tack on the batch size to the front of that tensor so images end up getting processed on the GPU in batches via $(N, C, H, W)$ dimensions, where $N$ is the batch size.

That said though, when we start working with videos we need to think about adding a dimension for time. And when you do this, tensors become dimensions of $(N, C, T, H, W)$ where all previous dimensions stay the same but you add $T=time$ to account for a stack of images that are essentially the clips of the video (a single clip is a frame i.e. a single image). And when you actually get around to training a model you can pass these dimension into a nn.Conv3d convolution.

But before we start playing with ath let's just get a feel for video tensors.

Video Tensors

Paths for sample videos

Here I'm setting the Path for a few sample videos

def take_file_from_path(path, list_ix=0):
    """ Helper function to take first file from a path"""
    return str((path).ls()[list_ix]) 

BALANCE_TEST, DUNK_TEST, CLIFF_TEST = [take_file_from_path(DATAPATH/x) for x in video_subsets]
print(BALANCE_TEST)
/home/bibsian/Desktop/action_subset/BalanceBeam/v_BalanceBeam_g25_c04.avi
Video utils

Here's some functions that will help us work with videos in a very basic way.

from functools import partial
def get_video_feature(video_cap, feature):
    return video_cap.get(feature)

vid_feature = lambda feature: partial(get_video_feature, feature=feature)

get_n_frames = vid_feature(cv2.CAP_PROP_FRAME_COUNT)

get_width = vid_feature(cv2.CAP_PROP_FRAME_WIDTH)

get_height = vid_feature(cv2.CAP_PROP_FRAME_HEIGHT)

get_fps = vid_feature(cv2.CAP_PROP_FPS) # frames per second

def get_shape(video_cap):
    return int(get_height(video_cap)), int(get_width(video_cap))

def get_frame_features(video_cap):
    return int(get_fps(video_cap)), int(get_n_frames(video_cap))
def describe_video(video_cap: cv2.VideoCapture):
    w, h = get_shape(video_cap)
    fps, n_frames = get_frame_features(video_cap)
    print(f"Height: {h}, Width: {w}, FPS: {fps}, n_frames: {n_frames}")
def standard_reader(stream_cap, n_frames=30):
    frames = [] # Will be (T, C, H, W) format
    for f_index in range(n_frames):
        _, image = stream_cap.read()
        frames.append(image)
    assert frames, f"Check file path, no frames coming from stream capture"
    return frames
def read_video_file(path, n_frames=30, silence=True):
    """ Read video from begining and return first n_frames """
    stream_capture = cv2.VideoCapture(str(path))
    if not silence:
        print(f"Test {str(path)}"); describe_video(stream_capture)
    return standard_reader(stream_capture, n_frames)

Now that we have some primatives to read in a video and return a PyTorch tensor we need to rearrange it so it's in the dimension we're looking for $(C, T, H, W)$.

# reorder video tensor shape to (C,T,H,W)
video_to_tensor = lambda frames: tensor(frames).permute(3,0,1,2)

Okay, so let's read some videos and turn them into tensors and check out some data

Test reads
cliff_tensor = video_to_tensor(read_video_file(CLIFF_TEST, silence=False))
Test /home/bibsian/Desktop/action_subset/CliffDiving/v_CliffDiving_g01_c06.avi
Height: 320, Width: 240, FPS: 25, n_frames: 76
cliff_tensor.shape
torch.Size([3, 30, 240, 320])

We're going to use a fast.ai helper function here for plotting images (show_images)

show_images(cliff_tensor[2,1:6,...], imsize=5)

Cool. We were able to read in the data, reorder the indeces so they're in the $(C,T,H,W)$ shape and view a snippet of the frames. Let's check out a sample of the other video action subsets I'm working with:

Obviously that one above is a divers, so let's check out basketball dunk.

show_images(video_to_tensor(read_video_file(DUNK_TEST, silence=False))[2, 1:6, ...], imsize=5)
Test /home/bibsian/Desktop/action_subset/BasketballDunk/v_BasketballDunk_g16_c01.avi
Height: 320, Width: 240, FPS: 25, n_frames: 76

And here we have Balance Beams

balance_tensor = video_to_tensor(read_video_file(BALANCE_TEST, silence=False))
Test /home/bibsian/Desktop/action_subset/BalanceBeam/v_BalanceBeam_g25_c04.avi
Height: 320, Width: 240, FPS: 25, n_frames: 102
show_images(balance_tensor[2, 25:, ...], imsize=5) # index from 25 to end of shape (30 in this case)

Cool; we can read in video clips and turn them in to tensors. I want to reiterate the fact that I'm still uncertain how to reshape a multi-channel 3D stack of images into a vector but it's something to think about.

Video Transforms & Data Augmentation

Alright, now that we have a basic understanding of how to manipulate a video and cast it into a tensor I needed to figure out how we can go about transforming video tensor so that we can do data augmentation prior to training; think cropping, resizing, rotating, etc.

The thing about data augmentation in fast.ai is that I couldn't get them to work on video tensors; they only seem to work on a single pytorch image or PIL image. After poking around I found the pytorch video transforms code base (I couldn't get it to work with imports so I copied the source code and link below); check out all those transforms!

balance_tensor.shape
torch.Size([3, 30, 240, 320])
# Torch reference
# https://github.com/stephenyan1231/vision/blob/video_transforms/references/video_classification/transforms.py
import torch
import random


def crop(vid, i, j, h, w):
    return vid[..., i:(i + h), j:(j + w)]


def center_crop(vid, output_size):
    h, w = vid.shape[-2:]
    th, tw = output_size

    i = int(round((h - th) / 2.))
    j = int(round((w - tw) / 2.))
    return crop(vid, i, j, th, tw)


def hflip(vid):
    return vid.flip(dims=(-1,))


# NOTE: for those functions, which generally expect mini-batches, we keep them
# as non-minibatch so that they are applied as if they were 4d (thus image).
# this way, we only apply the transformation in the spatial domain
def resize(vid, size, interpolation='bilinear'):
    # NOTE: using bilinear interpolation because we don't work on minibatches
    # at this level
    scale = None
    if isinstance(size, int):
        scale = float(size) / min(vid.shape[-2:])
        size = None
    return torch.nn.functional.interpolate(
        vid, size=size, scale_factor=scale, mode=interpolation, align_corners=False)


def pad(vid, padding, fill=0, padding_mode="constant"):
    # NOTE: don't want to pad on temporal dimension, so let as non-batch
    # (4d) before padding. This works as expected
    return torch.nn.functional.pad(vid, padding, value=fill, mode=padding_mode)


def to_normalized_float_tensor(vid):
    return vid.permute(3, 0, 1, 2).to(torch.float32) / 255


def normalize(vid, mean, std):
    shape = (-1,) + (1,) * (vid.dim() - 1)
    mean = torch.as_tensor(mean).reshape(shape)
    std = torch.as_tensor(std).reshape(shape)
    return (vid - mean) / std


# Class interface

class RandomCrop(object):
    def __init__(self, size):
        self.size = size

    @staticmethod
    def get_params(vid, output_size):
        """Get parameters for ``crop`` for a random crop.
        """
        h, w = vid.shape[-2:]
        th, tw = output_size
        if w == tw and h == th:
            return 0, 0, h, w
        i = random.randint(0, h - th)
        j = random.randint(0, w - tw)
        return i, j, th, tw

    def __call__(self, vid):
        i, j, h, w = self.get_params(vid, self.size)
        return crop(vid, i, j, h, w)


class CenterCrop(object):
    def __init__(self, size):
        self.size = size

    def __call__(self, vid):
        return center_crop(vid, self.size)


class Resize(object):
    def __init__(self, size):
        self.size = size

    def __call__(self, vid):
        return resize(vid, self.size)


class ToFloatTensorInZeroOne(object):
    def __call__(self, vid):
        return to_normalized_float_tensor(vid)


class Normalize(object):
    def __init__(self, mean, std):
        self.mean = mean
        self.std = std

    def __call__(self, vid):
        return normalize(vid, self.mean, self.std)


class RandomHorizontalFlip(object):
    def __init__(self, p=0.5):
        self.p = p

    def __call__(self, vid):
        if random.random() < self.p:
            return hflip(vid)
        return vid


class Pad(object):
    def __init__(self, padding, fill=0):
        self.padding = padding
        self.fill = fill

    def __call__(self, vid):
        return pad(vid, self.padding, self.fill)

So let's see if these bad boys work:

balance_tensor.shape
torch.Size([3, 30, 240, 320])
show_images(balance_tensor[1, 20:25, ...])
cropped_balance_tensor = CenterCrop((128, 128))(balance_tensor.float())
show_images(cropped_balance_tensor[1, 20:25, ...])

Isn't that pretty cool!

Dataset objects

So, now we know how to split our data, read videos into memory and reshape as necessary, and augment the video tensors. The next logical step is to build some utilities for creating Dataset objects (tuples of (X,y) pairs where X is the training sample an y is the data label).

If you ran the code where we loaded the video with openCV you'd realize it was pretttty slow. Doing this for a ton of videos and chunking them out into small clips (multiple clips per video) would take a while and also require us to build the Dataset classes ourselves. Luckily, pytorch has some utils for this that leverage PyAV library.

from torchvision.datasets.video_utils import VideoClips

# Source: https://github.com/pytorch/vision/releases/tag/v0.4.0
class MyVideoDataset(object):
    def __init__(self, video_paths):
        self.video_clips = VideoClips(video_paths,
                                      clip_length_in_frames=16,
                                      frames_between_clips=1,
                                      frame_rate=15)

    def __getitem__(self, idx):
        video, audio, info, video_idx = self.video_clips.get_clip(idx)
        return video, audio
    
    def __len__(self):
        return self.video_clips.num_clips()

You might get an error when reading in data (see here); I edited the source code to fix it but there might be other work arounds if you run into the error linked.

So lets try and create a dataset from 2 videos

vids_for_ds = [str(x) for x in get_files((DATAPATH/"BalanceBeam"))][0:2]
len(vids_for_ds) # 2 videos
2
#hide_output
ds = MyVideoDataset(vids_for_ds)
len(ds)
106

Let's look at a sample we created

ds[0][0].shape
torch.Size([16, 240, 320, 3])

Looks like it took 16 frames but also didn't permute the dimension to what we want so lets use a helper function from before, rearrange and check out the sample.

show_images(video_to_tensor(ds[4][0])[1, 1:8, ...])

Closing

So there we have it. Hopefully you have a better understanding about how to leverage videos in your applications. Feel free to comment and thanks for reading.