Action Detection - Part 1
Adding that extra dimension for images
import pprint as pp
import cv2
import pandas as pd
from fastai2.basics import *
from fastai2.vision.all import *
The Goals
This is the first in a series of post where I start exploring how to build a real-time video classificaiton system. Warning, I'm not a seasoned deep learning practioner and I've never worked with video data so if you're very experience in this area you might want to stop reading here. If you're like me and new to this space then welcome, hopefully you'll find something that helps you on your journey.
In the rest of this post I'm going do the following:
- Outline the data I'll be working with
- Create some utility functions to work with videos
- Explore basic data augmentation
- Explore creating
Dataset
objects
and I'll also put forth some ideas and thoughts that have come to mind while getting my hands dirty. Lastly, I'll be sprinkling useful links throughout the post.
Setup & Configuration with Some Gotchya's
First you're going to have to have some libraries installed for working with video.
My first set of googling led me to install openCV
through conda-forge:
conda install -c conda-forge opencv
After starting down this path though, I came across some information that video's will now be first class citizens in PyTorch
(version 0.4.0). You can read about the release here. The TLDR of the release notes are that they're building in video transforms, IO support through PyAV and adding to their model zoo for transfer learning.
That said, I ended up installing both libaries and I'll be using both; starting out with openCV and moving over to PyAV. One major difference you should consider before chosing a library to work with is that if you ever plan to use your code on an edge device you might be stuck with one or the other. I believe the Jetson Nano only supports gstreamer as a backend with support for openCV but PyAV utilizes ffmpeg so there may be issues trying to embed PyAV code on the Jeston Nano... maybe, this isn't confirmed but could be an issue and worth researching.
In brief, UCF-101 contains 101 videos of various types of actions; human-object interactions, human body motions, human-human interactions, playing instruments, and sports.
The subset of videos I'm working with are all sports related. I downloaded data and reorganized the subset in a directory I'm calling DATAPATH
:
DATAPATH = Path("/home/bibsian/Desktop/action_subset")
pp.pprint([x for x in DATAPATH.ls()])
Here are the class labels I'm working with (the above output is a little hard to read):
video_subsets = [x.name for x in sorted(Path(DATAPATH).ls()) if "txt" not in x.name]
print(video_subsets)
And just an FYI, the clip duration for these videos are somewhere between 3-6 seconds. And here's the label for an actual video:
(DATAPATH/"BalanceBeam").ls()[0].name
As we can see from above, each video label has the following syntax: v_[Action Label]_g[Group Number]_c[Clip Number]
When creating the train and validation sets you need to make sure that no clips from the same group are represented in both of the sets, otherwise there would be some data leakage and you'd get better than usual performance.
The source was kind enough to provide text files for multiple train/validaiton sets but since I'm working with a subset I just handmade the splits to account for that detail above; let's read the labels in and check them out.
df_train = pd.read_csv(Path(DATAPATH)/"trainlist_subset.txt", header=None, names=["label"])
df_valid = pd.read_csv(Path(DATAPATH)/"testlist_subset.txt", header=None, names=["label"])
df_train.label.head(4)
df_valid.label.head(4)
Working with the data
So now that you know how to create the train/validation splits let's start writing some code to get a feel for working with videos.
Image Shapes
Before we start working with videos let take a step back and think about how we usually work with images.
You can typically represent a regular colored image in the form of a tensor with dimensions, $(C, H, W)$, where $C=Channel$, $H=height$, and $W=width$. Then when passing the data through a network you can tack on the batch size to the front of that tensor so images end up getting processed on the GPU in batches via $(N, C, H, W)$ dimensions, where $N$ is the batch size.
That said though, when we start working with videos we need to think about adding a dimension for time. And when you do this, tensors become dimensions of $(N, C, T, H, W)$ where all previous dimensions stay the same but you add $T=time$ to account for a stack of images that are essentially the clips of the video (a single clip is a frame i.e. a single image). And when you actually get around to training a model you can pass these dimension into a nn.Conv3d
convolution.
But before we start playing with ath let's just get a feel for video tensors.
def take_file_from_path(path, list_ix=0):
""" Helper function to take first file from a path"""
return str((path).ls()[list_ix])
BALANCE_TEST, DUNK_TEST, CLIFF_TEST = [take_file_from_path(DATAPATH/x) for x in video_subsets]
print(BALANCE_TEST)
from functools import partial
def get_video_feature(video_cap, feature):
return video_cap.get(feature)
vid_feature = lambda feature: partial(get_video_feature, feature=feature)
get_n_frames = vid_feature(cv2.CAP_PROP_FRAME_COUNT)
get_width = vid_feature(cv2.CAP_PROP_FRAME_WIDTH)
get_height = vid_feature(cv2.CAP_PROP_FRAME_HEIGHT)
get_fps = vid_feature(cv2.CAP_PROP_FPS) # frames per second
def get_shape(video_cap):
return int(get_height(video_cap)), int(get_width(video_cap))
def get_frame_features(video_cap):
return int(get_fps(video_cap)), int(get_n_frames(video_cap))
def describe_video(video_cap: cv2.VideoCapture):
w, h = get_shape(video_cap)
fps, n_frames = get_frame_features(video_cap)
print(f"Height: {h}, Width: {w}, FPS: {fps}, n_frames: {n_frames}")
def standard_reader(stream_cap, n_frames=30):
frames = [] # Will be (T, C, H, W) format
for f_index in range(n_frames):
_, image = stream_cap.read()
frames.append(image)
assert frames, f"Check file path, no frames coming from stream capture"
return frames
def read_video_file(path, n_frames=30, silence=True):
""" Read video from begining and return first n_frames """
stream_capture = cv2.VideoCapture(str(path))
if not silence:
print(f"Test {str(path)}"); describe_video(stream_capture)
return standard_reader(stream_capture, n_frames)
Now that we have some primatives to read in a video and return a PyTorch tensor we need to rearrange it so it's in the dimension we're looking for $(C, T, H, W)$.
# reorder video tensor shape to (C,T,H,W)
video_to_tensor = lambda frames: tensor(frames).permute(3,0,1,2)
Okay, so let's read some videos and turn them into tensors and check out some data
cliff_tensor = video_to_tensor(read_video_file(CLIFF_TEST, silence=False))
cliff_tensor.shape
We're going to use a fast.ai helper function here for plotting images (show_images
)
show_images(cliff_tensor[2,1:6,...], imsize=5)
Cool. We were able to read in the data, reorder the indeces so they're in the $(C,T,H,W)$ shape and view a snippet of the frames. Let's check out a sample of the other video action subsets I'm working with:
Obviously that one above is a divers, so let's check out basketball dunk.
show_images(video_to_tensor(read_video_file(DUNK_TEST, silence=False))[2, 1:6, ...], imsize=5)
And here we have Balance Beams
balance_tensor = video_to_tensor(read_video_file(BALANCE_TEST, silence=False))
show_images(balance_tensor[2, 25:, ...], imsize=5) # index from 25 to end of shape (30 in this case)
Cool; we can read in video clips and turn them in to tensors. I want to reiterate the fact that I'm still uncertain how to reshape a multi-channel 3D stack of images into a vector but it's something to think about.
Video Transforms & Data Augmentation
Alright, now that we have a basic understanding of how to manipulate a video and cast it into a tensor I needed to figure out how we can go about transforming video tensor so that we can do data augmentation prior to training; think cropping, resizing, rotating, etc.
The thing about data augmentation in fast.ai is that I couldn't get them to work on video tensors; they only seem to work on a single pytorch image or PIL image. After poking around I found the pytorch video transforms code base (I couldn't get it to work with imports so I copied the source code and link below); check out all those transforms!
balance_tensor.shape
# Torch reference
# https://github.com/stephenyan1231/vision/blob/video_transforms/references/video_classification/transforms.py
import torch
import random
def crop(vid, i, j, h, w):
return vid[..., i:(i + h), j:(j + w)]
def center_crop(vid, output_size):
h, w = vid.shape[-2:]
th, tw = output_size
i = int(round((h - th) / 2.))
j = int(round((w - tw) / 2.))
return crop(vid, i, j, th, tw)
def hflip(vid):
return vid.flip(dims=(-1,))
# NOTE: for those functions, which generally expect mini-batches, we keep them
# as non-minibatch so that they are applied as if they were 4d (thus image).
# this way, we only apply the transformation in the spatial domain
def resize(vid, size, interpolation='bilinear'):
# NOTE: using bilinear interpolation because we don't work on minibatches
# at this level
scale = None
if isinstance(size, int):
scale = float(size) / min(vid.shape[-2:])
size = None
return torch.nn.functional.interpolate(
vid, size=size, scale_factor=scale, mode=interpolation, align_corners=False)
def pad(vid, padding, fill=0, padding_mode="constant"):
# NOTE: don't want to pad on temporal dimension, so let as non-batch
# (4d) before padding. This works as expected
return torch.nn.functional.pad(vid, padding, value=fill, mode=padding_mode)
def to_normalized_float_tensor(vid):
return vid.permute(3, 0, 1, 2).to(torch.float32) / 255
def normalize(vid, mean, std):
shape = (-1,) + (1,) * (vid.dim() - 1)
mean = torch.as_tensor(mean).reshape(shape)
std = torch.as_tensor(std).reshape(shape)
return (vid - mean) / std
# Class interface
class RandomCrop(object):
def __init__(self, size):
self.size = size
@staticmethod
def get_params(vid, output_size):
"""Get parameters for ``crop`` for a random crop.
"""
h, w = vid.shape[-2:]
th, tw = output_size
if w == tw and h == th:
return 0, 0, h, w
i = random.randint(0, h - th)
j = random.randint(0, w - tw)
return i, j, th, tw
def __call__(self, vid):
i, j, h, w = self.get_params(vid, self.size)
return crop(vid, i, j, h, w)
class CenterCrop(object):
def __init__(self, size):
self.size = size
def __call__(self, vid):
return center_crop(vid, self.size)
class Resize(object):
def __init__(self, size):
self.size = size
def __call__(self, vid):
return resize(vid, self.size)
class ToFloatTensorInZeroOne(object):
def __call__(self, vid):
return to_normalized_float_tensor(vid)
class Normalize(object):
def __init__(self, mean, std):
self.mean = mean
self.std = std
def __call__(self, vid):
return normalize(vid, self.mean, self.std)
class RandomHorizontalFlip(object):
def __init__(self, p=0.5):
self.p = p
def __call__(self, vid):
if random.random() < self.p:
return hflip(vid)
return vid
class Pad(object):
def __init__(self, padding, fill=0):
self.padding = padding
self.fill = fill
def __call__(self, vid):
return pad(vid, self.padding, self.fill)
So let's see if these bad boys work:
balance_tensor.shape
show_images(balance_tensor[1, 20:25, ...])
cropped_balance_tensor = CenterCrop((128, 128))(balance_tensor.float())
show_images(cropped_balance_tensor[1, 20:25, ...])
Isn't that pretty cool!
If you ran the code where we loaded the video with openCV you'd realize it was pretttty slow. Doing this for a ton of videos and chunking them out into small clips (multiple clips per video) would take a while and also require us to build the Dataset
classes ourselves. Luckily, pytorch has some utils for this that leverage PyAV
library.
from torchvision.datasets.video_utils import VideoClips
# Source: https://github.com/pytorch/vision/releases/tag/v0.4.0
class MyVideoDataset(object):
def __init__(self, video_paths):
self.video_clips = VideoClips(video_paths,
clip_length_in_frames=16,
frames_between_clips=1,
frame_rate=15)
def __getitem__(self, idx):
video, audio, info, video_idx = self.video_clips.get_clip(idx)
return video, audio
def __len__(self):
return self.video_clips.num_clips()
You might get an error when reading in data (see here); I edited the source code to fix it but there might be other work arounds if you run into the error linked.
So lets try and create a dataset from 2 videos
vids_for_ds = [str(x) for x in get_files((DATAPATH/"BalanceBeam"))][0:2]
len(vids_for_ds) # 2 videos
#hide_output
ds = MyVideoDataset(vids_for_ds)
len(ds)
Let's look at a sample we created
ds[0][0].shape
Looks like it took 16 frames but also didn't permute the dimension to what we want so lets use a helper function from before, rearrange and check out the sample.
show_images(video_to_tensor(ds[4][0])[1, 1:8, ...])
So there we have it. Hopefully you have a better understanding about how to leverage videos in your applications. Feel free to comment and thanks for reading.