VideoTrails

VideoTrails:
Representing & Visualizing Structure
in
Video Sequences

by
Vikrant Kobla
David Doermann
Christos Floutsos

Summarized by Ronnie Maor

Introduction

Information in the form of digital video has become widely available over the last few years. This raises the need for means to index and analyze this form of information automatically, as opposed to methods requiring a human viewer to manually extract the relevant information from the video.

This paper describes a technique which uses the temporal structure of the video - i.e. how the individual frames change over time, as opposed to anylizing each frame by itself. The technique described maps each frame to a single point in a low dimensional space. These points, ordered the same way their frames were in the video, constitute the new representation of the video, called a VideoTrail. The technique presented in this paper was specifically designed to work on video sequences encoded using the popular MPEG compression standard. This approach is more efficient that working on the uncompressed pixel data, because we don't need to pay the overhead of decompressing the video. The technique also relies on interesting features encoded in the MPEG stream, such as the differences between frames that are close to each other in time (when encoding P and B pictures). A pretty easy to read introduction to MPEG can be found at bmrc.berkeley.edu/projects/mpeg/mpeg_overview.html. This introduction contains all the information needed in order to read the article (or this summary). A list of frequently asked questions about MPEG containing more detailed information can be found at bmrc.berkeley.edu/projects/mpeg/faq/MPEG-2-FAQ.html.

Creating the VideoTrail

As mentioned above, generating the VideoTrail representation of a video essentialy consists of mapping each frame to a point in a low dimensional space. This is done in two stages:

For each frame we extract a set of features. These features are simply the set of DC values for each of the blocks in the frame.
For I pictures the DC components are readily available. For P and B pictures, the DC components are estimated using a weighted average of the DC components belonging to the Macroblocks which overlap the reference area.
This constitutes a mapping from frames to points in a high dimensional space (typically the dimension is a few thousands).
The points are projected onto a low dimensional space using an algorithm previously developed by the authors, called FastMap. This algorithm approximatly preserves the distances between points, and has the advantage of working in time O(n), where n is the nubmer of input points.

The figure below shows an example of a VideoTrail generated in 3 dimensions. Successive points are connected to show the flow of the video clip. The clip shown is a news interview. It consists of three types of shots - the interviewer, the interviewee, and a medium shot showing both of them.

Analyzing the VideoTrail

The usefullness of the VideoTrail representation is demonstrated by using it to segment the video into different shots, and to detect gradual transitions between them - a non-trivial problem in the compressed domain.
The algorithms presented in this paper were designed to work on points in a 3 dimensional space. An modification of the algorithms for a higher dimensional space can be found in more recent papers.

The analysis consists of three distinct parts:

The VideoTrail is split into sections. Each section corresponds to a shot in the video.
This is done in an online type of algorithm which decides for each point seen whether it will be added to the current section or start a new one.
Intuitively, the point is added to the current section as long as adding it does not cause the Minimal Bounding Rectangle (MBR) of the section to increase too much.
Each section is classified as having either a low or high activity level.
This is done by defining 4 criterions. Each criterion receives a value in [0,1] for the given section. The weighted average of these criterions is then computed, and segments having an overall average of more than 0.5 are classified as having a high level of activity, while those having an average of less than 0.5 are classified as having a low level of activity.
A subset of the group of segments having high activity are classified as being gradual transitions.
In order to do this, we assume that segments having high activity can either be gradual transitions, or they must contain a large degree of motion. Techniques for detecting global motion (such as that caused by panning or tilting), and for detecting zooms are presented. These techniques use the coding of Motion Vectors in the MPEG video.
The high activity segments that were not identified as containing motion, are classified as gradual transitions.

Results

In order to assess the performance of the technique presented for detecting gradual transitions, the authors ran it on 13 video clips containing a wide variety of content, and many types of gradual transitions such as dissolves, fades, wipes, and other special effect edits. The clips contain about 29,000 frames altogether, and 135 gradual transitions. The results of running the algorithms on the clips were compared to ground truth obtained by manually analyzing the clips.
The results show a recall rate of 90.4% (percent of gradual transitions identified as such), and a precision of 89.1% (percent of segments identified as gradual transitions that were actually such). However, it is hard to judge how good these results are, since the paper doesn't present a comparison with existing techniques for solving the same problem.

Other Applications

The paper demonstrated the usefullness of the VideoTrail representation for detecting gradual transitions between shots. Some other problems for which this representation might be usefull are:

Identifying similar shots (such as the same scene shot from two different cameras). Such shots will tend to occupy close regions in the VideoTrail representation.
Classification of action and conversational shots.
Key frame selection. A typical selection of a key from for a shot is the first frame of that shot. The VideoTrail representation suggests using the frame closest to the center of the MBR, or center of mass of the shot. It is likely that this frame will represent the entire shot more accurately.
Once we have identified shots, identified similar shtos, and extracted a key frame for each shot, we can represent the entire structure of the video as a directed graph with nodes corresponding to the extracted key frames, and edges corresponding to moves between them in the video. This could be used for performing analysis for extracting story units.

More Information

The complete paper, as well as previous papers by the authors referred to in this paper, and more recent work, can be downloaded from Vikrant Kobla's research homepage