Video Surveillance

Li Yin
Machine Learning for Li
13 min readMay 10, 2018

The whole video surveillance system could be as shown in Figure 1.

The most influential academic project is the video surveillance and monitoring (VSAM) project at CMU (Collins et al., 2000, project from 1997–1999) and other institutions (Borg et al., 2005; PETS, 2007).

1. Motion Segmentation

Techniques include temporal differencing, background subtraction, and optical flow.

Temporal differencing focuses on extract pixel-wise difference between consective frames (2/3).Temporal differencing is very fast and adaptive to dynamic environments, but generally does a poor job of extracting all the relevant pixels, e.g., there may be holes left inside moving entities.

Background subtraction is very popular for applications with relatively static backgrounds. It attempts to detect moving regions in an image by taking the difference between the current image and the reference background image in a pixel-by-pixel fashion. However, it is extremely sensitive to changes of environment lighting and extraneous events.

The statistical approaches use the characteristics of individual pixels or groups of pixels to construct more advanced background models, and the statistics of the backgrounds can be updated dynamically during processing. Each pixel in the current image can be classified into foreground or background by comparing the statistics of the current background model. This approach is becoming increasingly popular due to its robustness to noise, shadow, changing of lighting conditions, etc. (Stauffer & Grimson, 1999).

Optical flow is the velocity field, which warps one image into another (usually very similar) image, and is generally used to describe motion of point (speed and direction)or feature between images (Watson & Ahumada, 1985). Optical flow methods are very common for assessing motion from a set of images. However, most optical flow methods are computationally complex, sensitive to noise, and would require specialized hardware for real-time applications.

2. Object Classification

Since right now we know there is an object that is changing(moving), to decide their behavior, we need to know what is it. This way, we could distinguish if it is a human walking, running (decided by the moving speed), or a car that is speeding. (To be more specific, probably we need a speed detector). Or objects of interest of an investigated application (we could do object verification).

There are two main categories of approaches for classifying moving objects: shape-based classification and motion-based classification. And right now we have well-worked object classification algorithms based on neural network. Different descriptions of shape information of motion regions such as points, boxes, silhouettes and blobs are available for classifying moving objects. In general, human motion exhibits a periodic property, so this has been used as a strong cue for classification of moving objects also.

Slightly newer idea: Step 1 and Step 2, the motion detector and the object classification could all use the semantic segmentation, plus a speed tracker or consecutive images to get the object speed to resolve. Here we need to overcome some problems: 1) usually the semantic segmentation works only with higher-resolution image, we need to see how good it works on the others. 2) We need to do object verification, in order to detection its motion, we need to find the same object in the next frame of image. It is similar as object tracking.

Semantic segmentation could resolve the problem of where and what the object is.

3. Object Tracking

Tacking methods can be roughly divided into four major categories, and algorithms from different categories can be integrated together (Cavallaro et al., 2005, Javed & Shah, 2002).

a. Region-based Tracking
Region-based tracking algorithms track objects according to variation of the image regions corresponding to the moving objects. For these algorithms, the motion regions are usually detected by subtracting the background from the current images.

b. Contour-based Tracking
In contour-based methods instead of tracking the whole set of pixels comprising an object, the algorithms track only the contour of the object (Isard & Blake, 1996).

c. Feature-based Tracking
Feature-based methods use features of a video subject to track parts of the object. Feature- based tracking algorithms perform recognition and tracking of objects by extracting elements, clustering them into higher level features and then matching the feature between images.

d. Model-based Tracking
Model-based tracking algorithms track objects by matching projected object model. The models are usually constructed off-line with manual measurement, CAD tools or computer vision techniques. Generally, model-based human body tracking involves three main tasks: 1) construction of human body models; 2) representation of a priori knowledge of motion models and motion constraints; and 3) prediction and search strategies. Construction of human body models is the base of model-based human tracking. In general, the more complex a human body model, the more accurate the tracking results, but the more expensive the computation. Traditionally, the geometry structure of a human body can be represented in four styles: Stick figure, 2-D contour, volumetric model, and hierarchical model.

e. Hybrid Tracking
Hybrid approaches are designed as a hybrid between region-based and feature-based techniques. They exploit the advantages of two by considering first the object as an entity and then by tracking its parts.

4. Extraction and Motion Information

Before discussing the details of the extraction of motion information, Fig. 3 shows how a surveillance system may extract and learn motion patterns, e.g., a walk cycle, using an example of 4-level decomposition of the human dynamics as illustrated in (Bregler, 1997). Each level represents a set of random variables and probability distributions over hypotheses. The lowest level is a sequence of input images. For each pixel, we represent the spatio-temporal image gradient and optionally the color value as a random variable. The second level shows the blob hypotheses. Each blob is represented with a probability distribution over coherent motion (rotation and translation or full affine motion), color (HSV values), and spatial “support-regions”. In the third level, temporal sequences of blob tracks are grouped to linear stochastic dynamical models. At the fourth and highest level, each dynamic model corresponds to the emission probability of the state of a Hidden Markov Model (HMM).

The first import step in motion-based recognition is the motion extraction from a sequence of images. Motion perception and interpretation plays a very import role in a visual surveillance system. There are generally three methods for extracting motion information from a sequence of images: Optical flow, trajectory-based features, and region-based features.

a. Optical Flow Features
Optical flow methods are very common for assessing motion from a set of images. Optical flow is an approximation of the two-dimensional flow field from image intensities. Optical flow is the velocity field, which warps one image into another (usually very similar) image. Several methods have been developed, however, accurate and dense measurements are difficult to achieve (Cedras & Shah, 1995).
b. Trajectory-based Features
Trajectories, derived from the locations of particular points on an object in time, are very popular because they are relatively simple to extract and their interpretation is obvious (Morris & Trivedi, 2008). The generation of motion trajectories from a sequence of images typically involves the detection of tokens in each frame and the correspondence of such tokens from one frame to another. The tokens need to be distinctive enough for easy detection and stable through time so that they can be tracked. Tokens include edges, corners, interest points, regions, and limbs. Several proposed solutions (Cavallaro et al., 2005; Koller-meier & Van Gool, 2001; Makris & Ellis, 2005; Bobick & Wilson, 1997) for human actions modeling and recognition using the trajectory-based features approach. In the first step, an arbitrary changing number of objects are tracked. From the history of the tracked object states, temporal trajectories are formed which describe the motion paths of these objects. Secondly, characteristic motion patterns are learned by e.g. clustering these trajectories into prototype curves. In the final step, motion recognition is then tackled by tracking the position within these prototype curves based on the same method used for the object tracking.
c. Region- or Image-based Features
For certain types of objects or motions, the extraction of precise motion information for each single point is neither desirable nor necessary. Instead, the ability to have a more general idea about the content of a frame might be sufficient. Features generated from the use of information over a relatively large region or over the whole image are referenced here as region-based features. This approach has been used in several studies (Jan, 2004).

5. Behavior Analysis and Understanding

One of the most difficult challenges in the domain of computer vision and artificial intelligence is semantic behavior learning and understanding from observing activities in video (visual) surveillance.

An automated visual surveillance system generally requires a reliable combination of image processing and artificial intelligence techniques. Image processing techniques are used to provide low level image features. Artificial intelligence techniques are used to provide expert decisions.

Extensive research has been reported on low level image processing techniques such as object detection, recognition, and tracking; however, relatively few researches has been reported on reliable classification and understanding of human activities from the video image sequences.

Some researchers (Bremond et al., 2006; Ivanon & Bobick, 2000) have proposed and adopted a two-step approach to the problem of video understanding:

• A lower-level image processing visual module is used to extract visual cues and primitive events

• This collected information is used in a higher-level artificial intelligence module for the detection of more complex and abstract behavior patterns

By dividing the problem into two or three sub-problems, researchers can use simpler and more domain-independent techniques in each stage. The first stage usually involves and uses image processing and stochastic techniques for data analysis while the second stage conducts structural analysis of the symbolic data gathered at the previous step.

For the computer vision community, a natural approach to recognize scenarios consists of using a probabilistic or neural network. The nodes of this network correspond usually to scenarios that are recognized at a given instance with a computed likelihood.

For the artificial intelligence community, a natural way to recognize a scenario is to use a symbolic network where nodes correspond usually to the Boolean recognition of scenarios. The common characteristic of these approaches is that all totally recognized behaviors are stored.

Another development that has captured the attention of researchers, is the unsupervised behavior learning and recognition, consisting of the capability of a vision interpretation system of learning and detecting the frequent scenarios of a scene without requiring the prior definitions of behaviors by the user.

The automatic video understanding and interpretation needs to know how to represent and recognize behaviors corresponding to different types of concepts, which include (Bremond et al., 2006; Medioni et al., 2001; Levchuk et al., 2010):

• Basic Properties: A basic property is a characteristic of an object such as its trajectory or speed. (We can learn such properties)

• States: A state describes a situation characterizing one or several objects (actors) defined at given time (e.g., a subject is agitated) or a stable situation defined over a time interval. For the state: “an individual stays close to the ticket vending machine,” two subjects (actors) are involved: an individual and a piece of equipment.

• Events: An event is a change of state at two consecutive times (e.g., a subject enters an area of interest).

• Scenarios: A scenario is a combination of states, events or sub-scenarios. Behaviors are specific scenarios, dependent on the application defined by the users. For example, to monitor metro stations, end-users could have defined targeted behaviors: “Loitering”, “Unattended Luggage”, “Vandalism”, “Overcrowding”, “Fighting”, etc.

a. Hidden Markov Models (HMMs): A HMM is a statistical tool used for modeling generative sequences characterized by a set of observable sequences (Brand & Kettnaker, 2000).

b. Dynamic Time Warping (DTM): DTW is a technique that computes the non-linear warping function that optimally aligns two variable length time sequences (Bobick & Wilson, 1997). The warping function can be used to compute the similarity between two time series or to find corresponding regions between the two time series.

c. Finite-State Machine (FSM): FSM or finite-state automaton or simply a state machine, is a model of behavior composed of a finite number of states, transitions between those states, and actions. A finite state machine is an abstract model of a machine with a primitive internal memory.

d. Nondeterministic-Finite-State Automaton (NFA): A NFA or nondeterministic finite state machine is a finite state machine where for each pair of state and input symbols, there may be several possible next states. This distinguishes it from the deterministic finite automaton (DFA), where the next possible state is uniquely determined. Although the DFA and NFA have distinct definitions, it may be shown in the formal theory that they are equivalent, in that, for any given NFA, one may construct an equivalent DFA, and vice-versa.

e. Time-Delay Neural Network (TDNN): TDNN is an approach to analyzing time-varying data. In TDNN, the delay units are added to a general static network, and some of the preceding values in a time-varying sequence are used to predict the next value. As larger data sets become available, more emphasis is being placed on neural networks for representing temporal information. TDNN methods have been successfully applied to applications, such as hand gesture recognition and lip reading.

f. Syntactic/Grammatical Techniques: The basic idea in this approach is to divide the recognition problem into two levels. The lower level is performed using standard independent probabilistic temporal behavior detectors, such as HMMs, to output possible low-level temporal features. These outputs provide the input stream for a stochastic context-free grammar parser. The grammar and parser provide longer range temporal constraints, disambiguate uncertain low-level detection, and allow the inclusion of a priori knowledge about the structure of temporal behavior (Ivanov & Bobick, 2000).

g. Self-Organizing Neural Network: The methods discussed in (a) — (f) all involve supervised learning. They are applicable for known scenes where the types of object motions are already known. The self-organizing neural networks are suited to behavior understanding when the object motions are unrestricted.

h. Agent-Based Techniques: Instead of learning large amounts of behavior patterns using a centralized approach, agent-based methods decompose the learning into interactions of agents with much simpler behaviors and rules (Bryll et al., 2005).

i. Artificial Immune Systems: Several researchers have exploited the feasibility of learning behavior patterns and hostile intents in the optical flow level using artificial immune system approaches (Sarafijanovic & Leboudec, 2004).

6. Person identification

In most of video surveillance system literatures, the person identification is achieved by motion analysis and matching, such as gait, gesture, posture analysis and comparison (Hu et al., 2004). In model-based methods, parameters for gait, gesture, and/or posture, such as joint trajectories, limb lengths, and angular speeds are measured. Statistical recognition techniques usually characterize the statistical description of motion image sets and have been well developed in automatic gait recognition. Physical-parameter-based methods make use of geometric structural properties of a human body to characterize a person’s gait pattern. The parameters used included height

7. Camera handoff and data fusion

To expand the surveillance area and provide multiple view information to overcome, most of visual (or video) surveillance systems are multiple camera-based. In a multi- camera surveillance system, with overlapping fields of view to track objects and recognize their activities predefined by a set of activities or scenarios, or even learns new behavior patterns or new knowledge. Each camera agent performs per frame detection and tracking of scene objects, and the output data is transmitted to a centralized server where data associated and fused object tracking is performed. This tracking result is fed to a video event recognition module where spatial and temporal events relating to the objects are detected and analyzed. Tracking with a single camera easily generates ambiguity due to occlusion or depth. This ambiguity may be eliminated from another view. However, visual surveillance using multiple cameras also brings problems such as camera installation (how to cover the entire scene with the minimum number of cameras), camera calibration, object matching, automated camera switching, and data fusion (Collins et al., 2000).

Most of proposed systems use cameras as the sensor since the camera can provide resolution needed for accurate classification and position measurement. The disadvantage of image-only detection systems is the high computational cost associated with classifying a large number of candidate image regions. Accordingly, it has been a trend for several years to use a hierarchical detection structure combining different sensors. In the first step low computational cost sensors identify a small number of candidate regions of interest (ROI). LIDAR (Light Detection and Ranging) is an optical remote sensing technology that measures properties of scattered light to find range and/or other information of a distant target. The prevalent method to determine distance to an object or surface is to use laser pulses. Like the similar RADAR technology, which uses radio waves instead of light, the range to an object is determined by measuring the time delay between transmission of a pulse and detection of the reflected signal. As shown in (Szarvas et al., 2006; Premebida et al., 2007), the region of interest (ROI) detector in their proposed systems receives the signal from the LIDAR sensor and outputs a list of boxes in 3 dimensional (3D) world- coordinates. The 3D ROI-boxes are obtained by clustering the LIDAR measurements. Each 3D box is projected to the image plane using the intrinsic and extrinsic camera parameters.

8. Performance evaluation

The methods of evaluating the performance of object detection, object tracking, object classification, and behavior and intent detection and identification in a visual surveillance system are more complex than some of the well-established biometrics identification applications, such as fingerprint or face, due to unconstrained environments and the complexity of challenge itself. Performance Evaluation for Tracking and Surveillance (PETS) is a good starting place when looking into performance evaluation (PETS, 2007). As shown in Fig. 4, PETS has several good data sets for both indoor and outdoor tracking evaluation and event/behavior detection.

References

[1]Hu, Weiming, et al. “A survey on visual surveillance of object motion and behaviors.” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 34.3 (2004): 334–352.

[2]Borg, M., Thirde, D., Ferryman, J., Fusier, F., Valentin, V., Bremond, F. & Thonnat, M. (2005). “Video Surveillance for Aircraft Activity Monitoring,” IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 16–21.

--

--