Scene Parsing

Scene Parsing

more details in
Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers

Scene parsing, or semantic segmentation, consists in labeling each pixel in an image with the category of the object it belongs to. It is a challenging task that involves the simultaneous detection, segmentation and recognition of all the objects in the image.

I have been working on a model (joint-work with Laurent Najman and Yann LeCun) that can parse a wide variety of scenes in an extremely small amount of time (about a second on an i7-based computer). The model was quickly introduced in Yann's talk at the NPB workshop (at NIPS): here (the actual scene parsing part starts at 10:00).

In the following clips, recorded in random locations in the NYC area, we demonstrate the generalization capabilities of our model. This model was trained on the rather small Stanford Background dataset, which contains 715 images (520 for training, the rest for testing), labeled into 8 classes.

scene parsing in Brooklyn scene parsing in Brooklyn
The system demonstrated here was trained on the Stanford Background dataset. The challenge consists in labeling each pixel with 1 of 8 classes: sky, road, building, tree, grass, water, mountain, object. The last class (object) includes people, cars, trucks, poles, ...

The next clip demonstrates the generalization capabilities of our larger model, trained on the SiftFlow dataset, made of 2,688 images and 33 classes. The video was stitched from 4 GoPro cameras, and shot around the NYU campus. Our parse is far from perfect, but remember: our model doesn't perform any global inference, to maintain a high frame rate. We can parse a 360 degree image of that type in a bit more than a second on a 4-core Intel i7.

The system demonstrated here was trained on the SiftFlow dataset, a dataset consisting of 2,688 images, and 33 classes. The video captures 360 degree scenes areound the NYU campus.

A complete description of our system can be found in "Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers".

In very short, our scene parsing method starts by computing a tree of segments from a graph of pixel dissimilarities. Simultaneously, a set of dense feature vectors is computed which encodes regions of multiple sizes centered on each pixel. The feature extractor is a multiscale convolutional network trained from raw pixels. The feature vectors associated with the segments covered by each node in the tree are aggregated and fed to a classifier which produces an estimate of the distribution of object categories contained in the segment. A subset of tree nodes that cover the image are then selected so as to maximize the average "purity" of the class distributions, hence maximizing the overall likelihood that each segment will contain a single object. The convolutional network feature extractor is trained end-to-end from raw pixels, alleviating the need for engineered features. After training, the system is parameter free.

Diagram of the scene parsing system. The raw input image is transformed through a Laplacian pyramid. Each scale is fed to a 2-stage convolutional network, which produces a set of feature maps. The feature maps of all scales are concatenated, the coarser-scale maps being upsampled to match the size of the finest-scale map. Each feature vector thus represents a large contextual window around each pixel. In parallel, a segmentation tree is computed via the minimum spanning tree of the dissimilarity graph of neighboring pixels. The segment associated with each node in the tree is encoded by a spatial grid of feature vectors pooled in the segment’s region. A classifier is then applied to all the aggregated feature grids to produce a histogram of categories, the entropy of which measures the "impurity" of the segment. Each pixel is then labeled by the minimally-impure node above it, which is the segment that best "explains" the pixel.

NeuFlow: A Dataflow Processor for Vision

related read
NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision

neuFlow is a home-grown dataflow computer that was designed to optimally compute dense, filter-based vision models. Dataflow processors like this one are well suited for high speed processing of video streams, can be extremely compact, and could potentially give vision capabilities to any light-weight unmanned vehicle with low power constraints.

I'm particularly interested in the low-power, efficient computation of convolutional networks and the like. Yann LeCun et al. have demonstrated the efficiency of such architectures in pattern/object recognition of images [ LeCun 1998 , LeCun 2010 , Kavukcuoglu 2010 , Jarrett 2009 ].

The neuFlow processor is based on a reconfigurable grid of simple processing units. Such a grid provides a way of dynamically reconfiguring the hardware to run different operations on streams of data, in parallel. The architecture is largely inspired by the old ideas of dataflow computing [ Adams 1968 , Dennis 1974 , Hicks Arvind 1993 ].

neuFlow: overview of the architecture
related read
Large-Scale FPGA-Based Convolutional Networks

The figure above shows an overview of the neuFlow architecture. Several key ingredients were used to create a functional, practical dataflow grid:

  • a 2D grid instantiates Processing Tiles (PTs),
  • a PT is made of a bank of streaming, high-throughput operators, attached to some reconfigurable routing fabric (a MUX), that allows a tile to exchange streams of data from/to neighboring tiles,
  • a Smart Direct Memory Access module (Smart DMA), that interfaces off-chip memory and provides parallel data transfers, with priority management,
  • global data lines used to connect PTs to the Smart DMA,
  • local data lines used to connect PTs with their 4 neighbors,
  • a Runtime Configuration Bus, used to reconfigure many aspects of the grid at runtime: connections, operators, Smart DMA modes, ... (the configurable elements are depicted as squares in the figure),
  • a flow controller that can reconfigure most of the computing grid and the Smart DMA at runtime.

neuFlow: configured to compute 2 parallel 2D convolutions

This second figure shows our grid configured for a particular operation: a non-linear mapping of a sum of 2D convolutions, performed in one pass. Our grid implements the dataflow idea of having (at least theoretically) no state, or instruction pointer. In the case of the system presented here, the grid has no state, but a state does exit in a centralized control unit. For each configuration of the grid, no state is used, and the presence of data drives the computations. Although this leads to an optimal throughput, the system presented here strives to be as general as possible, and provides a clever mechanism to reconfigure the grid at runtime to sequence different operations, which is crucial to run algorithms that require different types of computations. A typical execution of an operation on this system is the following:

  • the control unit configures each tile to be used for the computation and each connection between the tiles and their neighbors and/or the global lines, by sending a configuration command to each of them,
  • it configures the Smart DMA to prefetch the data to be processed, and to be ready to write results back to off-chip memory,
  • when the DMA is ready, it triggers the streaming out,
  • each tile processes its respective incoming streaming data, and passes the results to another tile, or back to the Smart DMA,
  • the control unit is notified of the end of operations when the Smart DMA has completed.

related slides
NeuFlow: A Dataflow Architecture for Vision

Our latest platform for neuFlow (custom design by picocomupting). The platform embeds two cameras, 8GB of DDR3 memory, and one of the largest Xilinx FPGAs.

Convolutional Networks for Vision

extended read
Convolutional Networks and Applications in Vision

One of the key questions of Vision Science (natural and artificial) is how to produce good internal representations of the visual world. What sort of internal representation would allow an artificial vision system to detect and classify objects into categories, independently of pose, scale, illumination, conformation, and clutter? More interestingly, how could an artificial vision system learn appropriate internal representations automatically, the way animals and human seem to learn by simply looking at the world? In the time-honored approach to computer vision (and to pattern recognition in general), the question is avoided: internal representations are produced by a hand-crafted feature extractor, whose output is fed to a trainable classifier. While the issue of learning features has been a topic of interest for many years, considerable progress has been achieved in the last few years with the development of so-called deep learning methods.

related read
What is the Best Multi-Stage Architecture for Object Recognition?

Good internal representations are hierarchical. In vision, pixels are assembled into edglets, edglets into motifs, motifs into parts, parts into objects, and objects into scenes. This suggests that recognition architectures for vision (and for other modalities such as audio and natural language) should have multiple trainable stages stacked on top of each other, one for each level in the feature hierarchy. This raises two new questions: what to put in each stage? and how to train such deep, multi-stage architectures? Convolutional Networks (ConvNets) are an answer to the first question. Until recently, the answer to the second question was to use gradient-based supervised learning, but recent research in deep learning has produced a number of unsupervised methods which greatly reduce the need for labeled samples.

A typical convolutional network for pixelwise classification (object recognition).

Convolutional Networks are trainable hierarchical architectures composed of multiple transformation stages. The input and output of each stage are sets of arrays called feature maps. For example, if the input is a color image, each feature map would be a 2D array containing a color channel of the input image (for an audio input each feature map would be a 1D array, and for a video or volumetric image, it would be a 3D array). At the output, each feature map represents a particular feature extracted at all locations on the input. Each stage is composed of three layers: a filter bank layer, a non-linearity layer, and a feature pooling layer. A typical ConvNet is composed of one, two or three such 3-layer stages, followed by a classification module.

related read
Torch7: A Matlab-like Environment for Machines Learning

Trainable hierarchical vision models, and more generally image processing algorithms are usually expressed as sequences of operations or transformations. They can be well described by a modular approach, in which each module processes an input image bank and produces a new bank. The figure above is a nice graphical illustration of this approach. Each module requires the previous bank to be fully (or at least partially) available before computing its output. This causality prevents simple parallelism to be implemented across modules. However parallelism can easily be introduced within a module, and at several levels, depending on the kind of underlying operations.

We’re currently working a graph-based package for Torch7 that simplifies such descriptions, and also simplifies the mapping of these algorithms to neuFlow (neuFlow's compiler already relies on such descriptions, but with several restrictions).

Flying UAVs

extended read
Visual Tracking and LIDAR Relative Positioning for Automated Launch and Recovery of an Unmanned Rotorcraft from Ships at Sea

One of my first significant projects was the development of a miniaturized tracking system that was to assist an unmanned helicopter in landing. This system involved a lot of real-time video processing [Garratt 2009]. A lot of the work there was about integrating the whole system into a single chip (FPGA). This was joint work with Matthew Garratt and Andrew Lambert.

The UNSW UAV, and the FPGA-based tracker I designed.

Sensors and systems for a fully autonomous unmanned helicopter have been developed with the aim of completely automating the landing and launch of a small-unmanned helicopter from the deck of a ship. In this work, we combined a laser rangefinder (LRF) system with a visual tracking sensor to construct a low-cost guidance system. Our novel LRF system was able to determine both the distance to and the orientation of the deck in one cycle. We had constructed an optical sensor to complement the laser system, comprising a digital camera interfaced to a Field Programmable Gate Array (FPGA), which enabled the entire target tracking computation to be achieved in a very small self-contained form factor. A narrowband light source on the deck was detected by the digital camera and tracked by an algorithm implemented on the FPGA to provide a relative bearing to the deck from the helicopter. By combining the optical sensor bearing with the information from the laser system, an accurate estimate of the helicopter position relative to the deck could be found.