Welcome to The Neuromorphic Engineer  
Advanced search Home  Biological Models » Vision Neuromorphic computer vision: overcoming 3D limitations PDF version  Permalink Asynchronous, eventbased artificial retinas^{1} are introducing a shift in current methods of tackling visual signal processing.^{2–4} Conventional framebased image acquisition and processing technologies are not designed to take full advantage of the dynamic characteristics of visual scenes. A collection of snapshots is static and contains much redundant information. That is because every pixel is sampled repetitively, even if its value is unchanged, and thus unnecessarily digitized, transmitted, and finally stored.^{5} This waste of resources significantly limits memory and computational time in computer vision applications. Eventbased dynamic vision sensors (DVSs)^{6} provide a novel and efficient way of encoding light and its temporal variations by asynchronously reacting and transmitting only the scene changes at the exact time they occur.^{5, 7–14} This is similar to retinal outputs, which are massively parallel, asynchronous, and datadriven according to the information retrieved in scenes.^{15} Here, we describe work on timeoriented computation of 3D information^{2,3} that introduces eventbased computation into computer vision techniques such as calibration and stereo matching. This approach uses two DVS models^{5} whose acquisition principle is shown in Figure 1. The DVS models the transient responses of the retina,^{16} uses an addressevent representation with 128×128 pixels, and has an output of asynchronous address events that signal scenereflectance changes as they happen. Each pixel is independent and detects changes in log intensity larger than a threshold since the last event emitted (typically 15% contrast). When the change in log intensity exceeds a set threshold, a signed event is generated by the pixel depending on whether the log intensity increased or decreased. Since the DVS is not clocked like conventional cameras, the timing of events can be conveyed with a very high temporal resolution of approximately 1μs. Thus, the frame rate is typically several kilohertz. The stream of events from the retina can be −1 or +1 polarity when a negative or a positive contrast change is detected. The absence of events when no contrast change is detected implies that redundant visual information usually recorded in frames is not carried in the stream of events. Figure 1. The aim of stereovision is to compute depth using two sensors viewing the scene from different positions. In this context, two elementary steps are typically performed, calibration and matching. Calibration allows estimation of the pose between the two sensors of a stereovision acquisition chain. Once a match between two views has been identified, calibration provides all the information needed to estimate the position in the observed point's 3D space. Calibration is performed only once at the beginning of the process, unless the relative position of the sensors is changed. Matching allows identification of scene point projections in two images. In the framebased case, matching relies on neighboring gray level similarities. Two pixels are matched if their neighborhoods are similar. Figure 2. The fundamental matrix relates corresponding points in stereo images. It contains the geometric relations between the 3D points and their projections onto the 2D images of each acquired scene view.^{17}The fundamental matrix associates to an image point p (expressed as a homogeneous vector of size 3×1 in a 2D projective space) a line l in the right image that can be computed as l=Fp (see Figure 2). This line contains all points that can possibly match with p, and it represents the line of sight of pixel p in the right image. If p ′ is the match of p, the fundamental matrix satisfies p ′ ^{T}Fp=0. If eightpoint matches are known, the fundamental matrix is the solution of a set of linear equations.^{18} Otherwise, it can be found by solving a linear leastsquares minimization problem. With enough matched pairs p ′ ↔p, equation p ′ ^{T}Fp=0 can be used to compute the unknown matrix F. Each point match gives rise to one linear equation in the unknown entries of F. From all the point matches, we obtain a set of linear equations of the form Af=0, where f is a 9vector containing the entries of the matrix F, and A is the equation matrix. Figure 3. We let two asynchronous, eventbased DVS sensors observe a common part of a scene (see Figure 3). A 3D point P moving in space triggers changes of luminance in the sensors' common field of view. The 3D point generates two events, p_{1} and p_{2}, at very close timings, respectively, in retinas C_{1} and C_{2}. In an ideal case, the set of corresponding events should be timestamped with equal values, as they are the consequence of the same event that happens at a given time. Unfortunately, due to latencies in the acquisition system, the two events will not be generated at the same time. The idea is then to follow the activity of a single pixel. If p_{2} is the pixel to be monitored in the left retina C_{2}, we can estimate which pixels of the right retina C_{1} are active in a temporal interval around each activation of p_{2}. This activation monitoring provides a coactivation probability measure for each pixel of C_{1}. The probability will obviously depend on the geometric link between the pixels' line of sight. Pixels that see the same thing at the same time have a high probability of being coactive. The probability activation map of all the pixels of C_{1} contains geometrical information on F. Pixels of C_{1} that tend to be active with p_{2} must lie along the epipolar line Fp_{2}. The experimental results of monitoring clearly show that the highest coactivation probabilities lay on a line that corresponds to the epipolar line of the left retina's selected pixel (see Figure 4). Figure 4. It is interesting to notice that the fundamental matrix appears implicitly from the coactivation of pixels. There is no need to define or match any pattern in the observed scenes. This is an unusual estimation process of the epipolar geometry. The fundamental matrix is the result of temporal coactivation of pixel activities and can therefore be considered a Hebbian fundamental matrix. This process can be performed for all the pixels of C_{1}, thus providing the geometric link between the two retinas. Once the epipolar geometry has been estimated, it is possible to start the eventmatching step. Although the exact timing or close to exact timing cannot be used to discriminate the matches, it is possible to define a time window in which true matches are more likely to occur. The idea is to use the distance to their corresponding epipolar lines of possible matching events occurring at the same time. F links C_{2} to C_{1}, and a point p^{2}_{1} appearing in retina C_{2} provides a line l_{21} in C_{1} using Fp^{2}_{1}=l_{21}. The epipolar line l_{21} contains all possible matches of the event occurring at the spatial location p^{1}_{1} in C_{2} (see Figure 5). Figure 5. We implemented the stereo matching algorithm in the opensource Java software project. We generated disparity maps from the events by moving a pen back and forth at three different distances from the two retinas separated by about 10cm. The computed disparity is colorcoded from blue to red for results varying between 0 and 127 pixels, respectively. The default value for background unprocessed pixels is 0. The matching used a time window of 1ms and a maximum distance to the epipolar line of 1 pixel. The disparity is a decreasing function of the depth. Some pixels are not matched due to multiple matching. In summary, we have described an eventbased stereo chain. We have shown that the asynchronous high temporal resolution properties of eventbased acquisition are particularly efficient in terms of computational load and accuracy. The sparse nature of the acquired signals allows exploration of new paradigms for timeoriented computer vision. The entirely eventdriven algorithm of stereo vision takes full advantage of the datadriven neuromorphic signal. The combination of spatial and temporal constraints fully uses the high temporal resolution of neuromorphic retinas. Asynchronous eventbased acquisition is a promising 3D technology offering yet unexplored potential to overcome the current limitations of framebased 3D vision. These sensors should be of great use to the robotics and computer vision communities, especially for embedded computer vision applications. This technology will attract greater interest once retinas with larger spatial resolutions are available. We are currently extending this work to multicamera networks for realtime 3D streaming. References
 Tell us what to cover! If you'd like to write an article or know of someone else who is doing relevant and interesting stuff, let us know. Email the editor and suggest the subject for the article and, if you're suggesting someone else's work, tell us their name, affiliation, and email.  
