How to build a custom embedded stereo system for depth perception
DIY Stereo System step by step
There are various 3D sensor options for developing depth perception systems including, stereo vision with cameras, lidar, and time-of-flight sensors. Each option has its strengths and weaknesses. A stereo system is typically low cost, rugged enough for outdoor use, and can provide a high-resolution color point cloud.
There are various off-the-shelf stereo systems available on the market today. Depending on factors such as accuracy, baseline, field-of-view, and resolution, there are times when system engineers need to build a custom system to address specific application requirements.
Stereo vision overview
Stereo vision is the extraction of 3D information from digital images by comparing the information in a scene from two viewpoints. Relative positions of an object in two image planes provides information about the depth of the object from the camera.
An overview of a stereo vision system consists of the following key steps:
1. Calibration: Camera calibration refers to both the intrinsic and extrinsic. The intrinsic calibration determines the image center, focal length, and distortion parameters, while the extrinsic calibration determines the 3D positions of the cameras. This is a crucial step in many computer vision applications especially when metric information about the scene, such as depth, is required. We will discuss the calibration step in detail in Section 5 below.
2. Rectification: Stereo rectification refers to the process of reprojecting image planes onto a common plane parallel to the line between camera centers. After rectification, corresponding points lie on the same row, which greatly reduces cost and ambiguity of matching. This step is done in the code provided to build your own system.
3. Stereo matching: This refers to the process of matching pixels between the left and right images, which generates disparity images. The Semi-Global Matching (SGM) algorithm will be used in the code provided to build your own system.
4. Triangulation: Triangulation refers to the process of determining a point in 3D space given its projection onto the two images. The disparity image will be converted to a 3D point cloud.
Design example
Let’s go through a stereo system design example. Here are the requirements for a mobile robot application in a dynamic environment with fast moving objects. The scene of interest is 2 m in size, the distance from the cameras to the scene is 3 m and the desired accuracy is 1 cm at 3 m.
You can refer to this article for more details on stereo accuracy. The depth error is given by: ΔZ=Z²/Bf * Δd which depends on the following factors:
Z is the range
B is the baseline
f is the focal length in pixels, which is related to the camera field-of-view and image resolution
There are various design options that can fulfill these requirements. Based on the scene size and distance requirements above, we can determine the focal length of the lens for a specific sensor. Together with the baseline, we can use the formula above to calculate the expected depth error at 3 m, to verify that it meets the accuracy requirement.
Two options are shown in Figure 1, using lower resolution cameras with a longer baseline or higher resolution cameras with a shorter. The first option is a larger camera but has lower computational need, while the second option is a more compact camera but has a higher computational need. For this application, we chose the second option as a compact size is more desirable for the mobile robot and we can use the Quartet Embedded Solution for TX2 which has a powerful GPU onboard to handle the processing needs.
Hardware requirements
For this example, we mount two Teledyne Flir Blackfly S board level 1.6 MP cameras using the IMX273 Sony Pregius global shutter sensor on a 3D printed bar at 12 cm baseline. Both cameras have similar 6 mm S-mount lenses. The cameras connect to the Quartet Embedded Solution for TX2 customized carrier board using two FPC cables. To synchronize the left and right camera to capture images at the same time, a sync cable is made that connects the two cameras. Figure 3 shows the front and back views of our custom embedded stereo system.
Both lenses should be adjusted to focus the cameras on the range of distances your application requires. Don’t forget to tighten the screw on each lens to keep the focus.
Software requirements
a. Spinnaker
Teledyne FLIR Spinnaker SDK comes pre-installed on your Quartet Embedded Solutions for TX2. Spinnaker is required to communicate with the cameras. Please refer to the following link.
b. OpenCV 4.5.2 with CUDA support
OpenCV version 4.5.1 or newer is required for SGM, the stereo matching algorithm we are using. Download the zip file containing the code for this article and unzip it to StereoDepth folder. Please refer to the following link.
https://m.box.com/shared_item/https%3A%2F%2Fflir.app.box.com%2Fs%2Fh4soinqjchdqson2gm1piyf745idw7zb
The script to install OpenCV is OpenCVInstaller.sh. Type the following commands in a terminal:
cd ~/StereoDepth
chmod +x OpenCVInstaller.sh
./OpenCVInstaller.sh
The installer will ask you to input your admin password. The installer will start installing OpenCV 4.5.2. It may take a couple of hours to download and build OpenCV.
Calibration
The code to grab stereo images and calibrate them can be found in the “Calibration” folder. Use the SpinView GUI to identify the serial numbers for the left and right cameras. For our settings, the right camera is the master and left camera is the slave. Copy the master and slave camera serial numbers to file grabStereoImages.cpp lines 60 and 61. Build the executable using the following commands in a terminal:
cd ~/StereoDepth/Calibration
mkdir build
mkdir -p images/{left, right}
cd build
cmake ..
make
Print out the checkerboard pattern from this link and attach it to a flat surface to use as the calibration target. (www.ablwerbung.de/flir4.html) For best results while calibrating, in SpinView set Exposure Auto to Off and adjust the exposure so the checkerboard pattern is clear and the white squares are not over exposed, as shown in Figure 2. After the calibration images are collected, gain and exposure can be set to auto in SpinView.
To start collecting the images, type:
./grabStereoImages
The code should start collecting images at about 1 fps. Left images are stored in images/left folder and right images are stored in images/right folder. Move the target around so that it appears in every corner of the image. You may rotate the target, take images from close by and from further away. By default, the program captures 100 image pairs, but can be changed with a command line argument:
./grabStereoImages 20
This will collect only 20 pairs of images. Please note this will overwrite any images previously written in the folders.
After collecting the images, run the calibration Python code by typing:
cd ~/StereoDepth/Calibration
python cameraCalibration.py
This will generate 2 files called “intrinsics.yml” and “extrinsics.yml” which contain the intrinsic and extrinsic parameters of the stereo system. The code assumes 30 mm checkerboard square size by default but can be edited if needed. At the end of the calibration, it will display the RMS error which indicates how well the calibration is. Typical RMS error for good calibration should be below 0.5 pixel.
Real-time depth map
The code to calculate disparity in real-time is in the “Depth” folder. Copy the serial numbers of cameras to file live_disparity.cpp lines 230 and 231. Build the executable using the following commands in a terminal:
cd ~/StereoDepth/Depth
mkdir build
cd build
cmake ..
make
Copy the “intrinsics.yml” and the “extrinsics.yml” files obtained in the calibration step to this folder. To run the real-time depth map demo, type
./live_disparity
It would display the left camera image (raw unrectified image) and the depth map (our final output). The distance from the camera is color-coded according to the legend on the right of the depth map. The black region in the depth map means no disparity data was found in that region. Thanks to the Nvidia Jetson TX2 GPU, it can run up to 5 fps at a resolution of 1440 × 1080 and up to 13 fps at a resolution of 720 × 540.
To see the depth at a particular point, click on that point in the depth map and the depth will be displayed, as shown in the last example in Figure 3.
Summary
Using stereo vision to develop a depth perception has the advantages of working well outdoors, the ability to provide a high-resolution depth map, and very accessible with low-cost off-the-shelf components. Depending on the requirements there is a variety of off-the-shelf stereo systems on the market. Should it be necessary for you to develop a custom embedded stereo system, it is a relatively straightforward task with the instructions provided here.
Author
Dr. Stephen Se
Senior Research Manager