Abstract
Positioning technology is the foundation of intelligent construction machinery, the current mainstream positioning solution is simultaneous localization and mapping (SLAM) technology, which is mainly divided into lidar SLAM and visual SLAM. Due to the high cost of lidar, it is easy to degrade or even fail in scenes with a single environmental texture; while the cost of vision sensors is low and has a wealth of environmental texture information acquisition capabilities, which can effectively avoid degradation problems. In order to reduce the localization cost of intelligent construction machinery and improve the positioning accuracy, based on the VINS-Fusion stereo visual-inertial tightly coupled system framework, an improved Random Sampling Consensus (RANSAC) algorithm is used to reduce feature mismatch, and the Huber kernel function is used to IMU residuals and visual residuals are constrained to improve the effect of the SLAM system. Compared with the mainstream VINS-Fusion algorithm, the positioning root mean square error of this method on the EuRoC dataset is reduced by an average of 12.41%, which improves the positioning accuracy; simultaneously, the experimental results in the actual scene show that the motion trajectory of the algorithm, it is closer to the real trajectory than VINS-Fusion, which verifies the effectiveness of the method.
You have full access to this open access chapter, Download conference paper PDF
Keywords
1 Introduction
Simultaneous localization and mapping (SLAM) [1,2,3] technology is widely used in autonomous driving, drones, AR/VR and other fields. SLAM technology is mainly divided into lidar SLAM based on lidar and visual SLAM based on camera. Lidar is not affected by changes in illumination and can accurately measure the depth information of the environment. It has been widely used in the field of autonomous driving in recent years. However, high-precision lidar has a high cost and is prone to distortion during high-speed movement. Scenes with single environmental texture information such as indoor corridors are easy to degrade. In contrast, the cost of the camera is much lower than that of the lidar, and it can make full use of the environmental texture information. The effect in the low-texture environment is better than that of the lidar. The difficulty of visual SLAM lies in how to minimize the mismatch of features and the error in pose estimation process.
The camera and the inertial measurement unit (Inertial Measurement Unit, IMU) both have good complementarity. IMU measurement is not affected by environmental characteristics, and its accurate estimation of fast motion in a short period of time can effectively compensate for the camera's state estimation of fast motion, while camera measurement can correct the cumulative error of IMU. At the same time, the IMU can also be used to correct the point cloud distortion caused by the high-speed movement of the lidar. Therefore, in recent years, SLAM with fusion of vision or lidar and inertia has become a research hotspot in the field of SLAM. There are usually two schemes for fusion with inertial sensors, one is a filtering scheme based on EKF (extended Kalman filter) or ESKF (error state Kalman filter) [4, 5]. Such as MSCKF [6, 7] and FAST-LIO [8], et al. Another optimization-based scheme [9, 10] represented by VINS-Mono [11] and LIO-SAM [12].
However, there are always some problems in various current SLAM schemes.
-
In scenes with a single environment texture or rich local environment texture, feature extraction is difficult or the extracted feature points are densely distributed locally, which reduces the operating efficiency of the system.
-
Feature mismatch affects system stability.
-
IMU cumulative error affects system efficiency, et al.
In response to the above problems, this paper proposes a SLAM scheme based on the tight coupling of stereo vision and IMU based on the VINS-Fusion [13, 14] framework. The main innovations of this paper are as follows:
-
A feature uniform extraction mechanism is proposed to ensure the uniform distribution of feature points and effectively enhance the accuracy of the system.
-
Based on RANSAC, an improved feature mismatch elimination method is proposed to improve the mismatch elimination.
-
For the back-end IMU error term, the Huber kernel function is used to adjust the IMU error within an appropriate range.
The method proposed in this paper was tested on the public data set, which verified that the system has good accuracy and robustness in various scenarios. At the same time, a real vehicle platform was built to test it, proving that The effectiveness of the proposed method.
2 Overview
The structure of the stereo vision and inertial fusion SLAM system proposed in this paper is shown in Fig. 1. The system starts with front-end data preprocessing, including feature extraction and tracking, feature point triangulation to obtain depth information, and IMU measurement between two consecutive frames based on the camera frame do pre-integration. In the back-end nonlinear optimization part, various observation information extracted by the front-end are used to construct error constraints, including visual re-projection error, IMU pre-integration error, and prior error of sliding window marginalization. Loop detection uses the DBoW2 [15] bag-of-words model as the basis.
2.1 Visual Processing
For each new image, FAST corner points are extracted, and the existing features are tracked by the KLT [16] sparse optical flow method, and the minimum pixel threshold between two adjacent features is set to achieve a uniform distribution of feature points. The improved RANSAC [17] method is used to eliminate feature mismatches. First, the cross-validation method is used for rough elimination, and then the two-point RANSAC method is used for fine elimination to obtain correctly matched feature points.
Another task in this step is the selection of key frames in the sliding window, and there are two selection criteria. The first one is to judge whether the average disparity between the current frame and the previous key frame exceeds the set threshold, and if it exceeds the threshold, it is judged as a new key frame. Another criterion is to judge whether the number of features of the current frame is lower than a certain threshold, and this frame is also regarded as a key frame if it is lower than the threshold. This criterion is to avoid the complete loss of feature tracks.
2.2 IMU Preintegration
The Acceleration \({\widehat{a}}_{t}\) and angular velocity \({\widehat{\omega }}_{t}\) with noise measured by the IMU are modeled as follows.
where \({a}_{t}\) and \({\omega }_{t}\) are the true valuesof acceleration and angular velocity. \({n}_{a}\) and \({n}_{\omega }\) are Gaussian white noise from accelerometer and gyroscope measurements, \({n}_{a}\sim N(0,{\sigma }_{a}^{2})\), \({n}_{\omega }\sim N(0,{\sigma }_{\omega }^{2})\). \({b}_{{a}_{t}}\) and \({b}_{{\omega }_{t}}\) are acceleration and angular velocity deviations modeled by random walk, respectively, and their derivatives are Gaussian white noise. \({R}_{w}^{t}\) is the rotation matrix from the world coordinate system to the current IMU coordinate system. \({{\text{g}}}^{w}\) is the gravity acceleration vector in the world coordinate system.
Taking the image frame \({b}_{k}\) as a reference, the position, velocity and rotation measured by the IMU at time \({b}_{k+1}\) are modeled as follows [18].
where
Since the IMU measurement frequency is higher than that of the camera, there are multiple inertial measurement values in the time interval \(\Delta {t}_{k}\in \left[{t}_{k},{t}_{k+1}\right]\), so the position \({p}_{{b}_{k+1}}^{{b}_{k}}\), velocity \({v}_{{b}_{k+1}}^{{b}_{k}}\) and rotation change \({q}_{{b}_{k+1}}^{{b}_{k}}\) of the IMU during this period are preintegrated. Where: \({\alpha }_{{b}_{k+1}}^{{b}_{k}}\) is the position preintegration amount. \({\beta }_{{b}_{k+1}}^{{b}_{k}}\) is the speed preintegration amount. \({\gamma }_{{b}_{k+1}}^{{b}_{k}}\) is the rotation preintegration amount. Through the above formula, the constantly changing state change can be obtained by using frame \({b}_{k}\) as a reference.
When the estimation of the IMU bias has a small change, the preintegration measurement value is corrected by the following first-order approximation, so as to obtain a more accurate pre-integration measurement value. This method saves a lot of computing resources for the optimization-based algorithm, because Measurements that do not require repeated propagation of the IMU.
where \(J\) is the partial derivative Jacobian matrix of the error term with respect to the variable to be optimized.
3 Tightly Coupled Stereo VIO
When the visual measurement and IMU measurement are preprocessed and the system is initialized successfully, the nonlinear optimization link based on sliding window tight coupling can be performed, as shown in Fig. 2.
3.1 Optimization Variables
All variables \(\chi \) to be optimized in the sliding window are defined as follows.
where \({x}_{k}\) is the state of the IMU at the kth frame of the camera, including the position, velocity and rotation of the IMU in the world coordinate system at this time, as well as the acceleration deviation and gyroscope deviation of the IMU; n is the total key in the sliding window The number of frames; m is the number of feature points in the sliding window; \({\lambda }_{i}\) is the reciprocal of the first observed depth value of the ith feature point.
Optimizing the state variable by minimizing the sum of all residuals by modeling an overall optimization model within a sliding window.
where \(\rho \) is the Huber norm.
\({r}_{B}({\widehat{z}}_{{b}_{k+1}}^{{b}_{k}},\chi )\) and \({r}_{C}({\widehat{z}}_{{b}_{l}}^{{C}_{j}},\chi )\) are the residual terms of the IMU and visual measurements, respectively. B is the set of all IMU measurements, and C is the set of feature points within the current sliding window. \(\left\{{r}_{p},{H}_{p}\right\}\) is the prior information related to marginalization, which is used to construct the prior residual term. Use the Ceres solver [19] to solve the entire nonlinear optimization problem.
3.2 IMU Residual Block
Construct the residual term of the IMU measurement between two consecutive frames \({b}_{k}\) and \({b}_{k+1}\) within the sliding window according to the IMU preintegration.
where [·]xyz means to extract the vector part of the quaternion.
3.3 Visual Residual Block
Assuming that the observed points of the feature point l on the normalized plane of frame i and frame j are respectively \({p}_{l}^{{C}_{i}}({u}_{{C}_{i}},{v}_{{C}_{i}})\) and \({p}_{l}^{{C}_{j}}({u}_{{C}_{j}},{v}_{{C}_{j}})\), the visual reprojection error is constructed by back-projecting \({p}_{l}^{{C}_{i}}\) to frame j.
where \({\overline{p} }_{l}^{{C}_{j}}\) is the back-projected point of \({p}_{l}^{{C}_{i}}\) onto the jth frame. T is the transformation matrix between different coordinate systems. \({\lambda }_{l}\) is the depth value observed for the first time by feature point l.
3.4 Marginalization
In order to reduce the computational complexity of optimizing VIO based on sliding windows, a marginalization strategy is adopted, and the marginalized measurements are converted into prior information. The number of key frames in the sliding window is guaranteed to be the set value.
As shown in Fig. 3, when a new frame arrives, if the latest frame in the sliding window is a key frame at this time, the oldest frame in the sliding window and the corresponding measurement. Otherwise, the latest frame in the current sliding window is marginalized, but the IMU measurement of this frame is retained, and the current frame enters the sliding window to become the new latest frame.
Marginalization was achieved using Schur complement [20]. Construct prior residuals based on all marginalized measurements to achieve further optimization of state variables.
4 Experimental Results
In order to evaluate the accuracy and robustness of the system proposed in this paper, the algorithm proposed in this paper was tested and verified on the public data set and the real vehicle platform built by ourselves. Select the EuRoC public dataset [21] to compare the stereo vision and inertial fusion algorithm VINS-Fusion with the algorithm in this paper. At the same time, use the self-built real vehicle platform to verify the effectiveness of the algorithm in this paper.
4.1 Dataset Experiment
In order to verify the localization accuracy of the algorithm proposed in this paper, four motion sequence data of MH_04_difficult, MH_05_difficult, V1_03_difficult and V2_03_difficult of the EuRoC visual-inertial dataset were used for testing. The data in this dataset is collected using UAV-mounted sensors, including binocular images, IMU measurements, and ground-truth data. In this experiment, the algorithm in this paper is compared with the VINS-Fusion algorithm, and the comparison results are shown in Table 1.
It can be seen from Table 1 that the relative pose error (RPE) of the algorithm proposed in this paper is smaller than that of VINS-Fusion on the four selected sequences, and the root mean square error (RMSE) is reduced by 12.41% on average, which proves the effectiveness and robustness of the algorithm in this paper.
Figure 4 is a comparison of the trajectories between the algorithm of this paper and VINS-Fusion in the above four sequences. In the figure, ground truth is the real motion trajectory provided by the EuRoC dataset, and Ours represents the method of this paper. It can be seen from Fig. 4 that the running trajectory of the algorithm in this paper is closer to the real trajectory, and the error is smaller.
4.2 Real Scene Experiment
In order to verify the effectiveness of the algorithm in this paper, experiments are carried out in actual scene. The experimental platform is a crawler chassis equipped with ZED2i stereo camera and GNSS, as shown in Fig. 5. The camera is used to collect images and IMU data. It has a built-in 6-axis IMU. While collecting images and IMU data, it records GNSS data for making ground truth in real scenes. In order to facilitate the comparison and ensure that the two algorithms run on the same trajectory, this paper adopts the method of recording data sets to conduct comparative experiments in real scene.
The experimental environment Fig. 6 is a scene with good lighting conditions during the day. The crawler chassis test platform is used to drive on a random path in the open space inside the electromechanical test building, and the raw camera image, IMU raw data and GNSS positioning data are collected at the same time. The total length of the path is about 113 m.
Figure 7(a) is the comparison between the trajectory of the method in this paper and VINS-Fusion after running the recorded real data set, and the real trajectory of GNSS. It can be seen that the overall positioning effect of the algorithm in this paper is slightly better than that of VINS-Fusion, it can achieve basic and real trajectories. Figure 7b, c shows the relative pose error between our improved method and the original algorithm in the actual environment. Table 2 shows the RPE comparison between the proposed method and the original method in real scene. It is obvious that the error of the proposed algorithm is significantly lower than that of the original algorithm. This coincides with the validity of the method in this paper.
5 Conclusion
Based on the VINS-Fusion system framework, this paper proposes to use the improved RANSAC algorithm to reduce mismatching in the feature tracking step, and iterates only in the matching points whose Hamming distance is less than the set threshold. At the same time, in the nonlinear optimization link based on the sliding window, the Huber kernel function is used to preprocess the larger residual items for both the IMU residual and the visual residual. The algorithm comparison test was carried out on the public dataset EuRoC. The experimental results show that the positioning root mean square error of the improved method is reduced by an average of 12.41% compared with the original algorithm, which effectively improves the accuracy of the SLAM system; the experiment in the actual scene is also verified the effectiveness of the method.
References
Yousif K, Bab-Hadiashar A, Hoseinnezhad R (2015) An overview to visual odometry and visual SLAM: applications to mobile robotics. J Intell Indus Syst 1(4):289–311
Liu HM, Zhang GF, Bao HJ (2016) A survey of simultaneous localization and mapping methods based on monocular vision. J Comput Aided Des Graph 28(06):855–868
Bresson G, Alsayed Z, Yu L, Glaser S (2017) Simultaneous localization and mapping: a survey of current trends in autonomous driving. J IEEE Trans Intell Veh 2(3):194–220
Li XJ, Li JF, Zhu M, Peng NL, Zuo S (2021) Research on location fusion and verification algorithm based on unscented Kalman filter. J Autom Eng 43(06):825–832
Ye HY, Chen YY, Liu M (2019) Tightly coupled 3d lidar inertial odometry and mapping (Canada: Montreal) In: International conference on robotics and automation (ICRA), pp 3144–3150
Mourikis AI, Roumeliotis SI (2007) A multi-state constraint Kalman filter for vision-aided inertial navigation (Italy: Rome). In: Proceedings of international conference on robotics and automation, pp 3565–3572
Sun K, Mohta K, Pfrommer B, Watterson M, Liu S, Mulgaonkar Y, Taylor CT, Kumar V (2018) Robust stereo visual inertial odometry for fast autonomous flight. J IEEE Robot Autom Lett 3(2):965–972
Xu W, Zhang F (2021) Fast-LIO: a fast, robust lidar-inertial odometry package by tightly-coupled iterated Kalman filter. J IEEE Robot Autom Lett 6(2):3317–3324
Leutenegger S, Lynen S, Bosse M, Siegwart R, Furgale P (2015) Keyframe-based visual–inertial odometry using nonlinear optimization. J Inter J Robot Res 34(3):314–334
Loeliger HA, Dauwels J, Hu J, Korl S, Ping L, Kschischang FR (2007) The factor graph approach to model-based signal processing. J Proc IEEE 95(6):1295–1322
Qin T, Li PL, Shen SJ (2018) Vins-mono: a robust and versatile monocular visual-inertial state estimator. J. IEEE Trans Robot 34(4):1004–1020
Shan TX, Englot B, Meyers D, Wang W, Ratti C, Rus D (2020) Lio-SAM: tightly-coupled Lidar inertial odometry via smoothing and mapping (USA: NV, Las). In: IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 5135–5142
Qin T, Pan J, Cao SZ, Shen SJ (2019) A general optimization-based framework for local odometry estimation with multiple sensors. Preprint: arXiv:1901.03638v1
Qin T, Cao S, Pan J, Shen SJ (2019) A general optimization-based framework for global pose estimation with multiple sensors. Preprint: arXiv:1901.03642v1
Gálvez-López D, Tardos JD (2012) Bags of binary words for fast place recognition in image sequences. J IEEE Trans Robot 28(5):1188–1197
Lucas BD, Kanade T (1981) An iterative image registration technique with an application to stereo vision (USA: Carnegie Mellon University). In: 7th international joint conference on artificial intelligence, pp 674–679
Hartley R, Zisserman A (2003) Multiple view geometry in computer vision. Cambridge University Press, Cambridge
Shen SJ, Michael N, Kumar V (2015) Tightly-coupled monocular visual-inertial fusion for autonomous flight of rotorcraft MAVs (USA: WA, Seattle). In: IEEE international conference on robotics and automation (ICRA), pp 5303–5310
Agarwal S, Mierle K (2012) Ceres solver. http://ceres-solver.org
Sibley G, Matthies L, Sukhatme G (2010) Sliding window filter with application to planetary landing. J Field Robot 27(5):587–608
Burri M, Nikolic J, Gohl P, Schneider T, Rehder J, Omari S, Achtelik MW, Siegwart R (2016) The EuRoC micro aerial vehicle datasets. J Int J Robot Res 35(10):1157–1163
Acknowledgements
This work was supported by the Key projects of natural science foundation of Fujian Province (2021J02013), Fujian University industry university research joint innovation project plan (2022H6007) and Industry Cooperation of Major Science and Technology Project of Fujian Province (2022H6028).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2024 The Author(s)
About this paper
Cite this paper
Lin, T., He, Z., Wu, J., Chen, Q., Fu, S. (2024). Intelligent Construction Machinery SLAM with Stereo Vision and Inertia Fusion. In: Halgamuge, S.K., Zhang, H., Zhao, D., Bian, Y. (eds) The 8th International Conference on Advances in Construction Machinery and Vehicle Engineering. ICACMVE 2023. Lecture Notes in Mechanical Engineering. Springer, Singapore. https://doi.org/10.1007/978-981-97-1876-4_82
Download citation
DOI: https://doi.org/10.1007/978-981-97-1876-4_82
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-1875-7
Online ISBN: 978-981-97-1876-4
eBook Packages: EngineeringEngineering (R0)