Skip to main content

ARCore and Arkit, What is under the hood: SLAM (Part 2)

In our last blog post (part 1), we took a look at how algorithms detect keypoints in camera images. These form the basis of our world tracking and environment recognition. But for Mixed Reality, that alone is not enough. We have to be able to calculate the 3d position in the real world. It is often calculated by the spatial distance between itself and multiple keypoints. This is often called Simultaneous Localization and Mapping (SLAM). And this is what is responsible for all the world tracking we see in ARCore/ARKit.

What we will cover today:

  • How ARCore and ARKit does it's SLAM/Visual Inertia Odometry
  • Can we D.I.Y our own SLAM with reasonable accuracy to understand the process better

Sensing the world: as a computer

When we start any augmented reality application in mobile or elsewhere, the first thing it tries to do is to detect a plane. When you first start any MR app in ARKit, ARCore, the system doesn't know anything about the surroundings. It starts processing data from camera and pairs it up with other sensors.
Once it has those data it tries to do the following two things
  1. Build a point cloud mesh of the environment by building a map
  2. Assign a relative position of the device within that perceived environment
From our previous article, we know it's not always easy to build this map from unique feature points and maintain that. However, that becomes easy in certain scenarios if you have the freedom to place beacons at different known locations. Something we did at Mozfest 2016 when Mozilla still had the Magnets project which we had utilized as our beacons. A similar approach is used in a few museums for providing turn by turn navigation to point of interests as their indoor navigation system. However Augmented Reality systems don't have this luxury.

A little saga about relationships

We will start with a map.....about relationships. Or rather "A Stochastic Map For Uncertain Spatial Relationships" by Smith et al. 
In the real world, you have precise and correct information about the exact location of every object. However in AR world that is not the case. For understanding the case lets assume we are in an empty room and our mobile has detected a reliable unique anchor (A) (or that can be a stationary beacon) and our position is at (B). 
In a perfect situation, we know the distance between A and B, and if we want to move towards C we can infer exactly how we need to move.

Unfortunately, in the world of AR and SLAM we need to work with imprecise knowledge about the position of A and C. This results in uncertainties and the need to continually correct the locations. 

The points have a relative spatial relationship with each other and that allows us to get a probability distribution of every possible position. Some of the common methods to deal with the uncertainty and correct positioning errors are Kalman Filter (this is what we used in Mozfest), Maximum Posteriori Estimation or Bundle Adjustment. 
Since these estimations are not perfect, every new sensor update also has to update the estimation model.

Aligning the Virtual World

To map our surroundings reliably in Augmented Reality, we need to continually update our measurement data. The assumptions are, every sensory input we get contains some inaccuracies. We can take help from Milios et al in their paper "Globally Consistent Range Scan Alignment for Environment Mapping" to understand the issue. 
Image credits: Lu, F., & Milios, E. (1997). Globally consistent range scan alignment for environment mapping
Here in figure a, we see how going from position P1....Pn accumulates little measurement errors over time until the resulting environment map is wrong. But when we align the scan sin fig b, the result is considerably improved. To do that, the algorithm keeps track of all local frame data and network spatial relations among those.
A common problem at this point is how much data to store to keep doing the above correctly. Often to reduce complexity level the algorithm reduces the keyframes it stores.

Let's build the map a.k.a SLAM

To make Mixed Reality feasible, SLAM has the following challenges to handle
  1. Monocular Camera input
  2. Real-time
  3. Drift

Skeleton of SLAM

How do we deal with these in a Mixed Reality scene?
We start with the principles by Cadena et. al in their "Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age" paper. From that paper, we can see the standard architecture of SLAM to be something like
Image Credit: Cadena et al
If we deconstruct the diagram we get the following four modules
  1. Sensor: On mobiles, this is primarily Camera, augmented by accelerometer, gyroscope and depending on the device light sensor. Apart from Project Tango enabled phones, nobody ahd depth sensor for Android.
  2. Front End: The feature extraction and anchor identification happens here as we described in previous post.
  3. Back End: Does error correction to compensate for the drift and also takes care of localizing pose model and overall geometric reconstruction.
  4. SLAM estimate: This is the result containing the tracked features and locations.
To better understand it, we can take a look at one of the open source implementations of SLAM.

D.I.Y SlAM: Taking a peek at ORB-SLAM

To try our hands on to understand how SLAM works let's take a look at a recent algorithm by Montiel et al called ORB-SLAM. We will use the code of its successor ORB-SLAM2. The algorithm is available in Github under GPL3 and I found this excellent blog which goes into nifty details on how we can run ORB-SLAM2 in our computer. I highly encourage you to read that to avoid encountering problems at the setup.
His talk is also available here to see and is very interesting

ORB-SLAM just uses the camera and doesn't utilize any other gyroscope or accelerometer inputs. But the result is still impressive.

  1. Detecting Features: ORB-SLAM, as the name suggests uses ORB to find keypoint and generate binary descriptors. Internally ORB is based on the same method to find keypoint and generating binary descriptors as we discussed in part 1 for BRISK. In short ORB-SLAM analyzes each picture to find keyframe and then store it with a reference to the keyframe in a map. These are utilized in future to correct historical data.
  2. Keypoint > 3d landmark: The algorithm looks for new frames from the image and when it finds one it performs keypoint detection on it. These are then matched with the previous frame to get a spatial distance. This so far provides a good idea on where it can find the same key points again in a new frame. This provides the initial camera pose estimation.
  3. Refine Camera Pose: The algorithm repeats Step 2 by projecting the estimated initial camera pose into next camera frame to search for more keypoint which corresponds to the one it already knows. If it is certain it can find them, it uses the additional data to refine the pose and correct any spatial measurement error.
green squares  = tracked keypoints. Blue boxes: keyframes. Red box = camera view. Red points = local map points.
Image credits: ORB-SLAM video by Raúl Mur Artal

Returning home a.k.a Loop Closing

One of the goals of MR is when you walk back to your starting point it should understand you have returned. The inherent inefficiency and the induced error make it hard to accurately predict this. This is called loop closing for SLAM. ORB-SLAM handles it by defining a threshold. It tries to match keypoints in a frame with next frames and if the previously detected frames matching percentage exceeds a threshold then it knows you have returned.
Loop Closing performed by the ORB-SLAM algorithm.
Image credits: Mur-Artal, R., Montiel
To account for the error, the algorithm has to propagate coordinate correction throughout the whole frame with updated knowledge to know the loop should be closed
The reconstructed map before (up) and after (down) loop closure.
Image credits: Mur-Artal, R., Montiel

SLAM today:

Google: ARCore's documentation describes it's tracking method as "concurrent odometry and mapping" which is essentially SLAM+sensor inputs. Their patent also indicates they have included inertial sensors into the design.

Apple: Apple also is using Visual Interial Odometry which they acquired by buying Metaio and FlyBy. I learned a lot about what they are doing by having a look at this video at WWDC18.

Additional Read: I found this "A comparative analysis of tightly-coupled monocular, binocular, and stereo VINS" paper to be a nice read to see how different IMU's are used and compared. IMU's are the devices that provide all this sensory data to our devices today. And their calibration is supposed to be crazy difficult. 

I hope this post along with the previous one provides a better understanding of how our world is tracked inside ARCore/ARKit.

In a few days, I will start another blog series on how to build Mixed Reality applications and use experimental as well as some stable WebXR api's to build Mixed Reality application demos.
As always feedbacks are welcome.

References/Interesting Reads:

Popular posts from this blog

Visualizing large scale Uber Movement Data

Last month one of my acquaintances in LinkedIn pointed me to a very interesting dataset. Uber's Movement Dataset. It was fascinating to explore their awesome GUI and to play with the data. However, their UI for exploring the dataset leaves much more to be desired, especially the fact that we always have to specify source and destination to get relevant data and can't play with the whole dataset. Another limitation also was, the dataset doesn't include any time component. Which immediately threw out a lot of things I wanted to explore. When I started looking out if there is another publicly available dataset, I found one at Kaggle. And then quite a few more at Kaggle. But none of them seemed official, and then I found one released by NYC - TLC which looked pretty official and I was hooked.
To explore the data I wanted to try out OmniSci. I recently saw a video of a talk at jupytercon by Randy Zwitch where he goes through a demo of exploring an NYC Cab dataset using OmniSci. A…

ARCore and Arkit: What is under the hood : Anchors and World Mapping (Part 1)

Reading Time: 7 MIn
Some of you know I have been recently experimenting a bit more with WebXR than a WebVR and when we talk about mobile Mixed Reality, ARkit and ARCore is something which plays a pivotal role to map and understand the environment inside our applications.
I am planning to write a series of blog posts on how you can start developing WebXR applications now and play with them starting with the basics and then going on to using different features of it. But before that, I planned to pen down this series of how actually the "world mapping" works in arcore and arkit. So that we have a better understanding of the Mixed Reality capabilities of the devices we will be working with.
Mapping: feature detection and anchors Creating apps that work seamlessly with arcore/kit requires a little bit of knowledge about the algorithms that work in the back and that involves knowing about Anchors. What are anchors: Anchors are your virtual markers in the real world. As a develope…

HackRice 7.5: How "uFilter" was born

I have a thing for Hackathon. I am a procrastinator. A lazy and procrastinator graduate student, not a nice combination to have. But still when I see hundreds of sharp minds in a room scrabbling over idea, hungry to build and prototype their idea. Bring it to life, it finally pushes me to activity, makes me productive.  That is why I love Hackathon, that is why I love HackRice, our resident Hackathon of Rice University.

TL;DR: if you just want to try the extension, chrome version is here and Firefox version is here.
I have been participating at HackRice since 2014, when I think for the first time it was open for non-rice students, and have been participating ever since. What a roller coaster ride it has been, but that is a story for another day. HackRice 7.5 being the last one I will be able to attend at Rice, it was somewhat special and emotional for me.
HackRice 7.5 was a tad different form the other iterations. For starters it was the first time it was being held in Spring semester…