Skip to main content

ARCore and Arkit: What is under the hood : Anchors and World Mapping (Part 1)

Reading Time: 7 MIn
Some of you know I have been recently experimenting a bit more with WebXR than a WebVR and when we talk about mobile Mixed Reality, ARkit and ARCore is something which plays a pivotal role to map and understand the environment inside our applications.

I am planning to write a series of blog posts on how you can start developing WebXR applications now and play with them starting with the basics and then going on to using different features of it. But before that, I planned to pen down this series of how actually the "world mapping" works in arcore and arkit. So that we have a better understanding of the Mixed Reality capabilities of the devices we will be working with.

Mapping: feature detection and anchors

Creating apps that work seamlessly with arcore/kit requires a little bit of knowledge about the algorithms that work in the back and that involves knowing about Anchors.

What are anchors:

Anchors are your virtual markers in the real world. As a developer, you anchor any virtual object to the real world and that 3d model will stay glued into the physical location of the real world. Anchors also get updated over time depending on the new information that the system learns. For example, if you anchor a pikcahu 2 ft away from you and then you actually walk towards your pikachu, it realises the actual distance is 2.1ft, then it will compensate that. In real life we have a static coordinate system where every object has it's own x,y,z coordinates. Anchors in devices like HoloLens override the rotation and position of the transform component.

How Anchors stick?

If we follow the documentation in Google ARCore then we see it is attached to something called "trackables", which are feature points and planes in an image. Planes are essentially clustered feature points. You can have a more in-depth look at what Google says ARCore anchor does by reading their really nice Fundamentals. But for our purposes, we first need to understand what exactly are these Feature Points.

Feature Points: through the eyes of computer vision

Feature points are distinctive markers on an image that an algorithm can use to track specific things in that image. Normally any distinctive pattern, such as T-junctions, corners are a good candidate. They lone are not too useful to distinguish between each other and reliable place marker on an image so the neighbouring pixels of that image are also analyzed and saved as a descriptor.
Now a good anchor should have reliable feature points attached to it. The algorithm must be able to find the same physical space under different viewpoints. It should be able to accommodate the change in 
  • Camera Perspective
  • Rotation
  • Scale
  • Lightning
  • Motion blur and noise

Reliable Feature Points

This is an open research problem with multiple solutions on board. Each with their own set of problems though. One of the most popular algorithms stem from this paper by David G. Lowe at IJCV is called Scale Invariant Feature Transform (SIFT). Another follow up work which claims to have even better speed was published in ECCV in 2006 called Speeded Up Robust Features (SURF) by Bay et al. Thought both of them are patented at this point. 

Microsoft Hololens doesn't really need to do all this heavy lifting since it can rely on an extensive amount of sensor data. Especially it gets aid from the depth sensor data from its Infrared Sensors. However, ARCore and ARkit doesn't enjoy those privileges and has to work with 2d images. Though we cannot say for sure which of the algorithm is actually used for ARKit or ARCore we can try to replicate the procecss with a patent-free algorithm to understand how the process actually works.

D.I.Y Feature Detection and keypoint descriptors

To understand the process we will use an algorithm by Leutenegger et al called BRISK. To detect feature we must follow a multi-step process. A typical algorithm would adhere to the following two steps
  1. Keypoint Detection: Detecting keypoint can be as simple as just detecting corners. Which essentially evaluates the contrast between neighbouring pixels. A common way to do that in a large-scale image is to blur the image to smooth out pixel contrast variation and then do edge detection on it. The rationale for this is that you would normally want to detect a tree and a house as a whole to achieve reliable tracking instead of every single twig or window. SIFT and SURF adhere to this approach. However, for real-time scenario blurring adds a compute penalty which we don't want. In their paper "Machine Learning for High-Speed Corner Detection" Rosten and Drummond proposed a method called FAST which analyzes the circular surrounding if each pixel p. If the neighbouring pixels brightness is lower/higher than  and a certain number of connected pixels fall into this category then the algorithm found a corner. 
    Image credits: Rosten E., Drummond T. (2006) Machine Learning for High-Speed Corner Detection.. Computer Vision – ECCV 2006. 
    Now back to BRISK, for it out of the 16-pixel circle, 9 consecutive pixels must be brighter or darker than the central one. Also BRISk uses down-sized images allowing it to achieve better invariance to scale.
  2. Keypoint Descriptor: The primary property of the detected keypoints should be that they are unique. The algorithm should be able to find the feature in a different image with a different viewpoint, lightning. BRISK concatenates the brightness comparison results between different pixels surrounding the centre keypoint and concatenates them to a 512 bit string.
    Image credits: S. Leutenegger, M. Chli and R. Y. Siegwart, “BRISK: Binary Robust invariant scalable keypoints”, 2011 International Conference on Computer Vision
    As we can see from the sample figure, the blue dots create the concentric circles and the red circles indicate the individually sampled areas. Based on these, the results of the brightness comparison are determined. BRISK also ensures rotational invariance. This is calculated by the largest gradients between two samples with a long distance from each other.

Test out the code

To test out the algorithms we will use the reference implementations available in OpenCV. We can use a pip install to install it with python bindings

As test image, I used two test images I had previously captured in my talks at All Things Open and TechSpeaker meetup. One is an evening shot of Louvre, which should have enough corners as well as overlapping edges, and another a picture of my friend in a beer garden in portrait mode. To see how it fares against the already existing blur on the image.

Original Image Link

Original Image Link

Visualizing the Feature Points

We use the below small code snippet to use BRISK on the two images above to visualize the feature points

What the code does:
  1. Loads the jpeg into the variable and converts into grayscale
  2. Initialize BRISK and run it. We use the paper suggested 4 octaves and an increased threshold of 70. This ensures we get a low number but highly reliable key points. As we will see below we still got a lot of key points
  3. We used the detectAndCompute() to get two arrays from the algo holding both the keypoints and their descriptors
  4. We draw the keypoints at their detected positions indicating keypoint size and orientation through circle diameter and angle, The "DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS" does that.

As you can see most of the key points are visible at the castle edges and all visible edges for Louvre and almost none on the floor. With the portrait mode pic of my friend it's more diverse but also shows some false positives like the reflection on the glass. Considering how BRISK works, this is normal.


In this first part of understanding what power ARCore/Kit we understood how basic feature detection works behind the scenes and tried our hands on to replicate that. These spatial anchors are vital to the virtual objects "glueing" to the real world. This whole demo and hands-on now should help you understand why designing your apps so that users place objects in areas where the device has a chance to create anchors is a good idea.
Placing a lot of objects in a single non-textured smooth plane may produce inconsistent experience since its just too hard to detect keypoints in a plane surface (now you know why). As a result, the objects may sometimes drift away if tracking is not good enough.
If your app design encourages placing objects near corners of floor or table then the app has a much better chance of working reliably. Also, Google's guide on anchor placement is an excellent read.

Their short recommendation is:
  • Release spatial anchors if you don’t need them. Each anchor costs CPU cycles which you can save
  • Keep the object close to the anchor. 
On our next post, we will see how we can use this knowledge to do basic SLAM.

Update: The next post lives here:


Popular posts from this blog

Visualizing large scale Uber Movement Data

Last month one of my acquaintances in LinkedIn pointed me to a very interesting dataset. Uber's Movement Dataset. It was fascinating to explore their awesome GUI and to play with the data. However, their UI for exploring the dataset leaves much more to be desired, especially the fact that we always have to specify source and destination to get relevant data and can't play with the whole dataset. Another limitation also was, the dataset doesn't include any time component. Which immediately threw out a lot of things I wanted to explore. When I started looking out if there is another publicly available dataset, I found one at Kaggle. And then quite a few more at Kaggle. But none of them seemed official, and then I found one released by NYC - TLC which looked pretty official and I was hooked.
To explore the data I wanted to try out OmniSci. I recently saw a video of a talk at jupytercon by Randy Zwitch where he goes through a demo of exploring an NYC Cab dataset using OmniSci. A…

ARCore and Arkit, What is under the hood: SLAM (Part 2)

In our last blog post (part 1), we took a look at how algorithms detect keypoints in camera images. These form the basis of our world tracking and environment recognition. But for Mixed Reality, that alone is not enough. We have to be able to calculate the 3d position in the real world. It is often calculated by the spatial distance between itself and multiple keypoints. This is often called Simultaneous Localization and Mapping (SLAM). And this is what is responsible for all the world tracking we see in ARCore/ARKit.
What we will cover today:How ARCore and ARKit does it's SLAM/Visual Inertia OdometryCan we D.I.Y our own SLAM with reasonable accuracy to understand the process better Sensing the world: as a computerWhen we start any augmented reality application in mobile or elsewhere, the first thing it tries to do is to detect a plane. When you first start any MR app in ARKit, ARCore, the system doesn't know anything about the surroundings. It starts processing data from cam…