Realm VR:

Realm Suit

(Patent Pending)

The RealmSuit is designed to provide as much information as possible when using a standard RGB camera (like those found in webcams, laptops, and smartphones). It works by having color gradients along the arms, legs, hands, and feet within the different color channels. For a simple example, detecting a high blue value with low red/green values would mean the user’s right elbow is in that location. By using separate color gradients along each arm, foot, finger, and toe, we have created a way to train pose models by minimizing the difference between red/green/blue channels in a real image vs. a computer generated model. This is in contrast to landmark detection which is complex and can struggle to provide meaningful depth information. While initial models will be trained with the RealmSuit, users will eventually be able to wear whatever they want combined with ultra-breathable gloves and socks that contain the color gradients.

Visualizing the Color Channels:

While we may not see much of a difference between the different body parts, a computer can easily differentiate them.

Large mass of high green values? Clearly a foot. High red with a blue gradient? That’s going to be the left arm.

However, the RealmSuit’s real advancement isn’t just systematically differentiating body parts by colors. The RealmSuit uses gradients within each color channel to distinguish between different positions within each body part.

This is easily demonstrated with the green values along the arms. The green value increases as we move from elbow to wrist. This also happens in different channels along each finger, foot, and toe to provide as much information as possible to our AI models.

Why It Matters

Current pose estimation techniques with RGB are complex neural networks that calculate the 3D position of various landmarks. These networks were trained on normal images and do a fairly good job of what they’re meant to do. However, extracting 3D information from a single 2D image is complicated. By simplifying the requirement of our network to be matching a 2D representation of a pose to an image, we are leaving computational power for other things like face and voice detection. By training our model with your own images, you end up with a motion capture network that knows your headset, your face, your body type, and the way you move. The end result is faster, more accurate motion capture for the entire face and body on lower end hardware.