BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Evoking Magic Realism with Augmented Reality Technology

Evoking Magic Realism with Augmented Reality Technology

Bookmarks
37:26

Summary

Diana Hu explores how building a real world system is more a software engineering art, requiring making choices among a set of tradeoffs.

Bio

Diana Hu is the Director of Engineering and Head of the Augmented Reality Platform at Niantic, creator of AR games like Pokemon Go, Ingress, and Harry Potter Wizards Unite. Diana leads the engineering team, building core technology that enables developers to create shared AR experiences that seamlessly blend the real with the digital. Previously, she was the Cofounder and CTO of Escher Reality.

About the conference

QCon.ai is a practical AI and machine learning conference bringing together software teams working on all aspects of AI and machine learning.

Transcript

Hu: The talk today will be about evoking magic realism with augmented reality. Just to start, quick intro, right now, I'm the Director of Engineering and Head of the AR platform at Niantic. Prior to that, I was a co-founder and CTO for Escher Reality which was a startup and a Y Combinator batch from 2017. Our company got acquired by Niantic, so now we're there. I've been working in AR for roughly three, four years. Prior to that, I was building large scale machine learning and computer vision systems for different products since 2012, and I started as a data scientist back at Intel Labs and at OnCue Television, so experience doing recommender systems, information retrieval, and image understanding for television.

What is here is a list of book covers of different authors. On the left is a book by Gabriel Garcia Marquez who is a Nobel Prize laureate of literature. He's from Columbia, this book is about town in Macondo that tells a lot of history about Colombia, it tells a just a day-to-day life of a family and it talks about a journey about solitude. The interesting thing about this book is that it tells a lot of the story with a journalistic and mundane approach, but there's a lot of magic things that happen that to the character seems to be normal. For example, there's ghosts and flying carpets that age and that just seems normal to the characters there.

Another book which you might be familiar with is "The Metamorphosis" by Franz Kafka, it's about the story of a young man workaholic that suddenly one day, transform into an insect and then goes through this whole philosophical meditation on his perception of reality and society. The interesting thing about this book too is the way the story is told is also this transformation in an insect is pretty weird, but in the story, it just makes it sound as normal, just as another thing that just happened, you wake up and you suddenly become an insect, and that's it.

This other book also, Haruki Murakami, he's this Japanese author with this book, "The Wind-Up Bird Chronicle." It's about the story of Toru Okada, it’s that kind of a detective story about finding his wife's missing cat. That just sounds very normal story as well, but what's interesting is that as the story unravels, he start finding a lot of fantastical things that happen underneath Tokyo and, again, told in a very matter-of-fact storytelling approach which makes it seem that all these occurrences could happen in our life, but they seem to be outlandish.

This last book which also became a movie, "Like Water for Chocolate" by Laura Esquivel, is a story about a woman who's forbidden to marry, who pours magic into making chocolate. Then again, the drama in the story is very commonplace.

What do all these books and authors have in common? All of them follow the tradition of this literature movement called magic realism, and that's not science fiction or fantasy. What fantasy is that the main factor is that it takes world entirely different from our own, like "Game of Thrones," or "Harry Potter," or "Star Wars", is a completely different world that doesn't seem like it would exist here. Science fiction is also not magic realism because described as an altered world with elements where science makes advances that are crazy.

The interesting thing about magic realism is that its fiction that takes place in our world. They could happen to your neighbor as the stories I tell or it's a story that your grandma could be telling you. It has an interesting introduction of a magical element and the magical element a lot of times fits as an element that elevates and tells the stories, or as criticism about something that's happening in society, like in Kafka with "Metamorphosis" about criticism about society and how isolation worked.

A Magic Insight into Reality

How does this relate? Describing a bit more this magic insight into reality is this concept of telling the story always with a deadpan expression like journalism that, "Yes, there was a flying carpet and so what," kind of a thing. It is this way of telling a story in a matter-of-fact including a lot of fantastical elements into just day-to-day world, but note that imagination is used to enrich reality not to escape it like sci-fi or fantasy. It's really still grounded in reality because if it were just magic, it would be pure whimsy. Rooting it, in reality, builds this layer where illuminates and grows and makes the reality as we experience it more beautiful in unexpected ways.

Main thing about the way a lot of these stories and books tell the story is telling it in such way that captivates something in the real world that is not possible, but making it so believable that you do believe that magic carpets can fly or ghosts can age, and they just seem commonplace because the way they're told.

How does this relate to augmented reality? My definition for augmented reality is this concept of a matter-of-fact inclusion of fantastical elements into the physical world. The analogy to magic would be the digital, would be digital characters that get displays and seamlessly blend in the physical world while they still follow a lot of the laws of physics and human perception of how we understand things to behave, but with a bit of sprinkle of a fantastical aspect, because with digital you could represent information in a lot more interesting ways.

Augmenting Reality

How do you build such reality? What are some of those components? At Niantic, we have some principles for creating magical realism in AR and these follow a lot of the company missions that we have with one, exploration, more of the world around us. What that means is that there's a lot of stories and adventures everywhere that are just waiting to be discovered. Some of our games like, Ingress or Pokémon Go take you in an adventure in your neighborhood that you get to find out things that you didn't know about some historical landmark and that's something that we achieved to create. That's a different kind of AR that maybe we have the conception of being just the digital visual, but that's another aspect.

Exercise is an aspect that we all need a bit of a nudge to move and kind of being embedded, and following the physics of it, and following the natural rhythms of your body, add to the suspense of this belief for AR. The other aspect is social, we are social animals by default. Creating experiences that you're able to engage in the real world with our friends and a way that we can make new friends, not just friends like in a social network sense, but someone real in a sense that you build a connection with them.

With a lot of our journey throughout Niantic, we've been delighted to hear a lot from people around the world who found not only rewarding family, friendly, and intragenerational entertainment, but they found benefits among all these things that were unexpected from our games. Diving deeper into each of these three elements on how we built into the games for AR, this concept of exploration. In the world of AR, we wanted to be a matter of fact and follow reality. As part of reality, the diversity in the world, there are things like weather, it rains, it's shiny, or snows, etc. so we want those to be reflected in the game. In terms of that for AR, a feature in Pokemon Go, when it rains you see it right displayed in the digital screen in your phone and following that.

The other aspect of building virtual worlds, we want to push us to exercise and move. Capitalizing a bit on our natural rhythm on how we move is this concept that the world as we live in is the game board. You walk through different stores, so as you take a stroll through San Francisco, you're actually moving your digital avatar in the game with it. That's other connection where we're blending this digital with the physical and taking things that we take it for granted that work also in digital.

This other one about shared on social is this concept that the digital world should obey similar rules as in the real world to maintain the suspense of this belief. What I'm going to show you in the next slide is a demo of experience that we built in AR called Codename Neon. It's a technology that shows a group of people playing together where they're collecting pellets in the world to connect energy orbs and then they're having a game of tag and shooting energy balls to each other. The thing that's interesting to maintain physics and maintain the state is that if I collect that energy orb that's here, then my friend or them, they should not see it because it's a shared resource. If I shoot this energy orb, everyone should see it because reality as we experience it is consistent with time and space consistency.

Just to play a little bit the demo here, you can see that when you get a bit of experience, people just have fun. We're consistent with collecting those white pellets and then going and targeting your friends. If you're mad at them, maybe you can have a friendly fight in AR, it’s less physical, more digital.

Making the Digital Believable

What does it take to make the digital believable, taking against this learning from the literature masters on magic realism because they're able to weave in all these stories so believable that when you read all these stories, you're so absorbed? What does it take to do it, to do that now for AR?

As part of that, at Niantic we've built this real world platform which is a set of software that all these different games are built on top where we enable consistent game play, social, mapping, advance AR, and a lot of this underneath is powering our games like Ingress, Pokemon Go, and soon to launch, Harry Potter: Wizards Unite, which will be exciting sometime later this year. In order to do that, this talk will specifically mostly focus on the AR component, and going through that is starting a bit of a brief on AR technology and the big building blocks that it takes to create AR because AR is actually a very interdisciplinary field of many fields in computer science.

In order to make good AR, first, you need to understand the world in order to augment it. What that means is you have to feed on all the different sensors mainly, let's say from the camera. You need to start understanding and making a sense of what it means, the semantics and in terms of labeling what is it that you see. Besides that, this is the more visual one, I'll talk a bit later about more understanding world in terms of geometry, but that will be the next slide. Besides that, there's that whole field of understanding the world is the field of computer vision.

This other one is the need for visuals, in order to display these believable digital objects, characters, they need to blend with the real world. You need to create characters in 3D that makes sense. There's just a whole world of graphics, 3D animation, and all of that. The other bucket is, you have all these components, "Ok. I understand the world. Now, I can render some characters," and then how do you create? The reason why augmented reality is such a good fit for gaming is because the way game developers have been creating since a long time ago has been the natural way of creating worlds and experiences in 3D. These are tools like Unity and Real, and a couple of other ones, building experiences.

Diving deeper a little bit into more this concept of understanding the world because we're going to go a bit more detail into those is besides the semantics which I think, everyone gets what that means is, I see something and the sky is blue, it's a block of blue, it's the sky. This concept is a bit more abstract, understanding the world more, think of it at a lower level, kind of geometry. It's more understanding that this shape of blob for things in the world is more that there's something here so I should not collide, something here that's like a plane, something there that is like a blob, but is really a chair. You don't know whether it's a chair, but it's something that you should not collide with.

A lot of this is that taking from the field in robotics with algorithms like SLAM and Vince where you take different camera positions where you're able to triangulate and build a 3D understanding through stereo through time where you do feature extraction and a couple of other components to build this representation of the world that's used for AR. Why is it needed for AR? Because in order to display the digital characters, you need to pin them to the world. How do you pin them to the world? You have to know roughly your coherent systems, and SLAM is the ability to build that.

AR Systems for Human Perception

We talked about some of the components that make up AR. There's another component at the end who's consuming all these data, humans. You have to build AR systems for human response, and this is where a lot of the algorithms I mentioned earlier are similar to actually self-driving cars. This is where it diverges a little bit because AR systems are meant to be very interactive for humans. A couple of the concepts here to keep in mind is there's this famous study in the '80s that talked about the Miller response-time test, that talked about the response from a computer system, what does it feel good to humans and this it's tied a lot to kind of the brain and neurological signals, and how fast a signal comes from the world, and you're able to interpret those. In summary, what that whole study says is that anything that's about less than 100 milliseconds, it feels instantaneous or in real time, or it feels good.

We want to be in that bucket. Anything that's, like, around a second is fast enough but it's not instant, this is fast enough to get a response, but good enough. If it's is beyond 10 seconds, maybe when you look at the loading bar, so you lose the users, the page is gone, it's like, "Ok. This is not working." Based on that, you want to design systems that have the budget that has to be less than 100 milliseconds. That's key in order to create the suspense of disbelief or making things a matter-of-fact that just work in the physical world needs to work with our senses.

The other part, the other on the right side, is just understanding more since we're focusing a bit more on the visual SLAM, is understanding more how long it takes the brain to process images. There's this study by Rayner about eye movement and visual encoding during a scene perception from psychology that tells that the retina needs to see an image for about 80 milliseconds before that image is fully registered and understands what it is. Just another number, they have a bunch of numbers, another one is when you're reading a text, that number is about 50 to 60 milliseconds because it's less information entropy in a sense because we're just seeing a random image at a time, it's a lot more data bandwidth than just reading text. Reading text is more text, I guess, and it's more conceptual, that's a bit of things to understand where we're going to target.

Still taking even less than 100 millisecond, less than 80 millisecond, then there's this other constraint as well where a lot of rendering in video games is about 30 to 60 frames per second, so that gets you to a 30 to 60 milliseconds response latency.

Need for Speed as A “Matter-of-Fact”

That brings this big kind of assumption and design constraint for building augmented reality system is that the need for speed is what will create the matter-of-fact for augmented reality where we can create this storytelling like the authors that I mentioned. Building believable AR. we need to really build AR systems that are really fast with the other constraint that they need to work on this which there are tiny computation, they don't have that much battery, and the camera is not so great, and a lot of sensors they're kind of are more cheap as opposed to let's say, other robots or self-driving car systems where you could afford to maybe put a GPU in the trunk of your car. You don't want that for this because in the future when we move to headsets, you don't want to burn people's hair. There's a bit of that and so, it has to be really optimized for the world of low computation and low power, and very fast response.

How do we do that? The approach that we've taken in our company is taking two tenants in here in terms of AR system design principles. One is super efficient networking where you want as much, certain things can be offloaded to the server, but some not. If you do, even for a shared multiplayer AR, you want it to be real-time as much as possible by default to try to achieve that small response time. The other aspect is concurrent programming, I used to work at Intel, I can say that Moore's law is kind of that, but the world is moving more towards many core rather than a single big core that's becoming more and faster, so we have to get comfortable and taking advantage of this world where there are more cores to do processing rather than big fat cores.

An example is the processor for iPhone that got released, or even some Samsung devices, the design of four big cores and four small cores. The four small cores are for fast simple computation, the four big fat cores are more for expansive one, so you have to be smarter how do you do that and moving forward, this is just going to increase. It's looking more a little bit of how GPUs get at the end of extreme where you have thousands of cores which is why you can do a lot of amazing things with deep learning, which is a lot of them because we are being better at paralyzing the core, a lot of the computation there.

Some of the approaches that I'll just describe about concurrent programming have to do with lottery programming and the concept that I'm not sure if people here are familiar with actor models, and we'll go into detail about those. First, networking. Speeding up networking, life is in real time. You don't have loading bars when you're talking to your friend and you want the AR experience with characters that interact with you to also not have loading bars, they should feel natural.

One of the current constraints on current traditional cloud architecture for services is that a lot of cloud-based applications where your cloud and a lot of the machines are hosted somewhere far away. If you're in Amazon typically, Virginia or maybe Oregon, the round trip for those in latencies when you're paying somewhere around the world would be in the hundreds of milliseconds. That alone already does not make a cut of the human perception less than 100 millisecond number that I was mentioning, if you believe the number. If you do the hundreds of milliseconds and you're trying to render the AR position of your friend, you're going to render roughly in single digit frames per second which is pretty bad. It's like watching a really, really, really bad video that does not load from the internet, we don't want that.

How do we achieve something that can run at 30 frames per second? I'm going to show you first before I tell you how it works that we actually got it working. What you're going to see here is actual game footage of a multiplayer AR puzzle solving game where actual players are cloaked in an avatar. Look, they're cloaked in an avatar as if these were right to pre-planned path, but it's really just following the current position of the phone. We're not doing anything special to track the humans or anything. It's really just following the position of the phone, and because it is doing it with such a low latency that these avatars are cloaking the user quite well and it creates this amazing effect that creates a team and game design team came up with which just looks so fun.

This is rendering actually at 60 frames per second because the iPhone, we got this working at 60 frames per second. How do we do that? In terms of your design choices for networking, there's two access. You could do quote, unquote, real-time networking or non-real time. The other one is sending reliable messages over your network stack or unreliable messages. If you choose this category of reliable messages, that's the world of the web, which the whole world has done a lot of advances and built very fantastic tools for the web world with HTTP and REST because we want documents and transactions to not be lossy.

WebSockets is an attempt to try to get to a bit more real time, but it's still not good enough. It’s an attempt of we got so good at doing HTTP that we want to try to continue doing that for the new world, but for AR that's not enough because the packages there with the header for HTTP are too heavy and maybe you could do something better.

In the world of unreliable messages, in real-time you do have UDP which is actually old technology from the UNIX socket world. It's not that new, but it actually is pretty fast because it gets rid of a lot of the assumptions of TCP with the handshake and in coordination with that. Just that, on the other quarter of unreliable and non-real time, you don't want to be like the U.S. postal office. It has its uses, but not for the designing system. I was trying to figure out what you put there, but all I could think of is the U.S. postal office.

How do we design something that is the corner we want to be? The fact that there is actually there's no magic solution, that there's some new network protocol or something magical out there, is really carefully building a combination of both. What we've done actually is build a real-time peer to peer for AR technology with our own network protocols. Think of it like WebSockets, but better, not so heavy, but optimized for packages for computer vision. The cloud world, if you were trying to do some of this basically, phone once, that's their current position and it goes to a cell tower, and the cell tower goes to the cloud, and the cloud goes to the cell tower, and to the phone, and that whole round trip is in the hundreds of milliseconds and by the time you won't see your friend cloaked in the avatar. It will be a bit jarring because it will be off. It will send the previous position, not where they are now.

What we've done is actually cut this whole round trip to the cloud and just talk directly with the cell towers. This is an interesting approach because now in the world of 5G, some of the data bandwidth speed is getting even faster. This is another law, there are all these laws for computing, Edmond's law talks about wireless communication at some point will be as fast as wire line at some point. There are ways of laws of physics and data transmission that can get there.

There's this other concept around edge computing which is a push for the industry to put actually computation at the cell tower, which would be very interesting for AR for all the reasons I mentioned. You could start aggregating some of the computations and do them in the cell tower. Right now, what we do is actually just doing on the phones and burn a little more the battery of your phone, but later we could do that less. That's something interesting where the industry is moving and that's where we're betting. Now, you cut down from hundreds of milliseconds to tens of milliseconds, and then you hit your magical budget for the human response time.

Speeding up Computation

The other design consideration is this concept of speeding up computation in this world where we're moving too many cores rather than bigger fat cores. How do we do this? Computer vision is hard, there's a lot of things that we're trying to progress the field, there's that part also the engineering, and marrying the two things together, we can achieve very interesting results.

We're going to go very high level, I'm ignoring a lot of boxes in here, very high level on what traditional augmented reality SLAM pipeline is and what it is. At a high level you have these four stages. You have the raw sensor inputs that come from your pixels and from your IMU which is a gyro accelerometer and then those come through a box for feature extraction to extract the data in a more useful way, just think of it like the super high density data that all of it might not be useful to something more useful for localizing and mapping to creating this AR map and localizing is telling you where it is, and at the same time, you're building this map as you go to bootstrap the problem.

Explaining a little bit more the inputs just so you understand why we don't work with the raw images. For the cameras, if you're working with 1080P, imagine you're getting a matrix of 1080 by 720 times 3 because RGB. That's uncompressed, that's a lot, that's 10 to the 6 at 30 to 60 hertz per second. It's very hard for any system even your wimpy phone to just process all that and work with it all the time.

Then you have these other data and that's very high res data that you use for science systems called the inertial motion unit which is basically a gyro and the accelerometer that tells you the x, y, z rotation acceleration that's also used for telling you where you are in the world. It's lossier. I guess, I was telling Roland about this, where you could technically know where you are in the world based on this if you take the integral of the acceleration couple times, but there's a lot of error accumulated because these are super cheap sensors.

As I was telling Roland, the Apollo first went to the moon with just that, there were no cameras, but it was a super expensive IMU that was tens of thousands of dollars that could calculate all these numbers, because the math actually works out correct. Of course, we're not going to put a $50,000 sensor in a phone, but maybe for a self-driving car, you could afford to have more for that because if we get some of the position wrong for AR, the consequences are not as terrible. Nobody really has died from AR getting wrong, but a self-driving car is dangerous, so you do want to put more expensive sensors in cars.

Those are the inputs, the takeaway is that we don't work with the raw data and in order to do that, we work on feature extraction. What feature extraction is, is you take this raw cam matrix which is 1080 by 720 into 3 channels, here it's just one channel in gray scale, and extract the interesting features that are based on the texture of the scene. There's a bunch of algorithms here and that's one of the hard parts with getting it working reliably because I think I was mentioning to someone that this whole problem in SLAM is a bit of an open loop problem. How do you know that these are the right features and they work, and if the lighting changes at this point, the extractor this time still works? That's a whole feel of computer vision right there.

You get the super high density matrix and at the end what you get is just a vector of depending on your configuration of the feature. Here, we're pretending the vector is size four, it's actually a bit longer, maybe more like 100, but that's a lot less than 10 to a 6. That's what you work within the next couple stages. Those points that I showed become these abstract points and then as you build the AR map, I mentioned this concept of building stereo through time, start building the correlation as you move through the video with all the frames that go across and see the points that you see across frame, and start building with that this AR map.

This is how one of their maps for self-driving car looks like, for AR, they're not as dense, but roughly they look like this and this is what you use to tell where your phone is in the real world, so that at the end your characters can render properly. That's how it works.

It’ll be Hard to Run in Real-Time

A lot of these algorithms are super expensive to compute. This is from a paper that was trying to run a SLAM on embedded systems that are a bit of a proxy to phones. You look at the numbers, what you want to see is they're all kind of in the seconds which is not good. This is the academic implementation because the typical implementation this is what they do. Your feature extraction pipeline, you have a lot because you're going to wait until it's done and then send it to the other parts, and then wait and lock, and wait and lock a lot of busy loop spinning.

Is there something better you could do? Our answer is yes. We've taken traditional computer vision algorithms that when we implement the raw version run in let's say, single digits frames per second, when we did what I'm going to show you, we were able to achieve 60 frames per second. We did not change anything on the algorithm, this is basically just a new programming paradigm. With that, there's this new framework for concurrent programming called actor models. At a higher level what they are is that actors are primitive units of computations that are completely isolated and do some sort of computation with the internal state. They don't block anyone and they have their own memory trunk.

Messages for states when they're done computing or asking for messages are sent, currently the keyword here is asynchronous. You're not blocking to wait for other ones because if you actually start benchmarking and running a lot of these SLAM systems, half of the time, you're just busy looping and waiting, and that's expensive that you're doing a change of context. Actor system gets rid of that because they all are completely isolated, independent computation blocks.

How does this line system start looking now? Now, instead of being all locked to each other, you have this message queue where each of the components run at their own pace and when they're done, the other ones get the data, the messages and do their computation. What this means is that your future structure could maybe take a bit longer, but then could actually consume all that needs to do. It doesn't have to wait until all the other process loop with localization or mapping is done.

Summary

As a summary, we talked about magic realism in AR and what are some of the things that we can do to achieve the suspense of this belief like authors in literature, how do you make the digital believable, and how we build AR systems for human perception, and is this whole concept why, at least for AR, is so important to bill things for speed and optimizing for it. Last thing, just credits, just thanks to Peter and my team, who came up with a lot of these ideas.

Questions and Answers

Participant 1: Thanks for the talk. One of the questions I had was have you considered deeper learning end-to-end approaches which go from an image to 3D reconstruction?

Hu: Yes, those are some of the things that we're experimenting with. The main challenge with deep learning is that it takes a lot of computation cycles and it's not going to fully run in real time with your phone while you still need to render and run the game.

Participant 2: I'd love to hear more about how you guys evaluate whether these improvements in latency and speed, about translation into the user experience. Do you test them? How do you measure that someone's having a good experience with an AR app? It seems easy to benchmark on these more quantitative metrics, but how do you put that in the user's experience terms?

Hu: At the end, all of this why we designed it the way it is is based on this assumption of human perception. Some of the things that you saw would just not be possible to build at all. If your latency and response time is in the single digit frames per second, you just couldn't build any of those experience at all. That's one part, it's a binary switch whether it's possible or not.

The other question is what do you do with incremental improvements? At some point it's good enough, the thing that gets better for a lot of our users is definitely battery consumption. If you get better and more efficient, your battery lasts longer which is very important. Assuming we go into the world of AR which we're big believers is kind of go into the world of headsets, and those are even more battery hungry because of the optics and all the photons that you need to shoot in order to render stuff.

 

See more presentations with transcripts

 

Recorded at:

Aug 23, 2019

BT