"Vision Pro's passthrough isn't depth-correct"...?

This is a long post. TL;DR: what did Apple do to make Vision Pro’s passthrough not feel like crap despite not being depth-correct?

Depth-correct passthrough: what’s the big deal?

It’s well documented that it’s critical for video passthrough mixed reality to be “depth-correct” (AKA perspective correct) or you’ll have issues:

But perhaps the best analysis is this one by /u/kguttag. The conclusion is clear: if the passthrough is not corrected via reprojection to account for the difference between the locations of your eyeballs vs cameras looking outside (good quick explanation), you’re going to have an experience that feels weird at best and at worst unusable, disorienting, or even dangerous. You’ll have bigger problems than not being able to catch a ball.

Its importance was apparent to Meta who decided that even having significant bubble warping distortion is worth the tradeoff. A method in the madness.

Vision Pro defies conventional wisdom (maybe)

To say the Vision Pro prominently features video passthrough mixed reality would be an understatement. Notably, there haven’t really been widespread reports of the passthrough being disorienting. In fact, users claim the opposite, that it’s less sickness inducing than other headsets.

But according to UploadVR’s review, Vision Pro’s passthrough is actually not depth-correct:

But how I really know Vision Pro isn’t a dynamically reprojected view is that the scale and perspective are slightly off. Yes, that’s right, Apple Vision Pro’s passthrough is not depth-correct. This was the most surprising aspect of Vision Pro for me, and something I’ve seen almost no other review mention.
Being free of the Quest warping distortion is deeply refreshing, can feel sublime in comparison, and is probably what most people mean when they praise Vision Pro’s passthrough. And if you’re sitting on a couch where the only thing close to you is your hands, you probably won’t even notice that the view you’re seeing isn’t depth-correct. But if you’re sat at a desk, you will definitely notice how the table and monitor in front of you skews as you rotate your head, in a way that virtual objects don’t. And at these close ranges, you’ll also notice that the alignment of virtual objects with real objects is slightly off as you move your head. This isn’t because of any tracking error, it’s again, just that Vision Pro’s view of the real world isn’t depth-correct. Lift up Quest 3 and you’ll see real world objects remain in the position and scale they were at when you had the headset on. Lift up Vision Pro and you’ll see everything is slightly offset. Apple prioritized geometric stability at the cost of incorrect depth and scale, while Meta prioritized depth and scale at the cost of harsh bubble warping.

As the review mentioned, there’s very little coverage on this aspect. Given what we know about the importance of passthrough being depth-correct, it’s hard to reconcile. Why has almost no one mentioned it other than UploadVR? How are people playing ping pong, (not really) skiing, and walking/skateboarding around New York for hours without trouble?

It can’t be magic, right?

I think only one of these can be true:

The importance of passthrough depth-correctness is blown of out proportion.
UploadVR is wrong, Vision Pro’s passthrough is actually depth-correct.
Vision Pro is doing something to give users enough depth awareness to make things feel normal(ish) without doing the full dynamic reprojection like the Quest.

(1) is unlikely. There’s too much evidence and too many credible sources. A lot of people would have to collaborate to perpetuate a conspiracy like this (for unclear gains).

(2) is also unlikely. Any Vision Pro owner can independently verify the claim. Norm from Tested also corroborates the same conclusion.

That leaves only (3). As Norm said in the same video linked above:

… Whatever they’re doing to correct your hands and things in the near-field - it’s a perfect stereo image.

So the big question is: what is it? What exactly is the Vision Pro doing in the passthrough that results in a reasonably comfortable user experience without Quest’s near-field distortion?

Does anyone have a technical explanation or guesses? The most convincing possibility I’ve seen is John Carmack’s tweet about single depth estimation for the entire image instead of re-rendering? Still though, if this is the better approach, then why don’t other headsets use it?

Join the discussion on /r/VisionPro.