Cameras, Quaternions, and Google Earth

After implementing an intuitive one-finger 3D translation gesture which kept landmarks under the user's finger as the camera moved, we wanted a similarly-intuitive gesture for 3D camera rotation. We ended up replicating the style of camera rotation used in Google Earth, which keeps a landmark at a fixed point on the screen instead of keeping it under the user's finger. Some implementation details were not obvious, which is why we are sharing them with you today. 

Fingers as anchors and other flawed approaches

At first, we were not trying to imitate Google Earth, but to generalize existing 2D gestures to a 3D environment, like we did for the camera translation. In a 2D, two-finger rotation, the user places two fingers on two landmarks. Then, as the fingers are moved, the camera is translated so that the first landmark stays under the first finger, the camera is scaled so that the distance between the two landmarks on the screen is the same as the distance between the two fingers, and finally, the camera is rotated around the first landmark so that the second landmark remains under the second finger. Thus, 2D rotation is already much more complicated than 2D translation, as it incorporates three different kinds of camera movements.

Still, in 2D, there is a unique camera configuration which satisfies the simple constraint of keeping the landmarks under the fingers. In 3D, however, we again suffer from under-specification. The two finger positions correspond to two rays coming out of the camera at particular angles, thus we can imagine the problem of placing the camera so that the landmarks lie under the fingers as placing a V-shaped object in 3D space so that each branch of the V intersects one of the two 3D landmarks. One obvious solution is to make an isosceles triangle with the line connecting the two landmarks as a base and the V's branches as the two equal-length sides. But of course, since we are in 3D and only the base is constrained, the V can freely rotate around this base, so the solution is not unique. With non-isosceles triangles, we get even more solutions.

One naive way to add constraints to the problem would be to ask the user to place three fingers on the screen, constraining three landmarks instead of two. Unfortunately, this solution goes too far, over-constraining the problem. Imagine the camera is looking directly at a cube, whose front face appears on the screen as a square. Place two fingers on the bottom two corners of this square, and the third finger on one of the top two corners. Different camera rotations around this cube will cause the square to be squished into different shapes, either a trapezoid or some suitable distorted version thereof. No orientation, however, could ever turn this square into a rectangle, unless an orthographic projection was used. Yet by moving the third finger vertically, we can easily force the corners to assume a rectangular shape; therefore, in addition to translation, scaling and rotation, three finger rotation would have to constrain the camera's field-of-view parameter. This is not what we want.

Camera orbit gesture (horizontal only)

Eventually we gave up on the idea of anchoring landmarks to fingers, looked up what Google Earth was doing, and decided to follow suit. When you begin a rotation gesture in Google Earth, the landmark on which you click becomes the center of rotation, around which you can orbit the camera by dragging the mouse horizontally or vertically. These two directions correspond to two separate rotations, which we will handle separately.

For the horizontal direction, we circle around our target, as would a pack of wolves circling their prey. This kind of rotation is very easy, because the rotation is around one of the world's 3 axes: the one pointing towards the sky, which might be Y or Z depending on your coordinate system.

If your system doesn't already provide a primitive implementing a world-aligned rotation around a point, it's very easy to write your own: simply translate the world so that the landmark is at the origin, rotate around Y (or Z), then translate the world back. Unity has a primitive for rotating around a point, where the rotation is expressed as a quaternion. We will deal with the peculiarities of quaternions shortly, but for now, we can simply use the quaternion constructor which is based on Euler angles.

Quaternion hrot = Quaternion.Euler(0,mouseDeltaX,0);

Camera orbit gesture (vertical only)

When the mouse moves vertically, the rotation is a bit more complicated, because we are not rotating around one of the world's axes, but around one of the camera's axes. Since we are rotating around the landmark, the axis of rotation will of course pass through it, but this time, instead of throwing a ray towards the sky and rotating around it, we throw a ray towards the X axis, in the camera's frame of reference.

How do we express this as a quaternion around a point? Well, unlike Euler angles, quaternions can be added and inverted, so we can adapt our translation/inverse-translation trick. First, rotate the world so that the camera's X axis points towards the world's X axis. Then, apply the rotation around the X axis. Finally, rotate the world back, using the inverse of the first rotation.

Which rotation will change the world so that the camera's X axis points towards the world's X axis? Well, your camera itself has an orientation, which is probably obtained by placing a neutral world-aligned camera and rotating it into place. If we apply the inverse of this rotation onto the world, the camera will snap back into its original world-aligned camera, as required. In Unity, the rotation which brings a world-aligned camera to your camera orientation is called camera.transform.localRotation.

So, to recap: first the inverse of the camera's localRotation, then a rotation around X, then the inverse of the inverse of localRotation, which is localRotation itself. Oh, and if things were not complicated enough as they are, quaternions compose from right to left, using a multiplication symbol instead of addition. So the final code is:

Quaternion vrot = locationRotation
                * Quaternion.Euler(mouseDeltaY,0,0)
                * Quaternion.Inverse(locationRotation);

One last detail I should mention is that in order to avoid numerical difficulties, all computations should be done relative to an initial state, not accumulated. In practice, this means that you should save the position and localRotation of your camera at the beginning of the gesture, and that the new camera state should be obtained by rotating that position around the landmark, not the one you just calculated on the previous frame.

Camera orbit gesture (both directions)

Like matrix multiplication, quaternion composition is not commutative. This means that it makes a difference whether we begin by applying the horizontal rotation or the vertical rotation. So which order should it be? If we consider the user experience, it shouldn't matter whether we first move the mouse horizontally, then vertically, or the other way around; the result should be the same. But consider this: if we rotate horizontally first, around which axis will our vertical rotation take place? Around the X axis in the camera's reference frame, of course. But we have just rotated the camera horizontally, so this X axis is not at all the same as the localRotation we recorded at the beginning of the gesture! For this reason, it is important to perform the vertical rotation first, that is, by putting it on the right-hand-side of a quaternion composition.

Quaternion rot = hrot * vrot;

Concluding remarks

I hope this explanation was useful. There is still one more detail missing, though: we probably want to bound the vertical rotation so that the camera stops when it is looking directly down; any further, and the camera will be upside down. Dealing with this is even trickier than dealing with quaternions, because the notion of "looking straight down" relies on a Euler angles representation for a quaternion, which is not unique. If there is enough interest, I will probably cover this topic in a future post. Otherwise, well, have fun figuring out the answer on your own! There are a lot of cases to cover, but none is particularly difficult.