Plane

. In this theoretical paper we consider the problem of accurately triangulating a scene plane. Rather than ﬁrst triangulating a set of points and then ﬁtting a plane to these points, we try to minimize the back-projection errors as functions of the plane parameters directly. As this is both geometrically and statistically meaningful our method performs better than the standard two step procedure. Furthermore, we show that the error residuals of this formulation are quasiconvex thereby making it very easy to solve using for example standard local optimization methods.


Introduction
The use of planes and their homographies has become increasingly important since the introduction of graph cuts for dense stereo reconstruction [3]. Since then a number of methods building on this work has been proposed (e.g. [22,13,5]), all working in a similar fashion. Typically a family of planes are used to represent the scene, and using α-expansion each pixel is classified as belonging to one of the planes in the family. The cost of assigning a pixel to a plane is computed using back-projection between the two cameras via plane-induced homography (see [8]). If a pixel, back-projected from one camera to the other, looks similar to the corresponding pixel in the second camera then the cost of assigning that pixel to the current plane is low. Furthermore, to obtain smooth classifications a standard regularization term is added [3].
In this paper we investigate the problem of determining the family of planes accurately. The typical way of determining a scene plane from stereo correspondences is by first triangulating the image points (using for example [7]) and then fitting a plane to the 3D-points (see [8]). There are two downsides to this approach. First, when we are triangulating the points we are not using the knowledge that all the points are lying on the same plane. Second, and perhaps more importantly, when fitting the plane we are measuring distances in 3D. Hence, we are optimizing a quantity that we cannot observe in our data and therefore may be inaccurate. The latter is a problem in particular if the baseline is small. On the other hand fitting problems with a small baseline is very important since descriptors such as SIFT usually perform much better in this case.
Instead we propose to estimate the plane by minimizing the back-projection errors of the plane-induced-homography. We will show that this method gives a good estimation of the plane. Direct estimation (without computing the 3D points) of a scene plane has been considered before. For example in [11,10] two iterative methods are presented. There are however no guarantees of convergence to the global optimum. In [1] both structure and motion is estimated. Here 3D points are constrained to fulfill co-planarity constraints exactly, and a bundle adjustment process is employed. In this paper we assume that the cameras are known, but a similar approach (optimizing reprojection error with the points constrained to lie on the unknown plane) can of course be used. Such a method would however have to rely on triangulation of the 3D points for initialization and would only be locally convergent.
In contrast, we show in this paper that when minimizing back-projection errors of the plane-induced-homography, the problem can be globally optimized using convex optimization. In particular we show that the error residuals are affine functions composed with a projection. It was shown in [9] that this type of functions are examples of quasiconvex functions. Since quasiconvexity is preserved under the max-operation these problems exhibit no local minima when we minimize the max-norm of the errors. More recently minimizing the least squares error, using standard Levenberg-Marquardt type procedures, was addressed in [6,19]. It was shown that for the vast majority of problems from this class it is possible to use use local methods and verify that the solution is in fact globally optimal. Furthermore, in [21,14,16,17] systematic ways for handling outliers in the data was given.

Quasiconvex Optimization
In this section we very briefly recall the definition and basic properties of quasiconvex functions. A much more detailed treatment can be found in [18].
When dealing with optimization problems convexity is a very useful property. For example, when minimizing a convex function we are guaranteed that local methods converge to globally optimal solutions. Unfortunately, convexity does hardly ever occur in multiple view geometry problems because projections are in general not convex. More commonly occurring is the slightly weaker notion of quasiconvexity.

Definition 1. A function f is called quasiconvex on a convex set C if its sublevel sets
are convex for all μ ∈ R.  Figure 1 shows a function that is quasiconvex. It is not convex since it is possible to draw a line between two points on graph, such that the function is above the line. Quasiconvexity is not preserved under addition. It is, however, preserved under the max operation. That is minimizing the maximal error (see equation (4)) is a quasiconvex problem if all the error residuals are quasiconvex. The simplest, and perhaps the most common way of solving such problems is to employ a bisection method. Checking whether there is an x such that f (x) ≤ μ for a fixed μ is a convex problem. Hence, by solving a sequence of convex problems one can find an optimal μ (see [9,12,18]). In [6,19] it was shown that in the vast majority of cases it is also possible to solve the least squares formulation (see equation (7)) using local methods.

3D-Plane Triangulation
Next we consider the problem of determining a scene plane from two images. We assume that the image data is given as two sets of corresponding points in the two images that are known to be projections of points located on the 3D-plane. Let (Here ∼ denotes equality up to an unknown scale factor.) And similar for the inverse homography H 12 = H −1 21 from image 2 to image 1 Now, suppose that the system is not noise free. We would like to find the plane (or homography) that gives the smallest back-projection errors in both images. Therefore we formulate the following minimization problem where and Π : is the projection mapping given by The constraint x 3 > 0 reflects the fact that visible points in the image should be located in front of the cameras. The residual errors are similar to what is used in homography estimation with the L ∞ norm [9,12], note however that the homography H 21 depends on the plane p. In homography estimation it is not possible to use back-projection errors in both images since the inverse of H 21 cannot be parameterized linearly in terms of the elements in H 21 , and therefore does not yield a quasiconvex problem. However in this setting this is not a problem, since, as we will show, both H 21 and its inverse can be parameterized linearly by the plane parameters p.
In (4) and (5) we have used the maximum (squared) residual error as is traditionally done in the L ∞ -framework. However, recent results (see [6,19]) show that solving the least squares formulation using local methods also works well for the same framework. In both cases we need to parameterize H 21 and its inverse linearly. The least squares formulation of our problem is min where In this paper we will use both formulations. The latter is solved using local methods with initialization from the former.

Homography from a Plane
In this section we derive the expression for the homographies H 12 The special case b = 1 is proven in [8] and our proof is just a simple extension. It is however essential, since without it, it is not possible to use both H 12 and its inverse in the error residual, which would prevent us from using symmetric error residuals. Let x and y be the projections of the point z belonging to p. Then Now, since Z is the homogeneous coordinates for z we may we multiply Z with the scale factor b, assuming b is positive, without changing anything. The fact that this assumption is no restriction will be motivated later on. Therefore we obtain Projecting z into image 2 now gives which proves the statement. Now for the general case we use a transformation to prove the following.
then the homography H 21 can be written Let Changing coordinates using T we get in the new coordinate system Substituting (16), (17) and (18) is convex for a fixed μ (see Definition 1, Section 1.1). However since H 3 21 (p)Y i > 0 it is easy to see that (19) is equivalent to where and x i j denotes the j'th coordinate of x i . Since (21)-(23) are linear functions (20) will be a second order cone constraint (see [2]) which is convex. Hence the error residuals are quasiconvex functions (H 12 is handled in the same way), and the theory from [9,12,21,14,6,19,16,20] extends to this problem as well.

Chirality
In the proof of lemma 1 we did not motivate b being positive. It is easy to see that if b < 0 then there will be a sign change when multiplying X with b, and hence X might not have positive depth in the second camera. Similarly when using 1 we want the quantity b − R T 1 t 1 to be positive. This can always be ensured unless the cameras are located on opposite sides of the plane. Let where c 1 and c 2 are the camera centers. Now it is easy to see that if one of these are negative then c 1 and c 2 cannot be on the same side of the plane.
Since the cameras need to see the same points the sign of b is normally not a problem in practice. However, note that in case we for some reason would like to solve a problem where this is not fulfilled one simply multiplies X with −b instead and the same result holds. If we do not know anything about the relative locations of the cameras and the plane we have to test both possibilities. Similar to other multiple view geometry problems (see [9]), it is easily shown that for each choice there is a local minimum.

Experiments
In this section we perform some simple experiments to verify the theory and evaluate the quality of the proposed methods. In equation (14) the homography H 21 is written as a linear expression in a and b. However as 3 parameters is enough for specifying a plane we will choose b to be 1 in all our implementations. Note that we still need to use (14). It is not possible to parameterize both H 21 and H 12 linearly using only (9) (with b=1).

Stability of the Proposed Formulation
As we have mentioned before the standard way of fitting a scene plane to image data, is to first compute 3D-points using triangulation and then fit a plane to these points. In our first experiment we compare this approach to the two proposed formulations on synthetically generated data. We use synthetic data since we would like to know the true parameters of the plane that generated the measurements. The setup is as follows: First we placed 30 points randomly on the plane z = 0, within the box −1 ≤ x ≤ 1, −1 ≤ y ≤ 1. Then we placed two cameras at a distance roughly 4 from the origin with camera centers fulfilling z ≤ −2. We then added noise with standard deviation 0.0025 to the image coordinates. Figure 2 shows a typical example of the generated images. We also selected a maximal value for the baseline, that is, the camera centers was placed closer to each other than this maximal value. Figure 3 shows the results for a number of different values of the maximal baseline. To the left is the back-projection error when   measured with the max-norm (5). For each data point in the graph we generated 500 instances of the problem and computed the median of the back-projection error. We chose the median and not the mean since in rare instances the 3D point triangulation method produces a plane that is almost perpendicular to the true plane, resulting in extremely large back-projection errors. Even though we are averaging over a large number of experiments this gives a noisy graph which is difficult to interpret, therefore we use the median instead (for comparison we plotted one of the graphs obtained when using the mean in Figure 2). For the two proposed formulations the median and the mean are very similar.
To the left is the result when the back-projection error is measured using the maxnorm (4). As expected the proposed L ∞ formulation performs best here (regardless of the baseline).
In the middle is the result when the error is measured with the L 2 -norm (8). Here the proposed L 2 formulation is the winner. Somewhat surprisingly the proposed L ∞ formulation performs better than the 3D-point triangulation approach when the baseline is small. The reason is that when the baseline is small distances in the direction of the depth is difficult to observe accurately in the cameras. Hence, the position of the 3Dpoints in this direction is uncertain. Therefore, using distances in 3D instead of backprojection errors results in a more unstable procedure.
To the right is perhaps the most interesting graph. Here we have plotted the median of the distance from the computed solutions to the true (noiseless solution). If (a true , 1) is the parameters for the true solution the distance is measured as ||a true − a est || 2 where (a est , 1) is the estimated solution with one of the methods. When the baseline is sufficiently large the 3D point triangulation approach is almost as good as the proposed L 2 formulation however when the baseline becomes smaller it is less stable than the other two methods.
To test the stability with respect to noise we also plotted the same figures when varying the noise level instead of the maximal baseline. The results cam be seen in Figure 4. Here the maximal baseline was set to 0.5 and the noise level was varied between 0 and 0.005. The proposed formulations appear to exhibit roughly linear growth in the errors whereas the triangulation approach seem to grow faster.

Outlier Removal and Estimation
Next we evaluate our method on real data. In real settings the data is often corrupted by outliers. A popular method for removing outliers is RANSAC [4], however, here we will use the approach pioneered by Sim and Hartley [21], and later refined in [17]. This is an iterative method that is guaranteed to remove one outlier in each iteration. The algorithm works by solving the problem min s (26) where a i , b i , c i , a i0 , b i0 and c i0 are constructed from the i'th measurement (counting backward and forward homographies separately). It is shown in [17] that by removing the residuals for which the dual variables y are nonzero we are guaranteed to remove one outlier. This procedure is then iterated until the solution is good enough (s ≤ 0). We ran the algorithm on the stereo pair seen in Figure 5. Since the posters located on the wall has similar texture we obtained a lot of mismatches where points on one of the poster matches to points on the other. Using SIFT [15] we determined 678 point correspondences. In order to find a solution with all errors less than 10 pixels 112 iterations was needed. In total 241 image points where discarded. Manual inspection reveals that almost all the discarded cases is either a mismatch or a point not belonging to the dominant plane.

Conclusions
In this theoretical paper we have proposed a procedure for triangulating a scene plane. Our method is based on the framework of quasiconvex optimization making it easy to solve with guaranteed optimality. Furthermore, since we have shown that our problem belongs to this class, all the previously developed theory naturally applies to this problem as well. Since the formulation is based on back-projection which is a geometrically meaningful quantity it is also stable with respect to noise and geometry.