GPstruct: Bayesian Structured Prediction Using Gaussian Processesby Sebastien Bratieres, Novi Quadrianto, Zoubin Ghahramani

IEEE Transactions on Pattern Analysis and Machine Intelligence


GPstruct: Bayesian Structured Prediction

Using Gaussian Processes

Sebastien Bratieres, Novi Quadrianto, and

Zoubin Ghahramani,Member, IEEE

Abstract—We introduce a conceptually novel structured prediction model,

GPstruct, which is kernelized, non-parametric and Bayesian, by design. We motivate the model with respect to existing approaches, among others, conditional random fields (CRFs), maximum margin Markov networks (M3N), and structured support vector machines (SVMstruct), which embody only a subset of its properties. We present an inference procedure based on Markov Chain Monte

Carlo. The framework can be instantiated for a wide range of structured objects such as linear chains, trees, grids, and other general graphs. As a proof of concept, the model is benchmarked on several natural language processing tasks and a video gesture segmentation task involving a linear chain structure. We show prediction accuracies for GPstruct which are comparable to or exceeding those of

CRFs and SVMstruct.

Index Terms—Statistical learning, Markov random fields, Gaussion processes, natural language processing, structured prediction Ç 1 INTRODUCTION

MUCH interesting data does not reduce to points, scalars or single categories. Images, DNA sequences and text, for instance, are not just structured objects comprising simpler indendent atoms (pixels,

DNA bases and words). The interdependencies among the atoms are rich and define many of the attributes relevant to practical use.

Suppose that we want to label each pixel in an image as to whether it belongs to background or foreground (image segmentation), or we want to decide whether DNA bases are coding or not. The output interdependencies suggest that we will perform better in these tasks by considering the structured nature of the output, rather than solving a collection of independent classification problems.

Existing statistical machine learning models for structured prediction, such as maximum margin Markov Network (M3N) [1], structured support vector machines (SVMstruct) [2] and conditional random field (CRF) [3], have established themselves as the state-of-the-art solutions for structured problems (cf. Fig. 1 and

Table 1 for a schematic representation of model relationships).

In this paper, we focus our attention on CRF-like models due to their probabilistic nature, which allows us to incorporate prior knowledge in a seamless manner. Further, probabilistic models make it possible to compute posterior label probabilities that encode our uncertainty in the predictions.

On the other hand, SVMstruct and M3N offer the ability to use kernel functions for learning using implicit and possibly infinitedimensional features, thus overcoming the drawbacks of finite dimensional parametric models such as the CRF. In addition, kernel combination allows the integration of multiple sources of information in a principled manner. These reasons motivate introducing

Mercer kernels in CRFs [4], an advantage that wewish tomaintain.

From training and inference point of views, most CRF models estimate their parameters point-wise using some form of optimisation. In contrast, [5] provides a Bayesian treatment of the

CRF which approximates the posterior distribution of the model parameters, in order to subsequently average over this distribution at prediction time. This method avoids important issues such as overfitting, or the necessity of cross-validation for model selection.

Reflecting on this rich history of CRF models, we ask a very natural question: can we have a CRF model which is able to use implicit features spaces and at the same time estimates a posterior distribution over model parameters? The main drive for pursuing this direction is to combine the best of both worlds from Kernelized

CRFs and Bayesian CRFs. To achieve this, we investigate the use of

Gaussian processes (GP) [6] for modelling structured data where the structure is imposed by a Markov Random Field (MRF) as in the CRF.

Our contributions are the following:  a conceptually novel model which combines GPs and

CRFs, and its coherent and general formalisation;  while the structure in the model is imposed by a Markov

Random Field, which is very general, as a proof of concept we investigate a linear chain instantiation;  a Bayesian learning algorithm which is able to address model selection without the need of cross-validation, a drawback of many popular structured prediction models;

The present paper is structured as follows. Section 2 describes the model itself, its parameterization and its application to sequential data, following up with our proposed learning algorithm in

Section 3. In Section 4, we situate our model with respect to existing structured prediction models. An experimental evaluation against other models suited to the same task is discussed in Section 5. 2 THE MODEL

The learning problem addressed in the present paper is structured prediction. Assume data consists of observation-label (or input-output) tuples, which we will note D ¼ fðxð1Þ; yð1ÞÞ; . . . ; ðxðNÞ;yðNÞÞg, where N is the size of the dataset, and ðxðnÞ;yðnÞÞ 2 X  Y is a data point. In the structured context, y is an object such as a sequence, a grid, or a tree, which exhibits structure in the sense that it consists of interdependent categorical atoms. Sometimes the output y is referred to as themacro-label, while its constituents are termedmicrolabels. The observation (input) xmay have some structure of its own.

Often, the structure of y then reflects the structure of x, so that parts of the label map to parts of the observation, but this is not required.

The goal of the learning problem is to predict labels for new observations.

The model we introduce here, which we call GPstruct, in analogy to the structured support vector machine [2], can be succinctly described as follows. Latent variables (LV) mediate the influence of the input on the output. The distribution of the output labels given the input and the LV is defined per clique: in undirected graphical models, a clique is a set of nodes such that every two nodes are connected. Let this distribution be pðyjx; fÞ ¼ expð