B Further issues in emulator structure

The general structure of a univariate emulator is as follows: \[f(x) = g(x)^T \beta + u(x),\] where \(g(x)^T \beta\) is the mean function and \(u(x)\), with mean zero, is a Gaussian process or a weakly second order stationary process. For a general introduction to univariate emulators we refer to the reader to the article Bayesian History Matching of Complex Infectious Disease Models Using Emulation: A Tutorial and a Case Study on HIV in Uganda.

In this appendix we briefly discuss some of the choices that can be made in terms of regression functions and of correlation functions.

B.1 Regression structure

As already seen, in this tutorial we chose to emulate the mean of the process and this is expressed as a linear combination of the regression functions. The simplest possible choice for the regression functions is that of a constant mean: \(g(x)=1\) and \(\beta\) just a scalar. A less trivial structure is given by a polynomial regression. When the input space is one dimensional, this corresponds to \(g(x)^T=(1,x,x^2,...,x^p)\) for \(p\) a natural number. For example if \(p=1\), we are trying to fit a hyperplane, if \(p=2\) we fit a quadratic surface.

The default behaviour of the emulator_from_data function in hmer is quadratic, while by setting quadratic=FALSE one can choose to fit a hyperplane. Here we highlight the method implemented by the emulator_from_data function in its default setting (quadratic surface). We will deal with general basis functions in more advances case studies.

B.1.1 Active variables identification and benefits

The first step is the identification of the active variables, i.e. those variables that have the most explanatory power relative to our data. Restricting our attention to active variables we reduce the dimension of the input space and therefore the complexity of our model. Given the data, we start by fitting a model consisting of linear terms and pure quadratic terms, without considering possible interaction terms between parameters. We then perform stepwise deletion on the obtained model, using the Bayes Information Criterion (BIC). This process is repeated till no further deletion can improve the BIC. At this point, any parameters that have either their linear or quadratic term left in the model are designated active variables for the surface.

B.1.2 Building the model

We now build a new model including all linear, quadratic, and interaction terms in the active variables only. As before, stepwise deletion is used to prune the model. Note that when there are fewer data points than there are terms in the model, stepwise addition is used instead, starting from a purely linear model.

While the first step allowed us to identify the active variables, this final model determines the basis functions and coefficients of the regression surface. The residuals of the final model are computed and used to estimate the correlation length and variance \(\sigma^2\) for the correlation structure.

B.2 Correlation structure

Once the regression functions are chosen, we need to specify the correlation between \(u(x)\) and \(u(x^\prime)\) for all possible values of \(x\) and \(x^\prime\). This is a key step, since it will determine how the response \(f(x)\) at one parameter set \(x\) is affected by the response at any other parameter set \(x^\prime\).

A common assumption for the correlation structure is that the variance of \(u(x)\) is independent of \(x\). If its constant value is denoted by \(\sigma^2\), we can then write the correlation of \(u(x)\) and \(u(x^\prime)\) in the form \[\sigma^2 c(x,x^{\prime})\] where \(c\), which satisfies \(c(x,x)=1\), is the correlation function, or kernel. In our tutorial we also introduced a nugget term \(\delta I_{\{x=x^\prime\}}\) that operates on all variables: \[(1-\delta) c(x,x^{\prime}) + \delta I_{\{x=x^\prime\}}.\] This term represents the proportion of the overall variance due to the ensemble variability and ensures that the covariance matrix of \(u(x)\) is not ill-conditioned, making the computation of its inverse possible.

The choice of the function \(c\) reflects fundamental characteristics of the process \(u(x)\), such as stationarity, isotropy and smoothness. For example, \(u(x)\) is stationary if \(c(x,x^{\prime})\) depends only on \(x-x^{\prime}\). In this overview we will always assume stationarity, but non-stationary approaches are also possible.

From now on we will add a subscript \(A\) whenever referring to input points: this indicates that we are operating only on the active parameters for the emulator: that is, parameters that contribute to the regression surface.

An example of a stationary kernel is the squared-exponential correlation function, used in this tutorial: \[c_{\text{SE}}(x,x^{\prime}) = \exp\left(-\sum\limits_{i}\frac{(x_{i,A}-x_{i,A}^{\prime})^2}{\theta_i^2}\right).\] This function, also called Gaussian correlation function, has the mathematical advantage of being differentiable infinitely many times. In particular, this implies that observing a small continuous fraction of the input space is sufficient to recover the whole process. Such a strong property is not always suitable, since it might be unrealistic to assume that information from a small portion of the input space allows to infer the behaviour of the process everywhere else.

In such cases, it might be preferable to use the Matérn correlation functions, a family of parametric stationary kernels. The Matérn kernel of order \(\nu\) is defined as \[c_\nu(x,x^{\prime}) = \prod\limits_{i}\frac{(|x_{i,A}-x_{i,A}^{\prime}|/\theta_i)^\nu K_\nu(|x_{i,A}-x_{i,A}^{\prime}|/\theta_i)}{2^{\nu -1}\Gamma(\nu)},\] where \(K_\nu\) is the modified Bessel function of the second kind and order \(\nu\). Compared to the Gaussian kernel, which is differentiable infinitely many times, \(c_\nu\) is ‘less smooth’, being differentiable only \((\left \lceil{\nu}\right \rceil -1)\) times (here \(\left \lceil{\nu}\right \rceil\) denotes the ceiling function). The Matérn kernels can be thought as a generalisation of the Gaussian kernel: when \(\nu \rightarrow \infty\), the kernel \(c_\nu\) converges to \(c_{\text{SE}}\).

When \(\nu=1/2\), the kernel \(c_{1/2}\) coincides with the so called exponential kernel: \[c_{\text{exp}}(x,x^{\prime}) = \exp\left(-\sum\limits_{i}\frac{|x_{i,A}-x_{i,A}^{\prime}|}{\theta_i}\right).\] Note that the sample paths with the exponential kernel are not smooth.

A third example of stationarity is the rational quadratic kernel, defined by \[c_{\text{RQ}}(x,x^{\prime}) = \prod\limits_{i}\frac{1}{1+(x_{i,A}-x_{i,A}^{\prime})^2/\theta_i^2} .\] This correlation function is differentiable infinitely many times, as the Gaussian kernel.

Another key property of kernels is isotropy. A stationary kernel is said to be isotropic (or homogeneous) when it depends only on the distance of \(x\) and \(x^\prime\), rather than on the full vector \(x-x^\prime\). In other words, an isotropic kernel depends on the length of \(x-x^\prime\), but not on its direction. All the stationary kernels mentioned above are isotropic if and only if \(\theta_i=\theta_j\) for all \(i,j\).

We conclude by recalling the role of the hyperparameters \(\theta_i\), which are called correlation lengths. As just seen, when \(\theta_i=\theta\) for all \(i\), the kernel is isotropic and therefore has rotational symmetry. In this case, the value \(\theta\) has the following interpretation: the larger \(\theta\) is, the smoother the local variations of the emulators will be. In the non-isotropic case, \(\theta_i<\theta_j\) means that we believe the function to be less smooth with respect to parameter \(i\) than parameter \(j\).