This article was submitted to AI in Food, Agriculture and Water, a section of the journal Frontiers in Artificial Intelligence

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

A method (Ember) for nonstationary spatial modeling with multiple secondary variables by combining Geostatistics with Random Forests is applied to a three-dimensional Reservoir Model. It extends the Random Forest method to an interpolation algorithm retaining similar consistency properties to both Geostatistical algorithms and Random Forests. It allows embedding of simpler interpolation algorithms into the process, combining them through the Random Forest training process. The algorithm estimates a conditional distribution at each target location. The family of such distributions is called the model envelope. An algorithm to produce stochastic simulations from the envelope is demonstrated. This algorithm allows the influence of the secondary variables, as well as the variability of the result to vary by location in the simulation.

An issue that practitioners of spatial modeling confront is that the ensemble behavior of the modeled random function might not match observable properties of the dataset in some types of application. For example, some auxiliary hypotheses of a generative model such as stationarity, needed to allow inference, may leave an unwelcome trace in the predicted results. The subsurface reality itself is rarely stationary, and the usual remedy for scientists is to subdivide the domain of interest into regions which they believe to be approximately stationary (or reducible to stationary by removing a trend) and to create separate models within each region. While this generally works well in practice, it does suffer from a couple of potential downsides. Sometimes the number of regions required can be quite large, necessitating a labor-intensive and quite error-prone procedure. Secondly, a region which is modeled as stationary may, in fact, have some subtle but distinct sub-regions which, if identified, would affect some desired applications of the model (for example, fluid flow strongly depends on vertical spatial divisions).

A compounding factor is that when predicting a target variable at a location, there often are a number of secondary variables known at that location which are covariate with the target variable of interest, e.g.,

In this article, a simple alternative procedure is proposed which is aimed at reducing these effects. Influenced by the idea of Conditional Random Fields (CRFs) and the classic Geostatistical wariness of very highly parametric models, the idea is to directly estimate an envelope of conditional distributions based on the secondary data and on prior speculative spatial models. The envelope of distributions can be thought of as a generalization of a trend model. They can be quite nonstationary, reflecting local heterogeneity patterns. They provide estimates of the target variable, as well as a locally adapted estimate of uncertainty. This can be carried out without invoking a full multivariate distribution. Unfortunately, it is not possible to extend the workflow to produce realizations of a random function without such an assumption. In most traditional formulations, this is made up front. Examples are Gaussian Random Fields (GFs), Markov Random Fields (MRFs), Hidden Markov Models (HMMs), and many variants. Since the prior model is generally made with some hypothesis of stationarity, the risk of this hypothesis persisting into the results should be considered.

For the approach considered here, the realizations are made by a conditional sampling from the distributions that we have estimated. However, it is only the sampling that needs to be assumed as stationary. Hence, the fully spatial model is only for the sampling function, and there is no explicit random function model of the target variable itself, only an envelope of distributions which depend on the state of knowledge about the target. To summarize,

(a) A machine learning/conditional distribution estimation algorithm is used to get a distribution at each spatial location. This family of distributions is called the envelope in this article.

(b) A stationary ergodic random function is used to sample from the envelope at each location conditionally on the observed value of the target variable at that location.

As well as being the basis for simulating realizations, the envelope of conditional distributions can be used to obtain estimates of the mean, quantiles, robust estimates of uncertainty, and unbiased estimates of “sweet spots” such as locations where porosity is expected to be higher than a user set threshold. The idea of “embedding” prior spatial models in the estimation of the envelope is what gives rise to the name Ember, which stands for embedded model estimator. This allows use of an essentially unlimited number of additional variables to enhance conditioning of the final model. It turns out to still be possible to make inference about the type of variogram to use for the stationary sampling RF depending on the assumptions about the type of sampling that we wish to use. A major advantage of this method is that the final estimates are now allowed to be nonstationary. In other words, the predictor may “reweight” the importance of the variables in different parts of the field to adapt to local heterogeneity patterns.

After a more technical introduction to the method, an application to a synthetic subsurface reservoir model is shown. Additional technical discussion of the algorithm is held back to the appendix. Some alternative examples are in

In the classical universal kriging model, the trend is a point property (i.e., there is one value at each target location) and is often considered to be an estimate of the mean or of the low-frequency component of the model. It is not exact, in the sense that it does not honor the observed values of the target variable. Typically, it is constructed either as a local polynomial of the spatial co-ordinates (universal kriging) or using some additional variable(s) (e.g., external drift). In the Ember model, a conditional distribution is estimated at each location. In an analogy with the trend, the conditional distribution is built using the spatial co-ordinates and additional variables. In addition, the envelope estimation step will often use the predictions of simpler models to aid calculation of the conditional distributions at each location by embedding. In this work, the embedded model is kriging. The rationale is that the secondary variables, which might include seismic attributes, stress estimates, distance to the nearest fault, and height above contact, as well as true vertical depth, stratigraphic depth, and x,y spatial locations, do not explicitly contain information about the spatial correlation which is available in kriging through the variogram. In the example, we will see that including the additional information which kriging brings can help constrain the envelope. Depending on the case, it may be a weak variable, contributing little, or a very strong variable which comes close to determining the solution. We note that embedding models will take a little extra work as models do not behave exactly the same as data in training.

Conditional Random Fields (CRFs) avoid construction of the multivariate law (

The form of CRF that we use here to calculate the envelope accommodates and embeds existing spatial models using a Markovian hypothesis. Let

This hypothesis states that the conditional distribution of

In this work, we choose to base the algorithm for calculation of the envelope on a highly successful nonparametric paradigm, the Decision Forest, e. g.,

Embedded models are treated slightly differently to secondary data. They are embedded (

A number of methods have been proposed to do this, for example using Projection Pursuit,

A synthetic reservoir was constructed to allow for exploration of some of the issues raised in realistic surroundings. For a real case study, see

On the left, the true porosity of the reservoir to be modeled is shown. On the right, a synthetic RAI volume created is shown. The locations of the first eight wells are shown on the left, while the thirty-six-well case is on the right.

Plan view with the 36-well configuration on the right and the eight-well one on the left.

A synthetic seismic volume was created from a slightly modified version of the reservoir to ensure that it does not correspond too neatly to the “real” reservoir. Several attributes were derived from this volume. The modeling was performed using 13 data variables and two embedded kriging models giving a total of 15 secondary variables. A cross section of the reservoir together with the facies distribution, which is not known for the study, is shown in

Cross section of the reservoir. The facies used for model construction are shown on the right, although these are not used in subsequent modeling effort.

Cross section of nine of thirteen data variables used in modeling. From L to R per row: RAI, chaos, flatness: sweetness, amplitude, dist. to Fault: depositional zones, TVD, and strat. depth.

For simulations, a variogram is required for the Sampling Random Function. In many real-world reservoir modeling cases, there is not enough information to calculate reliable variograms, especially in the horizontal direction. In such cases, the range is considered an uncertainty. The same is true using the Ember model, although as we shall see, the uncertainty is often somewhat mitigated by the estimate of the conditional distribution. In the case that a meaningful fit for the variogram of the target variable can be found for classical modeling, then using

Since Ember simulations condition to wells, we calculate a posterior mean by averaging many realizations. While it does not seem to bring much in terms of new information compared to the prior mean, and so probably does not need to be routinely calculated, the results of the posterior mean for Ember, as well as for Gaussian simulation (sometimes called the E-type), are shown in

The results of stochastic simulation are shown in

Red are locations when the true porosity is above 20%.

These high values lie within the channel belt. The channel belt is readily identified in the P50 and P90 cases, as well as the probability map. For context, the color red in the probability map corresponds to estimated probabilities of 0.5 or above of finding sand with porosity above 20%. The spread between P10 and P90 values show that while the channel is identifiable, there is still variability and so high porosity patches are still possible outside the belt.

Turning attention to the 36-well case to get a feel for how additional information changes the Ember estimates,

Upper images are the eight-well case, while the lower ones are the 36-well case. On the left is Ember simulation, while on the right is the associated uncertainty.

Next, a layer in the shoreface sands is considered. Uncertainty is lower in the shoreface, and moreover, it reduces more quickly with increasing data due to the greater simplicity. The porosity distribution is not well identified on seismic within the shoreface, so the Ember estimate is largely depending on geometric variables and the embedded kriging.

The classic Gaussian simulations referred to in

Ember and Gaussian simulations are shown in

Ember results for a blind well, circled in red. Two layers are on the left, one in channel sand and the other in the shoreface. Their depth locations on the well track are shown with arrows. Tracks are, from left, true porosity of the blind well; prob (poro>15%); the next five tracks are simulations; true porosity (red) superimposed on envelope; and RAI.

In the two layers shown so far, the seismic was a large contributor for the channel case but played little role in the shoreface case. To show an intermediate situation, another layer from a channelized formation is shown in

Same result as in

In the top left of

In the 36-well case, both channels are well identified in the mean and the simulations tend to respect them. The eastern channel is still better identified due to the combination of seismic and embedded kriging, while the western one is less defined as it is less able to exploit the seismic attribute.

So far, we have focused on plan views of several layers of the model. For completeness and to see the value of the additional well information, we return to the cross-sectional view in

Shoreface layer. Left column, top truth and bottom RAI; middle column, Ember estimates and eight-well case on the top; right column, Ember uncertainty and eight-well case on the top.

Before showing some numerical results, it is worth noting that the classic Gaussian simulation model was fitted in a few different ways with separate zones, trends, and covariance models. Three different results are shown in

Top left is truth; bottom left is RAI. For the other three columns, the top is eight wells and the bottom 36 wells. Left to right: mean, uncertainty, and simulation from Ember algorithm.

The best Gaussian model was an MM2 model with an unpublished modification developed by the author for the Petrel software suite to account for multiple secondary variables. An MM2 model really amounts to the estimation or simulation of the residuals accounting for the trends (

Variance of trror and IQR for three Gaussian models and the Ember model, as well as for the poor choice of “short” variogram for both types. Eight-well case: green is the best case, and red is the worst case.

8 Well model | Zone | Var error | IQR error | Var sim err | IQR sim err |
---|---|---|---|---|---|

Gaussian 1 | Channel | 45.20 | 6.12 | 71.16 | 7.28 |

Shoreface | 15.11 | 3.80 | 46.22 | 7.82 | |

Gaussian 2 | Channel | 38.85 | 6.98 | 81.17 | 7.67 |

Shoreface | 10.02 | 2.63 | 26.63 | 5.33 | |

Gaussian 3 | Channel | 45.03 | 6.88 | 80.93 | 8.11 |

Shoreface | 25.28 | 2.52 | 32.13 | 3.57 | |

Ember | Channel | 39.57 | 8.07 | 83.76 | 7.42 |

Shoreface | 8.64 | 2.75 | 12.52 | 2.92 | |

Emb short vario | Channel | 39.57 | 8.07 | 88.61 | 7.62 |

Shoreface | 8.64 | 2.75 | 21.64 | 3.92 | |

Gau short vario | Channel | 48.51 | 8.15 | 95.54 | 9.39 |

Shoreface | 20.71 | 5.66 | 35.65 | 7.17 |

Variance of error and IQR for three Gaussian models and the Ember model, as well as for the poor choice of “short” variogram for both types, 36-well case.

36 Well model | Zone | Var error | IQR error | Var sim err | IQR sim err |
---|---|---|---|---|---|

Gaussian 1 | Channel | 32.26 | 3.38 | 42.46 | 5.21 |

Shoreface | 5.39 | 1.31 | 19.13 | 4.39 | |

Gaussian 2 | Channel | 28.28 | 3.98 | 53.38 | 6.59 |

Shoreface | 3.71 | 0.75 | 8.93 | 2.66 | |

Gaussian 3 | Channel | 31.17 | 4.11 | 54.96 | 5.69 |

Shoreface | 6.11 | 1.43 | 7.15 | 2.12 | |

Ember | Channel | 26.29 | 4.35 | 52.67 | 4.51 |

Shoreface | 2.56 | 0.83 | 3.00 | 0.96 | |

Emb short vario | Channel | 26.29 | 4.35 | 52.67 | 4.45 |

Shoreface | 2.56 | 0.83 | 7.10 | 1.50 | |

Gau short vario | Channel | 43.55 | 8.21 | 92.11 | 8.92 |

Shoreface | 14.41 | 4.22 | 30.83 | 6.37 |

Since the Ember model estimates the conditional distribution at each location, it can be interesting to view what this envelope looks like. To do this, we have selected a location for a blind well and estimated the values there. Results of the Ember model are shown in the eight-well case in

Left is true porosity, while the right-hand side is RAI.

The results show, as expected, that predictions are easier and have lower uncertainty in the shoreface sands. The first track is the true porosity at the blind well location. Each thin black line on the track is a 5% porosity increment. So, porosity values above 15% cross the 3rd black line. The second track shows the estimate of that the location has porosity above 15%. These probabilities are much higher, reflecting greater confidence in the estimate in the shoreface sands. So, for example, there is a thin layer of shoreface sand just above the lower red arrow with porosity just above 15%. It is predicted with high probability (about 0.75) and appears in all five of the simulations. In fact, the simulations vary little within the shoreface and are all very similar to the true porosity log. This is confirmed in view of the envelope itself, which is the second log from the right. It shows the true log in red, overlaying the conditional distribution which is seen as a gray-scale image “spread horizontally” at each depth. The lower bound of the “spread” is the P10, and the upper bound is the P90 of the conditional distribution. The true porosity lies between the P10 and P90 most of the time. The darker the color of the distribution at any point on the track, the higher the probability of having that value (i.e., the color represents probability density). In the shoreface sands, we see that distribution is very concentrated, meaning the conditional distribution has low variability.

In the channel sands, the conditional distribution has got a far larger spread. Interestingly at some depth values, near the top for example (e.g., depth = 2280), the distribution starts very dark on the left, becomes paler in the middle, and returns to dark on the right-hand side. This tells us that the distribution is bimodal at that depth. On reflection, this is not surprising. The prediction knows that it is in a channel interval and within those intervals it must either be in the channel or not. Hence, it has high porosity or low porosity, but only rarely a “middle” porosity. As expected, we see that the true curve switches back and forth between the extremes of the distribution but does not usually fall in the middle. The simulations, since they sample from this conditional distribution, must have the same switching effect; hence, there will be less “shoulder effects” in channel intervals than with a continuous Gaussian model.

Finally in

There are an increasing number of problems which require estimation of a sparsely observed spatially distributed target variable. In the example given in this paper, we considered estimation of petrophysical variables in a subsurface reservoir. While the target variable is sparse, in most situations, there are many other variables which are covariate with the target. These may be directly observed variables such as those obtained from seismic or electromagnetic measurements, but may also include some variables that can be observed by careful consideration of the environment, including spatial or geological position and proximity to important structural features such as faults. The statistical behavior tends to change locally both per variable and for interactions between variables.

Mathematical models of the spatial phenomena generally need to invoke some notion of stationarity to become tractable. For major studies with significant economic or environmental impact, the user may spend a considerable time on model construction. For three dimensional geological applications, an important part of the art of the practitioner is to segment the model into parts which may be adequately modeled with existing tools. A major objective of the current development is to simplify this workload for the user. The Random Forest algorithm that was used can consume a large number of correlated secondary variables with little overfitting, and it is shown that it can make use of simpler embedded models, provided they have predictive power. Rather than starting with an unrealistic stationary Random Function as prior for the target variable, with the risk that the weakness of the data coupled with the comparative strength of the prior leads to incorrect predictions, this algorithm opts to initially solve a weaker problem. It produces an envelope of distributions at the target locations which depend on the current state of knowledge about the training data and which can subsequently be sampled from to produce realizations of the target variable given the current state of knowledge.

Such a procedure will involve risks. Creating the envelope by resampling means that the model only explores uncertainty within the limits of the observed data. Embedding loses information about the multivariate distribution of the target, so if such information is available, then direct modeling will likely lead to tighter results. Finally, like many spatial modeling algorithms, its performance will degrade when used for extrapolation.

The project data is available at

The author confirms being the sole contributor of this work and has approved it for publication.

The author declares that he is an employee of Schlumberger Ltd. and that the work was conducted as part of that employment. The funder had no role in the study design, data collection and analysis or preparation of the manuscript but gave the author permission to publish.

The Supplementary Material for this article can be found online at: