Bayesian data analysis of extreme values with Stan and rstan.

In this post, I’m gonna provide an introductory tutorial for fitting an extreme value distribution via Bayesian analysis with Stan and its R interface, rstan. Stan is a probabilistic programming language written in C++ suitable for Bayesian inference and rstan is a package providing access to Stan thru R. More about Stan can be found here.

In extreme value statistics, a very useful distribution for modeling upper-tail data is the Generalized Extreme Value (GEV) distribution. Unlike the frequentist approach, where the three parameters (location, scale, and shape) are obtained typically via Maximum Likelihood Estimation (MLE) and their uncertainty is only asymptotically derived by MLE properties, a Bayesian approach explicitly accounts for uncertainty in parameter estimates  by inferring a respective distribution for each parameter, the so-called “posterior” distribution.

A typical workflow when using Stan/rstan consists of the following steps:

  1. Develop a Bayesian formulation, i.e., a Stan model file ending with .stan
  2. In R, fit the model with the use of rstan package by passing the model file and the data as arguments to the rstan::sampling function

Our .stan file (say we name it ourmodel.stan) consists of 5 blocks which are used to construct the Bayesian formulation of our analysis. Inside the functions block, we code a real function (gev_lpdf) which returns the log-probability density function of the GEV distribution. The data block comprises of simply declaring our data variables which in this case are: a) the number of observations (N), and b) the observations themselves (y). The purpose of the parameters block is to declare the parameters for which inference is sought. Note that these parameters are declared using lower/upper limits; particularly for the limits of the shape parameter, the minimum and maximum values of the data are needed and thus an additional transformed data block is necessary where we should declare these two “new” data points. Finally, the model block allows us to provide prior distributions for each one of the parameters (here, normal distributions) and set the expression for likelihood function.

functions{
	
	real gev_lpdf(vector y, real loc, real scale, real shape) {
		vector[rows(y)] z;
		vector[rows(y)] lp;
		real loglike;
		int N;
		
		N = rows(y);
		
		if (fabs(shape) > 1e-10) {
			for(i in 1:N){
				z[i] = 1 + shape * (y[i] - loc) / scale;
				lp[i] = (1 + 1 / shape) * log(z[i]) + pow(z[i], -1 / shape);
			}
		loglike = - sum(lp) - N * log(scale);
		} else {
			for (i in 1:N){
				z[i] = (y[i] - loc) / scale;
				lp[i] = z[i] + exp(-z[i]);
			}
			loglike = - sum(lp) - N * log(scale);
		}
		
		return(loglike);
	}
}

data {
	int<lower=0> N;
	vector[N] y;
}

transformed data {
	real miny;
	real maxy;
	
	miny = min(y);
	maxy = max(y);
}

parameters {
	real<lower=-1.0,upper=1.0> shape;
	real<lower=0> scale;
	real<lower=(shape > 0 ? miny : negative_infinity()),
	upper=(shape > 0 ? positive_infinity() : maxy) > loc;
}

model {
	scale ~ normal(0, 10);
	shape ~ normal(0, 10);
	loc ~ normal(0, 10);
	y ~ gev(loc, scale, shape);
}

Now, in R, we load the rstan package and read some data. Here, for illustration purposes, I’m reading 99 years of annual maxima storm surge (extreme sea level) data at Boston, MA, which are accessible here. We extract the number of years and observations and we load our stan model via the rstan::stan_model function. To fit the model, the rstan::sampling function is utilized passing as main arguments the stan model and data, and as additional arguments the number of Markov Chain Monte Carlo iterations and chains. Finally, we print the results of the fit. The algorithm has converged with summary percentiles for the three parameters shown in the table.

library("rstan")

storm.surge <- read.csv("Boston-surge.csv")

N <- nrow(storm.surge)
y <- storm.surge[["Surge"]]

model <- stan_model("ourmodel.stan")

fit <- sampling(model, list(N = N, y = y), iter = 10000, chains = 1)

print(fit)

#Inference for Stan model: fit.
#1 chains, each with iter=10000; warmup=5000; thin=1; 
#post-warmup draws per chain=5000, total post-warmup draws=5000.

#       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
#shape  0.05    0.00 0.04  0.00  0.02  0.04  0.07  0.15   629    1
#scale  0.17    0.00 0.01  0.15  0.16  0.17  0.18  0.20   947    1
#loc    0.78    0.00 0.02  0.74  0.77  0.78  0.79  0.81  1020    1
#lp__  16.45    0.04 1.23 13.36 15.83 16.75 17.37 17.97   826    1