In this post, I’m gonna provide an introductory tutorial for fitting an extreme value distribution via Bayesian analysis with * Stan* and its R interface,

*. Stan is a probabilistic programming language written in C++ suitable for Bayesian inference and rstan is a package providing access to Stan thru R. More about Stan can be found here.*

**rstan**In extreme value statistics, a very useful distribution for modeling upper-tail data is the Generalized Extreme Value (GEV) distribution. Unlike the frequentist approach, where the three parameters (location, scale, and shape) are obtained typically via Maximum Likelihood Estimation (MLE) and their uncertainty is only asymptotically derived by MLE properties, a Bayesian approach explicitly accounts for uncertainty in parameter estimates by inferring a respective distribution for each parameter, the so-called “posterior” distribution.

A typical workflow when using Stan/rstan consists of the following steps:

- Develop a Bayesian formulation, i.e., a Stan model file ending with .stan
- In R, fit the model with the use of rstan package by passing the model file and the data as arguments to the
*rstan::sampling*function

Our .stan file (say we name it ourmodel.stan) consists of 5 blocks which are used to construct the Bayesian formulation of our analysis. Inside the ** functions **block, we code a real function (gev_lpdf) which returns the log-probability density function of the GEV distribution. The

**block comprises of simply declaring our data variables which in this case are: a) the number of observations (N), and b) the observations themselves (y). The purpose of the**

*data**block is to declare the parameters for which inference is sought. Note that these parameters are declared using lower/upper limits; particularly for the limits of the shape parameter, the minimum and maximum values of the data are needed and thus an additional*

**parameters****block is necessary where we should declare these two “new” data points. Finally, the**

*transformed**data**block allows us to provide prior distributions for each one of the parameters (here, normal distributions) and set the expression for likelihood function.*

**model**```
functions{
real gev_lpdf(vector y, real loc, real scale, real shape) {
vector[rows(y)] z;
vector[rows(y)] lp;
real loglike;
int N;
N = rows(y);
if (fabs(shape) > 1e-10) {
for(i in 1:N){
z[i] = 1 + shape * (y[i] - loc) / scale;
lp[i] = (1 + 1 / shape) * log(z[i]) + pow(z[i], -1 / shape);
}
loglike = - sum(lp) - N * log(scale);
} else {
for (i in 1:N){
z[i] = (y[i] - loc) / scale;
lp[i] = z[i] + exp(-z[i]);
}
loglike = - sum(lp) - N * log(scale);
}
return(loglike);
}
}
data {
int<lower=0> N;
vector[N] y;
}
transformed data {
real miny;
real maxy;
miny = min(y);
maxy = max(y);
}
parameters {
real<lower=-1.0,upper=1.0> shape;
real<lower=0> scale;
real<lower=(shape > 0 ? miny : negative_infinity()),
upper=(shape > 0 ? positive_infinity() : maxy) > loc;
}
model {
scale ~ normal(0, 10);
shape ~ normal(0, 10);
loc ~ normal(0, 10);
y ~ gev(loc, scale, shape);
}
```

Now, in R, we load the rstan package and read some data. Here, for illustration purposes, I’m reading 99 years of annual maxima storm surge (extreme sea level) data at Boston, MA, which are accessible here. We extract the number of years and observations and we load our stan model via the *rstan::stan_model* function. To fit the model, the *rstan::sampling* function is utilized passing as main arguments the stan model and data, and as additional arguments the number of Markov Chain Monte Carlo iterations and chains. Finally, we print the results of the fit. The algorithm has converged with summary percentiles for the three parameters shown in the table.

```
library("rstan")
storm.surge <- read.csv("Boston-surge.csv")
N <- nrow(storm.surge)
y <- storm.surge[["Surge"]]
model <- stan_model("ourmodel.stan")
fit <- sampling(model, list(N = N, y = y), iter = 10000, chains = 1)
print(fit)
#Inference for Stan model: fit.
#1 chains, each with iter=10000; warmup=5000; thin=1;
#post-warmup draws per chain=5000, total post-warmup draws=5000.
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
#shape 0.05 0.00 0.04 0.00 0.02 0.04 0.07 0.15 629 1
#scale 0.17 0.00 0.01 0.15 0.16 0.17 0.18 0.20 947 1
#loc 0.78 0.00 0.02 0.74 0.77 0.78 0.79 0.81 1020 1
#lp__ 16.45 0.04 1.23 13.36 15.83 16.75 17.37 17.97 826 1
```