
\documentclass{article}

\usepackage{natbib}
\usepackage{graphics}
\usepackage{amsmath}
\usepackage{indentfirst}
\usepackage{url}
\usepackage[utf8]{inputenc}

\DeclareMathOperator{\var}{var}
\DeclareMathOperator{\cov}{cov}

% \VignetteIndexEntry{MCMC Example}

\begin{document}

<<foo,include=FALSE,echo=FALSE>>=
options(keep.source = TRUE, width = 60)
@

\title{MCMC Package Example (Version \Sexpr{packageVersion("mcmc")})}
\author{Charles J. Geyer}
\maketitle

\section{The Problem}

This is an example of using the \verb@mcmc@ package in R.  The problem comes
from a take-home question on a (take-home) PhD qualifying exam
(School of Statistics, University of Minnesota).

Simulated data for the problem are in the dataset \verb@logit@.
There are five variables in the data set, the response \verb@y@
and four predictors, \verb@x1@, \verb@x2@, \verb@x3@, and \verb@x4@.

A frequentist analysis for the problem is done by the following R statements
<<frequentist>>=
library(mcmc)
data(logit)
out <- glm(y ~ x1 + x2 + x3 + x4, data = logit,
    family = binomial, x = TRUE)
summary(out)
@

But this problem isn't about that frequentist analysis, we want a Bayesian
analysis.  For our Bayesian analysis we assume the same data model as the
frequentist, and we assume the prior distribution of the five parameters
(the regression coefficients) makes them independent and identically
normally distributed with mean 0 and standard deviation 2.

The log unnormalized posterior density (log likelihood plus log prior)
for this model is calculated by
the following R function.  In order to avoid using either global variables
or \verb@...@ arguments, we use the function factory pattern
(\citealp[Section~10.3.1]{wickham};
\citealp[Section~7.5, especially Subsection~7.5.4]{basic};
see also Appendix~\ref{sec:versus} below).
<<log.unnormalized.posterior>>=
lupost_factory <- function(x, y) function(beta) {
    eta <- as.numeric(x %*% beta)
    logp <- ifelse(eta < 0, eta - log1p(exp(eta)), - log1p(exp(- eta)))
    logq <- ifelse(eta < 0, - log1p(exp(eta)), - eta - log1p(exp(- eta)))
    logl <- sum(logp[y == 1]) + sum(logq[y == 0])
    return(logl - sum(beta^2) / 8)
}

lupost <- lupost_factory(out$x, out$y)
@
The tricky calculation of the log likelihood avoids overflow and catastrophic
cancellation in calculation of $\log(p)$ and $\log(q)$ where
\begin{align*}
   p & = \frac{\exp(\eta)}{1 + \exp(\eta)} = \frac{1}{1 + \exp(- \eta)}
   \\
   q & = \frac{1}{1 + \exp(\eta)} = \frac{\exp(- \eta)}{1 + \exp(- \eta)}
\end{align*}
so taking logs gives
\begin{align*}
   \log(p) & = \eta - \log(1 + \exp(\eta)) = - \log(1 + \exp(- \eta))
   \\
   \log(q) & = - \log(1 + \exp(\eta)) = - \eta - \log(1 + \exp(- \eta))
\end{align*}
To avoid overflow, we always chose the case where the argument of $\exp$
is negative.  We have also avoided catastrophic cancellation when
$\lvert\eta\rvert$ is large.  If $\eta$ is large and positive, then
\begin{align*}
   p & \approx 1
   \\
   q & \approx 0
   \\
   \log(p) & \approx - \exp(- \eta)
   \\
   \log(q) & \approx - \eta - \exp(- \eta)
\end{align*}
and our use of the R function \texttt{log1p}, which calculates the
function $x \mapsto \log(1 + x)$
correctly for small $x$ avoids all problems.  The case where $\eta$ is large
and negative is similar.

\section{Beginning MCMC}

With those definitions in place, the following code runs the Metropolis
algorithm to simulate the posterior.
<<metropolis-try-1>>=
set.seed(42)    # to get reproducible results
beta.init <- as.numeric(coefficients(out))

out <- metrop(lupost, beta.init, 1e3)
names(out)
out$accept
@

The arguments to the \verb@metrop@ function used here (there are others
we don't use) are
\begin{itemize}
\item an R function (here \verb@lupost@) that evaluates the log unnormalized
    density of the desired stationary distribution of the Markov chain
    (here a posterior distribution).  Note that (although this example
    does not exhibit the phenomenon) that the unnormalized density may
    be zero, in which case the log unnormalized density is \verb@-Inf@.
\item an initial state (here \verb@beta.init@) of the Markov chain.
\item a number of batches (here \verb@1e3@) for the Markov chain.
    This combines with batch length and spacing (both 1 by default)
    to determine the number of iterations done.
\item additional arguments (here \verb@x@ and \verb@y@) supplied to
    provided functions (here \verb@lupost@).
\item there is no ``burn-in'' argument, although burn-in is easily
    accomplished, if desired (more on this below).
\end{itemize}

The output is in the component \verb@out$batch@ returned by the \verb@metrop@
function.  We'll look at it presently, but first we need to adjust the
proposal to get a higher acceptance rate (\verb@out$accept@).  It is generally
accepted \citep*{grg} that an acceptance rate of about 20\% is right, although
this recommendation is based on the asymptotic analysis of a toy problem
(simulating a multivariate normal distribution) for which one would never
use MCMC and is very unrepresentative of difficult MCMC applications.

\citet{geyer-temp} came to a similar conclusion,
that a 20\% acceptance rate is about right, in a very different situation.
But they also warned that a 20\% acceptance rate could be very wrong
and produced
an example where a 20\% acceptance rate was impossible and attempting to
reduce the acceptance rate below 70\% would keep the sampler from ever
visiting part of the state space.  So the 20\% magic number must be
considered like other rules of thumb we teach in intro courses
(like $n > 30$ means means normal approximation is valid).
We know these rules of thumb can fail.
There are examples in the literature where
they do fail.  We keep repeating them because we want something simple to
tell beginners, and they are all right for some problems.

Be that as it may, we try for 20\%.
<<metropolis-try-2>>=
out <- metrop(out, scale = 0.1)
out$accept
out <- metrop(out, scale = 0.3)
out$accept
out <- metrop(out, scale = 0.5)
out$accept
out <- metrop(out, scale = 0.4)
out$accept
@

Here the first argument to each instance of the \verb@metrop@ function is
the output of a previous invocation.  The Markov chain continues where
the previous run stopped, doing just what it would have done if it had
kept going, the initial state and random seed being the final state and
final random seed of the previous invocation.  Everything stays the same
except for the arguments supplied (here \verb@scale@).
\begin{itemize}
\item The argument \verb@scale@ controls the size of the Metropolis
    ``normal random walk'' proposal.  The default is \verb@scale = 1@.
    Big steps give lower acceptance rates.  Small steps give higher.
    We want something about 20\%.  It is also possible to make \verb@scale@
    a vector or a matrix.  See \verb@help(metrop)@.
\end{itemize}

Because each run starts where the last one stopped (when the first argument
to \verb@metrop@ is the output of the previous invocation), each run serves
as ``burn-in'' for its successor (assuming that any part of that run was
worth anything at all).

\section{Diagnostics}

O.~K.  That does it for the acceptance rate.  So let's do a longer run
and look at the results.
<<label=metropolis-try-3>>=
out <- metrop(out, nbatch = 1e4)
t.test(out$accept.batch)$conf.int
out$time
@
Here we do a Monte Carlo confidence interval
for the true unknown acceptance rate
(what we would see with an infinite Monte Carlo sample size).

Figure~\ref{fig:fig1} (page~\pageref{fig:fig1})
shows the time series plot made by the R statement
<<label=fig1too,include=FALSE>>=
plot(ts(out$batch))
@
\begin{figure}
\begin{center}
<<label=fig1,fig=TRUE,echo=FALSE>>=
<<fig1too>>
@
\end{center}
\caption{Time series plot of MCMC output.}
\label{fig:fig1}
\end{figure}

Another way to look at the output is an autocorrelation plot.
Figure~\ref{fig:fig2} (page~\pageref{fig:fig2})
shows the time series plot made by the R statement
<<label=fig2too,include=FALSE>>=
acf(out$batch)
@
\begin{figure}
\begin{center}
<<label=fig2,fig=TRUE,echo=FALSE>>=
<<fig2too>>
@
\end{center}
\caption{Autocorrelation plot of MCMC output.}
\label{fig:fig2}
\end{figure}

As with any multiplot plot, these are a bit hard to read.  Readers are
invited to make the separate plots to get a better picture.
As with all ``diagnostic'' plots in MCMC, these don't ``diagnose''
subtle problems.
\begin{quotation}
The purpose of regression diagnostics is to find obvious, gross,
embarrassing problems that jump out of simple plots \citep{bogosity}.
\end{quotation}
The time series plots will show \emph{obvious} nonstationarity.
They will not show \emph{nonobvious} nonstationarity.  They
provide no guarantee whatsoever that your Markov chain is sampling
anything remotely resembling the correct stationary distribution
(with log unnormalized density \verb@lupost@).  In this very easy
problem, we do not expect any convergence difficulties and so believe
what the diagnostics seem to show, but one is a fool to trust such
diagnostics in difficult problems.

The autocorrelation plots seem to show that the
the autocorrelations are negligible after about lag 25.
This diagnostic inference is reliable if the sampler is actually
working (has nearly reached equilibrium) and worthless otherwise.
Thus batches of length 25 should be sufficient, but let's use
length 100 to be safe.

A more judicious discussion of asymptotics is found
in \citet[Section~1.11.5]{intro}, which says
\begin{quotation}
Many [diagnostics] come with theorems, but the theorems never prove
the property you really want a diagnostic to have. These theorems say
that if the chain converges, then the diagnostic will probably say that
the chain converged, but they do not say that if the chain pseudo-converges,
then the diagnostic will probably say that the chain did not converge.
\end{quotation}

\section{Monte Carlo Estimates and Standard Errors}

<<label=metropolis-try-4>>=
out <- metrop(out, nbatch = 100, blen = 100,
    outfun = function(z) c(z, z^2))
t.test(out$accept.batch)$conf.int
out$time
@

We have added an argument \verb@outfun@ that gives the functional
of the Markov chain \citep[Section~1.6]{intro} we want to average.
For this problem we are interested
in both posterior mean and variance.  Mean is easy, just average the
variables in question.  But variance is a little tricky.  We need to
use the identity
$$
   \var(X) = E(X^2) - E(X)^2
$$
to write variance as a function of two things that can be estimated
by simple averages.  Hence we want to average the state itself and
the squares of each component.  Hence our \verb@outfun@ returns
\verb@c(z, z^2)@ for an argument (the state vector) \verb@z@.

\subsection{Simple Means}

The grand means (means of batch means) are
<<label=metropolis-batch>>=
apply(out$batch, 2, mean)
@
The first 5 numbers are the Monte Carlo estimates of the posterior means.
The second 5 numbers are the Monte Carlo estimates of the posterior
ordinary second moments.  We get the posterior variances by
<<label=metropolis-batch-too>>=
foo <- apply(out$batch, 2, mean)
mu <- foo[1:5]
sigmasq <- foo[6:10] - mu^2
mu
sigmasq
@

Monte Carlo standard errors (MCSE) are calculated from the batch means.
This is simplest for the means.
<<label=metropolis-mcse-mu>>=
mu.mcse <- apply(out$batch[ , 1:5], 2, sd) / sqrt(out$nbatch)
mu.mcse
@
The extra factor \verb@sqrt(out$nbatch)@ arises because the batch means
have variance $\sigma^2 / b$ where $b$ is the batch length, which is
\verb@out$blen@,
whereas the overall means \verb@mu@ have variance $\sigma^2 / n$ where
$n$ is the total number of iterations, which is \verb@out$blen * out$nbatch@.

\subsection{Functions of Means}

To get the MCSE for the posterior variances we apply the delta method.
Let $u_i$ denote the sequence of batch means of the first kind for one
parameter and $\bar{u}$ the grand mean (the estimate of the posterior mean
of that parameter),
let $v_i$ denote the sequence of batch means of the second kind for the
same parameter and $\bar{v}$ the grand mean (the estimate of the posterior
second absolute moment of that parameter), and let $\mu = E(\bar{u})$ and
$\nu = E(\bar{v})$.  Then the delta method linearizes the nonlinear function
$$
   g(\mu, \nu) = \nu - \mu^2
$$
as
$$
   \Delta g(\mu, \nu) = \Delta \nu - 2 \mu \Delta \mu
$$
saying that
$$
   g(\bar{u}, \bar{v}) - g(\mu, \nu)
$$
has the same asymptotic normal distribution as
$$
   (\bar{v} - \nu) - 2 \mu (\bar{u} - \mu)
$$
which, of course, has variance \verb@1 / nbatch@ times that of
$$
   (v_i - \nu) - 2 \mu (u_i - \mu)
$$
and this variance is estimated by
$$
   \frac{1}{n_{\text{batch}}} \sum_{i = 1}^{n_{\text{batch}}}
   \bigl[ (v_i - \bar{v}) - 2 \bar{u} (u_i - \bar{u}) \bigr]^2
$$
So
<<label=metropolis-mcse-sigmasq>>=
u <- out$batch[ , 1:5]
v <- out$batch[ , 6:10]
ubar <- apply(u, 2, mean)
vbar <- apply(v, 2, mean)
deltau <- sweep(u, 2, ubar)
deltav <- sweep(v, 2, vbar)
foo <- sweep(deltau, 2, ubar, "*")
sigmasq.mcse <- sqrt(apply((deltav - 2 * foo)^2, 2, mean) / out$nbatch)
sigmasq.mcse
@
does the MCSE for the posterior variance.

Let's just check that this complicated \verb@sweep@ and \verb@apply@ stuff
does do the right thing.
<<label=metropolis-mcse-sigmasq-too>>=
sqrt(mean(((v[ , 2] - vbar[2]) - 2 * ubar[2] * (u[ , 2] - ubar[2]))^2) /
    out$nbatch)
@

\paragraph{Comment} Through version 0.5 of this vignette it contained
an incorrect procedure for calculating this MCSE, justified by a handwave
(which was incorrect).
Essentially, it said to use the standard deviation of the batch means called
\verb@v@ here, which appears to be very conservative.

\subsection{Functions of Functions of Means}

If we are also interested in the posterior standard deviation
(a natural question, although not asked on the exam problem),
the delta method gives its standard error in terms of that
for the variance
<<label=metropolis-mcse-sigma>>=
sigma <- sqrt(sigmasq)
sigma.mcse <- sigmasq.mcse / (2 * sigma)
sigma
sigma.mcse
@

\section{A Final Run}

So that's it.  The only thing left to do is a little more precision
(the exam problem directed ``use a long enough run of your Markov chain
sampler so that the MCSE are less than 0.01'')
<<label=metropolis-try-5>>=
out <- metrop(out, nbatch = 500, blen = 400)
t.test(out$accept.batch)$conf.int
out$time
<<metropolis-batch-too>>
<<metropolis-mcse-mu>>
<<metropolis-mcse-sigmasq>>
<<metropolis-mcse-sigma>>
@
and some nicer output, which is presented in three tables
constructed from the R variables defined above
using the R \verb@xtable@ command in the \verb@xtable@ library.
\begin{table}[ht]
\caption{Posterior Means}
\label{tab:mu}
\begin{center}
<<label=tab1,echo=FALSE,results=tex>>=
foo <- rbind(mu, mu.mcse)
dimnames(foo) <- list(c("estimate", "MCSE"),
    c("constant", paste("$x_", 1:4, "$", sep = "")))
library(xtable)
print(xtable(foo, digits = rep(4, 6),
    align = c("l", rep("c", 5))), floating = FALSE,
    caption.placement = "top",
    sanitize.colnames.function = function(x) return(x))
@
\end{center}
\end{table}
\begin{table}[ht]
\caption{Posterior Variances}
\label{tab:sigmasq}
\begin{center}
<<label=tab1,echo=FALSE,results=tex>>=
foo <- rbind(sigmasq, sigmasq.mcse)
dimnames(foo) <- list(c("estimate", "MCSE"),
    c("constant", paste("$x_", 1:4, "$", sep = "")))
library(xtable)
print(xtable(foo, digits = rep(4, 6),
    align = c("l", rep("c", 5))), floating = FALSE,
    caption.placement = "top",
    sanitize.colnames.function = function(x) return(x))
@
\end{center}
\end{table}
\begin{table}[ht]
\caption{Posterior Standard Deviations}
\label{tab:sigma}
\begin{center}
<<label=tab1,echo=FALSE,results=tex>>=
foo <- rbind(sigma, sigma.mcse)
dimnames(foo) <- list(c("estimate", "MCSE"),
    c("constant", paste("$x_", 1:4, "$", sep = "")))
library(xtable)
print(xtable(foo, digits = rep(4, 6),
    align = c("l", rep("c", 5))), floating = FALSE,
    caption.placement = "top",
    sanitize.colnames.function = function(x) return(x))
@
\end{center}
\end{table}

Note for the record that the all the results presented in the tables
are from ``one long run'' where long here took only
<<label=time,echo=FALSE,results=tex>>=
cat(out$time[1], "\n")
@
seconds (on whatever computer it was run on).

\section{New Variance Estimation Functions}

R function \texttt{initseq} (added in version 0.6 of this package)
estimates variances in the Markov chain
central limit theorem (CLT) following the methodology introduced by
\citet[Section~3.3]{practical}.  These methods only apply to scalar-valued
functionals of
reversible Markov chains, but the Markov chains produced by the \texttt{metrop}
function satisfy this condition, even, as we shall see below, when batching
is used.

Rather than redo the Markov chains in the preceding material, we just look
at a toy problem, an AR(1) time series, which can be simulated in one line
of R.  This is the example on the help page for \texttt{initseq}.
<<x>>=
n <- 2e4
rho <- 0.99
x <- arima.sim(model = list(ar = rho), n = n)
@
The time series \texttt{x} is a reversible Markov chain and trivially
a scalar-valued functional of a Markov chain.

Define
\begin{equation} \label{eq:little}
   \gamma_k = \cov(X_i, X_{i + k})
\end{equation}
where the covariances refer to the stationary Markov chain having the
same transition probabilities as \texttt{x}.  Then the variance in the CLT
is
$$
   \sigma^2 = \gamma_0 + 2 \sum_{k = 1}^\infty \gamma_k
$$
\citep[Theorem~2.1]{practical}, that is,
$$
   \bar{x}_n \approx \text{Normal}\left(\mu, \frac{\sigma^2}{n}\right),
$$
where $\mu = E(X_i)$ is the quantity being estimated by MCMC (in this
toy problem we know $\mu = 0$).

Naive estimates of $\sigma^2$ obtained by plugging in empirical
estimates of the gammas do not provide consistent estimation
\citep[Section~3.1]{practical}.  Thus the scheme implemented
by the R function \texttt{initseq}.  Define
\begin{equation} \label{eq:big}
   \Gamma_k = \gamma_{2 k} + \gamma_{2 k + 1}
\end{equation}
\citet[Theorem~3.1]{practical} says that $\Gamma_k$ considered as a function
of $k$ is strictly positive, strictly decreasing, and strictly convex
(provided we are, as stated above, working with a reversible Markov chain).
Thus it makes sense to use estimators that use these properties.
The estimators implemented by the R function \texttt{initseq} and
described by \citet[Section~3.3]{practical} are conservative-consistent
in the sense of Theorem~3.2 of that section.

Figure~\ref{fig:gamma} (page~\pageref{fig:gamma})
shows the time series plot made by the R statement
<<label=figgamtoo,include=FALSE>>=
out <- initseq(x)
plot(seq(along = out$Gamma.pos) - 1, out$Gamma.pos,
        xlab = "k", ylab = expression(Gamma[k]), type = "l")
lines(seq(along = out$Gamma.dec) - 1, out$Gamma.dec, lty = "dotted")
lines(seq(along = out$Gamma.con) - 1, out$Gamma.con, lty = "dashed")
@
\begin{figure}
\begin{center}
<<label=figgam,fig=TRUE,echo=FALSE>>=
<<figgamtoo>>
@
\end{center}
\caption{Plot ``Big Gamma'' defined by \eqref{eq:little} and \eqref{eq:big}.
Solid line, initial positive sequence estimator.
Dotted line, initial monotone sequence estimator.
Dashed line, initial convex sequence estimator.}
\label{fig:gamma}
\end{figure}
One can use whichever curve one chooses, but now that
the \texttt{initseq} function makes the computation trivial, it makes
sense to use the initial convex sequence.

Of course, one is not interested in Figure~\ref{fig:gamma}, except
perhaps when explaining the methodology.  What is actually important
is the estimate of $\sigma^2$, which is given by
<<assvar>>=
out$var.con
(1 + rho) / (1 - rho) * 1 / (1 - rho^2)
@
where for comparison we have given the exact theoretical value of $\sigma^2$,
which, of course, is never available in a non-toy problem.

These initial sequence estimators seem, at first sight to be a competitor
for the method of batch means.  However, appearances can be deceiving.
The two methods are complementary.  The sequence of batch means is itself
a scalar-valued functional of a reversible Markov chain.  Hence the
initial sequence estimators can be applied to it.
<<batx>>=
blen <- 5
x.batch <- apply(matrix(x, nrow = blen), 2, mean)
bout <- initseq(x.batch)
@
Because the batch length is too short, the variance of the batch means
does not estimate $\sigma^2$.  We must account for the autocorrelation
of the batches, shown in Figure~\ref{fig:gambat}.
<<label=figgambattoo,include=FALSE>>=
plot(seq(along = bout$Gamma.con) - 1, bout$Gamma.con,
        xlab = "k", ylab = expression(Gamma[k]), type = "l")
@
\begin{figure}
\begin{center}
<<label=figgambat,fig=TRUE,echo=FALSE>>=
<<figgambattoo>>
@
\end{center}
\caption{Plot ``Big Gamma'' defined by \eqref{eq:little} and \eqref{eq:big}
for the sequence of batch means (batch length \Sexpr{blen}).
Only initial convex sequence estimator is shown.}
\label{fig:gambat}
\end{figure}
Because the the variance is proportional to one over the batch length,
we need to multiply by the batch length to estimate the $\sigma^2$
for the original series.
<<compvar>>=
out$var.con
bout$var.con * blen
@
Another way to look at this is that the MCMC estimator of $\mu$ is
either \texttt{mean(x)} or \texttt{mean(x.batch)}.  And the variance
must be divided by the sample size to give standard errors.  So either
<<ci-con>>=
mean(x) + c(-1, 1) * qnorm(0.975) * sqrt(out$var.con / length(x))
mean(x.batch) + c(-1, 1) * qnorm(0.975) * sqrt(bout$var.con / length(x.batch))
@
is an asymptotic 95\% confidence interval for $\mu$.  Just divide by
the relevant sample size.

\appendix

\section{Dot-dot-dot Versus Global Variables Versus Closures}
\label{sec:versus}

This appendix deals with three ways to pass information to a function
being passed to another R function (a higher-order function), for example when
\begin{itemize}
\item the function being passed is the objective function for an optimization
    done by the higher-order function, which optimizes
    (R function \texttt{optim} for example),
\item the function being passed is the integrand for an integration
    done by the higher-order function, which integrates
    (R function \texttt{integrate} for example),
\item the function being passed is the estimator of a parameter for a
    bootstrap done by the higher-order function, which simulates
    the (bootstrap approximation) of the sampling distribution of the
    estimator (R function \texttt{boot} in R package \texttt{boot},
    for example),
\item the function being passed is the log unnormalized density function
    for a simulation
    done by the higher-order function, which simulates the distribution
    having that unnormalized density
    (R function \texttt{metrop} in this package for example),
\end{itemize}
These ways are
\begin{itemize}
\item using dot-dot-dot (R syntax \verb@...@),
\item using global variables,
\item using closures, also called the function factory pattern.
\end{itemize}
The main body of this vignette uses the third option.  This appendix
explains them all and the virtues and vices of each.

\subsection{Dot-dot-dot}

The dot-dot-dot mechanism is fairly easy to use when only one function is passed
to the higher-order function, but does require care and more work to use.
It is even harder to deal with when more than one function is passed to
the higher-order function.  R functions \texttt{metrop} and \texttt{temper} in
this package can be passed two functions: the log unnormalized density
function and the output function.

\subsection{Only One Function Argument}

Previous versions of this vignette used the dot-dot-dot mechanism everywhere.
In those versions the log unnormalized density function was defined by
<<log.unnormalized.posterior-dot-dot-dot>>=
lupost <- function(beta, x, y) {
     eta <- as.numeric(x %*% beta)
     logp <- ifelse(eta < 0, eta - log1p(exp(eta)), - log1p(exp(- eta)))
     logq <- ifelse(eta < 0, - log1p(exp(eta)), - eta - log1p(exp(- eta)))
     logl <- sum(logp[y == 1]) + sum(logq[y == 0])
     return(logl - sum(beta^2) / 8)
}
@
rather than the way it is now done in the main text above.  Note that
everything is the same in the two definitions except here we have
\begin{verbatim}
lupost <- function(beta, x, y) {
\end{verbatim}
where in the main text we now have
\begin{verbatim}
lupost_factory <- function(x, y) function(beta) {
\end{verbatim}
and then we have to execute the function factory \verb@lupost_factory@
to make the \texttt{lupost} function.

But the main difference is that R function \texttt{lupost}
(the user written function specifying the log unnormalized posterior density)
\begin{itemize}
\item here has 3 arguments, \texttt{beta}, \texttt{x},
    and \texttt{y}, and the latter two must be passed via the dot-dot-dot
    mechanism, but
\item there has 1 argument, \texttt{beta}, and it
    just knows about \texttt{x} and \texttt{y} --- they are in its closure.
\end{itemize}

So to use this \texttt{lupost} function, we have to add arguments
\texttt{x} and \texttt{y} to each call to R function \texttt{metrop}.
For example
<<metropolis-try-dot-dot-dot>>=
out <- glm(y ~ x1 + x2 + x3 + x4, data = logit,
    family = binomial, x = TRUE)
x <- out$x
y <- out$y

out <- metrop(lupost, beta.init, 1e3, x = x, y = y)
out$accept
out <- metrop(out, scale = 0.1, x = x, y = y)
out$accept
out <- metrop(out, scale = 0.3, x = x, y = y)
out$accept
out <- metrop(out, scale = 0.5, x = x, y = y)
out$accept
out <- metrop(out, scale = 0.4, x = x, y = y)
out$accept
@
and so forth.

This method has the benefit that we do not have to explain function factories
and has the drawback that we need to keep remembering to add
\verb@x = x, y = y@ to each invocation of R function \texttt{metrop}.
It is unclear to me which pattern is more mysterious to naive users.

\subsection{More Than One Function Argument}

The situation becomes more complicated (see the Warning section of the help
pages for R functions \texttt{metrop} and \texttt{temper}) when more than
one function argument is passed to the higher-order function.  Then
\emph{they all must handle the \emph{same} dot-dot-dot arguments} whether
or not they want them.

So now we must define the output function as
<<outfun-dot-dot-dot>>=
outfun <- function(z, ...) c(z, z^2)
@
The \verb@...@ argument in the function signature is essential because
this function is going to be passed dot-dot-dot arguments \texttt{x}
and \texttt{y}, which it does not need and does not want, so it has to
allow for them (and then not use them).

Then we can continue
<<outfun-try-dot-dot-dot>>=
out <- metrop(out, nbatch = 100, blen = 100, outfun = outfun,
    x = x, y = y)
out$accept
@
and this works.  But we would have gotten an error about unused arguments
if we had defined the output function without the dot-dot-dot as we did
in the main text.

Your humble author was himself confused about this for years and so doubts
that naive R users will find it intuitive.

\subsection{Global Variables}

As every programmer knows global variables are easy to use and evil
\citep{c2}.  They are part of the way of R (or perhaps part of one of
the ways of R).  They are OK for ``very small or one-off programs'' \citep{c2}.
If you are going to throw the code away after one use, fine.
If you are going to give your code to other people or even use it
yourself six months later (after you have long forgotten the details),
then this method is evil and should be avoided.

In this method we define both functions passed to the higher-order function
without \verb@...@ and without using a function factory.  We already
in the preceding section of this appendix defined R objects \texttt{x} and
\texttt{y} as global variables (in the R global environment \verb@.GlobalEnv@).
They are global variables and we use them as such, defining
<<functions-global>>=
lupost <- function(beta) {
     eta <- as.numeric(x %*% beta)
     logp <- ifelse(eta < 0, eta - log1p(exp(eta)), - log1p(exp(- eta)))
     logq <- ifelse(eta < 0, - log1p(exp(eta)), - eta - log1p(exp(- eta)))
     logl <- sum(logp[y == 1]) + sum(logq[y == 0])
     return(logl - sum(beta^2) / 8)
}
outfun <- function(z) c(z, z^2)
@
Then the following works
<<doit-global>>=
out <- metrop(lupost, beta.init, 1e3)
out$accept
out <- metrop(out, scale = 0.1)
out$accept
out <- metrop(out, scale = 0.3)
out$accept
out <- metrop(out, scale = 0.5)
out$accept
out <- metrop(out, scale = 0.4)
out$accept
out <- metrop(out, nbatch = 100, blen = 100, outfun = outfun)
out$accept
@

We get the best of both worlds.  We don't need \verb@x = x, y = y@
and we don't need a function factory.

But if we change the name of the global variables to say
\texttt{modmat} and \texttt{resp} instead of \texttt{x} and \texttt{y},
then our code breaks.  R function \texttt{lupost} is looking up
global variables \texttt{x} and \texttt{y} \emph{under those names}
not under any other names.  So your code using this method is rigid and brittle
and probably unusable by others.

Note that if we are using either of the other methods, renaming is no problem.
Using dot-dot-dot we do
\begin{verbatim}
out <- metrop(out, scale = 0.1, x = modmat, y = resp)
\end{verbatim}
Using the function factory we do
\begin{verbatim}
lupost <- lupost_factory(modmat, resp)
\end{verbatim}
See Section~7.5.3 of \citet{basic} for more about this.

\subsection{Function Factory}

The terminology ``function factory pattern'' is apparently due to
\citet[Section~10.3.1]{wickham}.  But it is just a special case about
how closures work (in R and all other languages that have them).
Compare
<<fred>>=
fred <- function(y) function(x) x + y
fred(2)(3)
@
\citep[Section~6.5]{basic}) to
<<currying>>=
lupost_factory <- function(x, y) function(beta) {
    eta <- as.numeric(x %*% beta)
    logp <- ifelse(eta < 0, eta - log1p(exp(eta)), - log1p(exp(- eta)))
    logq <- ifelse(eta < 0, - log1p(exp(eta)), - eta - log1p(exp(- eta)))
    logl <- sum(logp[y == 1]) + sum(logq[y == 0])
    return(logl - sum(beta^2) / 8)
}
lupost <- lupost_factory(x, y)
lupost(beta.init)
@
We could also do the same calculation treating \verb@lupost_factory@
as just an ordinary curried function, like R function \texttt{fred}
in the preceding example,
<<currying-too>>=
lupost_factory(x, y)(beta.init)
@

So there is nothing mysterious about function factories.  They are just
one particular use of
the essence of functional programming (closures).
If you really understand functional programming, then you must understand this.

Long ago, when I switched from S to R,
I would not have considered that being an ``knowledgeable user'' of R
required knowledge of closures.  Now I do.  This is partly because other
functional languages like Scheme, Javascript, F\#, Clojure, and Haskell
have become
very popular for general computer programming.  So now we want to emphasize
that we can do in R what we can do in them.
But it is also partly because I now understand more about functional
programming.  I can now see that if you really understand closures,
then you don't need most other features of these programming languages
(including R).  Following \citet{crockford} we can say that all programming
languages (including R) have their good parts and their bad parts and we
should use the good features and avoid the bad features.  The best feature
is closures.  Crockford calls this the best idea ever put in a programming
language.  So we should use it a lot.

If I were starting to write the \texttt{mcmc} package today, I might leave
out the dot-dot-dot arguments of R functions \texttt{metrop},
\texttt{morph.metrop}, and \texttt{temper}.  That would mean that
users could not use the dot-dot-dot pattern.  They would be forced to
use the function factory pattern or the global variables pattern.
And if they had accepted that the global variables pattern is evil,
then they would have to use the function factory pattern.
(If you like the dot-dot-dot pattern, don't worry.  It is not going
to be removed from this package.  Backward compatibility trumps everything.)

\begin{thebibliography}{}

\bibitem[C2 Wiki(2013)]{c2}
C2 Wiki Contributors (2013).
\newblock Global Variables Are Bad.
\newblock \url{http://wiki.c2.com/?GlobalVariablesAreBad}.

\bibitem[Crockford(2008)]{crockford}
Crockford, D. (2008).
\newblock \emph{JavaScript: The Good Parts}.
\newblock O'Reilly, Sebastopol CA.

\bibitem[Gelman et al.(1996)Gelman, Roberts, and Gilks]{grg}
Gelman, A., G.~O. Roberts, and W.~R. Gilks (1996).
\newblock Efficient Metropolis jumping rules.
\newblock In \emph{Bayesian Statistics, 5 (Alicante, 1994)}, pp.~599--607.
  Oxford University Press.

\bibitem[Geyer(1992)]{practical}
Geyer, C.~J. (1992).
\newblock Practical Markov chain Monte Carlo (with discussion).
\newblock \emph{Statistical Science}, 7, 473--511.

\bibitem[Geyer(2006)]{bogosity}
Geyer, C. J. (2006).
\newblock On the bogosity of MCMC diagnostics.
\newblock \url{http://users.stat.umn.edu/~geyer/mcmc/diag.html}.

\bibitem[Geyer(2011)]{intro}
Geyer, C. J. (2011).
\newblock Introduction to Markov chain Monte Carlo.
\newblock In \emph{Handbook of Markov Chain Monte Carlo}, edited by
    Brooks, S., Gelman, A., Jones, G., and Meng, X.-L., pp.~3--48.
\newblock Chapman \& Hall/CRC, Boca Raton, FL.

\bibitem[Geyer(2018)]{basic}
Geyer, C.~J. (2018).
\newblock Stat 3701 lecture notes: Basics of R.
\newblock \url{http://www.stat.umn.edu/geyer/3701/notes/basic.html}.

\bibitem[Geyer and Thompson(1995)]{geyer-temp}
Geyer, C.~J. and E.~A. Thompson (1995).
\newblock Annealing Markov chain Monte Carlo with applications to
    ancestral inference.
\newblock \emph{Journal of the American Statistical Association}, 90, 909--920.

\bibitem[Wickham(2014)]{wickham}
Wickham, H. (2014).
\newblock \emph{Advanced R}.
\newblock Chapman \& Hall/CRC, Boca Raton, FL.
\newblock Also available on the web \url{http://adv-r.had.co.nz/}.

\end{thebibliography}

\end{document}
