using DataFrames
function expand_grid(; iters...)
= collect(keys(iters))
var_names = [1:length(x) for x in iters.data]
var_itr = vcat([collect(x)' for x in Iterators.product(var_itr...)]...)
var_ix = DataFrame()
out = 1:length(var_names)
for i :,var_names[i]] = collect(iters[i])[var_ix[:,i]]
out[end
return outend
= expand_grid(iteration=1:3, Possibilities=["A", "B","C", "D"], stage = ["Prior", "Posterior"])
d
=DataFrame(Credibility =[fill(.25,4); 0; fill(1/3,3); 0; fill(1/3,3);0;.5;0;0.5;0;.5;0;0.5;fill(0,3);1])
d2
sort!(d, [:iteration])
= d2.Credibility d.Credibility
2 Introduction: Credibility, Models, and Parameters
2.1 Bayesian inference is reallocation of credibility across possibilities
To make Figure 2.1 we need data. We will create some synthetic data and store it in using DataFrames.jl
Here we are defining a function that creates combinations and then I just bind that we the Credibility values we will use for plotting.
We can take a look at the top few rows of the data with the first()
function.
first(d,5)
5 rows × 4 columns
iteration | Possibilities | stage | Credibility | |
---|---|---|---|---|
Int64 | String | String | Float64 | |
1 | 1 | A | Prior | 0.25 |
2 | 1 | B | Prior | 0.25 |
3 | 1 | C | Prior | 0.25 |
4 | 1 | D | Prior | 0.25 |
5 | 1 | A | Posterior | 0.0 |
Now lets plot our version of Figure 2.1. To do so, we will use the Gadfly.jl
package. This is my preferred plotting package because it has similar syntax to ggplot2 package in R.
maybe add the annotation using something like below https://stackoverflow.com/questions/50925450/gadfly-julia-how-to-add-text-annotation-to-a-plot
2.1.1 Data are noisy and inferences are probabilistic.
The next plot will involve getting creating some Gaussian distributions. To do so, we will use the Distributions.jl
package. Distribution allows create and extract information from a wide range of probability distributions. We will use this package a lot in subsequent chapter to draw samples to facilitate Bayesian inference. Note that I am ignoring the scaling issue apparent from the four distributions densities not being equal to one. Also to create the illusion of a discrete x axis, I will just show only the 4 values.
TODO: rescale the curves such that the density on the y axis for each curve sums to one.
using Gadfly
using Distributions
using DataFrames
set_default_plot_size(16cm, 18cm)
= [1, 2, 3, 4] #specify the tick marks
ticks
#create some data for the bars
= DataFrame(Possibilities = 1, Credibility = pdf(Normal(1, 1.2), 1))
x5 = DataFrame(Possibilities = 2, Credibility = pdf(Normal(2, 1.2), 2))
x6 = DataFrame(Possibilities = 3, Credibility = pdf(Normal(3, 1.2), 3))
x7 = DataFrame(Possibilities = 4, Credibility = pdf(Normal(4, 1.2), 4))
x8
plot(layer(x5, x = :Possibilities, y = :Credibility, Geom.bar(orientation = :vertical)),
layer(x6, x = :Possibilities, y = :Credibility, Geom.bar(orientation = :vertical)),
layer(x7, x = :Possibilities, y = :Credibility, Geom.bar(orientation = :vertical)),
layer(x8, x = :Possibilities, y = :Credibility, Geom.bar(orientation = :vertical)),
layer(x->pdf(Normal(1, 1.2), x), -6, 10, color=[colorant"black"], order=1),
layer(x->pdf(Normal(2, 1.2), x), -6, 10, color=[colorant"black"],order=1),
layer(x->pdf(Normal(3, 1.2), x), -6, 10, color=[colorant"black"],order=1),
layer(x->pdf(Normal(4, 1.2), x), -6, 10, color=[colorant"black"],order=1),
xticks(ticks=ticks),
Guide.xlabel("Possibilities"),
Guide.ylabel("Credibility"),
Guide.title("Prior"),
Guide.Theme(
= "white",
background_color = "white",
grid_color = 1mm,
bar_spacing = :none
key_position ))
Now we can use the same method to create the bottom panel of Figure 2.3. Here I will just hide the y axis. What is important to note is just how the assignment of credibility changes based on our 3 observations (shown in red).
= [1, 2, 3, 4] #specify the tick marks
ticks
#create some data for the bars
= DataFrame(Possibilities = 1, Credibility = pdf(Normal(1, 8), 1))
x5 = DataFrame(Possibilities = 2, Credibility = pdf(Normal(2, 3), 2))
x6 = DataFrame(Possibilities = 3, Credibility = pdf(Normal(3, 5), 3))
x7 = DataFrame(Possibilities = 4, Credibility = pdf(Normal(4, 20), 4))
x8 = DataFrame(Possibilities = [1.75, 2.25, 2.75], Credibility = zeros(3))
x9
plot(layer(x5, x = :Possibilities, y = :Credibility, Geom.bar(orientation = :vertical)),
layer(x6, x = :Possibilities, y = :Credibility, Geom.bar(orientation = :vertical)),
layer(x7, x = :Possibilities, y = :Credibility, Geom.bar(orientation = :vertical)),
layer(x8, x = :Possibilities, y = :Credibility, Geom.bar(orientation = :vertical)),
layer(x->pdf(Normal(1, 8), x), -6, 10, color=[colorant"black"], order=1),
layer(x->pdf(Normal(2, 3), x), -6, 10, color=[colorant"black"],order=1),
layer(x->pdf(Normal(3, 5), x), -6, 10, color=[colorant"black"],order=1),
layer(x->pdf(Normal(4, 20), x), -6, 10, color=[colorant"black"],order=1),
layer(x9, x = :Possibilities, y = :Credibility, Geom.point, color=[colorant"red"],size =[.2], order=2), #add data
xticks(ticks=ticks),
Guide.yticks(ticks=[]),
Guide.xlabel("Possibilities"),
Guide.ylabel("Credibility"),
Guide.title("Posterior"),
Guide.Cartesian(ymin=0,ymax=0.3),
Coord.Theme(
= "white",
background_color = "white",
grid_color = 1mm,
bar_spacing = :none
key_position ))
2.2 Possibilities are parameter values in descriptive models
In the last section, we used the Normal()
function to make curves following Gaussian distribution. Now we will showcase the power and flexibility of the Distribution.jl
package by assigning a distribution to a name and drawing random samples from that object. By doing so we can simulate data from a the specified normal distribution.
using Random
# set the seed to make the simulation reproducible
Random.seed!(2022)
# assign d `Normal()` with μ 10 σ = 5
= Normal(10,5)
d # simulate the data with `rand()` from the specified distribution
= DataFrame(x = rand(d, 2000))
simulated_data
#plot it
#Figure 2.4.a.
plot(layer(simulated_data, x=:x, Geom.histogram(density=true)),
layer(x->pdf(Normal(10, 5), x), -8, 27, color=[colorant"black"],order=1),
xlabel("Data Values"),
Guide.ylabel("Data Probability"),
Guide.title("The candidate normal distribution\nhas a μ of 10 and σ of 5."),
Guide.Theme(
= "white",
background_color = "white",
grid_color = 1mm
line_width
) )
#Figure 2.4.b.
plot(layer(simulated_data, x=:x, Geom.histogram(density=true)),
layer(x->pdf(Normal(8, 6), x), -8, 27, color=[colorant"black"],order=1),
xlabel("Data Values"),
Guide.ylabel("Data Probability"),
Guide.title("The candidate normal distribution\nhas a μ of 8 and σ of 6."),
Guide.Theme(
= "white",
background_color = "white",
grid_color = [:dash],
line_style = 1mm
line_width
) )
2.3 The steps of Bayesian data analysis
In general, Bayesian analysis of data follows these steps:
- Identify the data relevant to the research questions. What are the measurement scales of the data? Which data variables are to be predicted, and which data variables are supposed to act as predictors?
- Define a descriptive model for the relevant data. The mathematical form and its parameters should be meaningful and appropriate to the theoretical purposes of the analysis.
- Specify a prior distribution on the parameters. The prior must pass muster with the audience of the analysis, such as skeptical scientists.
- Use Bayesian inference to re-allocate credibility across parameter values. Interpret the posterior distribution with respect to theoretically meaningful issues (assuming that the model is a reasonable description of the data; see next step).
- Check that the posterior predictions mimic the data with reasonable accuracy (i.e., conduct a “posterior predictive check”). If not, then consider a different descriptive model.
Perhaps the best way to explain these steps is with a realistic example of Bayesian data analysis. The discussion that follows is abbreviated for purposes of this introductory chapter, with many technical details suppressed. (p. 25)
I will showcase the entire workflow here. In later chapters we’ll cover this workflow in much more detail.
First we need to generate the data and fit a model to those data. In HtWtDataDenerator.R
script, Kruschke provided the code for a function that will simulate height/weight data. Lets rewrite that function in Julia:
using Distributions, Random, DataFrames, StatsBase
function HtWtDataGenerator(nSubj, rndsd = nothing, maleProb = 0.50)
# Random height, weight generator for males and females. Uses parameters from
# Brainard, J. & Burmaster, D. E. (1992). Bivariate distributions for height and
# weight of men and women in the United States. Risk Analysis, 12(2), 267-275.
# Kruschke, J. K. (2011). Doing Bayesian data analysis:
# A Tutorial with R and BUGS. Academic Press / Elsevier.
# Kruschke, J. K. (2014). Doing Bayesian data analysis, 2nd Edition:
# A Tutorial with R, JAGS and Stan. Academic Press / Elsevier.
# Specify parameters of multivariate normal (MVN) distributions.
# Men:
= 69.18
HtMmu = 2.87
HtMsd = 5.14
lnWtMmu = 0.17
lnWtMsd = 0.42
Mrho = [HtMmu, lnWtMmu]
Mmean = [HtMsd^2 Mrho * HtMsd * lnWtMsd; Mrho * HtMsd * lnWtMsd lnWtMsd^2]
Msigma # Women cluster 1:
= 63.11
HtFmu1 = 2.76
HtFsd1 = 5.06
lnWtFmu1 = 0.24
lnWtFsd1 = 0.41
Frho1 = 0.46
prop1 = [HtFmu1, lnWtFmu1]
Fmean1 = [HtFsd1^2 Frho1 * HtFsd1 * lnWtFsd1; Frho1 * HtFsd1 * lnWtFsd1 lnWtFsd1^2]
Fsigma1 # Women cluster 2:
= 64.36
HtFmu2 = 2.49
HtFsd2 = 4.86
lnWtFmu2 = 0.14
lnWtFsd2 = 0.44
Frho2 = 1 - prop1
prop2 = [HtFmu2, lnWtFmu2]
Fmean2 = [HtFsd2^2 Frho2 * HtFsd2 * lnWtFsd2; Frho2 * HtFsd2 * lnWtFsd2 lnWtFsd2^2]
Fsigma2
# Randomly generate data values from those MVN distributions.
if rndsd !== nothing
Random.seed!(rndsd)
end
= DataFrame(zeros(nSubj, 3), ["male", "height", "weight"])
datamatrix # arbitrary coding values
= 1
maleval = 0
femaleval
for i in 1:nSubj
# Flip coin to decide sex
= wsample([maleval, femaleval], [maleProb, 1 - maleProb], 1)
sex
if sex[1] == maleval
= rand(MvNormal(Mmean, Msigma))
datum elseif sex[1] == femaleval
= wsample([1, 2], [prop1, prop2], 1)
Fclust
if Fclust[1] == 1
= rand(MvNormal(Fmean1, Fsigma1))
datum else
= rand(MvNormal(Fmean2, Fsigma2))
datum end
end
2] = exp(datum[2])
datum[:] = [sex; round.(datum, digits = 1)]
datamatrix[i, end
return datamatrix
end
HtWtDataGenerator (generic function with 3 methods)
The HtWtDataGenerator()
function has two arguments. The nSubj argument determines how many values to generate, and maleProb determines how probable we want those values to be from men.
5 rows × 3 columns
male | height | weight | |
---|---|---|---|
Float64 | Float64 | Float64 | |
1 | 1.0 | 66.0 | 141.2 |
2 | 0.0 | 60.4 | 104.3 |
3 | 0.0 | 62.7 | 139.4 |
4 | 0.0 | 61.7 | 109.9 |
5 | 1.0 | 64.4 | 162.5 |
We’re about ready for the model, which we will fit it with the Hamiltonian Monte Carlo (HMC) method via the turing.jl package. We’ll introduce turing.jl more fully in Chapter 8. I also recommend you go check out the following and resources:
- https://turing.ml/v0.21/tutorials/00-introduction/
- http://hakank.org/julia/turing/
- https://storopoli.github.io/Bayesian-Julia/
Following Solomon Kurz’s example, we’ll use weakly-regularizing priors for the intercept and slope and a half Cauchy with a fairly large scale parameter for \(\sigma\)
using Turing
using Optim
using MCMCChains, Plots, StatsPlots, Gadfly
using AbstractMCMC
# Define the model
# linear regression.
@model function linear_regression(x,y)
= length(y)
n
# Set variance prior.
~ Truncated(Cauchy(0,10),0,Inf)
σ # Set intercept prior.
~ Normal(0,100)
α # Set the prior for beta.
~ Normal(0,100)
β
for i in 1:n
~ Normal(α + β * x[i], σ)
y[i] end
return y
end
#to be concrete we assign the values to x and y
= d.height
x = d.weight
y
= linear_regression(x, y)
model
# Run sampler, collect results
= sample(model, NUTS(500, 0.65),MCMCDistributed(),2_000, 4); chain_lin_reg
If you wanted a model summary, you could execute display(chain_lin_reg)
. For more detail, see Chapter 8. Here’s Figure 2.5.a.
#store in a Dataframe
= DataFrame(sample(chain_lin_reg[:,:,:], 57))
plot_df = x
plot_df.x = y
plot_df.y
#plot
= Geom.abline(color="red", style=:dash)
abline plot(plot_df,
Gadfly.=:x, y=:y, Geom.point,color = [colorant"black"], intercept=:α, slope=:β, abline,
xxlabel("Height in inches"),
Guide.ylabel("Weight in pounds"),
Guide.title("Data with 57 credible regression lines"),
Guide.Theme(
= "white",
background_color = "white",
grid_color = [:dash],
line_style = .1mm
line_width ))
For Figure 2.5.b., we’ll mark off the mode and 95% highest density interval (HDI).
= DataFrame(chain_lin_reg)
df
= DataFrame(quantile(group(chain_lin_reg, :β)))
plt_HDI .= mode(chain_lin_reg[:,:β,:])
plt_HDI.β
plot(layer(df, x=:β, Geom.histogram(bincount = 50)),
layer(plt_HDI, x = :β,xmin="2.5%", xmax="97.5%",Geom.point, Geom.errorbar, color=[colorant"red"],size =[.1], order=1),
xlabel("β(slope)"),
Guide.title("The posterior distribution \n
Guide.95% HDI intervals are\nthe dot and horizontal line at the bottom."),
The mode and yticks(ticks=[]),
Guide.Theme(
= "white",
background_color = "white",
grid_color = 0.25mm,
bar_spacing = 1mm
line_width ))
To make Figure 2.6, we use the predict()
function. First we create a model object with missing values for our dependent variable, then we will predict those values using our posterior samples.
# extract the posterior draws
= linear_regression(x,Vector{Missing}(missing, length(y)))
model_pred = predict(model_pred, chain_lin_reg)
posterior_check
#get the summary stats for plotting
= summarystats(posterior_check)
posterior_res #create DataFrame
= DataFrame(quantile(posterior_check))
plot_df = x
plot_df.x = y
plot_df.y = posterior_res[:,:mean]
plot_df.predictions
#plot
plot(
Gadfly.layer(plot_df, x=:x, y=:y, Geom.point,color = [colorant"black"]),
layer(plot_df, x=:x, y=:predictions,ymin="2.5%", ymax="97.5%",Geom.point,Geom.errorbar, color = [colorant"grey"]),Guide.xlabel("Height in inches"),
ylabel("Weight in pounds"),
Guide.title("Data with the percentile-based 95% intervals and\nthe means of the posterior predictions"),
Guide.Theme(
= "white",
background_color = "white",
grid_color = .1mm
line_width ))
The posterior predictions might be easier to depict with a ribbon and line, instead. Luckily Gadfly.jl
makes it really easy to change.
plot(
Gadfly.layer(plot_df, x=:x, y=:y, Geom.point,color = [colorant"black"]),
layer(plot_df, x=:x, y=:predictions,ymin="2.5%", ymax="97.5%",Geom.line,Geom.ribbon, color = [colorant"red"]),Guide.xlabel("Height in inches"),
ylabel("Weight in pounds"),
Guide.title("Data with the percentile-based 95% intervals and\nthe means of the posterior predictions"),
Guide.Theme(
= "white",
background_color = "white",
grid_color = .2mm
line_width ))
“We have seen the five steps of Bayesian analysis in a fairly realistic example. This book explains how to do this sort of analysis for many different applications and types of descriptive models” (p. 30).
That’s it! We can now start digging into the details that make up each of the steps outlined in this introductory chapter. Learning Bayesian inference is a rewarding experience. I wish you the best of luck on this exciting journey!
2.4 Session info
Pkg.status()
Status `~/.julia/environments/v1.7/Project.toml`
[80f14c24] AbstractMCMC v4.1.3
[c7e460c6] ArgParse v1.1.4
[c52e3926] Atom v0.12.38
[76274a88] Bijectors v0.10.6
[e28b5b4c] Bootstrap v2.3.3
[336ed68f] CSV v0.10.4
[a81c6b42] Compose v0.9.4
[a93c6f00] DataFrames v1.3.4
[1313f7d8] DataFramesMeta v0.12.0
[864edb3b] DataStructures v0.18.13
[31c24e10] Distributions v0.25.70
[c91e804a] Gadfly v1.3.4
[c27321d9] Glob v1.3.0
[7073ff75] IJulia v1.23.3
[e5e0dc1b] Juno v0.8.4
[c7f686f2] MCMCChains v5.3.1
[c03570c3] Memoize v0.4.4
[a15396b6] OnlineStats v1.5.14
[429524aa] Optim v1.7.2
[d96e819e] Parameters v0.12.3
[91a5bcdd] Plots v1.31.7
[1fd47b50] QuadGK v2.5.0
[295af30f] Revise v3.4.0
[ed01d8cd] Sobol v1.5.0
[03a91e81] SplitApplyCombine v1.2.2
[90137ffa] StaticArrays v1.5.6
[2913bbd2] StatsBase v0.33.21
[4c63d2b9] StatsFuns v1.0.1
[f3b207a7] StatsPlots v0.14.34
[fd094767] Suppressor v0.2.1
[bd369af6] Tables v1.7.0
[fce5fe82] Turing v0.21.12
[9d95f2ec] TypedTables v1.4.1
[9a3f8284] Random