Table Of ContentAssigning a value to a power likelihood in a general Bayesian
model
C.C.Holmes&S.G.Walker
DepartmentofStatistics,UniversityofOxford,OX13LB&
DepartmentofMathematics,UniversityofTexasatAustin,78705.
[email protected]&[email protected]
January31,2017
7
1
0
2 Abstract
n
Bayesianapproachestodataanalysisandmachinelearningarewidespreadandpopularasthey
a
J provide intuitive yet rigorous axioms for learning from data; see Bernardo & Smith (2004) and
Bishop (2006). However, this rigour comes with a caveat, that the Bayesian model is a precise
0
3 reflection of Nature. There has been a recent trend to address potential model misspecification by
raisingthelikelihoodfunctiontoapower,primarilyforrobustnessreasons,thoughnotexclusively.
] InthispaperweprovideacoherentspecificationofthepowerparameteroncetheBayesianmodel
E
hasbeenspecifiedintheabsenceofaperfectmodel.
M
.
t
a 1 Introduction
t
s
[
Bayesianinferenceisoneofthemostimportantscientificlearningparadigmsinusetoday. Itscoreprin-
1 cipleistheuseofprobabilitytoquantifyallaspectsofuncertaintyinastatisticalmodel,andthengiven
v
data x, use conditional probability to update uncertainty via Bayes theorem, p(θ|x) ∝ f(x;θ)p(θ),
5
1 where θ denotes the parameters of the model, p(θ|x) the posterior update, f(x;θ) is the data model,
5 andp(θ)istheprioronthemodelparameters. Bayesianupdatingiscoherent,seeforexample(Lindley,
8
2000).
0
. ThejustificationforBayesianupdatingproceedsonanassumptionthattheformofthedatamodel,
1
0 f(x;θ), is correct up to the unknown parameter value θ. Bayesian learning is optimal, see Zellner
7 (1988),whichmeansthatposterioruncertaintyistheappropriatereflectionofprioruncertaintyandthe
1
information provided by the data. However, this is only in the case when the model is true. This is at
:
v oddswiththescientificdesireforkeepingmodelssimpleinordertofocusontheessentialaspectsofthe
i
X systemunderinvestigation.
r RecentlyanumberofpapershaveappearedseekingtoaddressthemismatchandallowforBayesian
a
learningundermodelmisspecification;thekeyreasonisrobustness,andtheideaistoraisethelikelihood
to a power. See, for example, Royall & Tsou (2003), Zhang (2006a), Zhang (2006b), Jiang & Tanner
(2008), Bissiri, Holmes & Walker (2016), Walker & Hjort (2001), Watson & Holmes (2016), Miller &
Dunson(2015),Gru¨nwald&vanOmmen(2014),Syring&Martin(2015),andmoregenerallyHansen
& Sargent (2008). The paper by Bissiri, Holmes & Walker (2016), in particular, provides a formal
motivationusingacoherencyprincipleforraisingthelikelihoodtoapower.
FortheformalBayesiananalyst,iff(x;θ)ismisspecified,thenthereisnoconnectionbetweenany
θ and any observation from this model, and as a consequence no meaningful prior can be set. In this
case, it is argued in Bissiri, Holmes & Walker (2016) that it is preferable to look at −logf(x;θ) as
simplyalossfunctionlinkingθandobservationx. Thenaformalgeneral-Bayesianupdateofpriorp(θ)
1
toposteriorp(θ|x)existsandfortheupdatetoremaincoherentitwasshownthatitmustbeoftheform,
p (θ|x) ∝ f(x;θ)wp(θ),
w
logp (θ|x) = w logf(x;θ)+logp(θ)+logZ
w w
whereZ isthenormalisingconstantensuringthattheposteriordistributionintegratesto1, andw isa
w
weighting parameter calibrating the two loss functions for θ, namely −logp(θ) and −logf(x;θ) . In
thisway,w > 0controlsthelearningrateofthegeneralised-Bayesianupdate,withw = 1returningthe
conventional Bayesian solution. Clearly for w < 1 the update gives less weight to the data relative to
thepriorcomparedtotheBayesianmodel,resultinginaposteriorthatismorediffuse,andwithw > 1
thedataisgivenmoreprominence.
The crucial question then becomes how to set w in a formal manner. One needs to be careful as
learningaboutw canbothbeoverdone(w settoohighandtheposterioruncertaintyisunderestimated)
andunderdone(wsettoolowandtheposterioruncertaintyisoverestimated). Theelegantandattractive
nature of Bayesian inference when the model precisely matches Nature is that the learning is achieved
optimally;i.eatthecorrectspeed. SeeLindley(2000),Bernardo&Smith(2004)andZellner(1988).
Inthispaperweproposetosetwonceaproperp(θ)andmodelf(x;θ)havebeensetbymatchingthe
priorexpectedgainininformationbetweenpriorandposteriorfromtwopotentialexperiments;forEx-
periment1usingp (θ|x)wecomputeanexpectedinformationgainbetweenp (·|x)andp(·),denoted
w w
by I (x), to be specified later. For Experiment 2 we consider the corresponding gain in information
w
betweenposteriorp(θ|x)andp(θ),whichwillbeI (x). Thenwesetw sothat
1
(cid:90) (cid:90)
I (x)f (x)dx = I (x)f(x;θ )dx, (1)
w 0 1 0
where f (x) is the true, unknown, density and θ is the true parameter value if the parametric model
0 0
is correct or else is the parameter value minimizing the Kullback-Leibler divergence between the true
model and the parametric family of densities. So, if the model is correct, then f (x) = f(x;θ ) and
0 0
w will automatically be 1. The rationale for (1) is coherence; that the expected gain in information for
learning about θ from a single sample for both experiments is the same. To elaborate: Experiment 1
0
is assuming the data is not necessarily coming from the parametric model, the likelihood is f(x;θ)w
with prior p(θ) and x ∈ f (x). According to Bissiri, Holmes & Walker (2016), the p (θ|x) is a valid
0 w
update for learning about the θ which minimizes the Kullback-Leibler divergence between f (x) and
0
f(x;θ); i.e. θ , and for w > 0 the posterior p (·|x) will be consistent for θ for regular models. That
0 w 0
thisisbeinglearntaboutfollowsfromBerk(1966). Experiment2isassumingthedataiscomingfrom
theparametricmodel, thelikelihoodisf(x;θ)withpriorp(θ)andx ∈ f(x;θ ). Bothexperimentsare
0
involvedwithlearningaboutthesameθ . Wearguethattheexperimentershouldbeaprioriindifferent
0
between these two experiments with respect to the prior expected gain in information about θ . Thus,
0
w issetsothepriorexpectedgainininformationisthesameasthatwhichwouldhavebeenobtainedif
theparametricmodelwerecorrect.
We can evaluate both sides of (1) using the observed data, {x ,...,x }, so the left side and right
1 n
sideof(1)areevaluatedas
n (cid:90)
(cid:88)
n−1 Iw(Xi) and I1(x)f(x;θ(cid:98))dx,
i=1
respectively, where θ(cid:98)is the maximum likelihood estimator. See White (1982) about the theory for θ(cid:98)
being the appropriate estimator for θ . In the next section we define I (x) and in section 3 we present
0 w
someillustrations.
2 The prior expected information in an experiment
To quantify the prior expected information of an experiment we utilise the well established notion of
Fisher information; see Lehmann & Casella (1998). In particular we shall consider the expected diver-
2
genceinFisherinformation,F(p ,p ),betweentwodensityfunctionsp andp ,withexactformgiven
1 2 1 2
below;seeforexampleOtto&Villani(2000). MotivationforthischoiceisgivenintheAppendix.
The Fisher relative information divergence of a posterior update from its prior, with likelihood
f(x;θ),isgivenby
(cid:8) (cid:9) (cid:90) (cid:26)(cid:53)p(θ|x) (cid:53)p(θ)(cid:27)2
F p(·),p(·|x) = p(θ) − dθ,
p(θ|x) p(θ)
wherethe(cid:53)operatesontheddimensionalθ. Thisisgivenby
(cid:8) (cid:9) (cid:90) (cid:26)(cid:53)f(x;θ)(cid:27)2 (cid:90) (cid:88)d (cid:26) ∂ (cid:27)2
F p(·),p(·|x) = p(θ) dθ = p(θ) logf(x;θ) dθ. (2)
f(x;θ) ∂θ
j
j=1
Hence,withlikelihoodf(x;θ)w,wehaveI (x) = w2∆(x),where∆(x) = F(cid:8)p(·),p(·|x)(cid:9).
w
Thisleadsto
1
(cid:26)(cid:82) (cid:27)
f(x;θ )∆(x)dx 2
0
w = (cid:82) . (3)
f (x)∆(x)dx
0
ThisresultalsohighlightswhyFisherinformationisaconvenientmeasureofinformationintheexperi-
mentasitleadstoanexplicitformulaforthesettingofw.
The actual setting of w via (3) is hindered by the lack of knowledge of f and θ . However, an
0 0
empiricalapproachfollowstriviallysincewecanestimatef (x)withtheempiricaldistributionfunction
0
ofthedataandthenestimateθ0 withθ(cid:98),themaximumlikelihoodestimator. Thus
1
(cid:40)(cid:82) f(x;θ(cid:98))∆(x)dx(cid:41)2
w = .
(cid:98) n−1(cid:80)n ∆(X )
i=1 i
Acommonsimplifyingchoiceofmodelwouldbefromtheclassofexponentialfamily;
M
(cid:88)
f(x;θ) = exp θ φ (x)−b(θ)
j j
j=1
wherethe(φ (x))areasetofbasisfunctionsandb(θ)isthenormalizingconstant. Thenstraightforward
j
calculationsyield
(cid:82) (cid:82) (cid:80)M {φ (x)−b(cid:48)(θ)}2f(x;θ )p(θ)dxdθ
w2 = j=1 j j 0 ,
(cid:82) (cid:82) (cid:80)M {φ (x)−b(cid:48)(θ)}2f (x)p(θ)dxdθ
j=1 j j 0
whereθ isgivenby(cid:82) φ (x)f (x)dx = b(cid:48)(θ )forallj = 1,...,M,andb(cid:48)(θ) = ∂b(θ)/∂θ . Hence
0 j 0 j 0 j j
w2 = (cid:82) (cid:82) (cid:80)Mj=1{φj(x)−b(cid:48)j(θ)}2f(x;θ(cid:98))p(θ)dxdθ.
(cid:98) n−1(cid:80)n (cid:82) (cid:80)M {φ (x )−b(cid:48)(θ)}2p(θ)dθ
i=1 j=1 j i j
1
In general we have, under the usual assumptions on the model that θ(cid:98) = θ0 + Op(n−2), and that
(cid:82) ∆2(x)f (x)dx < ∞:
0
Lemma2.1. Iff(x;θ ) = f (x)thenw → 1inprobabilityasn → ∞.
0 0 (cid:98)
Proof. If we write γ(θ) = (cid:82) ∆(x)f(x;θ)dx then we have γ(θ(cid:98)) = γ(θ0) + Op(n−12). Also, γn =
1
n−1(cid:80)ni=1∆(xi) = γ(θ0)+Op(n−2)andhencewehavetheresultasw(cid:98)2 = γ(θ(cid:98))/γn. (cid:3)
3
3 Illustrations
Weconsiderillustrationschosentohighlighttheessentialfeaturesofsettingw,chosenwhenthemodel
isexponentialfamily;specificallyPoissonandnormal.
3.1 Poissonmodel
If the model is Poisson, then for some θ > 0 the mass function for observation X = x is given by
f(x;θ) = θx/x! e−θ for x = 0,1,2,.... Then to find w we need to evaluate the denominator and
numeratorin(3),
∞ ∞
(cid:88) (cid:88)
D = ∆(x)f (x) and N = ∆(x)f(x;θ )
0 0
x=0 x=0
where
(cid:90) ∞(cid:26)∂f(x;θ)/∂θ(cid:27)2 (cid:90) ∞
∆(x) = p(θ)dθ = (x/θ−1)2p(θ)dθ,
f(x;θ)
0 0
(cid:80)
θ maximizes f (x) logf(x;θ);andasf(x;θ)isPoissonwehaveθ = µ astheexpectedvalues
0 x 0 0 0
fromf ,andσ2 thevariancefromf . Hence,lettinga = (cid:82) θ−2p(θ)dθ andb = (cid:82) θ−1p(θ)dθ,we
0 0 0 θ>0 θ>0
find,D = a(µ2+σ2)−2bµ +1 and N = (µ2+µ )a−2bµ +1.ThenforthePoissonmodelfit
0 0 0 0 0 0
todataarisingfromf (x)wehavew2 = N/D andD = a(σ2−µ )+N.
0 0 0
On inspectionof the resultwe see thatwhen σ2 > µ , where thedata are “overdispersed”, we find
0 0
thatw < 1. Theideahereisthatthedatawillprovidelargerthanexpectedobservations,fromaPoisson
model perspective, and unless the observations are down weighted, then inference will appear overly
precise. Downweighting the information in the observations will provide a more stable and practical
inference for the unknown parameter. Equally when the data are underdispersed then the Bayesian
learning will be adjusted to w > 1 accounting for the increased precision in the data to learn about the
parameterθ minimisingtherelativeentropyofthemodeltothedatadistribution.
0
To illustrate the performance we conducted the following experiment. We took n = 1000 obser-
vations from an overdispersed model, so X given φ is Poisson with mean φ and φ is from the gamma
distribution with mean 3.33 and variance 11.11. Thus the variance of the data is 14.44 while the mean
of the data is 3.33, so there is a substantial amount of overdispersion. The prior for θ in the Poisson
f(x;θ) model was taken to be gamma with mean 3 and variance 3. For this experiment we then com-
putedw(cid:98)usingthesamplemean(x¯)andsamplevariance(S2);D(cid:98) = a(x¯2+S2)−2bx¯+1 and N(cid:98) =
a(x¯2+x¯)−2bx¯+1.Thus
a(x¯2+x¯)−2bx¯+1
w2 = .
(cid:98) a(x¯2+S2)−2bx¯+1
We plot the w against sample size in Fig 1, and note that essentially the w < 1, with convergence to a
(cid:98) (cid:98)
numberlowerthan1.
On the other hand, if the model was true (the so called M-closed perspective in Bernardo & Smith
(2004)), then S2 −x¯ → 0, then w2 → 1. Moreover, using standard asymptotic, large sample size n,
(cid:98)
1
propertiesofmodelsandestimators,wehavethat1−w2 → 0ataspeedofn−2.
(cid:98)
3.2 Exponentialfamily
We provide some further analysis of the general case for the exponential family based on f(x;θ) =
c(x)exp{θx−b(θ)}. Then following the same strategy as in the previous sub-section, and using (3),
wherenow∆(x) = (cid:82){x−b(cid:48)(θ)}2p(θ)dθ andb(cid:48)(θ ) = (cid:82) xf (x)dx = (cid:82) xp (x)dx,wecanshow
0 0 θ0
b(cid:48)(cid:48)(θ )+(cid:82) (cid:8)b(cid:48)(θ )−b(cid:48)(θ)(cid:9)2p(θ)dθ
w2 = 0 0
σ2+(cid:82) (cid:8)b(cid:48)(θ )−b(cid:48)(θ)(cid:9)2p(θ)dθ
0 0
4
whichisestimatedvia
b(cid:48)(cid:48)(θ(cid:98))+(cid:82) (cid:8)X¯ −b(cid:48)(θ)(cid:9)2p(θ)dθ
w2 = .
(cid:98) S2+(cid:82) (cid:8)X¯ −b(cid:48)(θ)(cid:9)2p(θ)dθ
Thus, w will converge to 1 or otherwise depending on how the sample variance S2 compares with the
varianceestimatorfromthemodel;namelyb(cid:48)(cid:48)(θ(cid:98)). Eveninthecaseofregressionmodels,thebasicidea
isthesamewhen∆(·)isquadratic,asitwouldbeforexampleinthecaseofanormallinearregression
model.
3.3 Normalmodel
Hereweconsideranormalmodelwithunknownmeanθ andvariance1. Thepriorforθ isnormalwith
mean 0 and precision parameter λ. The aim here is to compare our selection of w with an alternative
usingtheKullback-Leiblerdivergence;i.e. tosetw basedonmatching
(cid:90) (cid:90)
D{pw(·|x),p(·)}dFn(x) = D{p(·|x),p(·)}f(x;θ(cid:98))dx,
(cid:82)
whereD(q,p) = qlog(q/p). Althoughthereisnoclosedformsolutionforw here,wecanevaluateit
numerically.
First we considered the overdispersed case and so generated 50 observations from a normal distri-
butionwithprecision0.2andusethepriorforθ tohavemean0andprecision0.01. Thenwelookedat
theunderdispersedcaseandgenerated50observationsfromanormaldistributionwithprecision4and
againusethepriorforθ tohavemean0andprecision0.01
InFig2,ontheleftside,weplotthreeposteriordistributions: blueistheposteriorusingthewfrom
our Fisher information distance; red is the posterior using the w obtained from the Kullback-Leibler
divergence, and the green is the correct posterior had the model been used with the correct precision
parameterof0.2.
OntherightsideofFig2weagainplotthreeposteriordistributions: blueistheposteriorusingthe
w from our Fisher information distance; red is the posterior using the w obtained from the Kullback-
Leibler divergence, and the green is the correct posterior had the model been used with the correct
precision parameter of 4. In both cases we see that our posterior is closer to the posterior based on the
correctmodel;i.e. replacing1withtheprecisions0.2and4,respectively.
4 Discussion
It can be argued that all models are misspecified. Under such a scenario there is no formal connection
between any observed x and any θ when looking at f(x;θ) as a density function. On the other hand,
(cid:82)
whenviewedasalossfunction,−logf(x;θ),andlearningaboutθ = argmin f(x;θ)f (x)dx,
0 θ∈Θ 0
we can interpret the correspondence between x and the object of inference θ. However, as pointed out
inBissiri,Holmes&Walker(2016),inthissettingthereisafreeparameterw introducedbythemodel
misspecification. In this paper we have introduced principles for the specification of w which provides
anaprioricoherentagendaintermsofpriorexpectedgainininformationaboutθ .
0
Appendix: Motivation for Fisher information distance
As shown in Walker (2016), the expected (with respect to the prior predictive) Fisher information dis-
tancebetweenpriorandposteriorisgivenby
(cid:90) (cid:90)
p¯(x)F(p(·|x),p(·))dx = J(θ)p(θ)dθ = E{J(Θ|X)}−J(Θ) (4)
5
(cid:82)
where J(θ) is the Fisher information for θ, p¯(x) is the prior predictive p¯(x) = f(x;θ)p(θ)dθ, and
J(Θ) = (cid:82) p(cid:48)(θ)2/p(θ)dθ is known as the Fisher information for the density p(θ), while J(Θ|X) is
the Fisher information for the posterior given X. So it has similar properties to the Kullback-Leibler
divergencewhichreliesonexpecteddifferentialentropybetweenpriorandposterior.
However,insteadofusing
(cid:90) (cid:26) ∂ p(θ|x)(cid:27)2
F{p(·|x),p(·)} = p(θ|x) log dθ
∂θ p(θ)
toget(4),weuse
(cid:90) (cid:26) ∂ p(θ|x)(cid:27)2
F{p(·),p(·|x)} = p(θ) log dθ.
∂θ p(θ)
Fortheformerissuitedtotheidealizedsettingofacorrectmodel;whereaswearetryingtoevaluatethe
priorandposteriordiscrepancy,i.e.
(cid:26)p(cid:48)(θ|x) p(cid:48)(θ)(cid:27)2 (cid:26) ∂ (cid:27)2
− = logf(x;θ) = S2(x,θ),
p(θ|x) p(θ) ∂θ
whereS(x,θ)istheusualscorefunction,withrespecttopriorbeliefs,foritisonlythepriorbeliefswe
assumecommontobothexperimenters;i.e. theoneusingI andtheoneusingI .
1 w
Wecanelaboratefurther: thepriorexpectedFisherinformation;i.e. E {J(θ)},is
p(θ)
(cid:90) (cid:90) (cid:90) (cid:90)
J(θ)p(θ)dθ = F{p(·|x),p(·)}p¯(x)dx = S2(x,θ)p(θ)f(x;θ)dθdx.
This would be the expected information in a single sample as an expected discrepancy between prior
andposterior. However,thisexpectedFisherinformationisprovidedundertheidealizedsettingthatthe
joint density of (x,θ) for the expectation of S2(x,θ) is p(θ)f(x;θ). It would be unrealistic for us to
assume the marginal density for x is p¯(x), even for the Bayesian assuming f(x;θ) is correct. A more
realisticestimationoftheexpectedsquaredscorefunction,i.e. informationinasinglesample,wouldbe
tousetheempiricallydeterminedjointdensityp(θ)f(x;θ(cid:98)).
FortheBayesianusingf(x;θ)w,thescorefunctionisS (x,θ) = wS(x,θ),andsowouldestimate
w
theinformation,usingtheproductmeasureofthepriorandempiricaldistributionfunction,F (x),since
n
this Bayesian is assuming the model incorrect. Matching these two forms of information from a single
sampleandaboutthesameparameter,wehave
(cid:90) (cid:90) (cid:90) (cid:90)
Sw2(x,θ)p(θ)dθdFn(x) = S2(x,θ)p(θ)f(x,θ(cid:98))dθdx
wherethetermontheleftisgivenbyw2(cid:82) (cid:82) S2(x,θ)p(θ)dθdF (x)andrecallthat(cid:82) S2(x,θ)p(θ)dθ =
n
F{p(·),p(·|x)}. Inshort,weareusingthesquareofthescorefunctionasameasureofinformationina
singlesamplewhichalsohastheinterpretationintermsofFisherdistancebetweenpriorandposterior.
Acknowledgements
TheauthorsaregratefultotwoanonymousrefereesandanAssociateEditorforcommentsandsugges-
tionsonapreviousversionofthepaper.
References
BERK, R. H. (1966). Limiting behavior of posterior distributions when the model is incorrect. The
AnnalsofMathematicalStatistics37,51-58.
6
BERNARDO, J. M. & SMITH, A. F. M. (2004). BayesianTheory. IOPPublishing.
BISHOP, C. M. (2006) PatternRecognitionandMachineLearning. Springer.
BISSIRI, P. G., HOLMES, C. C. & WALKER, S. G. (2016). A general framework for updating belief
distributions. ToappearinJ.Roy.Statist.Soc.SerB.
GRU¨NWALD,P.&VANOMMEN,T.(2014). InconsistencyofBayesianinferenceformisspecifiedlinear
models,andaproposalforrepairingit. arXivpreprintarXiv:1412.3730.
HANSEN, L. P. & SARGENT, T. J. (2008) Robustness. Princetonuniversitypress.
JIANG, W. & TANNER, M. A. (2008). Gibbs posterior for variable selection in high-dimensional
classificationanddatamining, Ann.Statist.36,2207-2231.
LEHMANN, E. L. & CASELLA, G. (1998). Theory of Point Estimation (Springer Texts in Statistics).
Springer.
LINDLEY, D. V. (2000). Thephilosophyofstatistics. JournaloftheRoyalStatisticalSociety,SeriesD
49,293-337.
MILLER, J. & DUNSON, S. (2015). RobustBayesianinferenceviacoarsening. arXiv:1506.06101
OTTO, F. & VILLANI, C. (2000). Generalization of an inequality by Talagrand and links with the
logarithmicSobolevinequality. JournalofFunctionalAnalysis173,361-400.
ROYALL, R. & TSOU, T. S. (2003). Interpretingstatisticalevidencebyusingimperfectmodels: robust
adjustedlikelihoodfunctions. JournaloftheRoyalStatisticalSociety,SeriesB65,391-404.
SYRING, N. & MARTIN, R.(2015). ScalingtheGibbsposteriorcredibleregions. arXiv:1509.00922v1
WATSON, J. & HOLMES, C. C. (2016). Approximate Models and Robust Decisions. Statist. Sci. 31,
465–489.
WALKER, S. G. (2016). Bayesian information in an experiment and the Fisher information distance.
StatisticsandProbabilityLetters112,5-9.
WALKER, S. G. & HJORT, N.L. (2001). On bayesian consistency. Journal of the Royal Statistical
Society,SeriesB63,811-821.
WHITE, H. (1982). Maximumlikelihoodestimationofmisspecifiedmodels. Econometrica50,1-25.
ZELLNER, A.(1988). OptimalinformationprocessingandBayes’stheorem. TheAmericanStatistician
42,278-280.
ZHANG, T. (2006). From (cid:15)-entropy to KL-entropy: Analysis of minimum information complexity
densityestimation. Ann.Statist.34,2180-2210.
ZHANG, T. (2006). Information theoretical upper and lower bounds for statistical estimation. IEEE
Trans.Inform.Theory52,1307-1321.
7
Figure1: Plotofw againstsamplesize: Overdispersedcase,Poissonexample.
(cid:98)
Figure2: Posteriordistributionsintheoverdisperedcase(leftfigure)andtheunderdispersedcase(right
figure)fornormalexample: posteriorbasedonFisherdistancew inblue; posteriorbasedonKullback-
Leiblerw inred;andtrueposteriorusingthecorrectmodelfromwhichdataaregeneratedingreen.
8