Table Of ContentBIOINFORMATICS Vol.00no.002007
Pages1–8
Analysis of a Gibbs sampler method for model based
clustering of gene expression data
Anagha Joshia,b, Yves Van de Peera,b, Tom Michoela,b
∗
aDepartmentofPlantSystemsBiology,VIB,Technologiepark927,9052Gent,Belgium,
bDepartmentofMolecularGenetics,UGent,Technologiepark927,9052Gent,Belgium
8
ABSTRACT A variety of heuristic clustering methods have been used, such
0
Motivation:Overthelastdecade,alargevarietyofclusteringalgo- as hierarchical clustering (Eisen et al., 1998), k-means (Tavazoie
0
rithms have been developed to detect coregulatory relationships et al., 1999), or self-organizing maps (Tamayo et al., 1999). Alt-
2
among genes from microarray gene expression data. Model based houghthesemethodshavehadanenormousimpact,theirstatistical
n clustering approaches have emerged as statistically well grounded propertiesaregenerallynotwellunderstoodandimportantparame-
a
methods, but the properties of these algorithms when applied to terssuchasthenumberofclustersarenotdeterminedautomatically.
J
large-scale data sets are not always well understood. An in-depth Therefore, therehasbeenashiftinattentiontowardsmodel based
4
analysiscanrevealimportantinsightsabouttheperformanceofthe clustering approaches in recent years (Yeung et al., 2001; Fraley
1
algorithm,theexpectedqualityoftheoutputclusters,andthepossi- andRaftery,2002;MedvedovicandSivaganesan,2002;Medvedo-
bilitiesforextractingmorerelevantinformationoutofaparticulardata vicetal.,2004;Qin,2006; Dahl,2006). Amodelbasedapproach
]
M set. assumesthatthedataisgeneratedbyamixtureofprobabilitydis-
Results: We have extended an existing algorithm for model based tributions, one for each cluster, and takes explicitly into account
Q
clustering of genes to simultaneously cluster genes and conditi- the noisyness of gene expression data. It allows for a statistical
. ons, and used three large compendia of gene expression data for assessmentoftheresultingclustersandgivesaformalestimatefor
o
S.cerevisiae to analyze its properties. The algorithmuses a Baye- theexpectednumberofclusters.Toinfermodelparametersandclu-
i
b sianapproachandaGibbssamplingproceduretoiterativelyupdate sterassignments,standardstatisticaltechniquessuchasExpectation
- the clusterassignmentofeach geneand condition. Forlarge-scale MaximizationorGibbssamplingareused(Liu,2002).
q
[ data sets, the posterior distribution is strongly peaked on a limited In this paper we use a novel model based clustering method
numberofequiprobableclusterings.AGOannotationanalysisshows whichbuildsupon the method recentlyintroduced byQin(2006).
1 that these local maxima are all biologically equally significant, and We address two key questions that have remained largely unans-
v
thatsimultaneouslyclusteringgenesandconditionsperformsbetter wered for model based clustering methods in general, namely
3
thanonlyclusteringgenesandassumingindependentconditions.A convergence of the Gibbs sampler for very large data sets, and
3
collectionofdistinctequivalentclusteringscanbesummarizedasa non-heuristic reconstruction of gene clusters from the posterior
0
weightedgraphonthesetofgenes,fromwhichweextractfuzzy,over- probabilitydistributionofthestatisticalmodel.
2
. lappingclustersusingagraphspectral method. Thecoresofthese Inthemodel used byQin(2006), itisassumedthat theexpres-
1 fuzzyclusterscontaintightsetsofstronglycoexpressedgenes,while sionlevelsofgenesinoneclusterarerandomsamplesdrawnfrom
0
the overlaps exhibit relations between genes showing only partial a Gaussian distribution and expression levels of different experi-
8
coexpression. mentalconditionsareindependent.Wehaveextendedthismodelto
0
Availability: GaneSh, a Java packagefor coclustering, isavailable allowdependenciesbetweendifferentconditionsinthesamecluster.
:
v underthetermsoftheGNUGeneralPublicLicensefromourwebsite Medvedovicetal.(2004)usedamultivariatenormaldistributionto
Xi athttp://bioinformatics.psb.ugent.be/software. take into account correlation among experimental conditions. Our
Contact:[email protected] approachconsistsofclusteringtheconditionswithineachgeneclu-
r
a Supplementaryinformation:availableonourwebsiteat ster, assuming that the expression levels of the genes in one gene
http://bioinformatics.psb.ugent.be/supplementary data/anjos/gibbs cluster for the conditions in one condition cluster are drawn from
oneGaussian distribution. Hence our model isamodel for coclu-
stering or two-way clustering of genes and conditions. The same
1 INTRODUCTION
statisticalmodelwasalsousedinourrecentapproachtoreconstruct
SincetheseminalpaperbyEisenetal.(1998),nowalmostadecade transcriptionregulatorynetworks(Michoeletal.,2007).Thecoclu-
ago,clusteringformsthebasisforextractingcomprehensibleinfor- steringiscarriedoutbyaGibbssamplerwhichiterativelyupdates
mation out of large-scale gene expression data sets. Clusters of theassignmentofeachgene,andwithineachgeneclustertheassi-
coexpressedgenestendtobeenrichedforspecificfunctionalcate- gnment of each experimental condition, using the full conditional
gories (Eisenet al., 1998), share cis-regulatory sequences intheir distributionsofthemodel.
promoters (Tavazoieet al., 1999), or form thebuilding blocks for ItisknownthataGibbssamplermayhavepoor mixingproper-
reconstructingtranscriptionregulatorynetworks(Segaletal.,2003). tiesifthedistributionbeingapproximatedismulti-modalanditwill
thenhaveaslowconvergencerate(Liu,2002). Previousstudiesof
∗Correspondingauthor,E-mail:[email protected]
©OxfordUniversityPress2007. 1
A.Joshietal
Gibbssamplersformodelbasedclusteringhavenotreportedconver- belong to one fuzzy cluster with very high probability we obtain
gencedifficulties(MedvedovicandSivaganesan,2002;Medvedovic tightclusterswhichshowhigherfunctionalcoherencecomparedto
etal.,2004;Dahl,2006).Inthosestudies,onlydatasetswitharela- standardclusters.Keepingalsogeneswhichbelongwithlowerbut
tively small number of genes (upto a few 100) (Medvedovic and still significant probability to multiple fuzzy clusters, we can ten-
Sivaganesan, 2002; Medvedovic et al., 2004), or a small number tatively identify multifunctional genes or relations between genes
ofexperimentalconditions(lessthan10)(Dahl,2006)wereconsi- showingonlypartialcoexpression. Weshowthatourresultsarein
dered, and special sampling techniques such as reverse annealing goodagreementwithpreviousfuzzyclusteringapproachestogene
(Medvedovic et al., 2004) or merge-split proposals (Dahl, 2006) expressiondata(GaschandEisen,2002).Webelievethatourfuzzy
weresufficienttogenerateawellmixingGibbssampler.Weobserve clusteringmethodtosummarizetheposteriordistributionwillbeof
that for data sets of increasing size the correlation between two generalinterestforallmodelbasedclusteringapproachesandsol-
Gibbssamplerrunsaswellasthenumberofclustersolutionsvisited vestheproblemsassociatedtoheuristicclusteringsofthepairwise
inonerunafterburn-insteadilydecreases.Thismeansthatforlarge- probabilitymatrix.
scaledatasets,theposteriordistributionisverystronglypeakedon Allouranalysesareperformedonthreelarge-scalepubliccom-
multiplelocalmodes.Sincethepeaksaresostrong,weapproximate pendia of gene expression data for S. cerevisiae (Spellman et al.,
theposteriordistributionbyaveragingovermultiplerunsperformed 1998;Gaschetal.,2000;Hughesetal.,2000).
inparallel, each converging quickly toasinglemode. Bycompu-
tingthecorrelationbetweendifferentaveragesofthesamenumber 2 METHODS
of runs we are able to show that the number of distinct modes is
Mathematicalmodel
relativelysmallandaccurateapproximationstotheposteriordistri-
bution can be obtained withas few as 10 modes for around 6000 ForanexpressionmatrixwithNgenesandMconditions,wedefineacoclu-
genes. steringasapartition ofthegenesintoK geneclusters Gk, together with
foreachgenecluster,apartitionofthesetofconditionsintoL condition
Toidentifythefinaloptimalclustering,thetraditionalapproachis k
toselectoutofalltheclusteringsvisitedbytheGibbssamplerthe clusters Ek,l. Weassumethat all data points in acocluster {(i,m): i ∈
onewhichmaximizestheposteriordistribution(maximumaposte- Gk,m ∈ Ek,l} are random samples from the same normal distribution.
ThismodelgeneralizesthemodelusedbyQin(2006),wherethepartitionof
riori(MAP)clustering).However, weshowthatforlargedatasets
conditionsisalwaysfixedatthetrivialpartitionintosingletonsets.
thedifferencesinlikelihoodbetweenthedifferentlocalmaximaare Givenasetofmeansandprecisions(µkl,τkl),acoclusteringCdefinesa
extremely small and statistically insignificant, such that the MAP probabilitydensityondatamatricesD=(xim)by
clusteringisasgoodastakinganylocalmaximumatrandom.AGO
K Lk
(Ashburneretal.,2000)analysisofthedifferentmodesshowsthat
p D|C,(µkl,τkl) = p(xim|µkl,τkl).
alsofrom the biological point of view any difference between the
` ´ kY=1lY=1i∈YGkm∈YEk,l
local modesisinsignificant. Taking intoaccount thefullposterior
Weuseauniformprioronthesetofcoclusteringswithnormal-gammacon-
distributionismoredifficultsincedifferentclusteringsmayhavea
jugatepriorsfortheparametersµ andτ .UsingBayes’rulewefindthe
kl kl
differentnumberofclustersandthelabelingofclustersisnotuni-
probability of a coclustering C with parameters (µkl,τkl) given the data
que(thelabelswitchingproblem(RednerandWalker,1984)).The D. Thenwetake the marginal probability overthe parameters (µkl,τkl)
commonsolutiontothisproblemistoconsiderpairwiseprobabili- toobtainthefinalprobabilityofacoclusteringCgiventhedataD,uptoa
tiesfortwogenesbeingclusteredtogetherornot(Medvedovicand normalizationconstant:
Sivaganesan,2002;Medvedovicetal.,2004;Dahl,2006).Amajor K Lk
questionthathasnotyetrecievedafinalanswerishowtoreconstruct p(C|D)∝ p(µ,τ) p(xim|µ,τ)dµdτ, (1)
gene clusters from these pairwise probabilities. Medvedovic and kY=1lY=1ZZ i∈YGkm∈YEk,l
Sivaganesan (2002) and Medvedovic et al. (2004) use a heuristic
wherep(µ,τ)=p(µ|τ)p(τ)with
hierarchical clustering on the pairwise probability matrix to form
a final clustering estimate. Dahl (2006) introduces a least-squares p(µ|τ)= λ0τ 1/2e−λ02τ(µ−µ0)2, p(τ)= β0α0 τα0−1e−β0τ,
method, which selects out of all clusterings visited by the Gibbs 2π Γ(α0)
` ´
sampler the one which minimizes a distance function to the pair- α0,β0,λ0 >0and−∞<µ0 <∞beingtheparametersofthenormal-
wiseprobabilitymatrix.Inbothapproaches, theprobabilitymatrix gamma prior distribution. We use the values α0 = β0 = λ0 = 0.1
isreducedtoasinglehardclustering.Thisnecessarilyremovesnon- and µ0 = 0.0, resulting in a non-informative prior. We have compared
thenormal-gammapriorwithothernon-informative, conjugate priors, but
transitiverelationsbetween genes (such asa low probability for a
foundnodifferenceinresults(seeSupplementaryInformation).Thedouble
pair of genes to be clustered together even though they both have
integral in eq. (1) can be solved exactly in terms of the sufficient stati-
relativelyhighprobabilitytobeclusteredwiththesamethirdgene) stics T(n) = xn (n = 0,1,2)foreachcocluster. The
whichmayneverthelessbeinformativeandbiologicallymeaningful. kl i∈Gk,m∈Ekl im
log-likelihoodorBayesianscoredecomposesasasumofcoclusterscores:
Weproposethatthepairwiseprobabilitymatrixreflectsasoftor P
fuzzyclusteringofthedata, i.e., genescanbelongtomultipleclu- K Lk
sterswithacertainprobability.Toextractthesefuzzyclustersfrom S(C)=logp(C|D)= Skl, (2)
thepairwiseprobabilitiesweuseamethodfrompatternrecognition kX=1Xl=1
with
theory (Inoue and Urahama, 1999). This method iteratively com-
papnurodtebusapbtdhilaeitteylsamrtghaeetsrptirxeo,ibgcaeobnnivsliattrlyuucemtsaatnardixfucbzoyzryrreecsmpluoosvntidenrignwgfritoehimgtehintevtheeeicgtwoenreviogefhctttohoref, Skl=−21Tk(0l)log(2π)+ 12log`λ0+λ0Tk(0l)´−logΓ(α0)
thegenesassignedtothelastcluster.Byonlykeepinggeneswhich +logΓ(α0+ 21Tk(0l))+α0logβ0−(α0+ 21Tk(0l))logβ1
2
and anentropymeasure
β1=β0+ 21hTk(2l)− (TTk(k(1l0l)))2i+ λ02`(λT0k(1l+)−Tk(µ0l0))TTk(k0(l0l))´2. Hfuzzy= N21ln2Xij h(Fij), (6)
whereNisthedimensionofthesquarematrixF and
Gibbssampleralgorithm
h(q)=−qln(q)−(1−q)ln(1−q)for0≤q≤1.
WeuseaGibbssamplertosamplecoclusteringsfromtheposteriordistribu-
tion(1).Thealgorithmiterativelyupdatestheassignmentofgenestogene Forahardclustering (Fij = 0or1forall i,j), Hfuzzy = 0, andfora
clusters,andforeachgenecluster,theassignmentofconditionstocondition maximallyfuzzyclustering(Fij =0.5foralli,j),Hfuzzy = 1.Inreality,
clustersasfollows: thematrixFisverysparse(mostgenepairswillneverbeclusteredtogether),
1. Initialization: randomly assign N genes to arandom K0 number of soWHefuzazsysuremmeatihnastsamfaulzlzeyvegnenfoer-greenalefmuzaztyrixclFustiesripnrgosd.ucedbyafuzzyclu-
geneclusters,andforeachcluster,randomlyassignM conditionstoa
steringofthegenes,i.e.,weassumethateachgeneihasaprobabilityp
randomL numberofconditionclusters. ik
k,0 tobelongtoeachclusterk,suchthat kpik =1.Toextracttheseproba-
2. ForNcycles,removearandomgeneifromitscurrentcluster.Foreach bilitiesfromF weuseagraphspectralmethod(InoueandUrahama,1999),
genecluster k, calculate theBayesian score S(Ci→k), where Ci→k originally developed forpattern recogPnition andimage analysis, modified
denotesthecoclusteringobtainedfromCbyassigninggeneitocluster heretoenforcethenormalizationconditionsonp .Afuzzyclusterisrepre-
ik
k,keepingallotherassignmentsofgenesandconditionsequal,aswell sentedbyacolumnvectorw = (w1,...,wN)T, withwi theweightof
astheprobabilityS(Ci→0)forthegenetobealoneinitsowncluster. geneiinthiscluster,normalizedsuchthatkwk2 =wTw= w2 =1.
i i
AssigngeneitooneofthepossibleK+1geneclusters, whereK Thecohesiveness oftheclusterwithrespecttothegene-genematrixF is
iQskth∝eceuSrr(eCnit→nku)m,nboerrmoafligzeendesuclcuhsttehrast, ackcoQrdkin=g t1o.theprobabilities definedaswTFw=PijwwiTFFijwwj.BytheRayleigh-RitzthePorem,
3. Fmorfreoamchitgsecnuerrcelnutstcelruskte,r.foFroMreaccyhclceosPn,drietimonovceluastrearnld,ocmalccuolnadteititohne mw6=ax0 wTw =v1TFv1=λ1,
BayesianscoreS(Ck,m→l).Assignconditionmtooneofthepossible whereλ1isthelargesteigenvalueofF andv1thecorresponding(normali-
Lk+1clusters,whereLkisthecurrentnumberofconditionclusters zed)eigenvector.HencethemaximallycohesiveclusterinF isgivenbythe
forgeneclusterk,accordingtotheprobabilities Ql ∝ eS(Ck,m→l), eigenvectorofthelargesteigenvalue.BythePerron-Frobeniustheorem,this
4. nItoerrmatealsizteepd2suacnhdt3hautnPtillcoQnlve=rge1n.ce.Oneiterationisdefinedasexecu- ethigeemnveemctboerrsishiupnpiqroubeaabnidlitaiellsittosecnlutrsiteesra1rebynopnin1eg=atimvea.xvWj1(,evi1c,ajn).thHeenndceefitnhee
tingstep2and3consecutivelyonce,andhenceconsistsofN+K×M genewiththehighestweightinv1isconsideredtheprototypicalgeneforthis
samplingsteps(withKthenumberofgeneclustersafterStep1ofthat cluster,anditwillnotbelongtoanyothercluster.Theprobabilitypi1mea-
iteration). surestowhatextentothergenesarecoexpressedwiththisprototypicalgene.
Tofindthenextmostcohesivecluster,weremovefromF theinformation
This coclustering algorithm simulates a Markov chain which satisfies
alreadycontainedinthefirstclusterbysetting
detailedbalancewithrespecttotheposteriordistribution(1),i.e.,afterasuf-
ficientnumberofiterations,theprobabilitytovisitaparticularcoclustering Fi(j2)= 1−pi1Fij 1−pj1,
Cisgivenexactlybyp(C|D).Theexpectationvalueofanyrealfunctionf
andcomputethelargesteigenpvalueandcorpresponding(normalized)eigen-
withrespecttotheposteriordistributioncanbeapproximatedbyaveraging
overtheiterationsofasufficientlylongGibbssamplerrun: vectorv2forthismatrix.Theprototypicalgeneforthisclustermayalready
have some probability assigned to the previous cluster, so we define the
1 T0+T membershipprobabilitiestothesecondclusterby
E(f)=XC f(C)p(C|D)≈ T t=XT0+1f(Ct) (3) pi2=min maxvj2(,vi2,j)(1−pimax1),1−pi1 .
“ ”
whereCtisthecoclusteringvisitedatiterationtandT0isapossibleburn- Hereimax=argmaxj(v2,j)istheprototypicalgeneforthesecondcluster,
inperiod.WesaythattheGibbssamplerhasconvergediftworunsstarting andwetakethe‘min’toensurethat kpikwillneverexceed1.
fromdifferentrandominitializationsreturnthesameaverages(3)forasui- ThisprocedureofreducingF andcomputingthelargesteigenvalueand
tablesetoftestfunctionsf.Moreprecisely,if{fn}isasetoftestfunctions, correspondingeigenvectortodefinetPhenextclustermembershipprobabili-
definean =E1(fn)theaverageoffninthefirstGibbssamplerrun,and tiesisiterateduntiloneofthefollowingstoppingcriteriaismet:
bn=E2(fn)theaverageoffninthesecondGibbssamplerrun.Wedefine 1. All entries in the reduced matrix F(k) reach 0, i.e., for all genes,
acorrelationmeasureρ(0≤ρ≤1)betweentworunsas
k′<kpik′ = 1, and we have completely determined all fuzzy
ρ= | nanbn| . (4) Pclustersandtheirmembershipprobabilities.
( Pna2n)( nb2n) 2. Thelargesteigenvalue ofthereducedmatrixF(k) hasrank> 1. In
thiscasetheeigenvectorisnolongeruniqueandneednolongerhave
FullconvergenceisreachedipfρP=1. P
nonnegativeentries,sowecannotmakenewclustermembershippro-
Fuzzyclustering babilities outofit. Thismayhappen ifthe(weighted) graphdefined
byconnecting genepairs withnon-zero entries inF(k) isnolonger
Tokeeptrackofthegeneclusters,independentofthe(varying)numberof stronglyconnected(Perron-Frobeniustheorem).
clustersortheirlabeling,weconsiderfunctions
Tocomputeoneormoreofthelargesteigenvalues andeigenvectors for
1 ifgeneiandjbelongtothesamegeneclusterinC large sparse matrices such as F and its reductions F(k) we use efficient
fij(C)= (5) sparse matrix routines, such as for instance implemented in the Matlab®
(0 otherwise
functioneigs.
Ingeneral,theposteriordistribution(1)isnotconcentratedonasinglecoclu-
Datasets
steringandthematrixF = (E(fij))ofexpectation values (seeeq. (3))
consistsofprobabilitiesbetween0and1.Toquantifythisfuzzyness,weuse Weusethreelargecompendiaofgeneexpressiondataforbuddingyeast:
3
A.Joshietal
1. Gaschetal.(2000)dataset:expressionin173stressrelatedconditions. 1
2. Hughes et al. (2000) data set: compendium of expression profiles
correspondingto300diversemutationsandchemicaltreatments.
0.8
3. Spellmanetal.(1998)dataset: 77conditions foralphafactorarrest,
elutriation,andarrestofacdc15temperature-sensitivemutant. ure
s
Weselectthegenespresentinallthreedatasets(6052genes)and,tobeas ea 0.6
m
unbiasedaspossible, nofurtherpostprocessingisdone.WeuseSynTReN n
n(VuamnbdeernofBcuolcnkdeitieotnasl.f,o2r0a06sy)nttohegteicnetrraatnescsrimiptuiloanterdegdualtaatosreytsnwetiwthovrkarywiintgh elatio 0.4
1000genes(seealsoSupplementaryInformation). Corr
Functionalcoherence 0.2
Toestimatetheoverallbiologicalrelevanceoftheclustersweuseamethod
whichcalculatesthemutualinformationbetweenclustersandGOattributes 0
(GibbonsandRoth,2002).ForeachGOslimattribute, wecreateacluster- 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
attributecontingencytablewhererowsareclustersandcolumnsareattribute Number of Iterations
status (‘Yes’ if the gene possesses the attribute, ‘No’ if it is not known
whetherthegenepossessestheattribute). Thetotal mutualinformation is Fig.1. TraceplotofthecorrelationmeasureρbetweentwodifferentGibbs
definedasthesumofmutualinformationsbetweenclustersandindividual samplerrunsasafunctionofthenumberofiterations, forasmalldataset
GOattributes: (100genes,10conditions,topcurve)andalargedataset(1000genes,173
MI= H(C)+H(A)−H(C,A) (7) conditions,bottomcurve).BothdatasetsaresubsetsoftheGaschetal.data
set.
XA
whereCisaclusteringofthegenes,AisaGOattributeandHisShannon’s
entropy,H =− ipilog(pi),andthepiareprobabilitiesobtainedfrom detailatasingleGibbssamplerrun.Itturnsoutthatthecorrelation
thecontingencytables.
P measure between two successive iterations reaches 1 very rapidly
andremainsunchanged afterwards(SeeSupplementary Figure2).
3 RESULTSANDDISCUSSION
Sinceeachiterationinvolvesalargenumberofsamplingsteps(i.e.,
ConvergenceoftheGibbssampleralgorithm
alargenumberofpossibleconfigurationchanges),thisimpliesthat
Westudyconvergenceusingthetestfunctionsf whichindicateif theGibbssamplerveryrapidlyfindsalocalmaximumoftheposte-
ij
geneiandjareclusteredtogetherornot(seeeq.(5)intheMethods) riordistributionfromwhichitcannolonger escape. Weconclude
and compute thecorrelation measure ρ between different runs for thattheposteriordistributionissupportedonmultiplelocalmaxima
thissetoffunctions(seeeq.(4)intheMethods).Inadditiontothe whichoverlaponlypartially,andwithvalleysinbetweenthatcan-
correlation measure, wealso compute theentropy measure H notbecrossedbytheGibbssampler. Theselocalmaximaallhave
fuzzy
(seeeq.(6)intheMethods).Thisparametersummarizesthe‘shape’ approximately the same log-likelihood (see for instance the small
oftheposteriordistribution:avalueof0correspondstohardcluste- varianceinFigure4below)andarethereforeallequallymeaningful.
ringwhichimpliesthatthedistributioniscompletelysupportedona Theprobabilityratiobetween peaksandvalleysissolarge(expo-
singlesolution,themorepositiveH is,themorethedistribution nential in the size of the data set) that an accurate approximation
fuzzy
issupportedonmultiplesolutions. to the posterior distribution is given by averaging over the local
IntheanalysisbelowweusesubsetsfromtheGaschetal.dataset maximaonly.Thosecanbeuncoveredbyperformingmultipleinde-
withavaryingnumberofgenesandconditionsandperformmultiple pendentruns,eachconvergingveryquicklyononeofthemaxima,
Gibbssamplerrunswithalargenumberofiterations.Oneiteration and there is no need for special techniques to also sample in bet-
involvesareassignmentofallgenesandallconditionsinallclusters, weenlocalmaxima. Thenumber of local maxima(Gibbssampler
andhenceinvolvesN+M KsamplingstepsintheGibbssampler, runs)necessaryforagoodapproximationcanbeestimatedasfol-
×
whereN isthenumberofgenes,M thenumberofconditions,and lows.Weperform150independentGibbssamplerrunsandcompute
Kthenumberofclustersatthatiteration(typicallyK √N). foreachthepairwisegene-geneclusteringprobabilitymatrixF (see
∼
First we consider a very small data set (100 genes, 10 conditi- Methods). For each k = 1,...,50, we take twonon-overlapping
ons). WestarttwoGibbssamplerrunsinparallelandcomputethe setsof k solutionsand computetheaverageof theirpairwisepro-
correlationmeasure ρat each iteration, see Figure1. In thiscase, bability matricesF. Then, we compute the correlation measure ρ
ρapproaches itsmaximum value ρ = 1 inlessthan 5000 iterati- betweenthosetwoaverages.Thisisrepeatedseveraltimes,depen-
onsandtheGibbssamplergeneratesawellmixingchainwhichcan dingonthenumberofnon-overlappingsetsthatcanbechosenfrom
easilyexplorethewholespace.Non-zerovaluesoftheentropymea- thepoolof150solutions.Ifforagivenkthecorrelationisalways
sureH (0.105 0.003)indicatethattheposteriordistribution 1,thenthereareatmostklocalmaxima.Figure2showsthatask
fuzzy
±
issupportedonmultipleclusteringsofthegenes. increases,thecorrelationquicklyreachesclosetothisperfectvalue
NextweruntheGibbssampleralgorithmonadatasetwith1000 1.Thisimpliesthatthenumberoflocalmaximaisnottoolargeand
genes and all 173 conditions. Unlike in the previous situation we agoodapproximationtotheposteriordistributioncanbeobtained
observethatthecorrelationbetweentwoGibbssamplerrunssatura- inthiscasealreadywith10to20solutions.SupplementaryFigure1
teswellbelow1(seeFigure1).HencetheGibbssamplerdoesnot showsanexampleofhardclustersformedasaresultofasinglerun
convergetotheposterior distributioninonerun. Wecangainfur- andfuzzyclustersformedbymergingtheresultof10independent
therunderstandingforthelackofconvergencebylookinginmore runs.
4
1 The rapid convergence of the log-likelihood shows that the Gibbs
samplerreachesthelocalmaximaveryquicklyandthelowvariance
shows that the different local maxima are all equally likely. The
0.9
averageover 10runsoftheGOmutualinformationscore(seeeq.
e (7) in the Methods) shows the same rapid convergence and small
sur 0.8 variance(seeSupplementaryFigure6), implyingthatthedifferent
a
e
m maximaarebiologicallyequallymeaningfulaccordingtothisscore.
on 0.7 Thecorrelationbetweendifferentaveragesof10Gibbssamplerruns
elati reaches0.85,avalueweconsiderhighenoughforagoodapproxi-
Corr 0.6 mationof theposterior distribution. The other twodata setsshow
preciselythesamebehavior(seeSupplementaryFigures4and5).
0.5
0.4 -40000
0 5 10 15 20 25 30 35 40 45 50
Number of cluster solutions merged -50000
-60000
Fig.2. Correlationmeasureρbetweendifferentaveragesofthesamenum-
beroflocalmaximaforadatasetof1000genesand173conditions(subset ore -70000
c
oftheGaschetal.dataset). d s -80000
o
o
h
eli -90000
k
In Figure 3, we keep the same 1000 genes and select an incre- ogli -100000
L
asingnumber ofconditions. Asthedatasetincreases, theentropy
-110000
measureH decreases, meaning theclustersbecomeincreasin-
fuzzy
glyhard.Simultaneously,thecorrelationmeasureρdecreasesfrom -120000
about0.85to0.55(seeSupplementaryFigure3).Weconcludethat
-130000
thedepthofthevalleysbetweendifferentlocalmaximaoftheposte- 0 10 20 30 40 50 60 70 80 90 100
riordistributionincreaseswiththesizeofthedatasetanditbecomes Number of Iterations
increasingly more difficult for the Gibbs sampler to escape from
thesemaximaandvisitthewholespaceinonerun. Fig.4. Traceplotoftheaveragelog-likelihoodscoreandstandarddeviation
for10GibbssamplerrunsfortheSpellmanetal.dataset.
0.045
0.04 Two-wayclusteringversusone-wayclustering
0.035 Our coclustering algorithm extends the CRC algorithm of Qin
(2006) by also clustering the conditions for each cluster of genes
e 0.03
ur (‘two-way clustering’), instead of assuming they are always inde-
s
ea 0.025 pendent(‘one-wayclustering’).Wecomparetheclusteringofgenes
m
py 0.02 forthethreeyeastdatasetsusingbothmethods, bycomputingthe
ntro averagenumberofclustersinferred(K),theaveragelog-likelihood
E 0.015 score and the average GO mutual information score for 10 inde-
0.01 pendent runsof eachalgorithm. TheresultsaretabulatedinTable
1 and 2. For all three data sets, both the log-likelihood score and
0.005
theGOmutualinformationscorearehigher(better)forourmethod.
0 TheincreaseinGOmutualinformationscoreisespeciallysignifi-
0 20 40 60 80 100 120 140 160
cantincaseoftheHughesetal.dataset.Thisdatasethasveryfew
Number of Conditions
overexpressedorrepressedvaluesandifeachconditionisconside-
red independent, there are very few distinct profiles which results
Fig.3. EntropymeasureHfuzzy fordatasetswith1000genesandvarying intheformationofveryfewclusters( 15for6052genes). Also
numberofconditions(subsetsoftheGaschetal.dataset). ∼
clusteringtheconditionsgivesmoremeaningfulresultssincediffe-
rentiallyexpressedconditionsformseparateclustersfromonelarge
backgroundclusterofnon-differentiallyexpressedconditions.
Analysisofwholegenomedatasets
Forsimulateddatasets,clustersaredefinedassetsofgenessha-
IfweruntheGibbssampleralgorithmonthethreewholegenome ring the same regulators in the synthetic regulatory network, and
yeast data sets, we are in the situation where the algorithm very thetruenumberofclustersisknown.Hereweconsideragenenet-
rapidlygetsstuckinalocalmaximum.InFigure4weplottheave- workwhosetopologyissubsampledfromanE.colitranscriptional
rageBayesianlog-likelihoodscore(seeeq.(2)intheMethods)for network(VandenBulcke etal.,2006) with1000 genes, ofwhich
10 different Gibbs sampler runs for the Spellman et al. data set. 105transcriptionfactors,and286clusters.Fortwo-wayclustering,
5
A.Joshietal
Table1. One-wayclustering,averagesfor10differentGibbssamplerruns. 350
300
Dataset Avg.K Avg.log-likelihoodscore Avg.MI
250
s
Gaschetal. 52.9(2.6) −6.101(0.014)×105 1.771(0.031) er
st
Hughesetal. 14.9(0.5) 2.530(0.002)×106 0.588(0.044) Clu 200
Spellmanetal. 49.7(2.2) −7.183(0.037)×104 1.491(0.032) of
er 150
b
m
u
N
100
Table2. Two-wayclustering,averagesfor10differentGibbssamplerruns.
50
Dataset Avg.K Avg.log-likelihoodscore Avg.MI 0
0 50 100 150 200 250 300 350 400 450 500
Gaschetal. 84.5(2.5) −5.586(0.012)×105 1.912(0.033) Number of Conditions
Hughesetal. 85.5(2.7) 2.798(0.004)×106 1.511(0.045)
Spellmanetal. 65.4(4.2) −5.112(0.011)×104 1.612(0.032) Fig.5. Numberofgeneclustersforasimulateddatasetwith1000genes
andavaryingnumberofconditions,fortwo-wayclustering(topdatapoints
(×))andone-wayclustering(bottomdatapoints(+))
asweincreasethenumber of conditionsinthesimulateddataset,
moreclustersareformedandthenumberofclusterssaturatesclose byaveraging over10different Gibbssampler runs. FortheGasch
tothetruenumber(seeFigure5).Forone-wayclustering,addition etal. andHughesetal. datasets, fullfuzzyclusteringisachieved
ofconditionsdoesnotaffecttheinferrednumberofclusterswhich with500fuzzyclusters(all6052geneshavetotalassignmentpro-
isanorderofmagnitudesmallerthanthetruenumber (seeFigure babilityPkpik >0.98).FortheSpellmanetal.datasetthesecond
5).Fortwo-wayclustering,duetotheclusteringofconditions, the stoppingcriteriumismetafterproducing321fuzzyclusters.
numberofmodelparametersisreduced,andgreaterstatisticalaccu- Ingeneral,weobservethatthealgorithmfirstproducesonevery
racycanbeachieved, evenwhenthenumber ofgenesinacluster largefuzzy cluster corresponding toan average expression profile
becomessmall. thatalmostallgenescanrelateto.Thisclusterisofnointerestfor
Thecorrelationmeasureρbetweentrueclustersandinferredclu- further analysis. Then it produces a number of fuzzy clusters of
stersalsoshowsahighervaluefortwo-wayclusteringoverone-way varying size which show interesting coexpression profiles and are
(SupplementaryFigure8). usefulforfurtheranalysis. Forthethreedatasetsconsideredhere,
Unlike for simulated data sets, the inferred number of clusters thisnumber isaround 100, consistent withtheaveragenumber of
does not depend much upon the number of conditions for real clustersindifferent Gibbs sampler runs (see Table2). Theremai-
biological data sets (Supplementary Figure 7), i.e., even if more ning fuzzy clusters are typically very small and consist mostly of
conditionsareadded,thealgorithmdoesnotgeneratemoreclusters. noise. Liketheveryfirstcluster, theyareofnointerestforfurther
Thisisbecauseinsimulateddata,everyadditionofaconditionadds analysis.
newinformation,butforrealdatasetsthatmightnotbethecase.In Since every gene belongs toevery cluster, we use a probability
ordertogetthetrueclustersfromtheexpressiondata,wedonotonly cutoff to remove from each cluster the genes which belong to it
needmoreconditionsbutalsothateachnew conditioncontributes withaverysmallprobability.Thesmallerthecutoff,themoregenes
information different from the information already available from belongtoacluster,whichresultsintomorefuzzyclustersandvice
thepreviousconditions. Thismightbeareasonwhythealgorithm versa.Table3showsthetotalnumberofgenesassignedtoatleast
clusters6052genesinonly 80clusters(seeTable2). one fuzzy cluster with different cutoff values and in brackets the
∼ numberofgenesassignedtoatleasttwofuzzyclusters.
Fuzzyclusters
The goal of merging different Gibbs sampler solutions and for-
Ouralgorithmreturnsasummaryoftheposteriordistributioninthe mingfuzzyclustersistoextractadditionalinformationoutofadata
formofagene-genematrixwhoseentriesaretheprobabilitiesthata setthatisnotcapturedbyasinglehardclusteringsolution.Thiscan
pairofgenesisclusteredtogether.Toconvertthesepairwiseproba- beachievedintwoways.First,byobtainingtightclustersoffewbut
bilitiesbacktoclustersweuseagraphspectralmethodasexplained highly coexpressed genes with a high probability cutoff. Second,
in the Methods. The method produces fuzzy overlapping clusters by characterizing genes which belong to multiple clusters with a
whereeachgeneibelongstoeachfuzzyclusterk withaprobabi- significantprobability.
lity p , such that P p = 1. The size of a fuzzy cluster k is Forallthreedatasets,ataprobabilitycutoffof0.5,wegetasub-
ik k ik
defined as P p . The algorithm iteratively produces new fuzzy setofgeneswhichbelongtoonlyoneclusterwithhighprobability.
i ik
clusters until all the information in the pairwise matrix is conver- Table3 showsthat each dataset retainsat least20% of itsgenes.
tedintoclusters(1st stoppingcriterium, seeMethods), oruntilthe Thesearesetsofstronglycoexpressedgeneswhichclustertogether
mathematicalconditionsunderlyingthealgorithmceasetohold(2nd inalmosteveryhardclustersolution. Ribosomal genesshow such
stoppingcriterium,seeMethods).Weappliedthealgorithmtopair- astrongcoexpression patterninallthethreedatasetswheremost
wiseprobability matrices for each of the three data sets, obtained genesbelongtothisclusterwithaprobabilitycloseto1(seeFigure
6
Table 3. Number of genes clustered and number of genes ERP2,RET2,RET3,SEC13,SEC21,SEC24andothers.Cluster34
belonging to multiple clusters with different membership containsgenesrepressedundernitrogenstressandstationarystate.
probabilitycutoffvalues. 20percentofthegenesincluster27alsobelongtocluster34with
a significant membership. These include genes encoding for ER
vesicle coat proteins like RET2, RET3, SEC13 and others which
Dataset 0.1 0.3 0.5
areinduced under DTTstress aswell as repressed under nitrogen
stress and stationary state. Also RIO1, an essential serine kinase,
Gaschetal. 6045(4356) 4062(344) 1781(0)
belongstotwoclusterswithasignificantprobability.Itclusterswith
Hughesetal. 6052(4554) 3959(34) 2254(0)
Spellmanetal. 6052(5187) 3158(139) 1255(0) genesinvolvedinribosomalbiogenesisandassembly(Gaschetal.
data cluster 3) as well as with genes functioning as generators of
precursormetabolitesandenergy(Gaschetal. datacluster7). We
findsimilarobservationsfortheHughesetal. andSpellmanetal.
datasets.GenesCLN1,CLN2andotherDNAsynthesisgeneslike
6). At least 75% of all the genes in cluster 2 (Gasch et al. data),
CLB6 which are known to be regulated by SBF during S1 phase
cluster3(Hughesetal.data)andcluster2(Spellmanetal.data)are
(Kochetal.,1996)belongtocluster19(Spellmanetal.data).They
locatedinribosome.
alsobelongwithsignificantprobabilitytocluster4(Spellmanetal.
data).Morethanonethirdofthegenesincluster4arepredictedto
becellcycleregulatedgenes.
CONCLUSION
Wehavedevelopedanalgorithmtosimultaneouslyclustergenesand
conditionsandsamplesuchcoclusteringsfromaBayesianprobabi-
listicmodel.Forlargedatasets,themodelissupportedonmultiple
equivalent local maxima. The average of these local maxima can
berepresentedbyamatrixofpairwisegene-geneclusteringproba-
bilitiesandwehaveintroducedanewmethodforextractingfuzzy,
Fig.6. RibosomalgenesformatightclusterintheHughesetal.dataset. overlappingclustersfromthismatrix.Thismethodisabletoextract
(Duetospaceconstraintsonlythefirstfewgenesareshown;forthecomplete
information out of the data set that is not available from a single,
figure,seetheSupplementaryInformation.)
hardclustering.
Localbutverystrongcoexpressionpatternscanalsobedetected FUNDING
byourmethod.Cluster15oftheGaschetal.datasetconsistsofonly
EarlyStage Marie Curie Fellowship to A.J.; Postdoctoral Fellow-
4genesclusteredtogetherwithprobability1(seeFigure7).These
shipoftheResearchFoundationFlanders(Belgium)toT.M.
fourgenes,GAL1,GAL2,GAL7,andGAL10,areenzymesinthe
galactosecatabolicpathwayandrespondtodifferentcarbonsources
duringsteadystate. Theyarestronglyupregulatedwhengalactose ACKNOWLEDGEMENT
isusedasacarbonsource(2nd experimentclusterinFigure7)and Wethank Steven Maere and Vanessa Vermeirssen for helpful dis-
stronglydownregulatedwithanyothersugarasacarbonsource(1st cussions.
experimentclusterinFigure7).Ineveryhardclustersolution,these
4genesareclusteredtogetheralongwithothergenes. Bymerging
REFERENCES
thesehardclustersolutionstoformfuzzyclusters,wegetatightbut
moremeaningfulclusterwithonly4genes. Ashburner, M., Ball, C.A., Blake, J. A., Botstein, D., Butler, H.,
Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig,
J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A.,
Lewis,S.,Matese,J.C.,Richardson,J.E.,Ringwald,M.,Rubin,
G.M.,andSherlock,G.(2000). Geneontology:toolfortheuni-
ficationofbiology.TheGeneOntologyConsortium. NatGenet,
25,25–29.
Dahl,D.B.(2006). Model-basedclusteringforexpressiondatavia
aDirichletprocessmixturemodel. InK.-A.Do, P.Mu¨ller, and
Fig.7. FourgenesGAL1, GAL2, GAL7andGAL10formatightcluster
M.Vannucci,editors,Bayesianinferenceforgeneexpressionand
showingconditionalcoexpressionintheGaschetal.dataset.
proteomics,pages201–218.CambridgeUniversityPress.
Eisen,M.B.,Spellman,P.T.,Brown,P.O.,andBotstein,D.(1998).
Table 3 shows that many genes belong to two or more clusters Clusteranalysisanddisplayofgenome-wideexpressionpatterns.
withasignificantprobability.FortheGaschetal.dataset,wefind ProcNatlAcadSciUSA,95(25),14863–14868.
similarobservationsasin(GaschandEisen,2002).Cluster27con- Fraley,C.andRaftery,A.E.(2002).Model-basedclustering,discri-
tains genes localized in endoplasmic reticulum (ER) and induced minantanalysis,anddensityestimation.JAmerStatisticalAssoc,
underdithiothreitol(DTT)stresslikeFKB2,JEM1, ERD2,ERP1, 97,611–631.
7
A.Joshietal
Gasch, A. P. and Eisen, M. B. (2002). Exploring the conditio- learningalgorithmsusingsimulateddata. BMCBioinformatics,
nalcoregulationofyeastgeneexpressionthroughfuzzyk-means 8Suppl2,S5.
clustering. GenomeBiol,3(11),RESEARCH0059. Qin,Z.S.(2006).Clusteringmicroarraygeneexpressiondatausing
Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., weighted Chinese restaurant process. Bioinformatics, 22(16),
Eisen, M. B., Storz, G., Botstein, D., andBrown, P.O.(2000). 1988–1997.
Genomic expression programs in the response of yeast cells to Redner, R.A.andWalker, H.F.(1984). Mixturedensities, maxi-
environmentalchanges. MolBiolCell,11(12),4241–4257. mum likelihood, and the EM algorithm. SIAM Review, 26(2),
Gibbons, F. D. and Roth, F. P. (2002). Judging the quality of 195–239.
geneexpression-basedclusteringmethodsusinggeneannotation. Segal,E.,Shapira,M.,Regev,A.,Pe’er,D.,Botstein,D.,Koller,D.,
GenomeRes,12(10),1574–1581. andFriedman,N.(2003). Modulenetworks: identifyingregula-
Hughes, T. R., Marton, M. J., Jones, A. R., Roberts, C. J., tory modules and their condition-specific regulators from gene
Stoughton, R., Armour, C. D., Bennett, H.A., Coffey, E., Dai, expressiondata. NatGenet,34,166–167.
H.,He,Y.D.,Kidd,M.J.,King,A.M.,Meyer,M.R.,Slade,D., Spellman,P.T.,Sherlock,G.,Zhang,M.Q.,Iyer,V.R.,Anders,K.,
Lum, P.Y.,Stepaniants, S.B.,Shoemaker, D.D.,Gachotte, D., Eisen,M.B.,Brown,P.O.,Botstein,D.,andFutcher,B.(1998).
Chakraburtty,K.,Simon,J.,Bard,M.,andFriend,S.H.(2000). Comprehensive identification of cell cycle-regulated genes of
Functional discovery via a compendium of expression profiles. theyeastSaccharomycescerevisiaebymicroarrayhybridization.
Cell,102(1),109–126. MolBiolCell,9(12),3273–3297.
Inoue,K.andUrahama,K.(1999). Sequentialfuzzyclusterextrac- Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S.,
tion by a graph spectral method. Pattern Recogn. Lett., 20(7), Dmitrovsky, E., Lander, E. S., and Golub, T. R. (1999). Inter-
699–705. preting patterns of gene expression with self-organizing maps:
Koch, C., Schleiffer, A., Ammerer, G., and Nasmyth, K. methods and application to hematopoietic differentiation. Proc
(1996). Switchingtranscriptiononandoffduringtheyeastcell NatlAcadSciUSA,96(6),2907–2912.
cycle:Cln/Cdc28kinasesactivateboundtranscriptionfactorSBF Tavazoie,S.,Hughes,J.D.,Campbell,M.J.,Cho,R.J.,andChurch,
(Swi4/Swi6)atstart,whereasClb/Cdc28kinasesdisplaceitfrom G. M. (1999). Systematic determination of genetic network
thepromoterinG2. GenesDev,10(2),129–141. architecture. NatGenet,22(3),281–285.
Liu, J. S. (2002). Monte Carlo strategies in scientific computing. Van den Bulcke, T., Van Leemput, K., Naudts, B., van Remor-
Springer. tel, P., Ma, H., Verschoren, A., De Moor, B., and Marchal, K.
Medvedovic, M. and Sivaganesan, S. (2002). Bayesian infi- (2006). SynTReN:ageneratorofsyntheticgeneexpressiondata
nitemixturemodelbasedclusteringofgeneexpressionprofiles. for design and analysis of structure learning algorithms. BMC
Bioinformatics,18(9),1194–1206. Bioinformatics,7,43.
Medvedovic,M.,Yeung,K.Y.,andBumgarner,R.E.(2004).Baye- Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E., and Ruzzo,
sian mixture model based clustering of replicated microarray W.L.(2001). Model-based clusteringanddatatransformations
data. Bioinformatics,20(8),1222–1232. forgeneexpressiondata. Bioinformatics,17(10),977–987.
Michoel, T., Maere, S., Bonnet, E., Joshi, A., Saeys, Y., Vanden
Bulcke,T.,VanLeemput,K.,vanRemortel,P.,Kuiper,M.,Mar-
chal,K.,andVandePeer,Y.(2007). Validatingmodulenetwork
8