Table Of ContentIMPROVINGCACHEPERFORMANCEWITHADAPTIVECACHE
TOPOLOGIESANDDEFERREDCOHERENCEMODELS
By
YONGJOONLEE
ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL
OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT
OFTHEREQUIREMENTSFORTHEDEGREEOF
DOCTOROFPHILOSOPHY
UNIVERSITYOFFLORIDA
1999
To
Myparentsandfamily
ACKNOWLEDGMENTS
Specialthanksareduetomycommitteechairman,Dr. Jih-KwonPeir,with-
outwhoseunendingpatience,guidance,andexpertisethiseffortwouldneverhave
reachedfruition. Thanksarealsoduetotherestofmycommittee,Dr. Timothy
Davis,Dr. Kenneth0,Dr. SanguthevarRajasekaran,andDr. SartajSahni.
ManythanksareduetoWindsorHsuwhohelpedtorunasimulationprogram
andByung-KwonChungwhosetthemultiprocessorsimulationenvironmentwith
decentefforts.
Also,themostappreciationisduetomywife, Chanwon,withoutwhosepa-
tience, understandingandencouragement, Ineverwouldhaveembarkedon,pro-
gressedthough,andnowcompletedthedoctorateprogramandthiseffort.
iii
TABLEOFCONTENTS
page
ACKNOWLEDGMENTS
iii
ABSTRACT vi
INTRODUCTION 1
IssuesofCacheMemory 1
PerformanceEvaluationMethods 9
TracingFacilities 9
Workloads 11
MachineModels 14
OverviewoftheDissertation 15
PTURINGDYNAMICMEMORYREFERENCEBEHAVIORWITHADAP-
TIVECACHETOPOLOGY 17
Introduction 17
StatementofProblem 18
CacheMissBehaviors 19
UnderutilizedCacheFrames 23
AdaptiveGroup-AssociativeCaches 29
AnExampleDesign 32
PerformanceImpact 35
PerformanceEvaluation 38
SimulationModel 38
WorkloadsandTraces 40
ConventionalLiMissRatios 43
TheConfigurationofSHTandOUTdirectory 45
ImprovementofHoles 50
ComparisonwithOtherCacheOrganizations 53
PartitionedLRUreplacementofSHTandOUT-directory 62
PerformancewithEmbeddedVictimCache 64
TheresultofTPC-C-likebenchmark 65
PreviousWorks 68
iv
DATAPREFETCHING 73
Introduction 73
StatementofProblem 76
HandlingDataPrefetchingUsingGroup-AssociativeCache 81
PerformanceofDataPrefetching 84
TheConfigurationoftheGroup-AssociativeCacheforDataPrefetch 84
TheComparisonwithothercacheTopologies 85
TheResultsofSPEC95Workloads 88
TheResultoftheCommercialWorkloadTPC-C-like 94
DEFERREDCACHECOHERENCEMODELS 96
Introduction 96
StatementofProblem 98
SynchronizationPrimitives 100
DeferredCacheCoherenceProtocol 103
ReconcilePartially-ModifiedCacheLines 108
PerformanceEvaluation 110
SimulationModel Ill
PerformanceComparison 114
RelatedWork 121
CONCLUSION 124
APPENDIXA Compulsory,Capacity,andConflictMissesforSPECint95 126
APPENDIXB Compulsory,Capacity,andConflictMissesforSPECfp95 . 135
REFERENCES 146
BIOGRAPHICALSKETCH 153
V
AbstractofDissertationPresentedtotheGraduateSchool
oftheUniversityofFloridainPartialFulfillmentofthe
RequirementsfortheDegreeofDoctorofPhilosophy
IMPROVINGCACHEPERFORMANCEWITHADAPTIVECACHE
TOPOLOGIESANDDEFERREDCOHERENCEMODELS
By
YongjoonLee
August,1999
Chairman: Jih-KwonPeir
MajorDepartment: ComputerandInformationScienceandEngineering
Memoryreferencesexhibitlocalityandarethereforenotuniformlydistributed
acrossthesetsofacache. Thisskewreducestheeffectivenessofacachebecause
itresults in thecachingofaconsiderablenumberofless-recentlyusedlines. In
thisdissertation, atechniquethatdynamicallyidentifies these less-recently used
lines and effectively utihzes the cache frames is described. These underutilized
cacheframescanbeoccupiedbythemore-recentlyusedcachelines. Also, these
framescanbeusedtofurtherreducethemissratiothroughdataprefetching. Inthe
proposeddesign,thepossiblelocationsthatalinecanresideinisnotpredetermined.
Instead,thecacheisdynamicallypartitionedintogroups. Becauseboththenumber
ofgroupsandeachgroupassociativityadapttothedynamicreferencepattern,this
vi
designiscalledtheadaptivegroup-associativecache. Performanceevaluationshows
thegroup-associativecacheisabletoachieveahitratiobetterthanthatofa4-way
set-associativecache. Forsomeoftheworkloads,thehitratioapproachesthatofa
fullyassociativecache.
Privatecachesareacriticalcomponenttohidememoryaccesslatencyinhigh
performance multiprocessorsystems. However, multiple processors may concur-
rentlyupdateadistinctportionofacachelineandcauseunnecessarycacheinvali-
dationsundertraditionalcachecoherenceprotocols.
Inthisresearch, adeferredcachecoherencemodelisdescribed,whichallows
acachelinetobesharedinmultiplecachesintheinconsistentstateaslongasthe
processorsareguaranteednottoaccessanystaledata. Multiplewriterequeststo
differentportionsofacachelinecanbeperformedlocallywithoutinvalidation. An
efficientmechanismtoreconcilemultipleinconsistentcopiesofthemodifiedlineis
described tosatisfythedatadependence. Simulationresultsshowthatthepro-
posedcachecoherencemodelimprovestheperformanceoftheparallelapplications
comparedtoconventionalMESIanddelayedcoherenceprotocol.
Insummary, anewadaptivecachetopologywhichutilizesthecacheframes
andanewcachecoherencemodelwhichminimizesthecachecoherenceactivitiesare
proposed. Andtheperformanceevaluationshowstheproposedschemesimprovethe
overallperformanceofthecachememoryinbothuniprocessorandmultiprocessor
systems.
vii
CHAPTER
1
INTRODUCTION
1.1 IssuesofCacheMemory
Programsexhibit both temporallocality, the tendency to reuse recentlyac-
cesseddataitems, andspatiallocality,thetendencytoreferencedataitemsthat
areclosetoeachother. Thesetendenciesarecalled"principleoflocality.''^ Because
ofthislocalitybehaviorofprograms,aprogramtendstoaccessarelativelysmall
portionofitsaddressspace,normallyreferredastheworkingset.
Forexample,mostprogramswhichcontainloopstendtoaccessinstructions
repeatedlyfromtheinstructioncache,whichshowshightemporallocality.Instruc-
tionsarenormallyaccessedsequentially,thatexhibitofhighspatiallocality. Also,
accessestoelementsofadataarrayasarecordfromthedatacacheindifferent
iterationsofaloopnormallytendtohavehighdegreeofspatiallocality.
Amemoryhierarchywhichconsistsofmultiplelevelsofmemorywithdifferent
accesstimeandsizeisintroducedtocapturetheselocalitybehaviors. Thefaster
memoriesarenormallymoreexpensivethanslowermemories. Therefore,itisad-
vantageoustobuildmemorysystemsasahierarchicallevel,withthefastmemory
closetotheprocessorandslowermemoryatlowerlevels. Whenthememorysystem
isorganizedasahierarchy,alevelclosetotheprocessorisgenerallycalledacache.
1
2
Theperformanceofcachememoryiscriticaltomemoryaccesstimewhichaffects
theoverallcomputersystemperformance.
The performance goal ofadding a cache memory is to achieve an average
memoryaccesstimeapproachingthatofcachememory.Theaveragememoryaccess
timeisameasureofthetimeittakestoreadadataitemfromthememorysystem.
Toachievethisgoal,themajorityofmemoryreferencesshouldbesatisfiedbythe
cache,i.e.,thecachemissratioshouldbeclosetozero. Thisispossiblebecauseof
thelocality-of-referencepropertyofmemoryreferencestreams,eventhoughthesize
ofcacheismuchsmallerthanthetotalsizeofmemoryaddressspaceofaprogram.
Theaveragememoryaccesstimecanbecomputedbasedonthreeparameters:
Averagememoryaccesstime=HitTime+Missratiox MissPenalty
1. Hittime: thetimetoaccessthecachememory,whichincludesthetimeto
determinewhethertheaccessisahitormiss.
2. Missratio: thefractionofmemoryaccessesnotfoundinthecache.
3. Misspenalty: thetimetoaccessthedatafrommemorywhentherequested
dataisnotpresentinthecache.
Toreducetheaveragememoryaccesstime,thehittime,themissratio,andthemiss
penaltyneedtobereduced [21]. However,reducingcachehittimewhilereducing
missratioisgenerallyhardtoachieve. Toreducemissratioofthecachememory,
cachememoriesarenormallyorganizedwithcomplexitytopresentgloballocaHtyof
referenceasaccuratelyaspossible. Thegloballocalityofreferencedefinesthatany
dataitemsincachememoryshouldbemore-recently-referencedthanthedataitem
whichisnotincachememory. Thiscomplexitynormallymakesthecacheaccess
3
timetobeincreasedbecauseofcomplexhardwarelogics. Thegoalofamemory
hierarchyistoreducetheoverallmemoryaccesstime,notjustmisses.
Whenacachemisshappensitcanbecategorizedbyoneofcompulsorymiss,
capacity miss, and conflict miss. The compulsory misses, capacity misses, and
conflictmissesaredefinedasfollowsbyHennesseyandPaterson[22].
• CompulsoryMisses-Thefirstaccesstoablockwhichisnotinthecache,
sotheblockmustbebroughtintothecache,thesearealsocalledcoldstart
missesorfirstreferencemisses.
• CapacityMisses-Ifthecachecannotcontainalltheblocksneededduringex-
ecutionofaprogram,capacitymisseswilloccurduetoblocksbeingdiscarded
anddataretrieved.
• ConfiictMisses-Iftheblock-placementstrategyisset-associativeordirect-
mapped,conflictmisses(inadditiontocompulsoryandcapacitymisses)will
occurbecauseablockcanbediscardedandlaterretrievediftoomanyblocks
maptoitsset. Thesearealsocalledcollisionmisses.
Restrictionsonwhereacachelineisplacedmakesseveralconventionalcacheorga-
nizations. Foreachcacheorganization,thehittimeandthemissratioaredifferent.
Ifeachlinehasonlyoneplaceitcanbeplacedinthecache, thecacheissaidto
bedirect-mapped. Ifacachelinecanbeplacedanywhereinthecache,thecacheis
saidtobefullyassociative.Ifacachelinecanbeplacedinarestrictedsetofplaces
inthecache,thecacheissaidtobeset-associative.Iftherearenlinesinaset,the
cacheorganizationiscalledn-wayset-associative.
Thedirect-mappedcachetopologycangenerallyachievefastercachehittime
thanset-associativecacheorfullyassociativecache. Thisisduetothefactthatthe
direct-mappedcacheissimpleanditsearchesonlyonelocationofthecachetofind
whetherareferenceisahitornot. Anadvantageofaset-associativecacheoverthe