Table Of ContentRegister Optimizations for Stencils on GPUs
PrashantSinghRawat AravindSukumaran-Rajam AtanasRountev
TheOhioStateUniversity TheOhioStateUniversity TheOhioStateUniversity
[email protected] [email protected] [email protected]
FabriceRastello Louis-NoëlPouchet P.Sadayappan
INRIA ColoradoStateUniversity TheOhioStateUniversity
[email protected] [email protected] [email protected]
Abstract computationupdateselementsofoneormoreoutputarrays
Therecentadventofcompute-intensiveGPUarchitecture usingelementsinthespatialneighborhoodfromoneormore
has allowed application developers to explore high-order inputarrays.Thefootprintofastencilisdeterminedbyits
3Dstencilsforbettercomputationalaccuracy.Acommon order,whichisthenumberofinputelementsfromthecenter
optimization strategy for such stencils is to expose suffi- readalongeachdimension.Inmanyscientificapplications,
cientdatareusebymeanssuchasloopunrolling,withthe thestencilorderdeterminesthecomputationalaccuracy.For
expectationofregister-levelreuse.However,theresulting thisreason,high-orderstencilshavebeengainingpopularity.
codeisoftenhighlyconstrainedbyregisterpressure.While However,theinherentdatareusewithinoracrossstatements
currentstate-of-the-artregisterallocatorsaresatisfactory insuchhigh-orderstencilsexposesperformancechallenges
for most applications, they are unable to effectively man- thatarenotaddressedbycurrentstenciloptimizers.
ageregisterpressureforsuchcomplexhigh-orderstencils, A significant focus in optimizing stencil computations
resultinginsub-optimalcodewithalargenumberofregis- hasbeentofuseoperationsacrosstimestepsoracrossase-
terspills.Inthispaper,wedevelopastatementreordering quenceofstencilsinapipeline[5,21,22,36,44,54,59].With
frameworkthatmodelsstencilcomputationsasaDAGof high-orderstencils,theoperationalintensityissufficiently
treeswithsharedleaves,andadaptsanoptimalscheduling highsothatevenwithjustasimplespatialtiling,thecompu-
algorithmforminimizingregisterusageforexpressiontrees. tationshouldtheoreticallynotbememory-bandwidthbound.
Theeffectivenessoftheapproachisdemonstratedthrough ConsideraGPUwitharound300GBytes/secglobalmem-
experimentalresultsonarangeofstencilsextractedfrom ory bandwidth and a peak double-precision performance
applicationcodes. ofaround1.5TFLOPS.Therequiredoperationalintensity
tobecompute-boundandnotmemory-bandwidthboundis
CCSConcepts •Softwareanditsengineering→Com- around5FLOPs/byteor40FLOPsperdouble-word.Many
pilers; high-order stencil computations have much higher arith-
ACMReferenceFormat: meticintensitiesthan40.Forsuchstencils,achievingahigh
PrashantSinghRawat,AravindSukumaran-Rajam,AtanasRountev, degree of reuse in cache is very feasible, but high perfor-
Fabrice Rastello, Louis-Noël Pouchet, and P. Sadayappan. 2018. manceisnotrealizedonGPUs.Themainhindrancetoperfor-
RegisterOptimizationsforStencilsonGPUs.InPPoPP’18:23nd manceisthehighregisterpressurewithsuchcodes,resultingin
ACM SIGPLAN Symposium on Principles and Practice of Parallel excessiveregisterspillingandasubsequentlossofperformance.
Programming,February24–28,2018,Vienna,Austria.ACM,New
Asweelaborateinthenextsection,existingregisterman-
York,NY,USA,15pages.https://doi.org/10.1145/3178487.3178500
agement techniques in productioncompilers arenot well
equippedtoaddresstheproblemwithregisterpressurefor
1 Introduction
high-orderstencils.Addressingthisproblemincontextof
Stencilcomputationsareanimportantcomputationalmotif GPUs is even more challenging, since most of the widely
inmanyscientificapplications.Typically,asimplestencil usedGPUcompilerslikeNVCC[38]areclosed-source.Even
the recent effort by Google (gpucc [57]) only exposes the
Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor
personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenot front-endtotheuser,andusestheNVCCbackendasablack
madeordistributedforprofitorcommercialadvantageandthatcopiesbear boxtoperforminstructionschedulingandregisteralloca-
thisnoticeandthefullcitationonthefirstpage.Copyrightsforcomponents tion.
ofthisworkownedbyothersthanACMmustbehonored.Abstractingwith
Inthispaper,wedevelopaneffectivepattern-drivenglobal
creditispermitted.Tocopyotherwise,orrepublish,topostonserversorto
optimizationstrategyforinstructionreorderingtoaddress
redistributetolists,requirespriorspecificpermissionand/orafee.Request
[email protected]. thisproblem.Thekeyideabehindtheinstructionreordering
PPoPP’18,February24–28,2018,Vienna,Austria approachistomodelreuseinhigh-orderstencilcomputa-
©2018AssociationforComputingMachinery. tionsbyusinganabstractionofaDAGoftreeswithshared
ACMISBN978-1-4503-4982-6/18/02...$15.00 nodes/leaves,andexploitthefactthatoptimalschedulingto
https://doi.org/10.1145/3178487.3178500
168
PPoPP’18,February24–28,2018,Vienna,Austria Rawatetal.
i i
for (i=2; i<N-2; i++) for (i=2; i<N-2; i++)
for (j=2; j<N-2; j++) { for (j=2; j<N-2; j++) {
out[i][j] = 0; out[i][j] = 0;
for (ii=-2; ii<=2; ii++) for (ii=2; ii>=-2; ii--)
for (jj=-2; jj<=2; jj++) j for (jj=-2; jj<=2; jj++) j
out[i][j] += in[i+ii][j+jj] * w[ii+2][jj+2]; out[i][j] += in[i+ii][j+jj] * w[ii+2][jj+2];
} }
(a)Stencilwithlexicographicalsweeps (b)Stencilwithreverse-lexicographicalsweeps
Figure1.Comparingsamestencilcomputationwithdifferentsweepingorder
minimizeregistersforasingletreewithdistinctoperands ishighlycombinatorial.Productioncompilersgenerallyuse
at the leaves is well known [47]. We thus devise a state- heuristicsforincreasingILP,withabest-effortgreedycon-
ment reordering strategy for a DAG of trees with shared trolonregisterpressure.Fortypicalapplicationcodes,the
nodesthatenablesreductionofregisterpressuretoimprove negativeeffectonregisterpressureisnotverysignificant.
performance.Thepapermakesthefollowingcontributions: However,forhigh-orderstencilcodeswithalargenumber
• Itproposesaframeworkformulti-statementstencils ofoperationsandalotofpotentialregister-levelreuse,the
thatreducesregisterpressurebyreorderinginstruc- impactcanbeveryhigh,asillustratedbyanexamplebelow.
tionsacrossstatements.
IllustrativeExample Consideranunrolledversionofthe
• ItdescribesnovelheuristicstoscheduleaDAGoftrees
double-precision2DJacobistencilcomputation(Figure1a)
thatreusedatausingaminimalnumberofregisters.
from[50].NVCCinterleavesthecontributionfromeachin-
• Itdemonstratestheeffectivenessoftheproposedframe-
putpointtodifferentoutputpointstoincreaseinstruction
workonanumberofregister-constrainedstencilker-
level parallelism (ILP). The interleaving performed to in-
nels.
creaseILPalsohastheserendipitouseffectofreducingthe
liverangeoftheregisterdata,andaconsequentreductionin
2 BackgroundandMotivation registerpressure.Nvprof[39]profilingdataonaTeslaK40c
deviceshowsthatundermaximumoccupancy,thisversion
RegisterAllocationandInstructionScheduling Acom-
performs3.73E+06spilltransactions,achieving467GFLOPS.
pilerhasseveraloptimizationpasses,registerallocationand
Figure1bshowsthesamestencilcomputationafterchang-
instructionschedulingbeingtwoofthem.Passesbeforereg-
ingtheorderofaccumulation.Exactlythesamecontribu-
isterallocationmanipulateanintermediaterepresentation
tionsaremadetoeachresultarrayelement,buttheorder
withanunboundednumberoftemporaryvariables.Thegoal
ofthecontributionshasbeenreversed.Withthisaccesspat-
ofregisterallocationistoassignthosetemporariestophysi-
ternforthecodeinFigure1b,NVCCfailstoperformthe
calstoragelocations,favoringthefewbutfastregistersto
sameinterleavingdespiteallowingreassociationviaappro-
theslowerbutlargermemory.
priatecompilationflags.Infact,theregisterpressureisnow
Forafixedschedule,acommonapproachtoperformregis-
exacerbatedbytheconsecutiveschedulingofindependent
terallocationistobuildaninterferencegraphoftheprogram,
operationstoincreaseILP.Forthisversion,1.58E+08spill
whichcapturestheintersectionofthelive-rangesoftem-
transactionsweremeasured,withperformancedroppingto
porariesatanyprogrampoint.Registerassignmentisthen
51GFLOPS.
reducedtocoloringtheinterferencegraph,whereeachcolor
Thisexampleillustratesaproblemwithregisterallocation
representsadistinctregister[10,11].Interferingnodesin
when the computation has a specific reuse pattern, char-
theinterferencegraphareassigneddifferentcolorsdueto
acteristicofhigh-orderstencilcomputations.Theproblem
theiradjacency.Thenumberofregistersneededbythecol-
stemsfromthefactthatformostcomplierstheregisterallo-
oringalgorithmislower-boundedbythemaximumnumber
cationandinstructionschedulingalgorithmsthatoperate
ofintersectinglive-rangesatanyprogrampoint(MAXLIVE).
atabasic-blocklevelhaveapeepholeviewofthecompu-
IfMAXLIVEismorethanthenumberofphysicalregisters,
tation–theymakescheduling/allocationdecisionswithout
spillingofregistersandtheconsequentload/storeoperations
aglobalperspective,andthussometimesworkantagonis-
from/tomemoryareunavoidable.
tically. Meanwhile, stencil computations typically have a
Registerpressurecansometimesbealleviatedbyreorder-
veryregularaccesspattern.Withabetterunderstandingof
ing the schedule of dependent instructions to reduce the
the pattern, and a global perspective on the computation,
MAXLIVE. Reordering independent instructions is often
itisfeasibletodeviseaninstructionreorderingstrategyto
usedtoenhancetheamountofinstruction-levelparallelism
alleviateregisterpressure.
(ILP), for hiding memory access latency. Thus, there is a
complexinterplaybetweeninstructionschedulingandreg-
SolutionApproach Inthispaper,wecircumventthecom-
isterallocation,affectinginstruction-levelparallelismand
plexityofthegeneraloptimizationproblemofinstructionre-
registerpressure,andtheassociatedoptimizationproblem
orderingandregisterallocationbydevisingapattern-specific
169
RegisterOptimizationsforStencilsonGPUs PPoPP’18,February24–28,2018,Vienna,Austria
optimizationstrategy.Stencilcomputationsinvolveaccumu- Listing1.TheinputrepresentationintheDSL
lationofcontributionsfromarraydataelementsinasmall
1 function j3d7pt (out, in, a, b, c) {
neighborhoodaroundeachelement.Theadditivecontribu- 2 out[k][j][i] = a*(in[k+1][j][i]) + b*(in[k][j-1][i] +
3 in[k][j][i-1] + in[k][j][i] + in[k][j][i+1] +
tionstoadataelementmaybeviewedasanexpressiontree.
4 in[k][j+1][i]) + c*(in[k-1][j][i]);
Thus,formulti-statementstencils,wehaveaDAGofexpres- 5 }
6 parameter L, M, N;
siontrees.Duetothefactthatanelementmaycontributeto
7 iterator k, j, i;
severalresultelements,thetreeswithintheDAGcanhave 8 double in[L][M][N], out[L][M][N], a, b, c;
9 unroll k=2, j=2;
manysharedleaves.
10 j3d7pt (out, in, a, b, c);
Givenasingletreewithoutanysharedleaves,itiswell 11 return out;
known[47]howtoscheduleitsoperationsinordertomini-
mizethenumberofregistersneeded.Weusethisasthebasis out = a + (b * c[i]) + d[i] + ((e[i] * f) / 2.3);
fordevelopingheuristicstoscheduletheoperationsfromthe
(a)Illustrativestencilstatement
DAGoftreeswithsharedleaves.Incontrasttotheproblem
ofreorderinganarbitrarysequenceofinstructionstomin-
imizeregisterpressure,astructuredapproachofadapting
theoptimalscheduleforisolatedtreestothecaseofDAGof
treeswithsharedleavesresultsinanefficientandeffective
algorithmthatwedevelopinthenexttwosections.
3 SchedulingDAGofExpressionTrees
Stencilcomputationsareoftensuccinctlyrepresentedusing
adomain-specificlanguage(DSL).Listing1showsa7-point (b)Expressiontree (c)Expressiontreewithaccumu-
lations
Jacobi stencil expressed in an illustrative DSL, similar in
spirit tostencil computation DSLs suchas SDSL [25] and Figure2.Expressiontreeexample
Forma[43].Thecorecomputationisshowninlines2–4.As
TheremainingtreenodesareinN .Figure2bshowsthe
withsimilarDSLs,theusercanspecifyunrollfactorsforloop op
expressiontreeforanillustrativeexpression.
iterators(line9).Loopunrolling,orthreadcoarseningon
Inapreprocessingstep,weintroducek-arynodesforasso-
GPUs,isoftenusedtoexploitregister-levelreuseinthecode.
ciativeoperators.Forexample,forthetreeinFigure2b,the
Thecomputationisautomaticallyunrolledasapreprocessing
chainof+nodesisreplacedwithasingle“accumulation”+
step,beforethecodeisgeneratedandoptimized.
node.Figure2cshowstheresultingexpressiontree;thenum-
ItisimportanttonotethatusingaDSLisnotaprereq-
bersonthenodeswillbedescribedshortly.Thesemanticsof
uisiteforusingtheschedulingtechniquesproposedinthis
anaccumulationnodeisasexpected:thevalueisinitialized
work.Asdescribedshortly,ourapproachworksonaDAGof
asappropriate(e.g.,0for+,1for∗)andthecontributionsof
expressiontrees.ThisDAGcanbeautomaticallyextracted
thechildrenareaccumulatedinarbitraryorder.
eitherfromtheDSLrepresentationorfromC/Fortrancode.
Weoftenconsiderasequenceofstencilcomputations—for
Astencilstatementcanbedefinedbythestencilshape
example,inimageprocessingpipelines[43].Eachcompu-
(as in lines 2–4) and the input/output data (as in line 8).
tation in the sequence will be represented by a separate
Eachsuchstencilstatementcanberepresentedbyalabeled
expression tree. Similarly, unrolling will result in distinct
expressiontree.Forexample,thetreecorrespondingtothe
expression trees for each unrolled instance. For example,
computationinListing1hasarrayelementout[k,j,i]asits
afterunrollingalongdimensionsk andj inListing1,there
root,scalarsa,b,c andaccessestoelementsofarrayinasits
leaves,andarithmeticoperators∗and+asinnernodes. willbeasequenceoffourexpressiontrees.Insomecases
theoutputofatreeisusedasaninputtoalatertreeinthe
An expression tree for a stencil computation has three
sequence. In such a case, there is a flow dependence: the
typesofnodes:(1)nodesn ∈ N representingaccessesto
mem
root of the producer tree has the same label as some leaf
memorylocations,(2)nodesn ∈ N representingbinary/u-
op
nodeoftheconsumertree(withoutanin-betweentreethat
naryarithmeticoperators,and(3)leafnodesrepresenting
writestothatlabel).Intheinputtoouranalysis,thisflowis
constants.AllleafnodesinN correspondtoreadsofar-
mem
rayelements(e.g.,in[k +1,j,i])orscalars.Therootofthe representedbyadependenceedgefromtherootnodetothe
leafnode.Thus,theentirecomputationisrepresentedasa
expressiontreeisalsoinN andcorrespondstoawrite
mem
DAGofexpressiontrees.
to an array element (e.g., out[k,j,i]) or a scalar. We asso-
Throughoutthepaper,wemakethefollowingtwoassump-
ciateauniquelabelwitheachread/writtenmemorylocation,
tions:(1)theassemblyinstructionsgeneratedfortheDAGof
and assign to each node in N the corresponding label.
mem treesafterregisterallocationareoftheformr ←r opr ,
1 2 3
170
PPoPP’18,February24–28,2018,Vienna,Austria Rawatetal.
wherer andr can be the same; (2) each operand/result su(n )=mx,oneofthemisscheduledfirst;however,inthis
1 2 j
requiresexactlyoneregister.Thisconditionisonlyenforced casesu(n)=1+mx.Inbothcases,theorderofevaluation
tosimplifythepresentationofthenexttwosections,and oftheremainingchildrenisirrelevant.Figure2cshowsthe
canbeveryeasilyrelaxed[4].Ourobjectiveistoschedule Sethi-Ullmannumbersforthesampleexpressiontree.
thecomputationsintheDAGsothattheregisterpressureis Notethattheschedulesproducedbythisapproachper-
reduced. formatomicevaluationofsubexpressions:oneofthechildren
isevaluatedcompletelybeforetheotheronesareconsidered.
3.1 Sethi-UllmanScheduling For a tree without data sharing, this restriction does not
Wewilluse“datasharing”torefertocaseswherethesame affecttheoptimalityoftheresult.Inthepresenceofdata
memorylocationisaccessedatmultipleplaces.Thereare sharing,atomicevaluationmaynotbeoptimal.
twotypesofdatasharing:(1)withinatree:severalnodes Sincestencilsreadvaluesfromalimitedspatialneighbor-
from N have the same label; and (2) across trees: in a hood,datasharingoftenmanifestsintheDAGofexpression
mem
DAGoftrees,nodesfromdistincttreeshavethesamelabel. trees.Forexample,inListing1,in[k][j][i]willbeaninput
Aclassicresult,duetoSethiandUllman[47],appliesto toallfourexpressiontreescorrespondingtotheunrolled
asingleexpressiontreewithout datasharing(i.e.,eachn ∈ stencilstatements.OnecanalsofindothernodesinListing1
N hasauniquelabel),andwithbinary/unaryoperators. thatwillbesharedacrossmultipleexpressiontrees.Forsuch
mem
They present a scheduling algorithm that minimizes the DAGs,theSethi-Ullmanalgorithmcannotbedirectlyapplied
numberofregistersneededtoevaluatesuchanexpression toobtainanoptimalschedule.InSection3.2,wepresentan
treeunderaspill-freemodel.1Eachtreenodenisassigned approachtocomputeanoptimalatomicscheduleforaDAG
anErshovnumber[1];wewillrefertothemas“Sethi-Ullman ofexpressiontreeswith datasharing.Incaseswhenfind-
numbers”anddenotethembysu(n).Theyaredefinedas ing an optimal evaluation can be prohibitively expensive,
Section3.3presentsheuristicstotradeoffoptimalityinfa-
su(n)= s1u(n1) nnihsaasloenaefchildn1 (1) vevoarloufatpirounntiongbethaetoemxpiclocraantiognensepraactee.sFuinb-aollpyt,irmeastlrsicchtiendgutlhese.
m1+axs(us(un(n)1),su(n2)) ssuu((nn1))(cid:44)=ssuu((nn2)) Stheacttipoenrf4oprmresseinnttesraleraevminegdioanlsthlieceo-uatnpdu-tinstcehreledauvleegalegnoerriathtemd
1 1 2
bytheapproachpresentedinSection3.2.
Thelasttwocasesapplytoabinaryopnodenwithchil-
dren n1 and n2. Intuitively, su(n) is the smallest possible 3.2 SchedulingaTreewithDataSharing
numberofregistersusedfortheevaluationofthesubtree
Figure3ashowsanexpressiontreewithdatasharing.For
rootedatn.Thefirsttwocasesareself-explanatory.Fora
illustration,nodeswiththesamelabelareconnectedinthe
binary op noden, if one childn′ has a higher register re-
figure.Recallthatweassumeaspill-freemodel,thereforea
quirement(case3),this“big”childshouldbeevaluatedfirst.
sharedlabelloadedonceintoaregisterwillremainlivefor
The result ofn′ will be stored in a register, which will be
allitsuses.Withdatasharing,thereisapossibilitythat(1)a
alivewhilethesecond(“small”)childisbeingevaluated.The
labelisalreadylivebeforewebegintherecursiveevaluation
remainingsu(n′)−1registersusedforn′areavailable(and
ofasubtreethathasitssubsequentuse,and/or(2)alabel
enough)toevaluatethesmallchild.Finally,theregisterof
mustremainliveevenaftertheevaluationofthesubtreein
n′canbeusedtostoretheresultforn,meaningthatsu(n)is
whichitisused.Theoptimalscheduleofasubtreeisaffected
equaltosu(n′).Iftheorderofevaluationwerereversed,the
bythelabelsthatarelivebeforeandaftertheevaluationof
resultofthesmallchildwouldhavetobekeptinaregister
thesubtree.Therefore,weneedtoaddlive-in/outstatesas
whilen′isbeingevaluated,whichwouldleadtosub-optimal
parameterstothecomputationoftheoptimalscheduleof
su(n)=1+su(n′).Inthelastcaseinequation(1),bothchil-
asubtree.Inthissection,wepresentanapproachtoopti-
drenhavethesameregisterneeds;thus,theirrelativeorder
mallyscheduleatreewithdatasharing,underthemodelof
ofevaluationisirrelevantandoneextraregisterisneeded
atomicevaluationofchildren;wedefertheinterleavingof
forn.Ofcourse,underthedefinitionsinequation(1),su(n)
computationacrosssubtreestoSection4.
isthesameasMAXLIVEforthetreerootedatn.
Foranoden,letuses(n)bethesetoflabelsusedinthe
ItisstraightforwardtogeneralizeSethi-Ullmannumber-
subtreerootedatn.Figure3ashowsuses(n)foreachinternal
ingtotreescontainingaccumulationnodes(asinFigure2c).
noden.Thelive-insetforanoden,denotedbyin(n),con-
Eachsuchaccumulationnodenhaschildrenn for1 ≤i ≤k.
i tainsalllabelsthatarelivebeforethesubtreerootedatnis
Let mx = max {su(n )}. If there is a single childn with
i i j evaluated.Thelive-outsetis
su(n ) = mx, this child is scheduled for evaluation first,
j
andthereforesu(n) = mx.Iftwoormorechildrenn have out(n) = (in(n) ∪ uses(n)) \ kill(n) (2)
j
1Inaspill-freemodelofthecomputation,adataelementisloadedonly wherekill(n)isthesetoflabelsthathavetheirlastusesinthe
onceintoaregisterforallitsuses/defs. subtreerootedatn.Notethatkill(n)iscontext-dependent,
171
RegisterOptimizationsforStencilsonGPUs PPoPP’18,February24–28,2018,Vienna,Austria
(a)Treewithcontextatroot (b)Optimalschedule
(a)Treewithdatasharing (b)Schedulingcost
Figure4.Example:computingsu′(π)
Figure3.Schedulingatreewithdatasharing
i.e.,thesetwillvarydependingontheorderinwhichthe requiredcontexttocomputesu′(n1)andsu′(n2).Letmx =
nodeisevaluated.Thekillsetscanbecomputedontheflyby maxi{su′(ni)},sothatmxequalsthemaximumnumberofsi-
maintainingthenumberofoccurrencesofeachlabell inthe multaneouslyliveregistersatanytimeduringtheevaluation
currentschedule,andcomparingitwiththetotalnumberof ofπ.Then,
occurrencesofl intheentireDAG. (cid:40)
1+mx n ∈ N &label(n )∈out(n)
WenowshowhowtocomputeamodifiedSethi-Ullman su′(π)= i mem i
number,su′foreachnoden,whenprovidedwithan“evalua- mx otherwise
tioncontext”intermsoflive-inandlive-outlabels.Consider Incase2,ifn ∈ N ,orifn ∈ N butlabel(n )(cid:60)out(n),
1 op 1 mem 1
anodenwithsomeinandout state.Justbeforetheevalua-
thentheresultofthecomputation,identifiedbythelabel
tionofnbegins,|in(n)|registersarelive.Similarly,justafter
ofthenoden,canreusetheregisterofn (similarlyforn ).
1 2
the evaluation ofn finishes, |out(n)| registers will be live.
However,incase1,wherebothn andn areleafnodesin
1 2
Duringtheevaluationofn,additionalregistersmaybecome
N andbothmustbeliveafterevaluatingn,weneedan
mem
live,whilesomeoftheotherliveregistersmaybereleased.
additionalregistertoholdtheresult.
Nowsu′(n,in,out)representsthemaximumnumberofreg-
Foranaccumulationnodewithk children,considerper-
istersthatweresimultaneouslyliveatanypointduringthe mutationπ = ⟨n ,n ...,n ⟩.Supposewehavecomputed
evaluationofn.Wealsodefinesu′(π,in,out),whereπ isa allsu′(n )andlet1mx2=maxk {su′(n )}.Then,
i i i
sequenceofthechildrennodesofn.Thisvaluewillrepresent
tloihrvedeemarteadaxniimnyupthmoeinnstuedqmuubreeinnrcgoeftdhreeesgecivrsiatbeluersdattbihoyantπow.fenrewsiitmhuitlstacnheioldurselny su′(π)= mmxx (cid:60)ssuuo′′((unnt1())n1==)mm&xxsu&&′(nnnj1)∈∈(cid:44)NNmmxem,&2&≤lajb≤elk(n1)
bouutFt(ontrh).esimdepfilinciittiyonwsewwiilllluusseestuh′e(nl)ivien-sitne/aoduotfsseuts′(nin,(inn),oauntd), 1+mx osuth′(enr1jw)i(cid:44)semx,2 ≤1j ≤kop
Foraleafnoden ∈ N with|in(n)| =α,
mem Justlikethegeneralizationofsu(n)foraccumulationnodes
(cid:40)α +1 label(n)(cid:60)in(n) inSection3.1,su′ =mx whenthefollowingtwoconditions
su′(n)=
hold:(1)n requiresthemaximumnumberofsimultaneously
α label(n)∈in(n) 1
liveregisters,andtherestofthenodesinπcanbecompletely
Thefirstcaseimpliesthatanewregistermustbereserved evaluatedusingtheregistersreleasedbyn ,and(2)thereg-
1
forthelabelofnifitisnotalreadylivebeforetheevaluation isterholdingn canbereusedbyn,i.e.,eithern ∈ N (case
1 1 op
ofn.Thesecondcaseisself-explanatory. 2),orn isaleafnodethatisnotlivebeyondthispoint(case
1
Tocomputesu′forak-ary(binaryoraccumulation)node 1).Inallotherscenarios,weneedmx+1registers(case3).
n with childrenn1...nk, we need to explore allk! evalu- Thecomputationofsu′(n)foratreewithoutanevaluation
ation orders of the children. Letπ be any permutation of contextisshowninFigure3b,andwithanevaluationcontext
thechildrenofnrepresentingtheirevaluationorder.Then isshowninFigure4a.Forthesametree,Figure4bshows
su′(n)=minsu′(π). thepermutationwithminimumsu′.Inallthreefigures,the
π
Forthepurposeofexplanation,supposethepermutation childrenofanodeareorderedleft-to-right,whichdefines
π = ⟨n ,n ⟩isoneparticularevaluationorderforabinary thecorrespondingpermutation.
1 2
nodenwithchildrenn ,n .Tocomputesu′(π),firstwede- Insomecases,exhaustivelyexploringallpermutationsof
1 2
terminethelive-inandlive-outsetsfornodesinπ asfol- thechildrenmaybeunnecessary.InthetreeofFigure4a,
lows:in(n )=in(n),in(n ) = out(n ),andout(n ) = out(n); therearetwosubtreeoperandsoftheaccumulationnode
1 2 1 2
hereout(n )isasdefinedinequation2.Thisprovidesthe thatsharenodata(thefirstandthethirdsubtreeoperands
1
172
PPoPP’18,February24–28,2018,Vienna,Austria Rawatetal.
intheleft-to-rightorder).Therefore,eventhoughthesched- Algorithm1:Schedule-Tree(n,in,out)
ulingwithinthosetwosubtreesmaybeinfluencedbythe
Input :Atreerootedatnwithlive-in/outcontextsinandout
evaluationcontext,theydonotinfluenceeachother’seval- Output:AnoptimalscheduleSforthetree
uation.Letpassthroughdenotethelabelsthatareliveboth 1 sched_cost←∅,S ←∅;
beforeandaftertheevaluationofnoden:passthrough(n)= 2 C←create_maximal_clusters(n);(Sec.3.2)
in(n)∩out(n).Then,forak-arynoden,anytwoofitschil- 3 foreachclustercinCdo
drenn andn donotinfluenceeachother’sevaluationif 4 if |c|is1then
i j 5 in(c)←in(n);
(uses(ni)∩uses(nj))\passthrough(n)=∅ (3) 6 out(c)←computedusingequation2;
Insuchascenario,forthenoden,wecancreatemaximal 7 sched_cost[c]←su′(c);
8 else
clusters of its children that share data, but the only data
9 computeinandoutforeachtreeinc;(Sec.3.2)
sharedamongclustersisthepassthroughlabelsofn.Forex-
10 π ←allpermutationsofthetreesinc;
ample,ifchildrent1andt2sharelabell1,andchildrent2and 11 sched_cost[c]←su′(π);
t3 sharelabell2,wherel1,l2 arenon-passthroughlabelsof 12 P ←sequenceclustersusingsched_costandThms.3.1,3.2,3.3;
theparentnode,then{t1,t2,t3,t4}mustbelongtothesame 13 foreachsubtreets inthesequencedescribedbyPdo
cluster.WeextendtheintuitionbehindSethi-Ullmansched- 14 appendthescheduleforts inS;
uling algorithm to establish that different clusters cannot 15 returnS;
influenceeachother’sevaluation.Then,foreachclusterc ,
i
wecanindependentlycomputesu′(c )within(c ) = in(n).
i i
Again,weprovetheresultbycontradiction.Supposethat
Weonlyneedtoexploreallpermutationswithinthenon-
su′(c )<su′(c ),andc isscheduledbeforec intheoptimal
singleton clusters. We propose the following theorems to 1 2 2 1
schedule.Wechangethisschedulebymovingtheevaluation
establishanevaluationorderfortheclusters.
ofc beforethatofc .FromTheorem3.2,su′(c )afterthis
1 2 2
Theorem3.1. Forkclustersci 1 ≤i ≤ksuchthat|in(ci)| ≤ changewilleitherremainthesame,ordecrease.Thus,su′
|out(ci)|,theonewithlargersu′(ci)−|out(ci)| willbeprior- forthenewschedulewilleitherbethesame,orreduceif
itized for evaluation over others in the optimal schedule. In su′(c )wasthemaximum,makingitanoptimalschedule.
2
thespecialcasewherealltheclustershavethesamesu′(ci)− Based on these theorems, Algorithm 1 summarizes the
|out(ci)|,theycanbeevaluatedinanyorderwithoutaffecting evaluationanoptimalscheduleforatreewithdatasharing.
MAXLIVE.
3.3 HeuristicsforTractability
This result is a direct consequence of the Sethi-Ullman
algorithm. The cluster with larger su′(c ) − |out(c )| will For a non-singleton clusterc, the algorithm presented in
i i
release more registers, which can be reused by the next Section3.2canbecomeprohibitivelyexpensiveif|c|islarge.
cluster.Thespecialcasetooisadirectconsequenceofthe Forexample,when|c|changesfrom7to8,thepermutations
Sethi-Ullmanalgorithm,wheretwosiblingnodeswiththe exploredincreasefrom5040to40320.Wenowpresentsome
samesucanbeevaluatedinanyorder(case4ofequation1). heuristics that trade off optimality for tractability, and a
cachingtechniquetofurtherspeedupthealgorithm.
Theorem3.2. Fortwoclustersc andc suchthat|in(c )| >
1 2 1
|out(c )|and|in(c )| ≤ |out(c )|,c mustbeevaluatedbefore Pruning Heuristics We begin by establishing, for any
1 2 2 1
c intheoptimalschedule. node n, the bounds on su′(n). When n is evaluated with
2
non-emptycontextsinandout,theboundsare:
We prove the result by contradiction. Suppose that c
2
is evaluated before c in the optimal schedule. Since the su′(n,∅,∅) ≤su′(n,in,out) ≤su′(n,∅,∅)+|in∪out|
1
scheduleisoptimal,su′(c ) ≥ su′(c ).Nowwechangethis
2 1 Weprovethelowerboundbycontradiction;theprooffor
optimalschedulebymovingtheevaluationofc beforec .
1 2 upperboundisomittedduetospaceconstraints.
Evaluatingc1 earlierwillrelease |in(c1)|−|out(c1)| (i.e., ≥ su′(n,∅,∅) ≤su′(n,in,out):Assumetothecontrarythat
1) registers, which can then be used in the evaluation of su′(n,in,out) <su′(n,∅,∅).WewillmodifythescheduleS
c2.Basedonthepreviousequations,thesu′(c2)willeither correspondingtosu′(n,in,out)asfollows:prependastage
decrease or remain the same, depending on whether the
toS thatloadsthelabels∈in(n)into|in|registers,andmake
numberofregistersreleasedbyc1isgreaterthan,orequal in(n) = ∅. Append a state to S that stores all the labels
to1.Thismodifiedschedulethereforeeitherhasthesame,
∈ out(n) from the respective registers into memory, and
or has lower su′ than the optimal schedule, making it an makeout(n) = ∅. This modified schedule corresponds to
optimalschedule. su′(n,∅,∅),andhence,su′(n,∅,∅)=su′(n,in,out).
Theorem3.3. Fortwoclustersc andc suchthat|in(c )| > Withtheboundsestablished,insteadofexploringallper-
1 2 1
|out(c )|and|in(c )| > |out(c )|,theonewithsmallersu′must mutations,wecansacrificeoptimalityandstopfurtherex-
1 2 2
beprioritizedforevaluationintheoptimalschedule. ploration when we are close to the optimal schedule. We
173
RegisterOptimizationsforStencilsonGPUs PPoPP’18,February24–28,2018,Vienna,Austria
useatunableparameterd,andstoptryingthepermutations Algorithm2:Schedule-DAG(D,R)
any further whensu′(n,in,out)−su′(n,∅,∅) ≤ d. For the
Input :D:DAGofexpressiontrees,R:Per-threadregisterlimit
experimentalevaluationinSection5,wesetd to1. Output:AnoptimalscheduleSfortheD
Foraclusterc with|c| > 8,wealsoapplyapartitioning 1 D′←D; fusion_feasible←true; tree_order←∅; S ←∅;
heuristic,whichrecursivelypartitionsthesubtreesinc into 2 whilefusion_feasibledo
sub-partitionswhereeachsub-partitioncanbeofamaximum 3 foreachpairoftransitivedep-freenodesti,tj inD′do
sizep,withp <8.Thepartitioningisbasedoneitherofthe 4 M ∪= compute_metric(D′,ti,tj);(sec.3.4)
5 sort_descendinд(M);
twocriteria:
6 (tp,tq,fusion_feasible)←find_fusion_candidate(M);
− on “label affinity": the subtrees that share the max-
7 fusetp andtq;
imum labels are greedily assigned to the same sub- 8 update_dependence_edдes(D′,tp,tq);
partitionaslongasthesizeofthesub-partitionisless 9 foreachnodedinD′do
thanp.Suchpartitioningisbasedonthenotionthat 10 appendthetreesequenceofdintree_order;
evaluatingsubtreeswithmaximumusestogetherwill 11 split_versions←create_split_versions(tree_order);
potentiallyreducepassthroughlabels,andMAXLIVE. 12 foreachsplitinsplit_versionsdo
− on“releasepotential":thesubtreesthathavethelast 13 S′←∅;
usesofsomelabelsareplacedinasub-partition,and 14 foreachkernelkinsplitdo
15 foreachtreetinkdo
thatsub-partitioniseagerlyevaluated.Thispartition-
16 computeinandoutfort;
ingisbasedonthenotionthatthereleasedregisters 17 appendoutputofSchedule-Tree(t,in,out)toS′;
canbereusedbythenextpartition. 18 executeS′aftercompilingitwithregisterlimitR;
Oncethesub-partitionsarecreated,weonlyexhaustively 19 S ←S′ifS′isafasterschedulethanS,orifSis∅;
exploreallpermutationsofsubtreeswithinasub-partition. 20 returnS;
Ifthenumberofsub-partitionscreatedislessthan8,then
wealsotryallthepermutationsofthesub-partitionsthem-
selves.Forexample,if|c| =8,andthepartitioningheuristic Foroptimalscheduling,oneneedstoexplorealltopological
createstwosub-partitionsp andp ofsize4each,thenour ordersforthetreesintheDAG,andthenevaluateallthe
1 2
explorationspacewillbe{p ,p }and{p ,p },whileexploring treesindependentlyforeachtopologicalorder.Thismaybe
1 2 2 1
all4!permutationsofsubtreeswithinp andp each–atotal practicalifthesizeoftheDAGissmall.Otherwise,wemust
1 2
of2×4!×4!permutationsinsteadof8!permutations. sacrifice optimality for tractability, and fix the evaluation
We also let the user externally specify a threshold that orderofthetreesintheDAGbeforethetreesareindividually
upper-boundsthetotalnumberofpermutationsforatree. evaluated.
WeusethegreedyheuristicdescribedbyRawatetal.[44]
Memoization For a noden, a lot of permutations of its to fix the evaluation order of trees in the DAG. At each
childrenwilldifferinonlyafewpositions.Insuchcases,we step, the heuristic tries to fix the evaluation order of two
enduprecomputingsu′forachildmultipletimes,evenwhen nodes in the DAG. We begin by computing, for each pair
thelive-in/outcontextforthechildremainsunchanged. oftransitivedependence-freetreesp intheDAG,ametric
i
Theserecomputationscanbeavoidedbyasimplemem- M thatencodes:(a)thenumberoflabelssharedbetween
i
oization,whereforanoden,wemapsu′(n)asafunction them,and(b)thenumberofcommoninputarraysreadby
ofaminimal context.Theminimalcontextstripsawayla- them.AmongthecomputedM ,wechoosetheonethathas
i
belsthatarenotinuses(n),butareinpassthrouдh(n).The thehighestnon-zerovalue,andfixtheevaluationorderof
su′(n)withminimalcontextcanbesuitablyadjustedtoget itstreepairtobecontiguoustoenhancereuseproximityin
su′(n)withadifferentcontextthathassomepassthrough thefinalschedule.TheDAGisupdatedbyfusingthenodes
labelsaddedtotheminimallive-in/out.Forexample,sup- corresponding to the two trees into a “macro node". Post
posethatsu′(n)is3whentheminimalin(n)= {a,b}andthe fusion, we update the dependence edges to and from the
minimalout(n) = ∅. Thensu′(n) when evaluating it with macro node, and recompute the metrics for the next step.
in(n)= {a,b,c},out(n)= {c}andc (cid:60)uses(n)willbe2+1=3, Thealgorithmterminateswhennomorenodescanbefused.
and the optimal schedule will remain unchanged. Memo- Oncethealgorithmterminates,weperformatopological
ization greatly reduces the total evaluation time, thereby sortofthefinalDAG,andexpandtheDAGmacronodesto
enablingtheexplorationofalargenumberofpermutations. theirtreesequences.Fortheseorderedtrees,wecangenerate
codeversionswithdifferentdegreeofsplits.Oneextreme
3.4 SchedulingaDAGofexpressiontrees
wouldbeaversionwhereallthetreesareinasinglekernel
Foreachstencilstatementthatismappedtoanexpression (max-fuse),andanotherextremewouldbeaversionwhere
tree, Section 3.2 described a way to schedule it. This sec- eachtreeisadistinctkernel(max-split)[8,55].Forcompute-
tiontieseverythingtogetherforamulti-statementstencil intensive stencils with many-to-many reuse, a single ker-
bydescribinghowtoscheduleaDAGofexpressiontrees. nelcanhaveextremelyhighregisterpressure,sometimes
174
PPoPP’18,February24–28,2018,Vienna,Austria Rawatetal.
causing spills despite allowing for the maximum permit-
tedregistersperthread.Forsuchcases,performingkernel
fissioninsteadofgeneratingasinglekernelfortheentire
computationmightimproveperformance.Thesplitkernels
willincuradditionaldatatransfersfromglobalmemory,but
theregisterpressureperkernelwillbemuchlower,giving
the user an opportunity to further enhance register-level
reuseviaunrolling.NotethatnoneoftheproductionGPU
compilersarecapableofperformingkernelfusion/fissionop-
timizations.Foreachsplitversioncreated,thetreesequence
initisevaluatedusingAlgorithm1.Thereturnedschedule (a)Originaltree (b)Interleavingexpressions
istheonethatgivesmaximumperformance.Algorithm2 Figure5.Example:interleavingtoreduceMAXLIVE
outlinestheentireprocess.
Algorithm3:slice-and-interleave(T,in,out)
4 InterleavingExpressions
Input :T:aninputtreewithscheduleSandcontextsinandout
At this point, we have a schedule for the entire DAG of Output:S:Thescheduleafterapplyingslice-and-interleave
trees,butwithatomicevaluationenforced.However,inter- 1 Lilv ←∅;
leavingwithin/acrosstreescanbeinstrumentalinreducing 2 min_exprs←sequenceofminimalexpressionsextractedfromS
whoseoperandsareleafnodes;
MAXLIVE.Forexample,intheunrolledstencilofListing1,
3 foreachexpressionsinmin_exprsdo
thereisnoreusewithinastencilstatement,butplentyof
4 Lilv ←allthelabelsseeninthescheduletills∪ labelswith
reuseacrossstencilstatements.WewillseelaterinSection
singleoccurrenceinT;
5thatrelaxingtheconstraintofatomicevaluation,andper- 5 foreachexpressiontappearingaftersinmin_exprsdo
forminginterleavingisimperativeforperformanceinsuch 6 iftonlyoperatesonthelabelsinLilvthen
stencils.Acompileroptimizationthatperformssomeinter- 7 t′←maximalexpressionobtainedbygrowingt
leavingiscommonsubexpressionelimination(CSE).How- untilitoperatesonthelabelsinLilv;
ever,werequireamoregeneralinterleavingthatworksat 8 ifliverangesreducedbyplacingt′aftersthen
9 sliceandplacet′aftersinS;
thegranularityofcommonlabelsinsteadofcommonsubex-
10 movetaftersinmin_exprs;
pressions.Forexample,Figure5ashowsanexpressiontree
11 returnS;
wheresu′(S) is the largest, and the operands of the accu-
mulationnodeareevaluatedinorderfromlefttorightin
the final schedule. Also, {c[i],b} (cid:60) uses(S). The fact that startwithminimalexpressions,i.e.,thesimplestexpressions
{c[i],b} ∈ passthrouдh(S)addstosu′(S).Byslicingtheex- whoseoperandsareleafnodes∈ N .Oncewefindamin-
mem
pression (e[i]∗b)/c[i] and placing it after the expression imalexpressione thatoperatesonlyonthelabels∈ L ,
m ilv
b∗c[i]asshowninFigure5b,c[i]andb willnolongerbein wefindtherootnoder ofe ,andgrowe totheexpression
m m
in(S).Insteadofthosetwolabels,atemporarylabelholding rootedatparent(r).Wecontinuetogrowtheexpressiontill
thevalueoftheslicedexpressionwillbeaddedtoin(S),and we have a maximal expression that only operates on the
hencesu′(S)willreduceby1.NotethatthisisnotCSE,buta labels∈L .Foreachtargetexpressionsthusdiscovered,we
ilv
moregeneraloptimizationaimedtoreduceMAXLIVE.This checkifslicingandplacingitbetweenthesourceexpression
slice-and-interleaveoptimizationslicesatarget expression, e andsubtreet immediatelyfollowinge intheschedule
s s s
andinterleavesitwithasourceexpression,sothatsu′ata decreases|in(t )|.Ifitdoes,thenslice-and-interleaveisper-
s
programpointreduces.ItsubsumesCSEifthesourceand formed.
targetexpressionsarethesame. IllustrativeExample Letb∗c[i]bethesourceexpression
Weperformtheslice-and-interleaveoptimizationattwo inthetreeoffigure5a.Oneoftheexploredtargetexpressions
levels:(a)withinanexpressiontree,wherethesourceand willbee[i]∗b,sinceitonlyusesnodes∈ L .Nowwetry
ilv
targetexpressionsbelongtothesametree;and(b)across togrowthetargetexpressionbychangingtherootfrom∗
theexpressiontreesintheDAG,wheresourceandtarget to /, making (e[i]∗b)/c[i] the new target expression. All
expressionsbelongtodifferenttrees.Forachosensource thelabelsusedinthegrowntargetexpressionalsobelong
expressiones rootedatnoden,wecomputeasetoflabels, to Lilv. A further attempt to grow will change the target
Lilv, which is a union of all the labels that were observed expression to ((e[i]∗b)/c[i])∗ f[i]. However, f[i] (cid:60) Lilv.
inthescheduletillnow,withthelabelsthathaveasingle Therefore,webacktrackandfinalize(e[i]∗b)/c[i]asatarget
occurrenceintheDAG. expression,sinceitisthemaximalexpressionwithallthe
Wenowtrytofindasetoftargetexpressionsoperating labelsinL .Placingthetargetexpressionafterthesource
ilv
onjustthelabelsfromLilv.Tofindthetargetexpressions,we expressiondecreasesin(S)by1.Therefore,weperformthe
175
RegisterOptimizationsforStencilsonGPUs PPoPP’18,February24–28,2018,Vienna,Austria
slice-and-interleaveoptimization.Algorithm3outlinesthe Benchmark N UF k F R A U
slice-and-interleavealgorithmthattriesoutdifferentsource 2d25pt 81922 4 2 33 2 104 44
2d64pt 81922 4 4 73 2 260 92
expressions,andcontinuouslyfindsthetargetexpressions 2d81pt 81922 4 4 95 2 328 112
withinthetreetointerleaveinordertoreducetheliveranges. 3d27pt 5123 4 1 30 2 112 58
3d125pt 5123 4 2 130 2 504 204
Theslice-and-interleaveacrossthetreesinaDAGissimilar.
hypterm 3003 1 4 358 13 310 152
rhs4th3fort 3003 1 2 687 7 696 179
derivative 3003 1 2 486 10 493 165
5 ExperimentalEvaluation
N:DomainSize,UF:UnrollingFactor,k:StencilOrder,F:FLOPsperPointR:#Arrays,
A:TotalElementsAccessedperThread,U:UniqueElementsAccessedperThread
Ourframeworkparsesstencilstatementswritteninasub-
Table1.Benchmarkcharacteristics
setofC:thearrayaccessindicesinthestencilstatements
mustbeanaffinefunctionofthesurroundingloopiterators,
allowefficientloadsviathetexturepipeline.Allthestencils
programparameters,andconstants;loopiteratorsandpa-
aredouble-precision,compiledwithNVCCflags‘–use_fast_-
rametersmustbeimmutableinthestencilstatements.The
mathXptxas"-dlcm=ca"’,andLLVMflags‘-O3-ffast-math
frameworksupportsauto-unrollingalongdifferentdimen-
-ffp-contract=fast’. Since none of the versions use shared
sionstoexposespatialreuseacrossstencilstatements.
memory,using‘dlcm=ca’forNVCCenhancesperformance
Toensureatightcoupling,severalprioreffortsonguid-
bycachingtheglobalmemoryaccessesatL1.However,we
ing register allocation or instruction scheduling were im-
noticenodiscernibleperformancedifferencewhencompil-
plementedasacompilerpassinresearch/prototypecompil-
ingwithoutthe‘–use_fast_math’flaginNVCC.Weexplore
ers[7,16,20,41,45],oropen-sourceproductioncompilers
differentinstructionschedulersimplementedinLLVM(de-
[29,46].However,likesomeotherrecentefforts[6,28,50],
fault,list-hybrid,andlist-burr)fortheoriginalandunrolled
weimplementourreorderingoptimizationatsourcelevel
code,andreportthebestperformance.Tominimizeinstruc-
forthefollowingreasons:(1)itallowsexternaloptimizations
tionreorderingforourreorderedcode,weuseLLVM’sde-
forclosed-sourcecompilerslikeNVCC;(2)itallowsusto
faultinstructionscheduler,anddonotusethe-ffast-math
performtransformationslikeexposingFMAsusingoperator
optionduringcompilation.Wetestalltheversionsagainsta
distributivity,andperformingkernelfusion/fission,which
standardCimplementationofthebenchmarksforcorrect-
canbeperformedmoreeffectivelyandefficientlyatsource
ness:thedifferenceineachcomputedoutputvaluewiththat
level;and(3)itisinput-dependent,notmachine-orcompiler-
of the C implementation must be less than a tolerance of
dependent–withanimplementationcoupledtocompiler
1E-5.
passes,itwouldhavetobere-implementedacrosscompilers
withdifferentintermediaterepresentation.Ourframework LoopUnrolling Fortheexperiments,theiterativekernels
massagestheinputtoaformthatismoreamenabletofurther were unrolled along a single dimension to expose spatial
optimizationsbyanyGPUcompiler,andweuseappropriate reuse.Loopunrollingoffersthecompileranopportunityto
compilationflagswheneverpossibletoensurethatourre- exploitILP,butschedulingindependentinstructionscontigu-
orderingoptimizationisnotundonebythecompilerpasses. ouslymayincreaseregisterpressure.Consideranunrolled
Weevaluateourframeworkforthebenchmarkslistedin version of 2d25pt, compiled with 32 registers. From table
Table1onaTeslaK40cGPU(peakdouble-precisionperfor- 1, it is clear that the unrolled code has a high degree of
mance1.43TFLOPS,peakbandwidth288GB/s)withNVCC- reuse.Listing2showstheSASS(ShaderASSembler)snippet
8.0[38]andLLVM-5.0.0compiler(previouslygpucc[57]). generatedusingNVCCfortheunrolledversionof2d25pt
Thefirstfivebenchmarksarestencilstypicallyusedinit- after register allocation. The instructions not relevant to
erativeprocessessuchassolvingpartialdifferentialequa- the discussion are omitted in Listing 2 (and 3), leading to
tions[26].Theremainingthreearerepresentativeofcom- non-contiguouslinenumbers.Thelineshighlightedinred
plexstenciloperationsextractedfromapplications.hypterm showtheinstructionsinvolvingthesamememorylocation
isaroutinefromtheExpCNSCompressibleNavier-Stokes –line1loadsavaluefromglobalmemoryintoregisterR4,
mini-application from DoE [17]; the last two stencils are andspillsitinline2withoutusingR4inanyoftheinterme-
fromtheGeodynamicsSeismicWaveSW4applicationcode diateinstructions.Suchwastefulspillsareacharacteristic
[51].Foreachbenchmark,theoriginalversionisaswritten ofregister-constrainedcodes.Thesamevalueisreloaded
byapplicationdeveloperswithoutanyloopunrolling;the fromlocalmemoryintoR4inline4,andR4issubsequently
unrolledversionhastheloopsunrolledexplicitly;andthe used in lines 5 and 8. The uses of R4 are placed far apart
reordered version is the output from our code generator. inSASS,addingtotheregisterpressure.Interspersedwith
OnanInteli7-4770Kprocessor,thecodegeneratorgener- theseinstructionsaretheload(line3)andsubsequentuses
atedeachreorderedversionunder4seconds.Whenthenet ofregisterR12.TheinterleavingincreasesILP,buttheuses
unrollingfactorislimitedto4,thesizeofeachreordered of R12arealsoplacedveryfarapart.Abetterschedulecan
versionisunder600linesofcode.Theread-onlyarraysare perhapsachievethesameILPwithlessregisterpressureand
annotatedwiththerestrict keywordinalltheversionsto fewerspills.
176
PPoPP’18,February24–28,2018,Vienna,Austria Rawatetal.
Listing2.SASSsnippetfortheunrolledcode frameworkexploitsoperatordistributivityandconvertsall
1 106 /*0328*/ @P0 LDG.E.64 R4, [R24]; thecontributionsinanindividualstatementtoFMAoper-
2 144 /*0458*/ @P0 STL [R1+0x10], R4; ations. Therefore, instead of 130 FLOPs per stencil point,
3 332 /*0a38*/ @P0 LDG.E.64 R12, [R8];
4 350 /*0ac8*/ @P0 LDL.LU R4, [R1+0x10]; thereorderedversionperforms250FLOPs.Asmeasuredby
5 354 /*0ae8*/ @P0 DADD R16, R16, R4; Nvprof,weincura2×increaseinfloatingpointoperations,
6 358 /*0b08*/ @P0 DADD R16, R12, R10;
7 376 /*0b98*/ @P0 DFMA R6, R12, c[0x2][0x40], R14; butachievesignificantreuseinregistersatahigheroccu-
8 374 /*0b88*/ @P0 DADD R16, R6, R4; pancy,whichconsequentlyimprovestheIPCandexecution
9 436 /*0d78*/ @P0 DADD R12, R12, R24;
time.
Listing3.SASSsnippetforthereorderedcode RegisterPressureSensitivity InGPUs,thenumberofreg-
1 163 /*04f0*/ @P0 DFMA R14, R22, c[0x2][0x8], R14; istersperthreadcanbevariedatcompiletimebytradingoff
2 164 /*04f8*/ @P0 DFMA R8, R22, c[0x2][0x18], R8; theoccupancy.Manyauto-tuningeffortshaverecentlybeen
3 166 /*0508*/ @P0 DFMA R12, R22, c[0x2][0x30], R12;
4 175 /*0550*/ @P0 DFMA R8, R20, c[0x2][0x30], R8; proposedtothatend[24,32].Table4showstheperformance,
5 176 /*0558*/ @P0 DFMA R12, R20, c[0x2][0x18], R12; andthelocalmemorytransactionsreportedbyNvprof,with
6 178 /*0568*/ @P0 DFMA R8, R30, c[0x2][0x38], R8;
7 183 /*0590*/ @P0 DFMA R22, R20, c[0x2][0x8], R10; varyingregisterpressure.Duetospaceconstraints,weonly
8 184 /*0598*/ @P0 DFMA R10, R20, c[0x2][0x18], R14; presentthedataforNVCCcompiler.Wemakethefollowing
9 187 /*05b0*/ @P0 DFMA R16, R30, c[0x2][0x20], R12;
10 191 /*05d0*/ @P0 DFMA R10, R30, c[0x2][0x20], R10; observations:(a)ouroptimizationstrategyreducestheregis-
terpressureforallthethreadconfigurations;(b)increasing
registersperthreadforcodesexhibitingveryhighspillsre-
Version reg IPC inst. ld/st FLOPs L2 textxn tex sultsinbetterperformance,e.g.,8×forrhs4th3fort;and(c)
exec. exec. reads GB/s
Original 128 1.76 2.74E+9 5.28E+8 1.73E+10 5.27E+8 4.19E+9 899.53 forlowspills,betterperformancecanbeachievedbyeither
Unrolled 255 1.12 1.36E+9 2.14E+8 1.72E+10 2.94E+8 1.67E+9 457.23 increasingoccupancy(e.g.,reorderedcodefor3d125pt and
Reordered 64 2.00 1.41E+9 2.14E+8 3.34E+10 1.55E+8 1.67E+9 791.16
hypterm), or maximizing registers per thread (e.g., all the
Table2.Metricsfor3d125pt fortunedconfigurations
codesforrhs4th3fort).
Metrics rhs4th3fort hypterm derivative Findingarightbalancebetweenregisterpressureandoc-
maxfuse split-3 maxfuse split-3 maxfuse split-2
cupancyisnon-trivial,andanactiveresearchfield[24,32,
Inst.Exec. 8.52E+9 8.25E+8 7.48E+8 7.71E+8 8.79E+8 8.96E+8
53, 58]. We perform a simple auto-tuning by varying the
IPC 1.07 1.11 0.97 1.06 1.02 1.14
DRAMreads 9.07E+7 1.65E+8 1.57E+8 1.77E+8 1.34E+8 2.47E+8 tilesizesbypowersof2,andvaryingregistersperthread
ldstexec. 1.55E+8 1.08E+8 1.27E+8 1.46E+8 1.45E+8 1.30E+8
[32].ThebestperformanceinGFLOPSfortheauto-tuned
FLOPs 1.73E+10 1.81E+10 9.66E+9 9.36E+9 1.28E+10 1.34E+10
textxn. 1.11E+9 8.24E+8 9.72E+8 1.06E+9 1.14E+9 1.01E+9 codewithNVCCandLLVMcompilersisshowninfigure
l2readtxn. 4.64E+8 3.79E+8 6.52E+8 5.90E+8 4.97E+8 4.51E+8 6.Unlikethecasewith32and64registersperthread,the
GFLOPS 237.16 274.52 140.71 155.02 168.27 182.83
unrolledcodeoutperformstheoriginalcodeforallbench-
Table3.Metricsforreorderedmax-fusevs.splitversions
marks,highlightingtheimportanceofloopunrollingand
register-levelreuse.Ourreorderingoptimizationimproves
Listing3showstheSASSsnippetforthereorderedcode
theperformanceby(a)producingacodeversionthatuses
generatedbyourcodegenerator.Usingoperatordistribu-
fewerregisters,andhencecanachievehigheroccupancy;and
tivity, the multiplication of the coefficient to the additive
(b)helpingexposeandscheduleindependentFMAstogether
contributionsisconvertedbyourpreprocessingpassinto
forsimpleaccumulationstencils,therebyhidinglatency.
fusedmultiply-adds.NoticethatalltheusesofregisterR20
(highlightedinred)aretightlycoupled.Thesameholdsfor Kernelfission Fromtable1,weselectthelastthreemulti-
registersR22,R30,andtheremaininginstructions.Indepen- statement,compute-intensivestencils,forwhichweantici-
dentFMAsarescheduledtogetherwithoutincreasingthe patehighvolumeofspillsinthemax-fuseform,andexpect
MAXLIVE.Thisreducesregisterpressurewithoutcompro- kernelfissiontobebeneficial.Forthesethreestencils,we
mising ILP. Therefore, even though the unrolled version generateversionswithvaryingdegreeofsplits(Section3.4).
performsfewerFLOPsthanthereorderedversion,weincur Somesplitsrequirepromotingthestoragefromscalarsto
lessspillLDL/STLinstructionsperthread(101forunrolled globalarrays,whileothersrequirerecomputationsdueto
vs.7forreordered). dependence edges in the DAG. Table 3 shows the Nvprof
Forthe3d125pt stencil,table2showssomeprofilingmet- metricswithNVCCfortworeorderedcodes:aversionwith
ricsgatheredbyNvprofwithNVCC.Thetexturethroughput maximumfusion(max-fuse),andaversionwithsplitker-
fortheoriginalcodeindicatesthatthestencilperformance nels.Notethatineachcase,eventhoughtheDRAMreads
islimitedbythetexturecachebandwidth.Loopunrolling increasegoingfrommax-fusetosplitversion,theIPCalso
halvestheaccessestotexturecacheandtheexecutedload increases.Thisisbecausetheregisterpressureperkernel
instructions,butresultsinasignificantdropinIPCdueto ismuchlowerinthesplitversion,andhencewecanunroll
loweredoccupancy.Tobetterexposereorderingopportuni- thecomputationtofurtherexploitregister-levelreuse.This
tiesafterunrolling,thepreprocessingpassofthereordering increase in register-level reuse is reflected in the reduced
177
Description:age register pressure for such complex high-order stencils, resulting in The GNU Compiler Collection: A GNU Manual For GCC Version 4.3.3.