Table Of Content

Register Optimizations for Stencils on GPUs PrashantSinghRawat AravindSukumaran-Rajam AtanasRountev TheOhioStateUniversity TheOhioStateUniversity TheOhioStateUniversity [email protected] [email protected] [email protected] FabriceRastello Louis-NoëlPouchet P.Sadayappan INRIA ColoradoStateUniversity TheOhioStateUniversity [email protected] [email protected] [email protected] Abstract computationupdateselementsofoneormoreoutputarrays Therecentadventofcompute-intensiveGPUarchitecture usingelementsinthespatialneighborhoodfromoneormore has allowed application developers to explore high-order inputarrays.Thefootprintofastencilisdeterminedbyits 3Dstencilsforbettercomputationalaccuracy.Acommon order,whichisthenumberofinputelementsfromthecenter optimization strategy for such stencils is to expose suffi- readalongeachdimension.Inmanyscientificapplications, cientdatareusebymeanssuchasloopunrolling,withthe thestencilorderdeterminesthecomputationalaccuracy.For expectationofregister-levelreuse.However,theresulting thisreason,high-orderstencilshavebeengainingpopularity. codeisoftenhighlyconstrainedbyregisterpressure.While However,theinherentdatareusewithinoracrossstatements currentstate-of-the-artregisterallocatorsaresatisfactory insuchhigh-orderstencilsexposesperformancechallenges for most applications, they are unable to effectively man- thatarenotaddressedbycurrentstenciloptimizers. ageregisterpressureforsuchcomplexhigh-orderstencils, A significant focus in optimizing stencil computations resultinginsub-optimalcodewithalargenumberofregis- hasbeentofuseoperationsacrosstimestepsoracrossase- terspills.Inthispaper,wedevelopastatementreordering quenceofstencilsinapipeline[5,21,22,36,44,54,59].With frameworkthatmodelsstencilcomputationsasaDAGof high-orderstencils,theoperationalintensityissufficiently treeswithsharedleaves,andadaptsanoptimalscheduling highsothatevenwithjustasimplespatialtiling,thecompu- algorithmforminimizingregisterusageforexpressiontrees. tationshouldtheoreticallynotbememory-bandwidthbound. Theeffectivenessoftheapproachisdemonstratedthrough ConsideraGPUwitharound300GBytes/secglobalmem- experimentalresultsonarangeofstencilsextractedfrom ory bandwidth and a peak double-precision performance applicationcodes. ofaround1.5TFLOPS.Therequiredoperationalintensity tobecompute-boundandnotmemory-bandwidthboundis CCSConcepts •Softwareanditsengineering→Com- around5FLOPs/byteor40FLOPsperdouble-word.Many pilers; high-order stencil computations have much higher arith- ACMReferenceFormat: meticintensitiesthan40.Forsuchstencils,achievingahigh PrashantSinghRawat,AravindSukumaran-Rajam,AtanasRountev, degree of reuse in cache is very feasible, but high perfor- Fabrice Rastello, Louis-Noël Pouchet, and P. Sadayappan. 2018. manceisnotrealizedonGPUs.Themainhindrancetoperfor- RegisterOptimizationsforStencilsonGPUs.InPPoPP’18:23nd manceisthehighregisterpressurewithsuchcodes,resultingin ACM SIGPLAN Symposium on Principles and Practice of Parallel excessiveregisterspillingandasubsequentlossofperformance. Programming,February24–28,2018,Vienna,Austria.ACM,New Asweelaborateinthenextsection,existingregisterman- York,NY,USA,15pages.https://doi.org/10.1145/3178487.3178500 agement techniques in productioncompilers arenot well equippedtoaddresstheproblemwithregisterpressurefor 1 Introduction high-orderstencils.Addressingthisproblemincontextof Stencilcomputationsareanimportantcomputationalmotif GPUs is even more challenging, since most of the widely inmanyscientificapplications.Typically,asimplestencil usedGPUcompilerslikeNVCC[38]areclosed-source.Even the recent effort by Google (gpucc [57]) only exposes the Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenot front-endtotheuser,andusestheNVCCbackendasablack madeordistributedforprofitorcommercialadvantageandthatcopiesbear boxtoperforminstructionschedulingandregisteralloca- thisnoticeandthefullcitationonthefirstpage.Copyrightsforcomponents tion. ofthisworkownedbyothersthanACMmustbehonored.Abstractingwith Inthispaper,wedevelopaneffectivepattern-drivenglobal creditispermitted.Tocopyotherwise,orrepublish,topostonserversorto optimizationstrategyforinstructionreorderingtoaddress redistributetolists,requirespriorspecificpermissionand/orafee.Request [email protected]. thisproblem.Thekeyideabehindtheinstructionreordering PPoPP’18,February24–28,2018,Vienna,Austria approachistomodelreuseinhigh-orderstencilcomputa- ©2018AssociationforComputingMachinery. tionsbyusinganabstractionofaDAGoftreeswithshared ACMISBN978-1-4503-4982-6/18/02...$15.00 nodes/leaves,andexploitthefactthatoptimalschedulingto https://doi.org/10.1145/3178487.3178500 168 PPoPP’18,February24–28,2018,Vienna,Austria Rawatetal. i i for (i=2; i<N-2; i++) for (i=2; i<N-2; i++) for (j=2; j<N-2; j++) { for (j=2; j<N-2; j++) { out[i][j] = 0; out[i][j] = 0; for (ii=-2; ii<=2; ii++) for (ii=2; ii>=-2; ii--) for (jj=-2; jj<=2; jj++) j for (jj=-2; jj<=2; jj++) j out[i][j] += in[i+ii][j+jj] * w[ii+2][jj+2]; out[i][j] += in[i+ii][j+jj] * w[ii+2][jj+2]; } } (a)Stencilwithlexicographicalsweeps (b)Stencilwithreverse-lexicographicalsweeps Figure1.Comparingsamestencilcomputationwithdifferentsweepingorder minimizeregistersforasingletreewithdistinctoperands ishighlycombinatorial.Productioncompilersgenerallyuse at the leaves is well known [47]. We thus devise a state- heuristicsforincreasingILP,withabest-effortgreedycon- ment reordering strategy for a DAG of trees with shared trolonregisterpressure.Fortypicalapplicationcodes,the nodesthatenablesreductionofregisterpressuretoimprove negativeeffectonregisterpressureisnotverysignificant. performance.Thepapermakesthefollowingcontributions: However,forhigh-orderstencilcodeswithalargenumber • Itproposesaframeworkformulti-statementstencils ofoperationsandalotofpotentialregister-levelreuse,the thatreducesregisterpressurebyreorderinginstruc- impactcanbeveryhigh,asillustratedbyanexamplebelow. tionsacrossstatements. IllustrativeExample Consideranunrolledversionofthe • ItdescribesnovelheuristicstoscheduleaDAGoftrees double-precision2DJacobistencilcomputation(Figure1a) thatreusedatausingaminimalnumberofregisters. from[50].NVCCinterleavesthecontributionfromeachin- • Itdemonstratestheeffectivenessoftheproposedframe- putpointtodifferentoutputpointstoincreaseinstruction workonanumberofregister-constrainedstencilker- level parallelism (ILP). The interleaving performed to in- nels. creaseILPalsohastheserendipitouseffectofreducingthe liverangeoftheregisterdata,andaconsequentreductionin 2 BackgroundandMotivation registerpressure.Nvprof[39]profilingdataonaTeslaK40c deviceshowsthatundermaximumoccupancy,thisversion RegisterAllocationandInstructionScheduling Acom- performs3.73E+06spilltransactions,achieving467GFLOPS. pilerhasseveraloptimizationpasses,registerallocationand Figure1bshowsthesamestencilcomputationafterchang- instructionschedulingbeingtwoofthem.Passesbeforereg- ingtheorderofaccumulation.Exactlythesamecontribu- isterallocationmanipulateanintermediaterepresentation tionsaremadetoeachresultarrayelement,buttheorder withanunboundednumberoftemporaryvariables.Thegoal ofthecontributionshasbeenreversed.Withthisaccesspat- ofregisterallocationistoassignthosetemporariestophysi- ternforthecodeinFigure1b,NVCCfailstoperformthe calstoragelocations,favoringthefewbutfastregistersto sameinterleavingdespiteallowingreassociationviaappro- theslowerbutlargermemory. priatecompilationflags.Infact,theregisterpressureisnow Forafixedschedule,acommonapproachtoperformregis- exacerbatedbytheconsecutiveschedulingofindependent terallocationistobuildaninterferencegraphoftheprogram, operationstoincreaseILP.Forthisversion,1.58E+08spill whichcapturestheintersectionofthelive-rangesoftem- transactionsweremeasured,withperformancedroppingto porariesatanyprogrampoint.Registerassignmentisthen 51GFLOPS. reducedtocoloringtheinterferencegraph,whereeachcolor Thisexampleillustratesaproblemwithregisterallocation representsadistinctregister[10,11].Interferingnodesin when the computation has a specific reuse pattern, char- theinterferencegraphareassigneddifferentcolorsdueto acteristicofhigh-orderstencilcomputations.Theproblem theiradjacency.Thenumberofregistersneededbythecol- stemsfromthefactthatformostcomplierstheregisterallo- oringalgorithmislower-boundedbythemaximumnumber cationandinstructionschedulingalgorithmsthatoperate ofintersectinglive-rangesatanyprogrampoint(MAXLIVE). atabasic-blocklevelhaveapeepholeviewofthecompu- IfMAXLIVEismorethanthenumberofphysicalregisters, tation–theymakescheduling/allocationdecisionswithout spillingofregistersandtheconsequentload/storeoperations aglobalperspective,andthussometimesworkantagonis- from/tomemoryareunavoidable. tically. Meanwhile, stencil computations typically have a Registerpressurecansometimesbealleviatedbyreorder- veryregularaccesspattern.Withabetterunderstandingof ing the schedule of dependent instructions to reduce the the pattern, and a global perspective on the computation, MAXLIVE. Reordering independent instructions is often itisfeasibletodeviseaninstructionreorderingstrategyto usedtoenhancetheamountofinstruction-levelparallelism alleviateregisterpressure. (ILP), for hiding memory access latency. Thus, there is a complexinterplaybetweeninstructionschedulingandreg- SolutionApproach Inthispaper,wecircumventthecom- isterallocation,affectinginstruction-levelparallelismand plexityofthegeneraloptimizationproblemofinstructionre- registerpressure,andtheassociatedoptimizationproblem orderingandregisterallocationbydevisingapattern-specific 169 RegisterOptimizationsforStencilsonGPUs PPoPP’18,February24–28,2018,Vienna,Austria optimizationstrategy.Stencilcomputationsinvolveaccumu- Listing1.TheinputrepresentationintheDSL lationofcontributionsfromarraydataelementsinasmall 1 function j3d7pt (out, in, a, b, c) { neighborhoodaroundeachelement.Theadditivecontribu- 2 out[k][j][i] = a*(in[k+1][j][i]) + b*(in[k][j-1][i] + 3 in[k][j][i-1] + in[k][j][i] + in[k][j][i+1] + tionstoadataelementmaybeviewedasanexpressiontree. 4 in[k][j+1][i]) + c*(in[k-1][j][i]); Thus,formulti-statementstencils,wehaveaDAGofexpres- 5 } 6 parameter L, M, N; siontrees.Duetothefactthatanelementmaycontributeto 7 iterator k, j, i; severalresultelements,thetreeswithintheDAGcanhave 8 double in[L][M][N], out[L][M][N], a, b, c; 9 unroll k=2, j=2; manysharedleaves. 10 j3d7pt (out, in, a, b, c); Givenasingletreewithoutanysharedleaves,itiswell 11 return out; known[47]howtoscheduleitsoperationsinordertomini- mizethenumberofregistersneeded.Weusethisasthebasis out = a + (b * c[i]) + d[i] + ((e[i] * f) / 2.3); fordevelopingheuristicstoscheduletheoperationsfromthe (a)Illustrativestencilstatement DAGoftreeswithsharedleaves.Incontrasttotheproblem ofreorderinganarbitrarysequenceofinstructionstomin- imizeregisterpressure,astructuredapproachofadapting theoptimalscheduleforisolatedtreestothecaseofDAGof treeswithsharedleavesresultsinanefficientandeffective algorithmthatwedevelopinthenexttwosections. 3 SchedulingDAGofExpressionTrees Stencilcomputationsareoftensuccinctlyrepresentedusing adomain-specificlanguage(DSL).Listing1showsa7-point (b)Expressiontree (c)Expressiontreewithaccumu- lations Jacobi stencil expressed in an illustrative DSL, similar in spirit tostencil computation DSLs suchas SDSL [25] and Figure2.Expressiontreeexample Forma[43].Thecorecomputationisshowninlines2–4.As TheremainingtreenodesareinN .Figure2bshowsthe withsimilarDSLs,theusercanspecifyunrollfactorsforloop op expressiontreeforanillustrativeexpression. iterators(line9).Loopunrolling,orthreadcoarseningon Inapreprocessingstep,weintroducek-arynodesforasso- GPUs,isoftenusedtoexploitregister-levelreuseinthecode. ciativeoperators.Forexample,forthetreeinFigure2b,the Thecomputationisautomaticallyunrolledasapreprocessing chainof+nodesisreplacedwithasingle“accumulation”+ step,beforethecodeisgeneratedandoptimized. node.Figure2cshowstheresultingexpressiontree;thenum- ItisimportanttonotethatusingaDSLisnotaprereq- bersonthenodeswillbedescribedshortly.Thesemanticsof uisiteforusingtheschedulingtechniquesproposedinthis anaccumulationnodeisasexpected:thevalueisinitialized work.Asdescribedshortly,ourapproachworksonaDAGof asappropriate(e.g.,0for+,1for∗)andthecontributionsof expressiontrees.ThisDAGcanbeautomaticallyextracted thechildrenareaccumulatedinarbitraryorder. eitherfromtheDSLrepresentationorfromC/Fortrancode. Weoftenconsiderasequenceofstencilcomputations—for Astencilstatementcanbedefinedbythestencilshape example,inimageprocessingpipelines[43].Eachcompu- (as in lines 2–4) and the input/output data (as in line 8). tation in the sequence will be represented by a separate Eachsuchstencilstatementcanberepresentedbyalabeled expression tree. Similarly, unrolling will result in distinct expressiontree.Forexample,thetreecorrespondingtothe expression trees for each unrolled instance. For example, computationinListing1hasarrayelementout[k,j,i]asits afterunrollingalongdimensionsk andj inListing1,there root,scalarsa,b,c andaccessestoelementsofarrayinasits leaves,andarithmeticoperators∗and+asinnernodes. willbeasequenceoffourexpressiontrees.Insomecases theoutputofatreeisusedasaninputtoalatertreeinthe An expression tree for a stencil computation has three sequence. In such a case, there is a flow dependence: the typesofnodes:(1)nodesn ∈ N representingaccessesto mem root of the producer tree has the same label as some leaf memorylocations,(2)nodesn ∈ N representingbinary/u- op nodeoftheconsumertree(withoutanin-betweentreethat naryarithmeticoperators,and(3)leafnodesrepresenting writestothatlabel).Intheinputtoouranalysis,thisflowis constants.AllleafnodesinN correspondtoreadsofar- mem rayelements(e.g.,in[k +1,j,i])orscalars.Therootofthe representedbyadependenceedgefromtherootnodetothe leafnode.Thus,theentirecomputationisrepresentedasa expressiontreeisalsoinN andcorrespondstoawrite mem DAGofexpressiontrees. to an array element (e.g., out[k,j,i]) or a scalar. We asso- Throughoutthepaper,wemakethefollowingtwoassump- ciateauniquelabelwitheachread/writtenmemorylocation, tions:(1)theassemblyinstructionsgeneratedfortheDAGof and assign to each node in N the corresponding label. mem treesafterregisterallocationareoftheformr ←r opr , 1 2 3 170 PPoPP’18,February24–28,2018,Vienna,Austria Rawatetal. wherer andr can be the same; (2) each operand/result su(n )=mx,oneofthemisscheduledfirst;however,inthis 1 2 j requiresexactlyoneregister.Thisconditionisonlyenforced casesu(n)=1+mx.Inbothcases,theorderofevaluation tosimplifythepresentationofthenexttwosections,and oftheremainingchildrenisirrelevant.Figure2cshowsthe canbeveryeasilyrelaxed[4].Ourobjectiveistoschedule Sethi-Ullmannumbersforthesampleexpressiontree. thecomputationsintheDAGsothattheregisterpressureis Notethattheschedulesproducedbythisapproachper- reduced. formatomicevaluationofsubexpressions:oneofthechildren isevaluatedcompletelybeforetheotheronesareconsidered. 3.1 Sethi-UllmanScheduling For a tree without data sharing, this restriction does not Wewilluse“datasharing”torefertocaseswherethesame affecttheoptimalityoftheresult.Inthepresenceofdata memorylocationisaccessedatmultipleplaces.Thereare sharing,atomicevaluationmaynotbeoptimal. twotypesofdatasharing:(1)withinatree:severalnodes Sincestencilsreadvaluesfromalimitedspatialneighbor- from N have the same label; and (2) across trees: in a hood,datasharingoftenmanifestsintheDAGofexpression mem DAGoftrees,nodesfromdistincttreeshavethesamelabel. trees.Forexample,inListing1,in[k][j][i]willbeaninput Aclassicresult,duetoSethiandUllman[47],appliesto toallfourexpressiontreescorrespondingtotheunrolled asingleexpressiontreewithout datasharing(i.e.,eachn ∈ stencilstatements.OnecanalsofindothernodesinListing1 N hasauniquelabel),andwithbinary/unaryoperators. thatwillbesharedacrossmultipleexpressiontrees.Forsuch mem They present a scheduling algorithm that minimizes the DAGs,theSethi-Ullmanalgorithmcannotbedirectlyapplied numberofregistersneededtoevaluatesuchanexpression toobtainanoptimalschedule.InSection3.2,wepresentan treeunderaspill-freemodel.1Eachtreenodenisassigned approachtocomputeanoptimalatomicscheduleforaDAG anErshovnumber[1];wewillrefertothemas“Sethi-Ullman ofexpressiontreeswith datasharing.Incaseswhenfind- numbers”anddenotethembysu(n).Theyaredefinedas ing an optimal evaluation can be prohibitively expensive, Section3.3presentsheuristicstotradeoffoptimalityinfa- su(n)= s1u(n1) nnihsaasloenaefchildn1 (1) vevoarloufatpirounntiongbethaetoemxpiclocraantiognensepraactee.sFuinb-aollpyt,irmeastlrsicchtiendgutlhese. m1+axs(us(un(n)1),su(n2)) ssuu((nn1))(cid:44)=ssuu((nn2)) Stheacttipoenrf4oprmresseinnttesraleraevminegdioanlsthlieceo-uatnpdu-tinstcehreledauvleegalegnoerriathtemd  1 1 2 bytheapproachpresentedinSection3.2. Thelasttwocasesapplytoabinaryopnodenwithchil- dren n1 and n2. Intuitively, su(n) is the smallest possible 3.2 SchedulingaTreewithDataSharing numberofregistersusedfortheevaluationofthesubtree Figure3ashowsanexpressiontreewithdatasharing.For rootedatn.Thefirsttwocasesareself-explanatory.Fora illustration,nodeswiththesamelabelareconnectedinthe binary op noden, if one childn′ has a higher register re- figure.Recallthatweassumeaspill-freemodel,thereforea quirement(case3),this“big”childshouldbeevaluatedfirst. sharedlabelloadedonceintoaregisterwillremainlivefor The result ofn′ will be stored in a register, which will be allitsuses.Withdatasharing,thereisapossibilitythat(1)a alivewhilethesecond(“small”)childisbeingevaluated.The labelisalreadylivebeforewebegintherecursiveevaluation remainingsu(n′)−1registersusedforn′areavailable(and ofasubtreethathasitssubsequentuse,and/or(2)alabel enough)toevaluatethesmallchild.Finally,theregisterof mustremainliveevenaftertheevaluationofthesubtreein n′canbeusedtostoretheresultforn,meaningthatsu(n)is whichitisused.Theoptimalscheduleofasubtreeisaffected equaltosu(n′).Iftheorderofevaluationwerereversed,the bythelabelsthatarelivebeforeandaftertheevaluationof resultofthesmallchildwouldhavetobekeptinaregister thesubtree.Therefore,weneedtoaddlive-in/outstatesas whilen′isbeingevaluated,whichwouldleadtosub-optimal parameterstothecomputationoftheoptimalscheduleof su(n)=1+su(n′).Inthelastcaseinequation(1),bothchil- asubtree.Inthissection,wepresentanapproachtoopti- drenhavethesameregisterneeds;thus,theirrelativeorder mallyscheduleatreewithdatasharing,underthemodelof ofevaluationisirrelevantandoneextraregisterisneeded atomicevaluationofchildren;wedefertheinterleavingof forn.Ofcourse,underthedefinitionsinequation(1),su(n) computationacrosssubtreestoSection4. isthesameasMAXLIVEforthetreerootedatn. Foranoden,letuses(n)bethesetoflabelsusedinthe ItisstraightforwardtogeneralizeSethi-Ullmannumber- subtreerootedatn.Figure3ashowsuses(n)foreachinternal ingtotreescontainingaccumulationnodes(asinFigure2c). noden.Thelive-insetforanoden,denotedbyin(n),con- Eachsuchaccumulationnodenhaschildrenn for1 ≤i ≤k. i tainsalllabelsthatarelivebeforethesubtreerootedatnis Let mx = max {su(n )}. If there is a single childn with i i j evaluated.Thelive-outsetis su(n ) = mx, this child is scheduled for evaluation first, j andthereforesu(n) = mx.Iftwoormorechildrenn have out(n) = (in(n) ∪ uses(n)) \ kill(n) (2) j 1Inaspill-freemodelofthecomputation,adataelementisloadedonly wherekill(n)isthesetoflabelsthathavetheirlastusesinthe onceintoaregisterforallitsuses/defs. subtreerootedatn.Notethatkill(n)iscontext-dependent, 171 RegisterOptimizationsforStencilsonGPUs PPoPP’18,February24–28,2018,Vienna,Austria (a)Treewithcontextatroot (b)Optimalschedule (a)Treewithdatasharing (b)Schedulingcost Figure4.Example:computingsu′(π) Figure3.Schedulingatreewithdatasharing i.e.,thesetwillvarydependingontheorderinwhichthe requiredcontexttocomputesu′(n1)andsu′(n2).Letmx = nodeisevaluated.Thekillsetscanbecomputedontheflyby maxi{su′(ni)},sothatmxequalsthemaximumnumberofsi- maintainingthenumberofoccurrencesofeachlabell inthe multaneouslyliveregistersatanytimeduringtheevaluation currentschedule,andcomparingitwiththetotalnumberof ofπ.Then, occurrencesofl intheentireDAG. (cid:40) 1+mx n ∈ N &label(n )∈out(n) WenowshowhowtocomputeamodifiedSethi-Ullman su′(π)= i mem i number,su′foreachnoden,whenprovidedwithan“evalua- mx otherwise tioncontext”intermsoflive-inandlive-outlabels.Consider Incase2,ifn ∈ N ,orifn ∈ N butlabel(n )(cid:60)out(n), 1 op 1 mem 1 anodenwithsomeinandout state.Justbeforetheevalua- thentheresultofthecomputation,identifiedbythelabel tionofnbegins,|in(n)|registersarelive.Similarly,justafter ofthenoden,canreusetheregisterofn (similarlyforn ). 1 2 the evaluation ofn finishes, |out(n)| registers will be live. However,incase1,wherebothn andn areleafnodesin 1 2 Duringtheevaluationofn,additionalregistersmaybecome N andbothmustbeliveafterevaluatingn,weneedan mem live,whilesomeoftheotherliveregistersmaybereleased. additionalregistertoholdtheresult. Nowsu′(n,in,out)representsthemaximumnumberofreg- Foranaccumulationnodewithk children,considerper- istersthatweresimultaneouslyliveatanypointduringthe mutationπ = ⟨n ,n ...,n ⟩.Supposewehavecomputed evaluationofn.Wealsodefinesu′(π,in,out),whereπ isa allsu′(n )andlet1mx2=maxk {su′(n )}.Then, i i i sequenceofthechildrennodesofn.Thisvaluewillrepresent tloihrvedeemarteadaxniimnyupthmoeinnstuedqmuubreeinnrcgoeftdhreeesgecivrsiatbeluersdattbihoyantπow.fenrewsiitmhuitlstacnheioldurselny su′(π)= mmxx (cid:60)ssuuo′′((unnt1())n1==)mm&xxsu&&′(nnnj1)∈∈(cid:44)NNmmxem,&2&≤lajb≤elk(n1) bouutFt(ontrh).esimdepfilinciittiyonwsewwiilllluusseestuh′e(nl)ivien-sitne/aoduotfsseuts′(nin,(inn),oauntd), 1+mx osuth′(enr1jw)i(cid:44)semx,2 ≤1j ≤kop  Foraleafnoden ∈ N with|in(n)| =α, mem Justlikethegeneralizationofsu(n)foraccumulationnodes (cid:40)α +1 label(n)(cid:60)in(n) inSection3.1,su′ =mx whenthefollowingtwoconditions su′(n)= hold:(1)n requiresthemaximumnumberofsimultaneously α label(n)∈in(n) 1 liveregisters,andtherestofthenodesinπcanbecompletely Thefirstcaseimpliesthatanewregistermustbereserved evaluatedusingtheregistersreleasedbyn ,and(2)thereg- 1 forthelabelofnifitisnotalreadylivebeforetheevaluation isterholdingn canbereusedbyn,i.e.,eithern ∈ N (case 1 1 op ofn.Thesecondcaseisself-explanatory. 2),orn isaleafnodethatisnotlivebeyondthispoint(case 1 Tocomputesu′forak-ary(binaryoraccumulation)node 1).Inallotherscenarios,weneedmx+1registers(case3). n with childrenn1...nk, we need to explore allk! evalu- Thecomputationofsu′(n)foratreewithoutanevaluation ation orders of the children. Letπ be any permutation of contextisshowninFigure3b,andwithanevaluationcontext thechildrenofnrepresentingtheirevaluationorder.Then isshowninFigure4a.Forthesametree,Figure4bshows su′(n)=minsu′(π). thepermutationwithminimumsu′.Inallthreefigures,the π Forthepurposeofexplanation,supposethepermutation childrenofanodeareorderedleft-to-right,whichdefines π = ⟨n ,n ⟩isoneparticularevaluationorderforabinary thecorrespondingpermutation. 1 2 nodenwithchildrenn ,n .Tocomputesu′(π),firstwede- Insomecases,exhaustivelyexploringallpermutationsof 1 2 terminethelive-inandlive-outsetsfornodesinπ asfol- thechildrenmaybeunnecessary.InthetreeofFigure4a, lows:in(n )=in(n),in(n ) = out(n ),andout(n ) = out(n); therearetwosubtreeoperandsoftheaccumulationnode 1 2 1 2 hereout(n )isasdefinedinequation2.Thisprovidesthe thatsharenodata(thefirstandthethirdsubtreeoperands 1 172 PPoPP’18,February24–28,2018,Vienna,Austria Rawatetal. intheleft-to-rightorder).Therefore,eventhoughthesched- Algorithm1:Schedule-Tree(n,in,out) ulingwithinthosetwosubtreesmaybeinfluencedbythe Input :Atreerootedatnwithlive-in/outcontextsinandout evaluationcontext,theydonotinfluenceeachother’seval- Output:AnoptimalscheduleSforthetree uation.Letpassthroughdenotethelabelsthatareliveboth 1 sched_cost←∅,S ←∅; beforeandaftertheevaluationofnoden:passthrough(n)= 2 C←create_maximal_clusters(n);(Sec.3.2) in(n)∩out(n).Then,forak-arynoden,anytwoofitschil- 3 foreachclustercinCdo drenn andn donotinfluenceeachother’sevaluationif 4 if |c|is1then i j 5 in(c)←in(n); (uses(ni)∩uses(nj))\passthrough(n)=∅ (3) 6 out(c)←computedusingequation2; Insuchascenario,forthenoden,wecancreatemaximal 7 sched_cost[c]←su′(c); 8 else clusters of its children that share data, but the only data 9 computeinandoutforeachtreeinc;(Sec.3.2) sharedamongclustersisthepassthroughlabelsofn.Forex- 10 π ←allpermutationsofthetreesinc; ample,ifchildrent1andt2sharelabell1,andchildrent2and 11 sched_cost[c]←su′(π); t3 sharelabell2,wherel1,l2 arenon-passthroughlabelsof 12 P ←sequenceclustersusingsched_costandThms.3.1,3.2,3.3; theparentnode,then{t1,t2,t3,t4}mustbelongtothesame 13 foreachsubtreets inthesequencedescribedbyPdo cluster.WeextendtheintuitionbehindSethi-Ullmansched- 14 appendthescheduleforts inS; uling algorithm to establish that different clusters cannot 15 returnS; influenceeachother’sevaluation.Then,foreachclusterc , i wecanindependentlycomputesu′(c )within(c ) = in(n). i i Again,weprovetheresultbycontradiction.Supposethat Weonlyneedtoexploreallpermutationswithinthenon- su′(c )<su′(c ),andc isscheduledbeforec intheoptimal singleton clusters. We propose the following theorems to 1 2 2 1 schedule.Wechangethisschedulebymovingtheevaluation establishanevaluationorderfortheclusters. ofc beforethatofc .FromTheorem3.2,su′(c )afterthis 1 2 2 Theorem3.1. Forkclustersci 1 ≤i ≤ksuchthat|in(ci)| ≤ changewilleitherremainthesame,ordecrease.Thus,su′ |out(ci)|,theonewithlargersu′(ci)−|out(ci)| willbeprior- forthenewschedulewilleitherbethesame,orreduceif itized for evaluation over others in the optimal schedule. In su′(c )wasthemaximum,makingitanoptimalschedule. 2 thespecialcasewherealltheclustershavethesamesu′(ci)− Based on these theorems, Algorithm 1 summarizes the |out(ci)|,theycanbeevaluatedinanyorderwithoutaffecting evaluationanoptimalscheduleforatreewithdatasharing. MAXLIVE. 3.3 HeuristicsforTractability This result is a direct consequence of the Sethi-Ullman algorithm. The cluster with larger su′(c ) − |out(c )| will For a non-singleton clusterc, the algorithm presented in i i release more registers, which can be reused by the next Section3.2canbecomeprohibitivelyexpensiveif|c|islarge. cluster.Thespecialcasetooisadirectconsequenceofthe Forexample,when|c|changesfrom7to8,thepermutations Sethi-Ullmanalgorithm,wheretwosiblingnodeswiththe exploredincreasefrom5040to40320.Wenowpresentsome samesucanbeevaluatedinanyorder(case4ofequation1). heuristics that trade off optimality for tractability, and a cachingtechniquetofurtherspeedupthealgorithm. Theorem3.2. Fortwoclustersc andc suchthat|in(c )| > 1 2 1 |out(c )|and|in(c )| ≤ |out(c )|,c mustbeevaluatedbefore Pruning Heuristics We begin by establishing, for any 1 2 2 1 c intheoptimalschedule. node n, the bounds on su′(n). When n is evaluated with 2 non-emptycontextsinandout,theboundsare: We prove the result by contradiction. Suppose that c 2 is evaluated before c in the optimal schedule. Since the su′(n,∅,∅) ≤su′(n,in,out) ≤su′(n,∅,∅)+|in∪out| 1 scheduleisoptimal,su′(c ) ≥ su′(c ).Nowwechangethis 2 1 Weprovethelowerboundbycontradiction;theprooffor optimalschedulebymovingtheevaluationofc beforec . 1 2 upperboundisomittedduetospaceconstraints. Evaluatingc1 earlierwillrelease |in(c1)|−|out(c1)| (i.e., ≥ su′(n,∅,∅) ≤su′(n,in,out):Assumetothecontrarythat 1) registers, which can then be used in the evaluation of su′(n,in,out) <su′(n,∅,∅).WewillmodifythescheduleS c2.Basedonthepreviousequations,thesu′(c2)willeither correspondingtosu′(n,in,out)asfollows:prependastage decrease or remain the same, depending on whether the toS thatloadsthelabels∈in(n)into|in|registers,andmake numberofregistersreleasedbyc1isgreaterthan,orequal in(n) = ∅. Append a state to S that stores all the labels to1.Thismodifiedschedulethereforeeitherhasthesame, ∈ out(n) from the respective registers into memory, and or has lower su′ than the optimal schedule, making it an makeout(n) = ∅. This modified schedule corresponds to optimalschedule. su′(n,∅,∅),andhence,su′(n,∅,∅)=su′(n,in,out). Theorem3.3. Fortwoclustersc andc suchthat|in(c )| > Withtheboundsestablished,insteadofexploringallper- 1 2 1 |out(c )|and|in(c )| > |out(c )|,theonewithsmallersu′must mutations,wecansacrificeoptimalityandstopfurtherex- 1 2 2 beprioritizedforevaluationintheoptimalschedule. ploration when we are close to the optimal schedule. We 173 RegisterOptimizationsforStencilsonGPUs PPoPP’18,February24–28,2018,Vienna,Austria useatunableparameterd,andstoptryingthepermutations Algorithm2:Schedule-DAG(D,R) any further whensu′(n,in,out)−su′(n,∅,∅) ≤ d. For the Input :D:DAGofexpressiontrees,R:Per-threadregisterlimit experimentalevaluationinSection5,wesetd to1. Output:AnoptimalscheduleSfortheD Foraclusterc with|c| > 8,wealsoapplyapartitioning 1 D′←D; fusion_feasible←true; tree_order←∅; S ←∅; heuristic,whichrecursivelypartitionsthesubtreesinc into 2 whilefusion_feasibledo sub-partitionswhereeachsub-partitioncanbeofamaximum 3 foreachpairoftransitivedep-freenodesti,tj inD′do sizep,withp <8.Thepartitioningisbasedoneitherofthe 4 M ∪= compute_metric(D′,ti,tj);(sec.3.4) 5 sort_descendinд(M); twocriteria: 6 (tp,tq,fusion_feasible)←find_fusion_candidate(M); − on “label affinity": the subtrees that share the max- 7 fusetp andtq; imum labels are greedily assigned to the same sub- 8 update_dependence_edдes(D′,tp,tq); partitionaslongasthesizeofthesub-partitionisless 9 foreachnodedinD′do thanp.Suchpartitioningisbasedonthenotionthat 10 appendthetreesequenceofdintree_order; evaluatingsubtreeswithmaximumusestogetherwill 11 split_versions←create_split_versions(tree_order); potentiallyreducepassthroughlabels,andMAXLIVE. 12 foreachsplitinsplit_versionsdo − on“releasepotential":thesubtreesthathavethelast 13 S′←∅; usesofsomelabelsareplacedinasub-partition,and 14 foreachkernelkinsplitdo 15 foreachtreetinkdo thatsub-partitioniseagerlyevaluated.Thispartition- 16 computeinandoutfort; ingisbasedonthenotionthatthereleasedregisters 17 appendoutputofSchedule-Tree(t,in,out)toS′; canbereusedbythenextpartition. 18 executeS′aftercompilingitwithregisterlimitR; Oncethesub-partitionsarecreated,weonlyexhaustively 19 S ←S′ifS′isafasterschedulethanS,orifSis∅; exploreallpermutationsofsubtreeswithinasub-partition. 20 returnS; Ifthenumberofsub-partitionscreatedislessthan8,then wealsotryallthepermutationsofthesub-partitionsthem- selves.Forexample,if|c| =8,andthepartitioningheuristic Foroptimalscheduling,oneneedstoexplorealltopological createstwosub-partitionsp andp ofsize4each,thenour ordersforthetreesintheDAG,andthenevaluateallthe 1 2 explorationspacewillbe{p ,p }and{p ,p },whileexploring treesindependentlyforeachtopologicalorder.Thismaybe 1 2 2 1 all4!permutationsofsubtreeswithinp andp each–atotal practicalifthesizeoftheDAGissmall.Otherwise,wemust 1 2 of2×4!×4!permutationsinsteadof8!permutations. sacrifice optimality for tractability, and fix the evaluation We also let the user externally specify a threshold that orderofthetreesintheDAGbeforethetreesareindividually upper-boundsthetotalnumberofpermutationsforatree. evaluated. WeusethegreedyheuristicdescribedbyRawatetal.[44] Memoization For a noden, a lot of permutations of its to fix the evaluation order of trees in the DAG. At each childrenwilldifferinonlyafewpositions.Insuchcases,we step, the heuristic tries to fix the evaluation order of two enduprecomputingsu′forachildmultipletimes,evenwhen nodes in the DAG. We begin by computing, for each pair thelive-in/outcontextforthechildremainsunchanged. oftransitivedependence-freetreesp intheDAG,ametric i Theserecomputationscanbeavoidedbyasimplemem- M thatencodes:(a)thenumberoflabelssharedbetween i oization,whereforanoden,wemapsu′(n)asafunction them,and(b)thenumberofcommoninputarraysreadby ofaminimal context.Theminimalcontextstripsawayla- them.AmongthecomputedM ,wechoosetheonethathas i belsthatarenotinuses(n),butareinpassthrouдh(n).The thehighestnon-zerovalue,andfixtheevaluationorderof su′(n)withminimalcontextcanbesuitablyadjustedtoget itstreepairtobecontiguoustoenhancereuseproximityin su′(n)withadifferentcontextthathassomepassthrough thefinalschedule.TheDAGisupdatedbyfusingthenodes labelsaddedtotheminimallive-in/out.Forexample,sup- corresponding to the two trees into a “macro node". Post posethatsu′(n)is3whentheminimalin(n)= {a,b}andthe fusion, we update the dependence edges to and from the minimalout(n) = ∅. Thensu′(n) when evaluating it with macro node, and recompute the metrics for the next step. in(n)= {a,b,c},out(n)= {c}andc (cid:60)uses(n)willbe2+1=3, Thealgorithmterminateswhennomorenodescanbefused. and the optimal schedule will remain unchanged. Memo- Oncethealgorithmterminates,weperformatopological ization greatly reduces the total evaluation time, thereby sortofthefinalDAG,andexpandtheDAGmacronodesto enablingtheexplorationofalargenumberofpermutations. theirtreesequences.Fortheseorderedtrees,wecangenerate codeversionswithdifferentdegreeofsplits.Oneextreme 3.4 SchedulingaDAGofexpressiontrees wouldbeaversionwhereallthetreesareinasinglekernel Foreachstencilstatementthatismappedtoanexpression (max-fuse),andanotherextremewouldbeaversionwhere tree, Section 3.2 described a way to schedule it. This sec- eachtreeisadistinctkernel(max-split)[8,55].Forcompute- tiontieseverythingtogetherforamulti-statementstencil intensive stencils with many-to-many reuse, a single ker- bydescribinghowtoscheduleaDAGofexpressiontrees. nelcanhaveextremelyhighregisterpressure,sometimes 174 PPoPP’18,February24–28,2018,Vienna,Austria Rawatetal. causing spills despite allowing for the maximum permit- tedregistersperthread.Forsuchcases,performingkernel fissioninsteadofgeneratingasinglekernelfortheentire computationmightimproveperformance.Thesplitkernels willincuradditionaldatatransfersfromglobalmemory,but theregisterpressureperkernelwillbemuchlower,giving the user an opportunity to further enhance register-level reuseviaunrolling.NotethatnoneoftheproductionGPU compilersarecapableofperformingkernelfusion/fissionop- timizations.Foreachsplitversioncreated,thetreesequence initisevaluatedusingAlgorithm1.Thereturnedschedule (a)Originaltree (b)Interleavingexpressions istheonethatgivesmaximumperformance.Algorithm2 Figure5.Example:interleavingtoreduceMAXLIVE outlinestheentireprocess. Algorithm3:slice-and-interleave(T,in,out) 4 InterleavingExpressions Input :T:aninputtreewithscheduleSandcontextsinandout At this point, we have a schedule for the entire DAG of Output:S:Thescheduleafterapplyingslice-and-interleave trees,butwithatomicevaluationenforced.However,inter- 1 Lilv ←∅; leavingwithin/acrosstreescanbeinstrumentalinreducing 2 min_exprs←sequenceofminimalexpressionsextractedfromS whoseoperandsareleafnodes; MAXLIVE.Forexample,intheunrolledstencilofListing1, 3 foreachexpressionsinmin_exprsdo thereisnoreusewithinastencilstatement,butplentyof 4 Lilv ←allthelabelsseeninthescheduletills∪ labelswith reuseacrossstencilstatements.WewillseelaterinSection singleoccurrenceinT; 5thatrelaxingtheconstraintofatomicevaluation,andper- 5 foreachexpressiontappearingaftersinmin_exprsdo forminginterleavingisimperativeforperformanceinsuch 6 iftonlyoperatesonthelabelsinLilvthen stencils.Acompileroptimizationthatperformssomeinter- 7 t′←maximalexpressionobtainedbygrowingt leavingiscommonsubexpressionelimination(CSE).How- untilitoperatesonthelabelsinLilv; ever,werequireamoregeneralinterleavingthatworksat 8 ifliverangesreducedbyplacingt′aftersthen 9 sliceandplacet′aftersinS; thegranularityofcommonlabelsinsteadofcommonsubex- 10 movetaftersinmin_exprs; pressions.Forexample,Figure5ashowsanexpressiontree 11 returnS; wheresu′(S) is the largest, and the operands of the accu- mulationnodeareevaluatedinorderfromlefttorightin the final schedule. Also, {c[i],b} (cid:60) uses(S). The fact that startwithminimalexpressions,i.e.,thesimplestexpressions {c[i],b} ∈ passthrouдh(S)addstosu′(S).Byslicingtheex- whoseoperandsareleafnodes∈ N .Oncewefindamin- mem pression (e[i]∗b)/c[i] and placing it after the expression imalexpressione thatoperatesonlyonthelabels∈ L , m ilv b∗c[i]asshowninFigure5b,c[i]andb willnolongerbein wefindtherootnoder ofe ,andgrowe totheexpression m m in(S).Insteadofthosetwolabels,atemporarylabelholding rootedatparent(r).Wecontinuetogrowtheexpressiontill thevalueoftheslicedexpressionwillbeaddedtoin(S),and we have a maximal expression that only operates on the hencesu′(S)willreduceby1.NotethatthisisnotCSE,buta labels∈L .Foreachtargetexpressionsthusdiscovered,we ilv moregeneraloptimizationaimedtoreduceMAXLIVE.This checkifslicingandplacingitbetweenthesourceexpression slice-and-interleaveoptimizationslicesatarget expression, e andsubtreet immediatelyfollowinge intheschedule s s s andinterleavesitwithasourceexpression,sothatsu′ata decreases|in(t )|.Ifitdoes,thenslice-and-interleaveisper- s programpointreduces.ItsubsumesCSEifthesourceand formed. targetexpressionsarethesame. IllustrativeExample Letb∗c[i]bethesourceexpression Weperformtheslice-and-interleaveoptimizationattwo inthetreeoffigure5a.Oneoftheexploredtargetexpressions levels:(a)withinanexpressiontree,wherethesourceand willbee[i]∗b,sinceitonlyusesnodes∈ L .Nowwetry ilv targetexpressionsbelongtothesametree;and(b)across togrowthetargetexpressionbychangingtherootfrom∗ theexpressiontreesintheDAG,wheresourceandtarget to /, making (e[i]∗b)/c[i] the new target expression. All expressionsbelongtodifferenttrees.Forachosensource thelabelsusedinthegrowntargetexpressionalsobelong expressiones rootedatnoden,wecomputeasetoflabels, to Lilv. A further attempt to grow will change the target Lilv, which is a union of all the labels that were observed expression to ((e[i]∗b)/c[i])∗ f[i]. However, f[i] (cid:60) Lilv. inthescheduletillnow,withthelabelsthathaveasingle Therefore,webacktrackandfinalize(e[i]∗b)/c[i]asatarget occurrenceintheDAG. expression,sinceitisthemaximalexpressionwithallthe Wenowtrytofindasetoftargetexpressionsoperating labelsinL .Placingthetargetexpressionafterthesource ilv onjustthelabelsfromLilv.Tofindthetargetexpressions,we expressiondecreasesin(S)by1.Therefore,weperformthe 175 RegisterOptimizationsforStencilsonGPUs PPoPP’18,February24–28,2018,Vienna,Austria slice-and-interleaveoptimization.Algorithm3outlinesthe Benchmark N UF k F R A U slice-and-interleavealgorithmthattriesoutdifferentsource 2d25pt 81922 4 2 33 2 104 44 2d64pt 81922 4 4 73 2 260 92 expressions,andcontinuouslyfindsthetargetexpressions 2d81pt 81922 4 4 95 2 328 112 withinthetreetointerleaveinordertoreducetheliveranges. 3d27pt 5123 4 1 30 2 112 58 3d125pt 5123 4 2 130 2 504 204 Theslice-and-interleaveacrossthetreesinaDAGissimilar. hypterm 3003 1 4 358 13 310 152 rhs4th3fort 3003 1 2 687 7 696 179 derivative 3003 1 2 486 10 493 165 5 ExperimentalEvaluation N:DomainSize,UF:UnrollingFactor,k:StencilOrder,F:FLOPsperPointR:#Arrays, A:TotalElementsAccessedperThread,U:UniqueElementsAccessedperThread Ourframeworkparsesstencilstatementswritteninasub- Table1.Benchmarkcharacteristics setofC:thearrayaccessindicesinthestencilstatements mustbeanaffinefunctionofthesurroundingloopiterators, allowefficientloadsviathetexturepipeline.Allthestencils programparameters,andconstants;loopiteratorsandpa- aredouble-precision,compiledwithNVCCflags‘–use_fast_- rametersmustbeimmutableinthestencilstatements.The mathXptxas"-dlcm=ca"’,andLLVMflags‘-O3-ffast-math frameworksupportsauto-unrollingalongdifferentdimen- -ffp-contract=fast’. Since none of the versions use shared sionstoexposespatialreuseacrossstencilstatements. memory,using‘dlcm=ca’forNVCCenhancesperformance Toensureatightcoupling,severalprioreffortsonguid- bycachingtheglobalmemoryaccessesatL1.However,we ing register allocation or instruction scheduling were im- noticenodiscernibleperformancedifferencewhencompil- plementedasacompilerpassinresearch/prototypecompil- ingwithoutthe‘–use_fast_math’flaginNVCC.Weexplore ers[7,16,20,41,45],oropen-sourceproductioncompilers differentinstructionschedulersimplementedinLLVM(de- [29,46].However,likesomeotherrecentefforts[6,28,50], fault,list-hybrid,andlist-burr)fortheoriginalandunrolled weimplementourreorderingoptimizationatsourcelevel code,andreportthebestperformance.Tominimizeinstruc- forthefollowingreasons:(1)itallowsexternaloptimizations tionreorderingforourreorderedcode,weuseLLVM’sde- forclosed-sourcecompilerslikeNVCC;(2)itallowsusto faultinstructionscheduler,anddonotusethe-ffast-math performtransformationslikeexposingFMAsusingoperator optionduringcompilation.Wetestalltheversionsagainsta distributivity,andperformingkernelfusion/fission,which standardCimplementationofthebenchmarksforcorrect- canbeperformedmoreeffectivelyandefficientlyatsource ness:thedifferenceineachcomputedoutputvaluewiththat level;and(3)itisinput-dependent,notmachine-orcompiler- of the C implementation must be less than a tolerance of dependent–withanimplementationcoupledtocompiler 1E-5. passes,itwouldhavetobere-implementedacrosscompilers withdifferentintermediaterepresentation.Ourframework LoopUnrolling Fortheexperiments,theiterativekernels massagestheinputtoaformthatismoreamenabletofurther were unrolled along a single dimension to expose spatial optimizationsbyanyGPUcompiler,andweuseappropriate reuse.Loopunrollingoffersthecompileranopportunityto compilationflagswheneverpossibletoensurethatourre- exploitILP,butschedulingindependentinstructionscontigu- orderingoptimizationisnotundonebythecompilerpasses. ouslymayincreaseregisterpressure.Consideranunrolled Weevaluateourframeworkforthebenchmarkslistedin version of 2d25pt, compiled with 32 registers. From table Table1onaTeslaK40cGPU(peakdouble-precisionperfor- 1, it is clear that the unrolled code has a high degree of mance1.43TFLOPS,peakbandwidth288GB/s)withNVCC- reuse.Listing2showstheSASS(ShaderASSembler)snippet 8.0[38]andLLVM-5.0.0compiler(previouslygpucc[57]). generatedusingNVCCfortheunrolledversionof2d25pt Thefirstfivebenchmarksarestencilstypicallyusedinit- after register allocation. The instructions not relevant to erativeprocessessuchassolvingpartialdifferentialequa- the discussion are omitted in Listing 2 (and 3), leading to tions[26].Theremainingthreearerepresentativeofcom- non-contiguouslinenumbers.Thelineshighlightedinred plexstenciloperationsextractedfromapplications.hypterm showtheinstructionsinvolvingthesamememorylocation isaroutinefromtheExpCNSCompressibleNavier-Stokes –line1loadsavaluefromglobalmemoryintoregisterR4, mini-application from DoE [17]; the last two stencils are andspillsitinline2withoutusingR4inanyoftheinterme- fromtheGeodynamicsSeismicWaveSW4applicationcode diateinstructions.Suchwastefulspillsareacharacteristic [51].Foreachbenchmark,theoriginalversionisaswritten ofregister-constrainedcodes.Thesamevalueisreloaded byapplicationdeveloperswithoutanyloopunrolling;the fromlocalmemoryintoR4inline4,andR4issubsequently unrolledversionhastheloopsunrolledexplicitly;andthe used in lines 5 and 8. The uses of R4 are placed far apart reordered version is the output from our code generator. inSASS,addingtotheregisterpressure.Interspersedwith OnanInteli7-4770Kprocessor,thecodegeneratorgener- theseinstructionsaretheload(line3)andsubsequentuses atedeachreorderedversionunder4seconds.Whenthenet ofregisterR12.TheinterleavingincreasesILP,buttheuses unrollingfactorislimitedto4,thesizeofeachreordered of R12arealsoplacedveryfarapart.Abetterschedulecan versionisunder600linesofcode.Theread-onlyarraysare perhapsachievethesameILPwithlessregisterpressureand annotatedwiththerestrict keywordinalltheversionsto fewerspills. 176 PPoPP’18,February24–28,2018,Vienna,Austria Rawatetal. Listing2.SASSsnippetfortheunrolledcode frameworkexploitsoperatordistributivityandconvertsall 1 106 /*0328*/ @P0 LDG.E.64 R4, [R24]; thecontributionsinanindividualstatementtoFMAoper- 2 144 /*0458*/ @P0 STL [R1+0x10], R4; ations. Therefore, instead of 130 FLOPs per stencil point, 3 332 /*0a38*/ @P0 LDG.E.64 R12, [R8]; 4 350 /*0ac8*/ @P0 LDL.LU R4, [R1+0x10]; thereorderedversionperforms250FLOPs.Asmeasuredby 5 354 /*0ae8*/ @P0 DADD R16, R16, R4; Nvprof,weincura2×increaseinfloatingpointoperations, 6 358 /*0b08*/ @P0 DADD R16, R12, R10; 7 376 /*0b98*/ @P0 DFMA R6, R12, c[0x2][0x40], R14; butachievesignificantreuseinregistersatahigheroccu- 8 374 /*0b88*/ @P0 DADD R16, R6, R4; pancy,whichconsequentlyimprovestheIPCandexecution 9 436 /*0d78*/ @P0 DADD R12, R12, R24; time. Listing3.SASSsnippetforthereorderedcode RegisterPressureSensitivity InGPUs,thenumberofreg- 1 163 /*04f0*/ @P0 DFMA R14, R22, c[0x2][0x8], R14; istersperthreadcanbevariedatcompiletimebytradingoff 2 164 /*04f8*/ @P0 DFMA R8, R22, c[0x2][0x18], R8; theoccupancy.Manyauto-tuningeffortshaverecentlybeen 3 166 /*0508*/ @P0 DFMA R12, R22, c[0x2][0x30], R12; 4 175 /*0550*/ @P0 DFMA R8, R20, c[0x2][0x30], R8; proposedtothatend[24,32].Table4showstheperformance, 5 176 /*0558*/ @P0 DFMA R12, R20, c[0x2][0x18], R12; andthelocalmemorytransactionsreportedbyNvprof,with 6 178 /*0568*/ @P0 DFMA R8, R30, c[0x2][0x38], R8; 7 183 /*0590*/ @P0 DFMA R22, R20, c[0x2][0x8], R10; varyingregisterpressure.Duetospaceconstraints,weonly 8 184 /*0598*/ @P0 DFMA R10, R20, c[0x2][0x18], R14; presentthedataforNVCCcompiler.Wemakethefollowing 9 187 /*05b0*/ @P0 DFMA R16, R30, c[0x2][0x20], R12; 10 191 /*05d0*/ @P0 DFMA R10, R30, c[0x2][0x20], R10; observations:(a)ouroptimizationstrategyreducestheregis- terpressureforallthethreadconfigurations;(b)increasing registersperthreadforcodesexhibitingveryhighspillsre- Version reg IPC inst. ld/st FLOPs L2 textxn tex sultsinbetterperformance,e.g.,8×forrhs4th3fort;and(c) exec. exec. reads GB/s Original 128 1.76 2.74E+9 5.28E+8 1.73E+10 5.27E+8 4.19E+9 899.53 forlowspills,betterperformancecanbeachievedbyeither Unrolled 255 1.12 1.36E+9 2.14E+8 1.72E+10 2.94E+8 1.67E+9 457.23 increasingoccupancy(e.g.,reorderedcodefor3d125pt and Reordered 64 2.00 1.41E+9 2.14E+8 3.34E+10 1.55E+8 1.67E+9 791.16 hypterm), or maximizing registers per thread (e.g., all the Table2.Metricsfor3d125pt fortunedconfigurations codesforrhs4th3fort). Metrics rhs4th3fort hypterm derivative Findingarightbalancebetweenregisterpressureandoc- maxfuse split-3 maxfuse split-3 maxfuse split-2 cupancyisnon-trivial,andanactiveresearchfield[24,32, Inst.Exec. 8.52E+9 8.25E+8 7.48E+8 7.71E+8 8.79E+8 8.96E+8 53, 58]. We perform a simple auto-tuning by varying the IPC 1.07 1.11 0.97 1.06 1.02 1.14 DRAMreads 9.07E+7 1.65E+8 1.57E+8 1.77E+8 1.34E+8 2.47E+8 tilesizesbypowersof2,andvaryingregistersperthread ldstexec. 1.55E+8 1.08E+8 1.27E+8 1.46E+8 1.45E+8 1.30E+8 [32].ThebestperformanceinGFLOPSfortheauto-tuned FLOPs 1.73E+10 1.81E+10 9.66E+9 9.36E+9 1.28E+10 1.34E+10 textxn. 1.11E+9 8.24E+8 9.72E+8 1.06E+9 1.14E+9 1.01E+9 codewithNVCCandLLVMcompilersisshowninfigure l2readtxn. 4.64E+8 3.79E+8 6.52E+8 5.90E+8 4.97E+8 4.51E+8 6.Unlikethecasewith32and64registersperthread,the GFLOPS 237.16 274.52 140.71 155.02 168.27 182.83 unrolledcodeoutperformstheoriginalcodeforallbench- Table3.Metricsforreorderedmax-fusevs.splitversions marks,highlightingtheimportanceofloopunrollingand register-levelreuse.Ourreorderingoptimizationimproves Listing3showstheSASSsnippetforthereorderedcode theperformanceby(a)producingacodeversionthatuses generatedbyourcodegenerator.Usingoperatordistribu- fewerregisters,andhencecanachievehigheroccupancy;and tivity, the multiplication of the coefficient to the additive (b)helpingexposeandscheduleindependentFMAstogether contributionsisconvertedbyourpreprocessingpassinto forsimpleaccumulationstencils,therebyhidinglatency. fusedmultiply-adds.NoticethatalltheusesofregisterR20 (highlightedinred)aretightlycoupled.Thesameholdsfor Kernelfission Fromtable1,weselectthelastthreemulti- registersR22,R30,andtheremaininginstructions.Indepen- statement,compute-intensivestencils,forwhichweantici- dentFMAsarescheduledtogetherwithoutincreasingthe patehighvolumeofspillsinthemax-fuseform,andexpect MAXLIVE.Thisreducesregisterpressurewithoutcompro- kernelfissiontobebeneficial.Forthesethreestencils,we mising ILP. Therefore, even though the unrolled version generateversionswithvaryingdegreeofsplits(Section3.4). performsfewerFLOPsthanthereorderedversion,weincur Somesplitsrequirepromotingthestoragefromscalarsto lessspillLDL/STLinstructionsperthread(101forunrolled globalarrays,whileothersrequirerecomputationsdueto vs.7forreordered). dependence edges in the DAG. Table 3 shows the Nvprof Forthe3d125pt stencil,table2showssomeprofilingmet- metricswithNVCCfortworeorderedcodes:aversionwith ricsgatheredbyNvprofwithNVCC.Thetexturethroughput maximumfusion(max-fuse),andaversionwithsplitker- fortheoriginalcodeindicatesthatthestencilperformance nels.Notethatineachcase,eventhoughtheDRAMreads islimitedbythetexturecachebandwidth.Loopunrolling increasegoingfrommax-fusetosplitversion,theIPCalso halvestheaccessestotexturecacheandtheexecutedload increases.Thisisbecausetheregisterpressureperkernel instructions,butresultsinasignificantdropinIPCdueto ismuchlowerinthesplitversion,andhencewecanunroll loweredoccupancy.Tobetterexposereorderingopportuni- thecomputationtofurtherexploitregister-levelreuse.This tiesafterunrolling,thepreprocessingpassofthereordering increase in register-level reuse is reflected in the reduced 177

Description:

age register pressure for such complex high-order stencils, resulting in The GNU Compiler Collection: A GNU Manual For GCC Version 4.3.3.

Register Optimizations for Stencils on GPUs PDF

15 Pages·2017·2.26 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Download Register Optimizations for Stencils on GPUs PDF Free - Full Version

by Unknow| 2017| 15 pages| 2.26| English

Download Register Optimizations for Stencils on GPUs by in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Register Optimizations for Stencils on GPUs

age register pressure for such complex high-order stencils, resulting in The GNU Compiler Collection: A GNU Manual For GCC Version 4.3.3.

Detailed Information

Author:	Unknown
Publication Year:	2017
Pages:	15
Language:	English
File Size:	2.26
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Register Optimizations for Stencils on GPUs Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Register Optimizations for Stencils on GPUs PDF?

Yes, on https://PDFdrive.to you can download Register Optimizations for Stencils on GPUs by completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Register Optimizations for Stencils on GPUs on my mobile device?

After downloading Register Optimizations for Stencils on GPUs PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Register Optimizations for Stencils on GPUs?

Yes, this is the complete PDF version of Register Optimizations for Stencils on GPUs by Unknow. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Register Optimizations for Stencils on GPUs PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.