Table Of ContentHigh-Performance High-Order Simulation of Wave and
Plasma Phenomena
by
AndreasKlo¨ckner
Dipl.-Math. techn.,Universita¨tKarlsruhe(TH);Karlsruhe,Germany,2005
M.S.,BrownUniversity;Providence,RI,2006
Adissertationsubmittedinpartialfulfillmentofthe
requirementsforthedegreeofDoctorofPhilosophy
inTheDivisionofAppliedMathematicsatBrownUniversity
PROVIDENCE,RHODEISLAND
May2010
(cid:13)c Copyright2010 byAndreasKlo¨ckner
ThisdissertationbyAndreasKlo¨cknerisacceptedinitspresentform
byTheDivisionofAppliedMathematicsassatisfyingthe
dissertationrequirementforthedegreeofDoctorofPhilosophy.
Date
JanSickmannHesthaven,Ph.D.,Advisor
RecommendedtotheGraduateCouncil
Date
JohnnyGuzma´n,Ph.D.,Reader
Date
Chi-WangShu,Ph.D.,Reader
ApprovedbytheGraduateCouncil
Date
SheilaBonde,DeanoftheGraduateSchool
iii
Vitae
Biographical Information
Birth August5th,1977
Konstanz,Germany
Education
2005–2010 Ph.D.inAppliedMathematics(inprogress)
DivisionofAppliedMathematics,BrownUniversity,Providence,
RI
Advisor: JanHesthaven
2005 DiplomdegreeinAppliedMathematics(Technomathematik)
Institut fu¨r Angewandte Mathematik, Universita¨t Karlsruhe, Ger-
many
Advisor: WillyDo¨rfler
2001–2002 ExchangeStudent,DepartmentofMathematics
UniversityofNorthCarolinaatCharlotte,Charlotte,NC
2000 VordiplominComputerScience,Universita¨tKarlsruhe,Germany
Experience
6/2006–9/2006 J.WallaceGivensResearchAssociate
MathematicsandComputerScienceDiv.,ArgonneNat’l
Laboratory,Illinois
Worked on high-order unstructured electromagnetic simulation
of particle accelerators (with Paul Fischer, Misun Min, and col-
leaguesatANL’sAdvancedPhotonSource).
iv
2/2005–7/2005 ResearchAssociate(WissenschaftlicherMitarbeiter)
Institutfu¨rAngewandteMathematik,Universita¨tKarlsruhe,
Germany
Workedonvariousextensionsofmythesisresearch(withWilly
Do¨rfler).
5/2002–11/2002 ResearchIntern
DaimlerChryslerResearch&Technology,PaloAlto,CA
Worked on driver stress detection, precision GPS, and software
infrastructure(withStefanSchro¨dl).
Publications
2010 ViscousShockCapturingwithanExplicitlyTime-Stepped
DiscontinuousGalerkinMethod.
AK,T.Warburton,J.S.Hesthaven. Inpreparation.
2009 PyCUDA:GPURun-TimeCodeGenerationfor
High-PerformanceComputing.
AK, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih.
Submitted, available at http://arxiv.org/abs/0911.
3456.
2009 NodalDiscontinuousGalerkinMethodsonGraphicsProcessors.
AK,T.Warburton,J.Bridge,J.S.Hesthaven. JournalofCompu-
tationalPhysics,Volume228,Issue21,20November2009.
2009 DeterministicNumericalSchemesfortheBoltzmannEquation.
A.Narayan,AK.BrownUniversityScientificComputingTechnni-
calReport2009-39.
2005 OntheComputationofMaximallyLocalizedWannierFunctions.
DiplomThesis,Universita¨tKarlsruhe,Germany.
v
Acknowledgments
First and foremost, the support of my advisor Jan Hesthaven has been the cornerstone
of my working life in the past five years. He was a source of questions, of answers, of
inspiration, he encouraged me to be bold in the scientific questions I pursue, all while
givingmegreatfreedominfollowingmyinterests. Hehasalsopatientlyputupwiththe
thingsthatturnedoutnottobesosmartinhindsight. Beyondscience,hehasbeenarole
modelanda tremendousinfluenceonmy lifeasawhole. Iconsidermyselflucky tohave
hadhimasamentor.
Overtheyears,IhaveworkedverycloselywithTimWarburtonatRiceUniversityon
manyofthetopicsthatthisthesisdiscusses. Hisgenerosity,help,andinsighthavebenefited
meinmanyways.
Both of the above, along with Chi-Wang Shu and Johnny Guzma´n have graciously
agreedto serveon myPhD committee, spenttime thinkingabout mywork, andprovided
invaluablefeedback.
Throughoutmygraduatestudies,Ihavehadthehonorofworkingonvariousprojects
withalargeanddiversegroupofpeople. Theirinsights,commentaryandencouragement,
sharedinmanyconversations,wereandcontinuetobeagreatassettomyscientificlife.
ThegraduatestudentandpostdoccommunityatBrown’sDivisionofAppliedMathe-
maticsis agreat crowdin whichtogrow upacademically. Many ofyouhavebecomemy
vi
friends,andIhopewewillbeabletostaycloseevenaslifescattersusacrosstheglobe.
Nvidia Corporation have been very generous with equipment and travel support and
were instrumental in initiating, furthering and publicizing the GPU-based part of my
research.
Manycontributorsaroundtheworldhavecreatedtheopen-sourcesoftwareandtools
on which my work has crucially depended. This notably includes the communities that
haveformedaroundmyvariousprojects.
Partsofthisthesisarebasedontwopublications,[Klo¨ckneretal.,2009b]and[Klo¨ckner
etal.,2009a]. Mycoauthorshavecontributedconsiderablytobotharticlesthroughtheir
ideas,suggestions,andfeedback.
Last,butbynomeansleast,myparentsBinaandHeinrichKlo¨cknerhave,throughout
myentirelife,givenmetheirunconditionalsupport,advice,andlove.
Thankyou,allofyou.
Some of the computational meshes used in this work were generated using Triangle
[Shewchuk,1996]andTetGen[SiandGaertner,2005]. ThesurfacemeshforFigure5.10
originatesintheFlightGearflightsimulatorandwasprocessedusingBlenderandMeshLab,
atooldevelopedwiththesupportoftheEpochEuropeanNetworkofExcellence.
vii
Abstract of “High-Performance High-Order Simulation of Wave and Plasma Phenomena”
byAndreasKlo¨ckner,Ph.D.,BrownUniversity,May2010
Thisthesispresentsresultsaimingtoenhanceandbroadentheapplicabilityofthediscon-
tinuousGalerkin(“DG”)methodinavarietyofways. DGwaschosenasafoundationfor
this work because it yields high-order finite element discretizations with very favorable
numericalpropertiesforthetreatmentofhyperbolicconservationlaws.
Inafirstpart,IexamineprogressthatcanbemadeonimplementationaspectsofDG.
In adapting the method to mass-market massively parallel computation hardware in the
formofgraphicsprocessors(“GPUs”),Iobtainanincreaseincomputationperformanceper
unitofcostbymorethananorderofmagnitudeoverconventionalprocessorarchitectures.
KeytothisadvanceisarecipethatadaptsDGtoavarietyofhardwarethroughautomated
self-tuning. I discuss new parallel programming tools supporting GPU run-time code
generation which are instrumental in the DG self-tuning process and contribute to its
reaching application floating point throughput greater than 200 GFlops/s on a single
GPU andgreater than 3 TFlops/son a 16-GPUcluster in simulationsof electromagnetics
problemsinthreedimensions. Ifurtherbrieflydiscussthesolverinfrastructurethatmakes
thispossible.
Inthesecondpartofthethesis,Iintroduceanumberofnewnumericalmethodswhose
motivationispartlyrootedinthe opportunitycreatedbyGPU-DG:First,Iconstruct and
examine a novel GPU-capable shock detector, which, when used to control an artificial
viscosity,helps stabilize DGcomputationsingas dynamicsandanumber ofotherfields.
Second,Idescribemypursuitofamethodthatallowsthesimulationofrarefiedplasmas
using a DG discretization of the electromagnetic field. Finally, I introduce new explicit
multi-rate time integrators for ordinary differential equations with multiple time scales,
withafocusonapplicabilitytoDGdiscretizationsoftime-dependentproblems.
Contents
Vitae iv
Acknowledgments vi
1 Introduction 1
1.1 AboutthisThesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 TheScientificMethodandtheComputationalExperiment . . . . . . . . . 3
1.3 AnArgumentforHybridCodes . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 AssemblingaSetofTools . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 ReproducibilityforResultsinthisThesis . . . . . . . . . . . . . . . . . . 7
2 Preliminaries 10
2.1 TheDiscontinuousGalerkinMethod . . . . . . . . . . . . . . . . . . . . 11
2.1.1 ImplementingDG . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 GPUHardware: ABriefIntroduction . . . . . . . . . . . . . . . . . . . . 15
2.2.1 SpecificsofNvidiahardware . . . . . . . . . . . . . . . . . . . . 18
3 ACode-GeneratingDiscontinuousGalerkinSolver 21
3.1 OntheDesignofaDiscontinuousGalerkinPDESolver . . . . . . . . . . 22
3.2 ALanguageforDiscontinuousGalerkinMethods . . . . . . . . . . . . . 26
3.2.1 FluxesandFlux-LocalBinding . . . . . . . . . . . . . . . . . . . 30
3.2.2 CommonSubexpressionElimination . . . . . . . . . . . . . . . . 31
3.2.3 AnExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 TheProcessingPipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 TypeInferenceandOperatorSpecialization . . . . . . . . . . . . 35
3.3.2 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.3 Target-SpecificRewriting . . . . . . . . . . . . . . . . . . . . . . 37
3.4 TheVirtualMachine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 TheCompilationStep . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.2 TheExecutionModel . . . . . . . . . . . . . . . . . . . . . . . . 40
viii
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 CodeGenerationonGraphicsProcessors 45
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 GPUSoftwareCreation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 ProblemsSolvedbyGPURun-TimeCodeGeneration . . . . . . . . . . . 51
4.3.1 AutomatedTuning . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 TheCostofFlexibility . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.3 High-PerformanceAbstractions . . . . . . . . . . . . . . . . . . 54
4.3.4 GPUsandtheNeedforFlexibility . . . . . . . . . . . . . . . . . 56
4.4 PyCUDA:AScripting-BasedApproachtoGPURTCG . . . . . . . . . . 57
4.4.1 AbstractionsinPyCUDA . . . . . . . . . . . . . . . . . . . . . . 61
4.4.2 CodeGenerationwithPyCUDA . . . . . . . . . . . . . . . . . . 63
4.4.3 PyOpenCL:OpenCLandGPURTCG . . . . . . . . . . . . . . . 66
4.5 SuccessfulApplications . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 DiscontinuousGalerkinMethodsonGraphicsProcessors 70
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 DGontheGPU:Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 DGontheGPU:Implementation . . . . . . . . . . . . . . . . . . . . . . 78
5.3.1 HowtoreadthisSection . . . . . . . . . . . . . . . . . . . . . . 78
5.3.2 FluxLifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.3 FluxExtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.4 Element-LocalDifferentiation . . . . . . . . . . . . . . . . . . . 89
5.4 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.1 FurtherResults: DoublePrecision,DistributedComputation . . . 105
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6 ViscousShockCapturinginaTime-ExplicitDiscontinuousGalerkinMethod111
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2 BasicDesignConsiderations . . . . . . . . . . . . . . . . . . . . . . . . 116
6.3 ApplicationsandEquations . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3.1 AdvectionEquation . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3.2 Second-OrderWaveEquation . . . . . . . . . . . . . . . . . . . 121
6.3.3 Burgers’Equation . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.3.4 Euler’sEquationsofGasDynamics . . . . . . . . . . . . . . . . 123
6.4 ASmoothness-EstimatingDetectorfortheSelectiveApplicationofArtifi-
cialViscosity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4.1 DetectionMethodsintheLiterature . . . . . . . . . . . . . . . . 124
6.4.2 EstimatingSolutionSmoothness . . . . . . . . . . . . . . . . . . 127
6.4.3 AmbiguitiesinTwoandMoreDimensions . . . . . . . . . . . . . 140
6.5 FromSmoothnesstoViscosity . . . . . . . . . . . . . . . . . . . . . . . 142
6.5.1 ScalingtheViscosity . . . . . . . . . . . . . . . . . . . . . . . . 142
6.5.2 SmoothingtheViscosity . . . . . . . . . . . . . . . . . . . . . . 146
ix
Description:dissertation requirement for the degree of Doctor of Philosophy. Date. Jan Sickmann Worked on high-order unstructured electromagnetic simulation of particle