Table Of Content

Design Insights for MapReduce from Diverse Production Workloads Yanpei Chen Sara Alspaugh Randy H. Katz Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2012-17 http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-17.html January 25, 2012 Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 25 JAN 2012 2. REPORT TYPE 00-00-2012 to 00-00-2012 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Design Insights for MapReduce from Diverse Production Workloads 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of California at Berkeley,Electrical Engineering and REPORT NUMBER Computer Sciences,Berkeley,CA,94720 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT In this paper, we analyze seven MapReduce workload traces from production clusters at Facebook and at Cloudera customers in e-commerce, telecommunications media, and retail. Cumulatively, these traces comprise over a year?s worth of data logged from over 5000 machines, and contain over two million jobs that perform 1.6 exabytes of I/O. Key observations include input data forms up to 77% of all bytes, 90% of jobs access KB to GB sized files that make up less than 16% of stored bytes, up to 60% of jobs re-access data that has been touched within the past 6 hours, peak-to-median job submission rates are 9:1 or greater, an average of 68% of all compute time is spent in map, task-seconds-per-byte is a key metric for balancing compute and data bandwidth task durations range from seconds to hours, and five out of seven workloads contain map-only jobs. We have also deployed a public workload repository with workload replay tools so that the researchers can systematically assess design priorities and compare performance across diverse MapReduce workloads. 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE Same as 14 unclassified unclassified unclassified Report (SAR) Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 Copyright © 2012, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement The authors are grateful for the feedback from our colleagues at UC Berkeley AMP Lab, Cloudera, and Facebook. We especially appreciate the inputs from Anthony Joseph, David Zats, Matei Zaharia, Jolly Chen, Todd Lipcon, Aaron T. Myers, John Wilkes, and Srikanth Kandula. This research is supported in part by AMP Lab (https://amplab.cs.berkeley.edu/sponsors/), and the DARPA- and SRC- funded MuSyC FCRP Multiscale Systems Center. Design Insights for MapReduce from Diverse Production Workloads YanpeiChen,SaraAlspaugh,RandyKatz UniversityofCalifornia,Berkeley Abstract lionjobsthatmovedapproximately1.6exabytesspread over5000machines(Table3). Theyarecollectedusing In this paper, we analyze seven MapReduce work- standardtoolsinHadoop,andfacilitatecomparisonboth load traces from production clusters at Facebook and acrossworkloadsandovertime. at Cloudera customers in e-commerce, telecommunica- Our analysis reveals a range of workload behavior. tions,media,andretail. Cumulatively,thesetracescom- Key similarities between workloads include input data priseoverayear’sworthofdataloggedfromover5000 forms up to 77% of all bytes, 90% of jobs access KB machines,andcontainovertwomillionjobsthatperform to GB sized files that make up less than 16% of stored 1.6exabytesofI/O.Keyobservationsincludeinputdata bytes, up to 60% of jobs re-access data that has been forms up to 77% of all bytes, 90% of jobs access KB touchedwithinthepast6hours,peak-to-medianjobsub- to GB sized files that make up less than 16% of stored missionratesare9:1orgreater,anaverageof68%ofall bytes, up to 60% of jobs re-access data that has been computetimeisspendinmap,taskdurationsrangefrom touchedwithinthepast6hours,peak-to-medianjobsub- secondstohours,andfiveoutofsevenworkloadscontain missionratesare9:1orgreater,anaverageof68%ofall map-onlyjobs. Keydifferencesincludethefrequencyof compute time is spent in map, task-seconds-per-byte is data re-access, the burstiness of the workload, the bal- akeymetricforbalancingcomputeanddatabandwidth, ance between computation to data bandwidth, the ana- taskdurationsrangefromsecondstohours,andfiveout lytical frameworks used on top of MapReduce, and the ofsevenworkloadscontainmap-onlyjobs. Wehavealso multi-dimensiondescriptionsofcommonjobcategories. deployedapublicworkloadrepositorywithworkloadre- These results provide empirical evidence to inform the play tools so that the researchers can systematically as- designandevaluationofschedulers,caches,optimizers, sess design priorities and compare performance across andnetworks,inadditiontorevealinginsightsforcluster diverseMapReduceworkloads. provisioning,benchmarking,andworkloadmonitoring. Table 1 summarizes our findings and serves as a 1 Introduction roadmapfortherestofthepaper. Ourmethodologyex- tends[19,18,17], andorganizestheanalysisaccording The MapReduce computing paradigm is gaining to three conceptual aspects of a MapReduce workload: widespread adoption across diverse industries as a plat- data, temporal, and compute patterns. Section 3 looks form for large-scale data analysis, accelerated by open- at data patterns. This includes aggregate bytes in the source implementations such as Apache Hadoop. The MapReduce input, shuffle, and output stages, the distri- scale of such systems and the depth of their software bution of per-job data sizes, and the per-file access fre- stackscreatecomplexchallengesforthedesignandoper- quencies and intervals. Section 4 focuses on temporal ationofMapReduceclusters.Anumberofrecentstudies patterns. Itanalyzesvariationovertimeacrossmultiple lookedatMapReducestylesystemswithinahandfulof workloaddimensions,quantifiesburstiness,andextracts largetechnologycompanies[19,8,39,33,30,10]. The temporalcorrelationsbetweendifferentworkloaddimen- stand-alonedatasetsusedinthesemeansthattheassoci- sions. Section 5 examines computate patterns. It ana- ateddesigninsightsmaynotgeneralizeacrossusecases, lyzes the balance between aggregate compute and data an important concern given the emergence of MapRe- size,thedistributionoftasksizes,andthebreakdownof duceusersindifferentindustries[5].Consequently,there commonjobcategoriesbothbyjobnamesandbymulti- is a need to develop systematic knowledge of MapRe- dimensionaljobbehavior. Ourcontributionsare: duce behavior at both established users within technol- • Analysis of seven MapReduce production workloads ogycompanies,andatrecentadoptersinotherindustries. fromfiveindustriestotalingovertwomillionjobs, In this paper, we analyze seven MapReduce work- • Derivationofdesignandoperationalinsights,and load traces from production clusters at Facebook and at • Methodologyofanalysisandthedeploymentofapub- Cloudera’s customers in e-commerce, telecommunica- licworkloadrepositorywithworkloadreplaytools. tions,media,andretail. Cumulatively,thesetracescom- WeinviteotherMapReduceresearchersanduserstoadd priseoverayear’sworthofdata,coveringovertwomil- toourworkloadrepositoryandderiveadditionalinsights 1 Section Observations Implications/Interpretations 3.1 Inputdataforms51-77%ofallbytes,shuffle8-35%,output8-23%. Needtore-assesssolefocusonshuffle-liketrafficfordatacenternetworks. 3.2 ThemajorityofjobinputsizesareMB-GB. TB-scalebenchmarkssuchasTeraSortarenotrepresentative. 3.2 FortheFacebookworkload,overayear,inputandshufflesizesincrease Growingcustomersandrawdatasizeovertime,distilledintopossiblythe byover1000×butoutputsizedecreasesbyroughly10×. samesetofmetrics. 3.3 Forallworkloads,dataaccessesfollowthesameZipfdistribution. Tieredstorageisbeneficial.Uncertaincauseforthecommonpatterns. 3.3 90%ofjobsaccesssmallfilesthatmakeup1-16%ofstoredbytes. Cachedataiffilesizeisbelowathreshold. 3.3 Upto78%ofjobsreadinputdatathatisrecentlyreadorwrittenbyother LRUorthreshold-basedcacheevictionviable.Tradeoffsbetweeneviction jobs.75%ofre-accessesoccurwithin6hrs. policiesneedfurtherinvestigation. 4.1 Thereisahighamountofnoiseinjobsubmitpatternsinallworkloads. Onlinepredictionofdataandcomputationneedswillbechallenging. 4.1 Dailydiurnalpatternevidentinsomeworkloads. Humandriveninteractiveanalysis,and/ordailyautomatedcomputation. 4.2 Workloadsarebursty,withpeaktomedianhourlyjobsubmitratesof9:1 Schedulingandplacementoptimizationsessentialunderhighload. In- orgreater. creaseutilizationorconserveenergyduringinherentworkloadtroughs. 4.2 ForFacebook,overayear,peaktomedianjobsubmitratesdecreasefrom Multiplexingmanyworkloadshelpsmoothoutbustiness. However,uti- 31:1to9:1,whilemoreinternalorganizationsuseMapReduce. lizationremainslow. 4.2 Workloadintensityismulti-dimensional(jobs,I/O,task-times,etc.). Needmulti-dim.metricsforburstinessandothertimeseriesproperties. 4.3 Avg.temporalcorrelationbetweenjobsubmit&datasizeis0.21;forjob Schedulers need to consider metrics beyond number of active jobs. submit&computetimeitis0.14;fordatasize&computetimeitis0.62. MapReduceworkloadsremaindata-ratherthancompute-centric. 5.1 Workloadsonavg.spend68%oftimeinmap,32%oftimeinreduce. Optimizingreadlatency/localityshouldbeapriority. 5.1 Mostworkloadshavetask-secondsperbyteof1×10−7to7×10−7;one Buildbalancedsystemsforeachworkloadaccordingtotask-secondsper workloadhastask-secondsperbyteof9×10−4. byte.Developbenchmarkstoprobeagivenvalueoftask-secondsperbyte. 5.2 Forallworkloads,taskdurationsrangefromsecondstohours. Needmechanismstochooseuniformtasksizesacrossjobs. 5.3.1 <10jobnametypesmakeup>60%ofallbytesorallcomputetime. Per-job-typepredictionandmanualtuningarepossible. 5.3.1 Allworkloadsaredominatedbya2-3frameworks. Meta-schedulersneedtomultiplexonlyafewframeworks. 5.3.2 Jobstouching<10GBoftotaldatamakeup>92%ofalljobs,andoften Schedulersshoulddesignforsmalljobs.Re-assessthepriorityplacedon have<10tasksorevenasingletasks. addressingtaskstragglers. 5.3.2 Fiveoutofsevenworkloadscontainmap-onlyjobs. Notalljobsbenefitfromnetworkoptimizationsforshuffle. 5.3.2 Commonjobtypeschangesignificantlyoverayear. Periodicre-tuningandre-optimizationsisnecessary. Table1: Summaryofobservationsfromtheworkload,andassociateddesignimplicationsandinterpretations. using the data in this paper. We hope that such public cases [34, 36, 16, 32, 13, 24]. The increased generality datawillallowresearchersandclusteroperatorstobetter likelycomesfromacombinationofimprovedmeasure- understandandoptimizeMapReducesystems. menttools,wideadoptionofcertainsystems,andbetter appreciationofwhatgoodsystemmeasurementenables. The trend reversed in recent years, with stand-alone 2 Background studiesbecomingcommonagain[19,8,39,33,10,30]. This is likely due to the tremendous scale of the sys- This section reviews key previous work on workload temsofinterest.Onlyafeworganizationscanaffordsys- characterization in general and highlight our advances. temsofthousandsorevenmillionsofmachines.Concur- Inaddition,wedescribethetracesweused. rently,thevastimprovementinmeasurementtoolscreate anover-abundanceofdata,presentingnewchallengesto 2.1 PriorWork deriveusefulinsightsfromthedelugeoftracedata. These technology trends create the pressing need to The desire for thorough system measurement predates generalize beyond the initial point studies. As MapRe- theriseofMapReduce. Workloadcharacterizationstud- duceusediversifiestomanyindustries,systemdesigners ies have been invaluable in helping designers identify needtooptimizeforcommonbehavior[11],inaddition problems, analyze causes, and evaluate solutions. Ta- toimprovingtheparticularsofindividualusecases. ble 2 summarizes some studies founded on the under- Somestudiesamplifiedtheirbreadthbyworkingwith standing realistic system behavior. They fall into the ISPs[36,24]orenterprisestoragevendors[13],i.e.,in- categories of network systems [29, 34, 36, 16, 24, 8], termediarieswhointeractwithalargenumberofendcus- storagesystems[35,16,32,13,19,10], andlargescale tomers. The emergence of enterprise MapReduce ven- data centers running computation paradigms including dors present us with similar opportunities to generalize MapReduce[8,39,33,10,30,11]. Networkandstorage beyondsingle-pointMapReducestudies. subsystems are key components for MapReduce. This paperbuildsonthesestudies. A striking trend emerges when we order the stud- 2.2 WorkloadTracesOverview ies by trace date. In the late 1980s and early 1990s, measurement studies often capture system behavior for We analyze seven workloads from various Hadoop de- only one setting [35, 29]. The stand-alone nature of ployments. All seven come from clusters that sup- thesestudiesareduetotheemergingmeasurementtools port business critical processes. Five are workloads that created considerable logistical and analytical chal- from Cloudera’s enterprise customers in e-commerce, lenges at the time. Studies in the 1990s and early telecommunications, media, and retail. Two others are 2000s often traced the same system under multiple use Facebookworkloadsonthesameclusteracrosstwodif- 2 Study Tracedsystem Tracedate Companies Industries Findings&contributions Ousterhoutetal.[35] BSDfilesystem 1985 1 1 Variousfindingsrebandwidth&accesspatterns Lelandetal.[29] Ethernetpackets 1989-92 1 1 Ethernettrafficexhibitsself-similarity Mogul[34] HTTPrequests 1994 2 2 PersistentconnectionsarevitalforHTTP Paxson[36] Internettraffic 1994-95 35 >3 Variousfindingsrepacketloss,re-ordering,&delay Breslauetal.[16] Webcaches 1996-98 6 2 WebcachesseevariousZipfaccesspatterns Mesnieretal.[32] NFS 2001-04 2 1 Predictfilepropertiesbasedonnameandattributes Bairavasundarametal.[13] Enterprisestorage 2004-08 100s Many Quantifiessilentdatacorruptioninreliablestorage Feamsteretal.[24] BGPconfigurations 2005 17 N/A DiscoversBGPmisconfigurationsbystaticanalysis Chenetal.[19] Enterprisestorage 2007 1 1 Variousfindingsreplacement,consolidation,caching Alizadehetal.[8] DatacenterTCPtraffic 2009 1 1 UseexplicitcongestionnotificationtoimproveTCP Thereskaetal.[39] Storageforwebapps 2009 1 1 Saveenergybypoweringdowndatareplicas Mishraetal.[33] Webappsbackend 2009 1 1 Taskclassificationforcapacityplanning&scheduling Anathanarayananetal.[10] Websearch(Dryad) 2009 1 1 Alleviatehotspotsbypre-emptivedatareplication Meisneretal.[30] Websearch 2010 1 1 Interactivelatencycomplicateshardwarepowermngmt. Anathanarayananetal.[11] HadoopandDryad 2009-11 3 1 Cachedatabasedonheavy-tailaccesspatterns Table2: Summaryofpriorwork. Thelistincludesnetworksystemstudies[29,34,36,16,24,8],storagesystemstudies[35,16, 32,13,19,10],andstudiesondatacentersrunninglargescalecomputationparadigmsincludingMapReduce[8,39,33,10,30,11]. Notethatstudiesinthe1990sandearly2000shavegoodbreadth,whilestudiesinthelate2000sreturnedtobeingstand-alone. Trace Machines Length Date Jobs Bytes 1 input CC-a <100 1month 2011 5759 m80ovTeBd bytes 0.8 sohuutpffulet CC-b 300 9days 2011 22974 600TB of 0.6 CCCC--cd 400-750000 2+1mmoonnthths 22001111 2113023803 188PPBB action 0.4 CC-e 100 9days 2011 10790 590TB Fr 0.2 FB-2009 600 6months 2009 1129193 9.4PB 0 FB-2010 3000 1.5months 2010 1169184 1.5EB FB-2009 FB-2010 CC-a CC-b CC-c CC-d CC-e Total >5000 ≈1year - 2372213 1.6EB Table3: Summaryoftraces. CCisshortfor“ClouderaCus- Figure1:Aggregatefractionofallbytesthatcomefrominput, tomer”.FBisshortfor“Facebook”.Bytestouchediscomputed shuffle,oroutputofeachworkload. bysumofinput,shuffle,andoutputdatasizesforalljobs. reflectnologginginterruptions,exceptfortheclusterin CC-d, which was taken offline several times due to op- ferentyears. Theseworkloadsofferarareopportunityto erational reasons. There are some inaccuracies at trace survey Hadoop use cases across several technology and startandtermination,duetoincompletejobsatthetrace traditionalindustries(Clouderacustomers),andtrackthe boundaries. Thelengthofourtracesfarexceedsthetyp- growthofaleadingHadoopdeployment(Facebook). ical job length on these systems, leading to negligible Table 3 gives some detail about these workloads. errors. To capture weekly behavior for CC-b and CC-e, The trace lengths are limited by the logistical chal- we intentionally queried for 9 days of data to allow for lenges of shipping trace data for offsite analysis. The inaccuraciesattraceboundaries. Cloudera customer workloads have raw logs approach- ing100GB,requiringustosetupspecializedfiletransfer tools. Transferring raw logs is infeasible for the Face- 3 DataAccessPatterns bookworkloads,requiringustoqueryFacebook’sinter- nalmonitoring tools. Combined, theworkloadscontain Datamovementisakeyfunctionoftheseclusters,soun- over a year’s worth of trace data, covering a non-trivial derstanding data access patterns is crucial. This section amountofjobsandbytesprocessedbytheclusters. looksatdatapatternsinaggregate(§3.1),byjobs(§3.2), The data is logged by standard tools in Hadoop; no andper-file(§3.3). additional tracing tools were necessary. The workload traces contain per-job statistics for job ID (numerical 3.1 Aggregateinput/shuffle/outputsizes key), job name (string), input/shuffle/output data sizes (bytes), duration, submit time, map/reduce task time Figure 1 shows the aggregate input, shuffle, and output (slot-seconds),map/reducetaskcounts,andinput/output bytes. These statistics reflect I/O bytes seen from the filepaths. Wecalleachofthesecharacteristicanumer- MapReduce API. Different MapReduce environments icaldimensionofajob. Sometraceshavesomedatadi- lead to two interpretations with regard to actual bytes mensionsunavailable. movedinhardware. WeobtainedtheClouderatracesbydoingatime-range If we assume that task placement is random, and lo- selectionofper-jobHadoophistorylogsbasedonthefile calityisnegligibleforallthreeinput,shuffle,andoutput timestamp. The Facebook traces come from a similar stages,thenallthreeMapReducedatamovementstages query on Facebook’s internal log database. The traces involvenetworktraffic. Further,becausetaskplacement 3 Fraction of jobs 0000....012468 0000....012468 0000....012468 CCCCCC---ace CCCC--bd File access frequency 110100,,,000100010000100 1 1,000 1CCCCF,0BCCCC0-----02bcde,001000 File access frequency 110100,,,000100010000100 1 1,000 1,00CCCC0CCCC,0----bcde00 1 KB MB GB TB 1 KB MB GB TB 1 KB MB GB TB Input file rank by descending Output file rank by descending access frequency access frequency Per-job input size Per-job shuffle size Per-job output size Figure 3: Log-log file access frequency vs. rank. Showing FB-2009 FB-2010 Zipfdistributionofsameshape(slope)forallworkloads. bs 1 1 1 n of jo 00..68 00..68 00..68 larger) by several orders of magnitude, while the per- o acti 0.4 0.4 0.4 joboutputsizedistributionshiftsleft(becomessmaller). Fr 0.2 0.2 0.2 0 0 0 Rawandintermediatedatasetshavegrownwhilethefi- 1 KB MB GB TB 1 KB MB GB TB 1 KB MB GB TB nalcomputationresultshavebecomesmaller. Onepos- Per-job input size Per-job shuffle size Per-job output size sibleexplanationisthatFacebook’scustomerbase(raw Figure2:Datasizeforeachworkload.Showinginput,shuffle, data)hasgrown,whilethefinalmetrics(output)todrive andoutputsizeperjob. businessdecisionshaveremainedthesame. is random, the aggregate traffic looks like N-to-N shuf- 3.3 Accessfrequencyandintervals fle traffic for all three stages. Under these assumptions, recentresearchcorrectlyoptimizeforN-to-Ntrafficpat- ThissectionanalyzesHDFSfileaccessfrequencyandin- ternsfordatacenternetworks[7,8,25,20]. tervals based on hashed file path names. The FB-2009 However, the default behavior in Hadoop is to at- and CC-a traces do not contain path names, and the tempt to place map tasks for increased locality of input FB-2010tracecontainspathnamesforinputonly. data. Hadoop also tries to combine or compress map Figure 3 shows the distribution of HDFS file access outputs and optimize placement for reduce tasks to in- frequency, sorted by rank according to non-decreasing crease rack locality for shuffle. By default, for every frequency. Note that the distributions are graphed on API output block, HDFS stores one copy locally, an- log-log axes, and form approximate straight lines. This other within-rack, and a third cross-rack. Under these indicatesthatthefileaccessesfollowaZipfdistribution. assumptions,datamovementwouldbedominatedbyin- The generalized Zipf distribution has the form in put reads, with read localityoptimizations beingworth- Equation 1, where f(r;α,N) is the access frequency of while[41]. Further, whileinter-racktrafficfromshuffle rth ranked file, N is the total size of the distribution, canbedecreasedbyreducetaskplacement,HDFSoutput i.e., number of unique files in the workload, and α is by design produces cross-rack replication traffic, which the shape parameter of the distribution. Also, H is N,α hasyettobeoptimized. the Nth generalized harmonic number. Figure 3 graphs Facebook uses HDFS RAID, which employs Reed- log(f(r;α,N)) against log(r). Thus, a linear log-log Solomonerasurecodestotolerate4missingblockswith graphindicatesadistributionoftheforminEquation1, 1.4× storage cost [15, 37]. Parity blocks are placed in with N/H being a vertical shift given by the size of N,α a non-random fashion. Combined with efforts to im- thedistribution,andα reflectstheslopeoftheline. provelocality,thedesigncreatesanotherenvironmentin whichweneedtoreassessoptimizationprioritybetween N N 1 MapReduceAPIinput,shuffle,andoutput. f(r;α,N)= rαH HN,α = ∑ kα (1) N,α k=1 3.2 Per-jobdatasizes Figure3indicatesthatfewfilesaccountforaveryhigh number of accesses. Thus, any data caching policy that Figure 2 shows the distribution of per-job input, shuf- includesthosefileswillbringconsiderablebenefit. fle,andoutputdatasizesforeachworkload. Themedian Further,theslope,i.e.,αparameterofthedistributions per-jobinput,shuffle,andoutputsizerespectivedifferby areallapproximately5/6,acrossworkloadsandforboth 6, 8, and 4 orders of magnitude. Most jobs have input, inputs and outputs. Thus, file access patterns are Zipf shuffle, and output sizes in the MB to GB range. Thus, distributions of the same shape. Figure 3 suggests the benchmarksofTBandabove[6,4]capturesonlyanar- existenceofcommoncomputationneedsthatleadstothe rowsetofinput,shuffle,andoutputpatterns. samefileaccessbehavioracrossdifferentindustries. From2009to2010,theFacebookworkloads’per-job The above observations indicate only that caching input and shuffle size distributions shift right (become helps. If there is no correlation between file sizes and 4 s 1 s 1 es es CC-b acc 0.8 acc 0.8 CC-c n of re- 00..46 n of re- 00..46 CCCC--de o 0.2 o 0.2 cti cti FB-2010 a 0 a 0 Fr 1 100 10000 Fr 1 100 10000 Input-input re-access interval (s) Output-input re-access interval (s) Fraction of jobs 0000....0146821 KB MB GB TB Fraction of jobs 0000....0146821 KB MB GB TB CCCCFBCCCC-----2dbce010 Fraction of jobs 0000....12468 jrejreooeexxbb--iiaassssctct iiwwnncceehhggss oooissnssu peepptu prriieenntu--pptuutt Input file size Output file size 0 FB-2010 CC-b CC-c CC-d CC-e es 1.0 es 1 CC-b Fraction of bytstored 00000.....02468 Fraction of bytstored 0000....02468 CCCFBCCC----2cde010 FNiogtuerteha6t:ouFtrpaucttipoanthoifnjfoobrmstahtiaotnreisadmsipssrien-gexfirsotmingFBin-p2u0t1p0a.th. 1 KB MB GB TB 1 KB MB GB TB erarchy to see if the small data sets are stand-alone or Input files size Output file size samplesoflargerdatasets. Frequentuseofdatasamples Figure4: Accesspatternsvs. input/outputfilesize. Showing wouldsuggestopportunitiestopre-generatedatasamples cummulativefractionofjobswithinput/outputfilesofacertain thatpreservevariouskindofstatistics. Thisanalysisre- size (top) and cummulative fraction of all stored bytes from quires proprietary file name information, and would be input/outputfilesofacertainsize(bottom). possiblewithineachMapReduceuser’sorganization. Fraction of re-access 0000....012486 1 100 10000 Fraction of re-access 0000....012468 1 100 10000 CCCCFBCCCC-----2bcde010 4TjohbesWiunbteomnrisskistliyoonoafdraatVeM,aaarpsiRwaetediloulcnaesOtwhoevreckrolomTapdiumdtaeetpioenndasnodndtahtea Input-input re-access interval (s) Output-input re-access interval (s) access patterns of the jobs that are submitted. System Figure 5: Data re-accesses intervals. Showing interval be- occupancydependsonthecombinationofthesemultiple tweenwhenaninputfileisre-read(left),andwhenanoutputis 1 robs e-u0s.e8dastheinputforanotherjob(righjrtoe)b-.asc wcehsoss ep rien-put time-varying dimensions. This section looks at work- of j 0.6 existing output loadvariationoveraweek(§4.1),quantifiesburstiness, n aFractiocc00e..s24sfrequencies,maintainingcachjreoeexb-iasshct iwnciehgts oirsnsa pepu triente-psutwouldre- apuctoesmtmemonpofreaaltucroerrfeolartiaollnswboerktwloeaednsd(i§ff4er.2en),tawnodrkcloomad- 0 quirecFaBc-2h0i1n0gCaC-fibxeCdC-fcracCtCio-dnoCfCb-eytesstored.Thisdesign dimensions(§4.3). is not sustainable, since caches intentionally trade ca- pacityforperformance,andcachecapacitygrowsslower 4.1 Weeklytimeseries thanfulldatacapacity. Fortunately,furtheranalysissug- gestsmoreviablecachingpolicies. Figure 7 depicts the time series of four dimensions Figure4showsdataaccesspatternsplottedagainstfile of workload behavior over a week. The first three sizes. The distributions for fraction of jobs versus file columns respectively represents the cumulative job size vary widely (top graphs), but converge in the up- counts, amountof I/O(againcountedfromMapReduce per right corner. In particular, 90% of jobs access files API),andcomputationtimeofthejobssubmittedinthat of less than a few GBs (note the log-scale axis). These hour. The last column shows cluster utilization, which files account for up to only 16% of bytes stored (bot- reflectshowtheclusterservicesthesubmittedworkload tom graphs). Thus, a viable cache policy would be to describesbytheprecedingcolumns,anddependsonthe cachefileswhosesizeislessthanathreshold. Thispol- clusterhardwareandexecutionenvironment. icy would allow cache capacity growth rates to be de- The first feature to observe in the graphs of Figure 7 tachedfromthegrowthrateindata. is that noise is high. This means that even though the Further analysis also suggest cache eviction policies. number of jobs submitted is known, it is challenging to Figure 5 indicates the distribution of time intervals be- predicthowmanyI/Oandcomputationresourceswillbe tweendatare-accesses.75%ofthere-accessestakeplace neededasaresult.Also,standardsignalprocessmethods within 6 hours. Thus, a possible cache eviction policy toquantifythesignaltonoiseratiowouldbechallenging wouldbetoevictentirefilesthathavenotbeenaccessed toapplytothesetimeseries,sinceneitherthesignalnor forlongerthanaworkloadspecificthresholdduration. noisemodelsareknown. Figure6furthershowsthatupto78%ofjobsinvolve Some workloads exhibit daily diurnal patterns, re- data re-accesses (CC-c, CC-d, CC-e), while for other vealedbyFourieranalysis,andforsomecases,visually workloads, the fraction is lower. Thus, the same cache identifiable (e.g., jobs submission for FB-2010, utiliza- evictionpolicypotentiallytranslatestodifferentbenefits tionforCC-e).InSection6,wecombinethisobservation fordifferentworkloads. withseveralotherstospeculatethatthereisanemerging Futureworkshouldanalyzefilepathsemanticsandhi- classofinteractiveandsemi-streamingworkloads. 5 rSatueb (mjoisbssi/ohnr) I/O (TB/hr) (taCsokm-hprust/eh r) Ut(islilzoatsti)on 200 2.0 1500 400 1000 CC-a 100 1.0 200 500 0 0.0 0 0 S0u M1 T2u W3 Th4 F5 Sa6 7 S0u M1 T2u W3 T4h F5 S6a 7 S0u 1M T2u W3 T4h F5 S6a 7 0Su 1 M 2Tu W3 T4h F5 S6a 7 200 40 30000 4000 20000 100 20 2000 CC-b 10000 0 0 0 0 T0u W1 T2h F3 S4a S5u M6 7 T0u W1 T2h F3 S4a S5u M6 7 T0u W1 T2h F3 S4a S5u M6 7 0Tu 1 W T2h F3 S4a S5u M6 7 200 300 30000 200 20000 100 100 10000 CC-c 0 0 0 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Su M Tu W Th F Sa Su M Tu W Th F Sa Su M Tu W Th F Sa 200 300 30000 200 20000 CC-d 100 100 10000 0 0 0 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 F Sa Su M Tu W Th F Sa Su M Tu W Th F Sa Su M Tu W Th 200 40 1500 400 CC-e 1000 100 20 200 500 0 0 0 0 T0u W1 T2h F3 S4a S5u M6 7 T0u W1 T2h F3 S4a S5u M6 7 T0u 1W T2h 3F S4a 5Su 6 M 7 T0u 1W T2h F3 S4a S5u 6M 7 3000 40 1500 2000 1000 FB-2009 20 1000 500 0 0 0 S0u M1 T2u W3 Th4 F5 Sa6 7 S0u M1 T2u W3 T4h F5 S6a 7 S0u 1M T2u 3W T4h 5F S6a 7 3000 600 100000 40000 FB-2010 2000 400 50000 20000 1000 200 0 0 0 0 S0u M1 T2u W3 Th4 F5 Sa6 7 S0u M1 T2u W3 T4h F5 S6a 7 S0u 1M T2u W3 T4h 5F S6a 7 S0u M1 Tu2 W3 Th4 F5 Sa6 7 Figure7: Workloadbehavioroveraweek. Fromlefttoright:(1)Jobssubmittedperhour. (2)AggregateI/O(i.e.,input+shuffle +output)sizeofjobssubmitted.(3)Aggregatemapandreducetasktimeintask-hoursofjobssubmitted.(4)Clusterutilizationin averageactiveslots. Fromtoprowtobottom,showingCC-a,CC-b,CC-c,CC-d,CC-e,FB-2009,andFB-2010workloads. Note thatforCC-c,CC-d,andFB-2009,theutilizationdataisnotavailablefromthetraces.Alsonotethatsometimeaxesaremisaligned duetoshort,week-longtracelengths(CC-bandCC-e),orgapsfrommissingdatainthetrace(CC-d). 4.2 Burstiness thesameasthe100th-percentile-to-medianratio. While the median is statistically robust to outliers, the 100th- AnotherfeatureofFigure7istheburstysubmissionpat- percentile is not. This implies that the 99th, 95th, or ternsinalldimensions. Burstinessisanoftendiscussed 90th-percentileshouldalsobecalculated. Weextendthis property of time-varying signals, but it is not precisely line of thought and compute the general nth-percentile- measured. A common metric is the peak-to-average ra- to-medianratioforaworkload. Wecangraphthisvector tio. There are also domain-specific metrics, such as for ofvalues,with nth−percentile onthex-axis,versusnonthe median bursty packet loss on wireless links [38]. Here, we ex- y-axis. Theresultantgraphcanbeinterpretedasacumu- tend the concept of peak-to-average ratio to quantify a lative distribution of arrival rates per time unit, normal- keyworkloadproperty: burstiness. izedbythemedianarrivalrate. Thisgraphisanindica- Westartdefiningburstinessfirstbyusingthemedian tionofhowburstythetimeseriesis. Amorehorizontal ratherthanthearithmeticmeanasthemeasureof“aver- line corresponds to a more bursty workload; a vertical age”. Medianisstatisticallyrobustagainstdataoutliers, linerepresentsaworkloadwithaconstantarrivalrate. i.e., extreme but rare bursts [26]. For two given work- Figure8graphsthismetricforoneofthedimensions loads with the same median load, the one with higher ofourworkloads.Wealsographtwodifferentsinusoidal peaks, that is, a higher peak-to-median ratio, is more signals to illustrate how common signals appear under bursty. Wethenobservethatthepeak-to-medianratiois thisburstinessmetric. Figure8showsthatforallwork- 6 Fraction of hours 0000....824601 CCCCCCCCCC-----abcde Fraction of hours 0000....824601 FFssiiBBnn--ee22 ++00 0122900 Fraction of task time 0000....24681 mreadp 0.01 0.1 1 10 100 0.01 0.1 1 10 100 0 Normalized task-seconds per hour Normalized task-seconds per hour FB-2009 FB-2010 CC-a CC-b CC-c CC-d CC-e Figure8: Workloadburstiness. Showingcummulativedistri- Figure10:Aggregatefractionofallmapandreducetasktimes butionoftask-time(sumofmaptimeandreducetime)perhour. foreachworkload. Toallowcomparisonbetweenworkloads,allvalueshavebeen nlsoionaredm.saulFibzomerditcbopymatptthearernismso,ensd,ciaawlneedtaawlsskiot-htismhmoeiwnp-mebruahrxsotruianrnefgsoserfteoharecahsratmwifioecrikaas-l ds per byte 11..00EE--0042 mean(sine+2)and10%ofmean(sine+20). on 1 jobs - bytes Task-sec11..00EE--0086 n 0.8 jobs - task-seconds FB-2009 FB-2010 CC-a CC-b CC-c CC-d CC-e atio 0.6 bytes - task-seconds Figure11: Task-secondsperbyteforeachworkload. el orr 0.4 C 0.2 5 ComputationPatterns 0 FB-2009 FB-2010 CC-a CC-b CC-c CC-d CC-e Figure 9: Correlation between different submission pattern MapReduceisdesignedasacombinedstorageandcom- time series. Showing pair-wise correlation between jobs per putationsystem.Thissectionexaminescomputationpat- hour,(input+shuffle+output)bytesperhour,and(map+re- terns by task times (§ 5.1), task granularity (§ 5.2), and duce)tasktimesperhour. common job types by both job names (§ 5.3.1) and a multi-dimensionalanalysis(§5.3.2). loads,thehighestandlowestsubmissionratesareorders ofmagnitudefromthemedianrate.Thisindicatesalevel 5.1 Tasktimes ofburstinessfarabovetheworkloadsexaminedbyprior Figure10showstheaggregatemapandreducetaskdu- work,whichhavemoreregulardiurnalpatterns[39,30]. rationsforeachworkload. Onaverage,workloadsspend For the workloads here, scheduling and task placement 68% of time in map and 32% of time in reduce. For policies will be essential under high load. Conversely, Facebook, the fraction of time spent mapping increases mechanisms for conserving energy would be beneficial by 10% over a year. Thus, a design priority should be duringperiodsoflowutilization. tooptimizemaptaskcomponents,suchasreadlocality, For the Facebook workloads, over a year, the peak- readbandwidth,andmapoutputcombiners. to-median-ratiodroppedfrom31:1to9:1, accompanied Figure11showsforeachworkloadtheratioofaggre- by more internal organizations adopting MapReduce. gate task durations (map time + reduce time) over the This shows that multiplexing many workloads (work- aggregate bytes (input + shuffle + output). This ratio loadsfrommanyorganizations)helpdecreasebustiness. aims to capture the amount of computation per data in However,theworkloadremainsbursty. the absence of CPU/disk/network level logs. The unit of measurement is task-seconds per byte. Task-seconds 4.3 Timeseriescorrelations measures the computation time of multiple tasks, e.g., twotasksof10secondsequals20task-seconds. Thisis Wealsocomputedthecorrelationbetweentheworkload agoodunitofmeasurementifparallelizationoverheadis submissiontimeseriesinallthreedimensions,shownin small; it approximates the amount of computational re- Figure9. Theaveragetemporalcorrelationbetweenjob sources required by a job that is agnostic to the degree submit and data size is 0.21; for job submit and com- ofparallelism(e.g.,X task-secondsdividedintoT tasks 1 pute time it is 0.14; for data size and compute time it equalstoX task-secondsdividedintoT tasks). 2 is 0.62. The correlation between data size and compute Figure 11 shows that the ratio of computation per timeisbyfarthestrongest.Wecanvisuallyverifythisby datarangesfrom1×10−7 to7×10−7 task-secondsper the 2nd and 3rd columns for CC-e in Figure 9. This in- byte,withtheFB-2010workloadhaving9×10−4 task- dicates that MapReduce workloads remain data-centric seconds per byte. Task-seconds per byte clearly sepa- rather than compute-centric. Also, schedulers and load rates the workloads. A balanced system should be pro- balancers need to consider dimensions beyond number visionedspecificallytoservicethetask-secondsperbyte ofactivejobs. ofaparticularworkload. Existinghardwarebenchmarks 7

DTIC ADA555881: Design Insights for MapReduce from Diverse Production Workloads PDF

0.37 MB·English

by Defense Technical Information Center

#additional_collections #dticarchive

Checking for file health...

Save to my drive

Quick download

Download

Download DTIC ADA555881: Design Insights for MapReduce from Diverse Production Workloads PDF Free - Full Version

by Defense Technical Information Center| 0.37| English

Download DTIC ADA555881: Design Insights for MapReduce from Diverse Production Workloads by Defense Technical Information Center in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About DTIC ADA555881: Design Insights for MapReduce from Diverse Production Workloads

No description available for this book.

Detailed Information

Author:	Defense Technical Information Center
Language:	English
File Size:	0.37
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free DTIC ADA555881: Design Insights for MapReduce from Diverse Production Workloads Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download DTIC ADA555881: Design Insights for MapReduce from Diverse Production Workloads PDF?

Yes, on https://PDFdrive.to you can download DTIC ADA555881: Design Insights for MapReduce from Diverse Production Workloads by Defense Technical Information Center completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read DTIC ADA555881: Design Insights for MapReduce from Diverse Production Workloads on my mobile device?

After downloading DTIC ADA555881: Design Insights for MapReduce from Diverse Production Workloads PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of DTIC ADA555881: Design Insights for MapReduce from Diverse Production Workloads?

Yes, this is the complete PDF version of DTIC ADA555881: Design Insights for MapReduce from Diverse Production Workloads by Defense Technical Information Center. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download DTIC ADA555881: Design Insights for MapReduce from Diverse Production Workloads PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.