Table Of ContentDesign Insights for MapReduce from Diverse
Production Workloads
Yanpei Chen
Sara Alspaugh
Randy H. Katz
Electrical Engineering and Computer Sciences
University of California at Berkeley
Technical Report No. UCB/EECS-2012-17
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-17.html
January 25, 2012
Report Documentation Page Form Approved
OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE 3. DATES COVERED
25 JAN 2012 2. REPORT TYPE 00-00-2012 to 00-00-2012
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Design Insights for MapReduce from Diverse Production Workloads
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
University of California at Berkeley,Electrical Engineering and REPORT NUMBER
Computer Sciences,Berkeley,CA,94720
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
14. ABSTRACT
In this paper, we analyze seven MapReduce workload traces from production clusters at Facebook and at
Cloudera customers in e-commerce, telecommunications media, and retail. Cumulatively, these traces
comprise over a year?s worth of data logged from over 5000 machines, and contain over two million jobs
that perform 1.6 exabytes of I/O. Key observations include input data forms up to 77% of all bytes, 90% of
jobs access KB to GB sized files that make up less than 16% of stored bytes, up to 60% of jobs re-access
data that has been touched within the past 6 hours, peak-to-median job submission rates are 9:1 or greater,
an average of 68% of all compute time is spent in map, task-seconds-per-byte is a key metric for balancing
compute and data bandwidth task durations range from seconds to hours, and five out of seven workloads
contain map-only jobs. We have also deployed a public workload repository with workload replay tools so
that the researchers can systematically assess design priorities and compare performance across diverse
MapReduce workloads.
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE Same as 14
unclassified unclassified unclassified Report (SAR)
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18
Copyright © 2012, by the author(s).
All rights reserved.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission.
Acknowledgement
The authors are grateful for the feedback from our colleagues at UC
Berkeley AMP Lab, Cloudera, and Facebook. We especially appreciate the
inputs from Anthony Joseph, David Zats, Matei Zaharia, Jolly Chen, Todd
Lipcon, Aaron T. Myers, John Wilkes, and Srikanth Kandula. This research
is supported in part by AMP Lab
(https://amplab.cs.berkeley.edu/sponsors/), and the DARPA- and SRC-
funded MuSyC FCRP Multiscale Systems Center.
Design Insights for MapReduce from Diverse Production Workloads
YanpeiChen,SaraAlspaugh,RandyKatz
UniversityofCalifornia,Berkeley
Abstract lionjobsthatmovedapproximately1.6exabytesspread
over5000machines(Table3). Theyarecollectedusing
In this paper, we analyze seven MapReduce work-
standardtoolsinHadoop,andfacilitatecomparisonboth
load traces from production clusters at Facebook and
acrossworkloadsandovertime.
at Cloudera customers in e-commerce, telecommunica-
Our analysis reveals a range of workload behavior.
tions,media,andretail. Cumulatively,thesetracescom-
Key similarities between workloads include input data
priseoverayear’sworthofdataloggedfromover5000
forms up to 77% of all bytes, 90% of jobs access KB
machines,andcontainovertwomillionjobsthatperform
to GB sized files that make up less than 16% of stored
1.6exabytesofI/O.Keyobservationsincludeinputdata
bytes, up to 60% of jobs re-access data that has been
forms up to 77% of all bytes, 90% of jobs access KB
touchedwithinthepast6hours,peak-to-medianjobsub-
to GB sized files that make up less than 16% of stored
missionratesare9:1orgreater,anaverageof68%ofall
bytes, up to 60% of jobs re-access data that has been
computetimeisspendinmap,taskdurationsrangefrom
touchedwithinthepast6hours,peak-to-medianjobsub-
secondstohours,andfiveoutofsevenworkloadscontain
missionratesare9:1orgreater,anaverageof68%ofall
map-onlyjobs. Keydifferencesincludethefrequencyof
compute time is spent in map, task-seconds-per-byte is
data re-access, the burstiness of the workload, the bal-
akeymetricforbalancingcomputeanddatabandwidth,
ance between computation to data bandwidth, the ana-
taskdurationsrangefromsecondstohours,andfiveout
lytical frameworks used on top of MapReduce, and the
ofsevenworkloadscontainmap-onlyjobs. Wehavealso
multi-dimensiondescriptionsofcommonjobcategories.
deployedapublicworkloadrepositorywithworkloadre-
These results provide empirical evidence to inform the
play tools so that the researchers can systematically as-
designandevaluationofschedulers,caches,optimizers,
sess design priorities and compare performance across
andnetworks,inadditiontorevealinginsightsforcluster
diverseMapReduceworkloads.
provisioning,benchmarking,andworkloadmonitoring.
Table 1 summarizes our findings and serves as a
1 Introduction roadmapfortherestofthepaper. Ourmethodologyex-
tends[19,18,17], andorganizestheanalysisaccording
The MapReduce computing paradigm is gaining to three conceptual aspects of a MapReduce workload:
widespread adoption across diverse industries as a plat- data, temporal, and compute patterns. Section 3 looks
form for large-scale data analysis, accelerated by open- at data patterns. This includes aggregate bytes in the
source implementations such as Apache Hadoop. The MapReduce input, shuffle, and output stages, the distri-
scale of such systems and the depth of their software bution of per-job data sizes, and the per-file access fre-
stackscreatecomplexchallengesforthedesignandoper- quencies and intervals. Section 4 focuses on temporal
ationofMapReduceclusters.Anumberofrecentstudies patterns. Itanalyzesvariationovertimeacrossmultiple
lookedatMapReducestylesystemswithinahandfulof workloaddimensions,quantifiesburstiness,andextracts
largetechnologycompanies[19,8,39,33,30,10]. The temporalcorrelationsbetweendifferentworkloaddimen-
stand-alonedatasetsusedinthesemeansthattheassoci- sions. Section 5 examines computate patterns. It ana-
ateddesigninsightsmaynotgeneralizeacrossusecases, lyzes the balance between aggregate compute and data
an important concern given the emergence of MapRe- size,thedistributionoftasksizes,andthebreakdownof
duceusersindifferentindustries[5].Consequently,there commonjobcategoriesbothbyjobnamesandbymulti-
is a need to develop systematic knowledge of MapRe- dimensionaljobbehavior. Ourcontributionsare:
duce behavior at both established users within technol- • Analysis of seven MapReduce production workloads
ogycompanies,andatrecentadoptersinotherindustries. fromfiveindustriestotalingovertwomillionjobs,
In this paper, we analyze seven MapReduce work- • Derivationofdesignandoperationalinsights,and
load traces from production clusters at Facebook and at • Methodologyofanalysisandthedeploymentofapub-
Cloudera’s customers in e-commerce, telecommunica- licworkloadrepositorywithworkloadreplaytools.
tions,media,andretail. Cumulatively,thesetracescom- WeinviteotherMapReduceresearchersanduserstoadd
priseoverayear’sworthofdata,coveringovertwomil- toourworkloadrepositoryandderiveadditionalinsights
1
Section Observations Implications/Interpretations
3.1 Inputdataforms51-77%ofallbytes,shuffle8-35%,output8-23%. Needtore-assesssolefocusonshuffle-liketrafficfordatacenternetworks.
3.2 ThemajorityofjobinputsizesareMB-GB. TB-scalebenchmarkssuchasTeraSortarenotrepresentative.
3.2 FortheFacebookworkload,overayear,inputandshufflesizesincrease Growingcustomersandrawdatasizeovertime,distilledintopossiblythe
byover1000×butoutputsizedecreasesbyroughly10×. samesetofmetrics.
3.3 Forallworkloads,dataaccessesfollowthesameZipfdistribution. Tieredstorageisbeneficial.Uncertaincauseforthecommonpatterns.
3.3 90%ofjobsaccesssmallfilesthatmakeup1-16%ofstoredbytes. Cachedataiffilesizeisbelowathreshold.
3.3 Upto78%ofjobsreadinputdatathatisrecentlyreadorwrittenbyother LRUorthreshold-basedcacheevictionviable.Tradeoffsbetweeneviction
jobs.75%ofre-accessesoccurwithin6hrs. policiesneedfurtherinvestigation.
4.1 Thereisahighamountofnoiseinjobsubmitpatternsinallworkloads. Onlinepredictionofdataandcomputationneedswillbechallenging.
4.1 Dailydiurnalpatternevidentinsomeworkloads. Humandriveninteractiveanalysis,and/ordailyautomatedcomputation.
4.2 Workloadsarebursty,withpeaktomedianhourlyjobsubmitratesof9:1 Schedulingandplacementoptimizationsessentialunderhighload. In-
orgreater. creaseutilizationorconserveenergyduringinherentworkloadtroughs.
4.2 ForFacebook,overayear,peaktomedianjobsubmitratesdecreasefrom Multiplexingmanyworkloadshelpsmoothoutbustiness. However,uti-
31:1to9:1,whilemoreinternalorganizationsuseMapReduce. lizationremainslow.
4.2 Workloadintensityismulti-dimensional(jobs,I/O,task-times,etc.). Needmulti-dim.metricsforburstinessandothertimeseriesproperties.
4.3 Avg.temporalcorrelationbetweenjobsubmit&datasizeis0.21;forjob Schedulers need to consider metrics beyond number of active jobs.
submit&computetimeitis0.14;fordatasize&computetimeitis0.62. MapReduceworkloadsremaindata-ratherthancompute-centric.
5.1 Workloadsonavg.spend68%oftimeinmap,32%oftimeinreduce. Optimizingreadlatency/localityshouldbeapriority.
5.1 Mostworkloadshavetask-secondsperbyteof1×10−7to7×10−7;one Buildbalancedsystemsforeachworkloadaccordingtotask-secondsper
workloadhastask-secondsperbyteof9×10−4. byte.Developbenchmarkstoprobeagivenvalueoftask-secondsperbyte.
5.2 Forallworkloads,taskdurationsrangefromsecondstohours. Needmechanismstochooseuniformtasksizesacrossjobs.
5.3.1 <10jobnametypesmakeup>60%ofallbytesorallcomputetime. Per-job-typepredictionandmanualtuningarepossible.
5.3.1 Allworkloadsaredominatedbya2-3frameworks. Meta-schedulersneedtomultiplexonlyafewframeworks.
5.3.2 Jobstouching<10GBoftotaldatamakeup>92%ofalljobs,andoften Schedulersshoulddesignforsmalljobs.Re-assessthepriorityplacedon
have<10tasksorevenasingletasks. addressingtaskstragglers.
5.3.2 Fiveoutofsevenworkloadscontainmap-onlyjobs. Notalljobsbenefitfromnetworkoptimizationsforshuffle.
5.3.2 Commonjobtypeschangesignificantlyoverayear. Periodicre-tuningandre-optimizationsisnecessary.
Table1: Summaryofobservationsfromtheworkload,andassociateddesignimplicationsandinterpretations.
using the data in this paper. We hope that such public cases [34, 36, 16, 32, 13, 24]. The increased generality
datawillallowresearchersandclusteroperatorstobetter likelycomesfromacombinationofimprovedmeasure-
understandandoptimizeMapReducesystems. menttools,wideadoptionofcertainsystems,andbetter
appreciationofwhatgoodsystemmeasurementenables.
The trend reversed in recent years, with stand-alone
2 Background
studiesbecomingcommonagain[19,8,39,33,10,30].
This is likely due to the tremendous scale of the sys-
This section reviews key previous work on workload
temsofinterest.Onlyafeworganizationscanaffordsys-
characterization in general and highlight our advances.
temsofthousandsorevenmillionsofmachines.Concur-
Inaddition,wedescribethetracesweused.
rently,thevastimprovementinmeasurementtoolscreate
anover-abundanceofdata,presentingnewchallengesto
2.1 PriorWork
deriveusefulinsightsfromthedelugeoftracedata.
These technology trends create the pressing need to
The desire for thorough system measurement predates
generalize beyond the initial point studies. As MapRe-
theriseofMapReduce. Workloadcharacterizationstud-
duceusediversifiestomanyindustries,systemdesigners
ies have been invaluable in helping designers identify
needtooptimizeforcommonbehavior[11],inaddition
problems, analyze causes, and evaluate solutions. Ta-
toimprovingtheparticularsofindividualusecases.
ble 2 summarizes some studies founded on the under-
Somestudiesamplifiedtheirbreadthbyworkingwith
standing realistic system behavior. They fall into the
ISPs[36,24]orenterprisestoragevendors[13],i.e.,in-
categories of network systems [29, 34, 36, 16, 24, 8],
termediarieswhointeractwithalargenumberofendcus-
storagesystems[35,16,32,13,19,10], andlargescale
tomers. The emergence of enterprise MapReduce ven-
data centers running computation paradigms including
dors present us with similar opportunities to generalize
MapReduce[8,39,33,10,30,11]. Networkandstorage
beyondsingle-pointMapReducestudies.
subsystems are key components for MapReduce. This
paperbuildsonthesestudies.
A striking trend emerges when we order the stud- 2.2 WorkloadTracesOverview
ies by trace date. In the late 1980s and early 1990s,
measurement studies often capture system behavior for We analyze seven workloads from various Hadoop de-
only one setting [35, 29]. The stand-alone nature of ployments. All seven come from clusters that sup-
thesestudiesareduetotheemergingmeasurementtools port business critical processes. Five are workloads
that created considerable logistical and analytical chal- from Cloudera’s enterprise customers in e-commerce,
lenges at the time. Studies in the 1990s and early telecommunications, media, and retail. Two others are
2000s often traced the same system under multiple use Facebookworkloadsonthesameclusteracrosstwodif-
2
Study Tracedsystem Tracedate Companies Industries Findings&contributions
Ousterhoutetal.[35] BSDfilesystem 1985 1 1 Variousfindingsrebandwidth&accesspatterns
Lelandetal.[29] Ethernetpackets 1989-92 1 1 Ethernettrafficexhibitsself-similarity
Mogul[34] HTTPrequests 1994 2 2 PersistentconnectionsarevitalforHTTP
Paxson[36] Internettraffic 1994-95 35 >3 Variousfindingsrepacketloss,re-ordering,&delay
Breslauetal.[16] Webcaches 1996-98 6 2 WebcachesseevariousZipfaccesspatterns
Mesnieretal.[32] NFS 2001-04 2 1 Predictfilepropertiesbasedonnameandattributes
Bairavasundarametal.[13] Enterprisestorage 2004-08 100s Many Quantifiessilentdatacorruptioninreliablestorage
Feamsteretal.[24] BGPconfigurations 2005 17 N/A DiscoversBGPmisconfigurationsbystaticanalysis
Chenetal.[19] Enterprisestorage 2007 1 1 Variousfindingsreplacement,consolidation,caching
Alizadehetal.[8] DatacenterTCPtraffic 2009 1 1 UseexplicitcongestionnotificationtoimproveTCP
Thereskaetal.[39] Storageforwebapps 2009 1 1 Saveenergybypoweringdowndatareplicas
Mishraetal.[33] Webappsbackend 2009 1 1 Taskclassificationforcapacityplanning&scheduling
Anathanarayananetal.[10] Websearch(Dryad) 2009 1 1 Alleviatehotspotsbypre-emptivedatareplication
Meisneretal.[30] Websearch 2010 1 1 Interactivelatencycomplicateshardwarepowermngmt.
Anathanarayananetal.[11] HadoopandDryad 2009-11 3 1 Cachedatabasedonheavy-tailaccesspatterns
Table2: Summaryofpriorwork. Thelistincludesnetworksystemstudies[29,34,36,16,24,8],storagesystemstudies[35,16,
32,13,19,10],andstudiesondatacentersrunninglargescalecomputationparadigmsincludingMapReduce[8,39,33,10,30,11].
Notethatstudiesinthe1990sandearly2000shavegoodbreadth,whilestudiesinthelate2000sreturnedtobeingstand-alone.
Trace Machines Length Date Jobs Bytes 1 input
CC-a <100 1month 2011 5759 m80ovTeBd bytes 0.8 sohuutpffulet
CC-b 300 9days 2011 22974 600TB of 0.6
CCCC--cd 400-750000 2+1mmoonnthths 22001111 2113023803 188PPBB action 0.4
CC-e 100 9days 2011 10790 590TB Fr 0.2
FB-2009 600 6months 2009 1129193 9.4PB 0
FB-2010 3000 1.5months 2010 1169184 1.5EB FB-2009 FB-2010 CC-a CC-b CC-c CC-d CC-e
Total >5000 ≈1year - 2372213 1.6EB
Table3: Summaryoftraces. CCisshortfor“ClouderaCus- Figure1:Aggregatefractionofallbytesthatcomefrominput,
tomer”.FBisshortfor“Facebook”.Bytestouchediscomputed shuffle,oroutputofeachworkload.
bysumofinput,shuffle,andoutputdatasizesforalljobs.
reflectnologginginterruptions,exceptfortheclusterin
CC-d, which was taken offline several times due to op-
ferentyears. Theseworkloadsofferarareopportunityto
erational reasons. There are some inaccuracies at trace
survey Hadoop use cases across several technology and
startandtermination,duetoincompletejobsatthetrace
traditionalindustries(Clouderacustomers),andtrackthe
boundaries. Thelengthofourtracesfarexceedsthetyp-
growthofaleadingHadoopdeployment(Facebook).
ical job length on these systems, leading to negligible
Table 3 gives some detail about these workloads.
errors. To capture weekly behavior for CC-b and CC-e,
The trace lengths are limited by the logistical chal-
we intentionally queried for 9 days of data to allow for
lenges of shipping trace data for offsite analysis. The
inaccuraciesattraceboundaries.
Cloudera customer workloads have raw logs approach-
ing100GB,requiringustosetupspecializedfiletransfer
tools. Transferring raw logs is infeasible for the Face- 3 DataAccessPatterns
bookworkloads,requiringustoqueryFacebook’sinter-
nalmonitoring tools. Combined, theworkloadscontain Datamovementisakeyfunctionoftheseclusters,soun-
over a year’s worth of trace data, covering a non-trivial derstanding data access patterns is crucial. This section
amountofjobsandbytesprocessedbytheclusters. looksatdatapatternsinaggregate(§3.1),byjobs(§3.2),
The data is logged by standard tools in Hadoop; no andper-file(§3.3).
additional tracing tools were necessary. The workload
traces contain per-job statistics for job ID (numerical
3.1 Aggregateinput/shuffle/outputsizes
key), job name (string), input/shuffle/output data sizes
(bytes), duration, submit time, map/reduce task time Figure 1 shows the aggregate input, shuffle, and output
(slot-seconds),map/reducetaskcounts,andinput/output bytes. These statistics reflect I/O bytes seen from the
filepaths. Wecalleachofthesecharacteristicanumer- MapReduce API. Different MapReduce environments
icaldimensionofajob. Sometraceshavesomedatadi- lead to two interpretations with regard to actual bytes
mensionsunavailable. movedinhardware.
WeobtainedtheClouderatracesbydoingatime-range If we assume that task placement is random, and lo-
selectionofper-jobHadoophistorylogsbasedonthefile calityisnegligibleforallthreeinput,shuffle,andoutput
timestamp. The Facebook traces come from a similar stages,thenallthreeMapReducedatamovementstages
query on Facebook’s internal log database. The traces involvenetworktraffic. Further,becausetaskplacement
3
Fraction of jobs 0000....012468 0000....012468 0000....012468 CCCCCC---ace CCCC--bd File access frequency 110100,,,000100010000100 1 1,000 1CCCCF,0BCCCC0-----02bcde,001000 File access frequency 110100,,,000100010000100 1 1,000 1,00CCCC0CCCC,0----bcde00
1 KB MB GB TB 1 KB MB GB TB 1 KB MB GB TB Input file rank by descending Output file rank by descending
access frequency access frequency
Per-job input size Per-job shuffle size Per-job output size Figure 3: Log-log file access frequency vs. rank. Showing
FB-2009
FB-2010 Zipfdistributionofsameshape(slope)forallworkloads.
bs 1 1 1
n of jo 00..68 00..68 00..68 larger) by several orders of magnitude, while the per-
o
acti 0.4 0.4 0.4 joboutputsizedistributionshiftsleft(becomessmaller).
Fr 0.2 0.2 0.2
0 0 0 Rawandintermediatedatasetshavegrownwhilethefi-
1 KB MB GB TB 1 KB MB GB TB 1 KB MB GB TB
nalcomputationresultshavebecomesmaller. Onepos-
Per-job input size Per-job shuffle size Per-job output size
sibleexplanationisthatFacebook’scustomerbase(raw
Figure2:Datasizeforeachworkload.Showinginput,shuffle,
data)hasgrown,whilethefinalmetrics(output)todrive
andoutputsizeperjob.
businessdecisionshaveremainedthesame.
is random, the aggregate traffic looks like N-to-N shuf-
3.3 Accessfrequencyandintervals
fle traffic for all three stages. Under these assumptions,
recentresearchcorrectlyoptimizeforN-to-Ntrafficpat- ThissectionanalyzesHDFSfileaccessfrequencyandin-
ternsfordatacenternetworks[7,8,25,20]. tervals based on hashed file path names. The FB-2009
However, the default behavior in Hadoop is to at- and CC-a traces do not contain path names, and the
tempt to place map tasks for increased locality of input FB-2010tracecontainspathnamesforinputonly.
data. Hadoop also tries to combine or compress map Figure 3 shows the distribution of HDFS file access
outputs and optimize placement for reduce tasks to in- frequency, sorted by rank according to non-decreasing
crease rack locality for shuffle. By default, for every frequency. Note that the distributions are graphed on
API output block, HDFS stores one copy locally, an- log-log axes, and form approximate straight lines. This
other within-rack, and a third cross-rack. Under these indicatesthatthefileaccessesfollowaZipfdistribution.
assumptions,datamovementwouldbedominatedbyin- The generalized Zipf distribution has the form in
put reads, with read localityoptimizations beingworth- Equation 1, where f(r;α,N) is the access frequency of
while[41]. Further, whileinter-racktrafficfromshuffle rth ranked file, N is the total size of the distribution,
canbedecreasedbyreducetaskplacement,HDFSoutput i.e., number of unique files in the workload, and α is
by design produces cross-rack replication traffic, which the shape parameter of the distribution. Also, H is
N,α
hasyettobeoptimized. the Nth generalized harmonic number. Figure 3 graphs
Facebook uses HDFS RAID, which employs Reed- log(f(r;α,N)) against log(r). Thus, a linear log-log
Solomonerasurecodestotolerate4missingblockswith graphindicatesadistributionoftheforminEquation1,
1.4× storage cost [15, 37]. Parity blocks are placed in with N/H being a vertical shift given by the size of
N,α
a non-random fashion. Combined with efforts to im- thedistribution,andα reflectstheslopeoftheline.
provelocality,thedesigncreatesanotherenvironmentin
whichweneedtoreassessoptimizationprioritybetween N N 1
MapReduceAPIinput,shuffle,andoutput. f(r;α,N)= rαH HN,α = ∑ kα (1)
N,α k=1
3.2 Per-jobdatasizes Figure3indicatesthatfewfilesaccountforaveryhigh
number of accesses. Thus, any data caching policy that
Figure 2 shows the distribution of per-job input, shuf- includesthosefileswillbringconsiderablebenefit.
fle,andoutputdatasizesforeachworkload. Themedian Further,theslope,i.e.,αparameterofthedistributions
per-jobinput,shuffle,andoutputsizerespectivedifferby areallapproximately5/6,acrossworkloadsandforboth
6, 8, and 4 orders of magnitude. Most jobs have input, inputs and outputs. Thus, file access patterns are Zipf
shuffle, and output sizes in the MB to GB range. Thus, distributions of the same shape. Figure 3 suggests the
benchmarksofTBandabove[6,4]capturesonlyanar- existenceofcommoncomputationneedsthatleadstothe
rowsetofinput,shuffle,andoutputpatterns. samefileaccessbehavioracrossdifferentindustries.
From2009to2010,theFacebookworkloads’per-job The above observations indicate only that caching
input and shuffle size distributions shift right (become helps. If there is no correlation between file sizes and
4
s 1 s 1
es es CC-b
acc 0.8 acc 0.8 CC-c
n of re- 00..46 n of re- 00..46 CCCC--de
o 0.2 o 0.2
cti cti FB-2010
a 0 a 0
Fr 1 100 10000 Fr 1 100 10000
Input-input re-access interval (s) Output-input re-access interval (s)
Fraction of jobs 0000....0146821 KB MB GB TB Fraction of jobs 0000....0146821 KB MB GB TB CCCCFBCCCC-----2dbce010 Fraction of jobs 0000....12468 jrejreooeexxbb--iiaassssctct iiwwnncceehhggss oooissnssu peepptu prriieenntu--pptuutt
Input file size Output file size 0
FB-2010 CC-b CC-c CC-d CC-e
es 1.0 es 1 CC-b
Fraction of bytstored 00000.....02468 Fraction of bytstored 0000....02468 CCCFBCCC----2cde010 FNiogtuerteha6t:ouFtrpaucttipoanthoifnjfoobrmstahtiaotnreisadmsipssrien-gexfirsotmingFBin-p2u0t1p0a.th.
1 KB MB GB TB 1 KB MB GB TB erarchy to see if the small data sets are stand-alone or
Input files size Output file size
samplesoflargerdatasets. Frequentuseofdatasamples
Figure4: Accesspatternsvs. input/outputfilesize. Showing wouldsuggestopportunitiestopre-generatedatasamples
cummulativefractionofjobswithinput/outputfilesofacertain
thatpreservevariouskindofstatistics. Thisanalysisre-
size (top) and cummulative fraction of all stored bytes from
quires proprietary file name information, and would be
input/outputfilesofacertainsize(bottom).
possiblewithineachMapReduceuser’sorganization.
Fraction of re-access 0000....012486 1 100 10000 Fraction of re-access 0000....012468 1 100 10000 CCCCFBCCCC-----2bcde010 4TjohbesWiunbteomnrisskistliyoonoafdraatVeM,aaarpsiRwaetediloulcnaesOtwhoevreckrolomTapdiumdtaeetpioenndasnodndtahtea
Input-input re-access interval (s) Output-input re-access interval (s)
access patterns of the jobs that are submitted. System
Figure 5: Data re-accesses intervals. Showing interval be-
occupancydependsonthecombinationofthesemultiple
tweenwhenaninputfileisre-read(left),andwhenanoutputis
1
robs e-u0s.e8dastheinputforanotherjob(righjrtoe)b-.asc wcehsoss ep rien-put time-varying dimensions. This section looks at work-
of j 0.6 existing output loadvariationoveraweek(§4.1),quantifiesburstiness,
n
aFractiocc00e..s24sfrequencies,maintainingcachjreoeexb-iasshct iwnciehgts oirsnsa pepu triente-psutwouldre- apuctoesmtmemonpofreaaltucroerrfeolartiaollnswboerktwloeaednsd(i§ff4er.2en),tawnodrkcloomad-
0
quirecFaBc-2h0i1n0gCaC-fibxeCdC-fcracCtCio-dnoCfCb-eytesstored.Thisdesign dimensions(§4.3).
is not sustainable, since caches intentionally trade ca-
pacityforperformance,andcachecapacitygrowsslower
4.1 Weeklytimeseries
thanfulldatacapacity. Fortunately,furtheranalysissug-
gestsmoreviablecachingpolicies. Figure 7 depicts the time series of four dimensions
Figure4showsdataaccesspatternsplottedagainstfile of workload behavior over a week. The first three
sizes. The distributions for fraction of jobs versus file columns respectively represents the cumulative job
size vary widely (top graphs), but converge in the up- counts, amountof I/O(againcountedfromMapReduce
per right corner. In particular, 90% of jobs access files API),andcomputationtimeofthejobssubmittedinthat
of less than a few GBs (note the log-scale axis). These hour. The last column shows cluster utilization, which
files account for up to only 16% of bytes stored (bot- reflectshowtheclusterservicesthesubmittedworkload
tom graphs). Thus, a viable cache policy would be to describesbytheprecedingcolumns,anddependsonthe
cachefileswhosesizeislessthanathreshold. Thispol- clusterhardwareandexecutionenvironment.
icy would allow cache capacity growth rates to be de- The first feature to observe in the graphs of Figure 7
tachedfromthegrowthrateindata. is that noise is high. This means that even though the
Further analysis also suggest cache eviction policies. number of jobs submitted is known, it is challenging to
Figure 5 indicates the distribution of time intervals be- predicthowmanyI/Oandcomputationresourceswillbe
tweendatare-accesses.75%ofthere-accessestakeplace neededasaresult.Also,standardsignalprocessmethods
within 6 hours. Thus, a possible cache eviction policy toquantifythesignaltonoiseratiowouldbechallenging
wouldbetoevictentirefilesthathavenotbeenaccessed toapplytothesetimeseries,sinceneitherthesignalnor
forlongerthanaworkloadspecificthresholdduration. noisemodelsareknown.
Figure6furthershowsthatupto78%ofjobsinvolve Some workloads exhibit daily diurnal patterns, re-
data re-accesses (CC-c, CC-d, CC-e), while for other vealedbyFourieranalysis,andforsomecases,visually
workloads, the fraction is lower. Thus, the same cache identifiable (e.g., jobs submission for FB-2010, utiliza-
evictionpolicypotentiallytranslatestodifferentbenefits tionforCC-e).InSection6,wecombinethisobservation
fordifferentworkloads. withseveralotherstospeculatethatthereisanemerging
Futureworkshouldanalyzefilepathsemanticsandhi- classofinteractiveandsemi-streamingworkloads.
5
rSatueb (mjoisbssi/ohnr) I/O (TB/hr) (taCsokm-hprust/eh r) Ut(islilzoatsti)on
200 2.0 1500 400
1000 CC-a
100 1.0 200
500
0 0.0 0 0
S0u M1 T2u W3 Th4 F5 Sa6 7 S0u M1 T2u W3 T4h F5 S6a 7 S0u 1M T2u W3 T4h F5 S6a 7 0Su 1 M 2Tu W3 T4h F5 S6a 7
200 40 30000 4000
20000
100 20 2000 CC-b
10000
0 0 0 0
T0u W1 T2h F3 S4a S5u M6 7 T0u W1 T2h F3 S4a S5u M6 7 T0u W1 T2h F3 S4a S5u M6 7 0Tu 1 W T2h F3 S4a S5u M6 7
200 300 30000
200 20000
100 100 10000 CC-c
0 0 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Su M Tu W Th F Sa Su M Tu W Th F Sa Su M Tu W Th F Sa
200 300 30000
200 20000 CC-d
100
100 10000
0 0 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
F Sa Su M Tu W Th F Sa Su M Tu W Th F Sa Su M Tu W Th
200 40 1500 400 CC-e
1000
100 20 200
500
0 0 0 0
T0u W1 T2h F3 S4a S5u M6 7 T0u W1 T2h F3 S4a S5u M6 7 T0u 1W T2h 3F S4a 5Su 6 M 7 T0u 1W T2h F3 S4a S5u 6M 7
3000 40 1500
2000 1000 FB-2009
20
1000 500
0 0 0
S0u M1 T2u W3 Th4 F5 Sa6 7 S0u M1 T2u W3 T4h F5 S6a 7 S0u 1M T2u 3W T4h 5F S6a 7
3000 600 100000 40000 FB-2010
2000 400
50000 20000
1000 200
0 0 0 0
S0u M1 T2u W3 Th4 F5 Sa6 7 S0u M1 T2u W3 T4h F5 S6a 7 S0u 1M T2u W3 T4h 5F S6a 7 S0u M1 Tu2 W3 Th4 F5 Sa6 7
Figure7: Workloadbehavioroveraweek. Fromlefttoright:(1)Jobssubmittedperhour. (2)AggregateI/O(i.e.,input+shuffle
+output)sizeofjobssubmitted.(3)Aggregatemapandreducetasktimeintask-hoursofjobssubmitted.(4)Clusterutilizationin
averageactiveslots. Fromtoprowtobottom,showingCC-a,CC-b,CC-c,CC-d,CC-e,FB-2009,andFB-2010workloads. Note
thatforCC-c,CC-d,andFB-2009,theutilizationdataisnotavailablefromthetraces.Alsonotethatsometimeaxesaremisaligned
duetoshort,week-longtracelengths(CC-bandCC-e),orgapsfrommissingdatainthetrace(CC-d).
4.2 Burstiness thesameasthe100th-percentile-to-medianratio. While
the median is statistically robust to outliers, the 100th-
AnotherfeatureofFigure7istheburstysubmissionpat- percentile is not. This implies that the 99th, 95th, or
ternsinalldimensions. Burstinessisanoftendiscussed 90th-percentileshouldalsobecalculated. Weextendthis
property of time-varying signals, but it is not precisely line of thought and compute the general nth-percentile-
measured. A common metric is the peak-to-average ra- to-medianratioforaworkload. Wecangraphthisvector
tio. There are also domain-specific metrics, such as for ofvalues,with nth−percentile onthex-axis,versusnonthe
median
bursty packet loss on wireless links [38]. Here, we ex- y-axis. Theresultantgraphcanbeinterpretedasacumu-
tend the concept of peak-to-average ratio to quantify a lative distribution of arrival rates per time unit, normal-
keyworkloadproperty: burstiness. izedbythemedianarrivalrate. Thisgraphisanindica-
Westartdefiningburstinessfirstbyusingthemedian tionofhowburstythetimeseriesis. Amorehorizontal
ratherthanthearithmeticmeanasthemeasureof“aver- line corresponds to a more bursty workload; a vertical
age”. Medianisstatisticallyrobustagainstdataoutliers, linerepresentsaworkloadwithaconstantarrivalrate.
i.e., extreme but rare bursts [26]. For two given work- Figure8graphsthismetricforoneofthedimensions
loads with the same median load, the one with higher ofourworkloads.Wealsographtwodifferentsinusoidal
peaks, that is, a higher peak-to-median ratio, is more signals to illustrate how common signals appear under
bursty. Wethenobservethatthepeak-to-medianratiois thisburstinessmetric. Figure8showsthatforallwork-
6
Fraction of hours 0000....824601 CCCCCCCCCC-----abcde Fraction of hours 0000....824601 FFssiiBBnn--ee22 ++00 0122900 Fraction of task time 0000....24681 mreadp
0.01 0.1 1 10 100 0.01 0.1 1 10 100 0
Normalized task-seconds per hour Normalized task-seconds per hour FB-2009 FB-2010 CC-a CC-b CC-c CC-d CC-e
Figure8: Workloadburstiness. Showingcummulativedistri- Figure10:Aggregatefractionofallmapandreducetasktimes
butionoftask-time(sumofmaptimeandreducetime)perhour. foreachworkload.
Toallowcomparisonbetweenworkloads,allvalueshavebeen
nlsoionaredm.saulFibzomerditcbopymatptthearernismso,ensd,ciaawlneedtaawlsskiot-htismhmoeiwnp-mebruahrxsotruianrnefgsoserfteoharecahsratmwifioecrikaas-l ds per byte 11..00EE--0042
mean(sine+2)and10%ofmean(sine+20). on
1 jobs - bytes Task-sec11..00EE--0086
n 0.8 jobs - task-seconds FB-2009 FB-2010 CC-a CC-b CC-c CC-d CC-e
atio 0.6 bytes - task-seconds Figure11: Task-secondsperbyteforeachworkload.
el
orr 0.4
C
0.2 5 ComputationPatterns
0
FB-2009 FB-2010 CC-a CC-b CC-c CC-d CC-e
Figure 9: Correlation between different submission pattern MapReduceisdesignedasacombinedstorageandcom-
time series. Showing pair-wise correlation between jobs per putationsystem.Thissectionexaminescomputationpat-
hour,(input+shuffle+output)bytesperhour,and(map+re- terns by task times (§ 5.1), task granularity (§ 5.2), and
duce)tasktimesperhour. common job types by both job names (§ 5.3.1) and a
multi-dimensionalanalysis(§5.3.2).
loads,thehighestandlowestsubmissionratesareorders
ofmagnitudefromthemedianrate.Thisindicatesalevel 5.1 Tasktimes
ofburstinessfarabovetheworkloadsexaminedbyprior
Figure10showstheaggregatemapandreducetaskdu-
work,whichhavemoreregulardiurnalpatterns[39,30].
rationsforeachworkload. Onaverage,workloadsspend
For the workloads here, scheduling and task placement
68% of time in map and 32% of time in reduce. For
policies will be essential under high load. Conversely,
Facebook, the fraction of time spent mapping increases
mechanisms for conserving energy would be beneficial
by 10% over a year. Thus, a design priority should be
duringperiodsoflowutilization.
tooptimizemaptaskcomponents,suchasreadlocality,
For the Facebook workloads, over a year, the peak-
readbandwidth,andmapoutputcombiners.
to-median-ratiodroppedfrom31:1to9:1, accompanied
Figure11showsforeachworkloadtheratioofaggre-
by more internal organizations adopting MapReduce.
gate task durations (map time + reduce time) over the
This shows that multiplexing many workloads (work-
aggregate bytes (input + shuffle + output). This ratio
loadsfrommanyorganizations)helpdecreasebustiness.
aims to capture the amount of computation per data in
However,theworkloadremainsbursty.
the absence of CPU/disk/network level logs. The unit
of measurement is task-seconds per byte. Task-seconds
4.3 Timeseriescorrelations measures the computation time of multiple tasks, e.g.,
twotasksof10secondsequals20task-seconds. Thisis
Wealsocomputedthecorrelationbetweentheworkload agoodunitofmeasurementifparallelizationoverheadis
submissiontimeseriesinallthreedimensions,shownin small; it approximates the amount of computational re-
Figure9. Theaveragetemporalcorrelationbetweenjob sources required by a job that is agnostic to the degree
submit and data size is 0.21; for job submit and com- ofparallelism(e.g.,X task-secondsdividedintoT tasks
1
pute time it is 0.14; for data size and compute time it equalstoX task-secondsdividedintoT tasks).
2
is 0.62. The correlation between data size and compute Figure 11 shows that the ratio of computation per
timeisbyfarthestrongest.Wecanvisuallyverifythisby datarangesfrom1×10−7 to7×10−7 task-secondsper
the 2nd and 3rd columns for CC-e in Figure 9. This in- byte,withtheFB-2010workloadhaving9×10−4 task-
dicates that MapReduce workloads remain data-centric seconds per byte. Task-seconds per byte clearly sepa-
rather than compute-centric. Also, schedulers and load rates the workloads. A balanced system should be pro-
balancers need to consider dimensions beyond number visionedspecificallytoservicethetask-secondsperbyte
ofactivejobs. ofaparticularworkload. Existinghardwarebenchmarks
7