Table Of ContentBandits meet Computer Architecture:
Designing a Smartly-allocated Cache
YonatanGlassner
KobyCrammer
ElectricalEngineeringDepartment
TheTechnion-IsraelInstituteofTechnology
6 Haifa,Israel32000
1 [email protected],[email protected]
0
2
February2,2016
n
a
J
1 Abstract cache to the threads in a way to maximize hit-rate,
3
namelythefractionofmemorycallsgetansweredby
] In many embedded systems, such as imaging sys- the (fast) cache, and not the (slow) memory. An ef-
G
tems, the system has a single designated purpose, fective allocation would take into consideration the
L
and same threads are executed repeatedly. Profiling various requirements of the threads and their behav-
.
s thread behavior, allows the system to allocate each ior when memory lacks. However, the exact nature
c
[ thread its resources in a way that improves overall of this behavior is unknown, and should be learned
systemperformance.Westudyanonlineresourceal- from experience. The repeatability of such systems
1
v locationproblem,wherearesourcemanagersimulta- provides opportunity for good threads identification
9 neously allocates resources (exploration), learns the (hit rate as function of given resources), which al-
0
impactonthedifferentconsumers(learning)andim- lowstheresourceallocatortoapproachoptimalallo-
3
0 proves allocation towards optimal performance (ex- cations.
0 ploitation).Webuildontherichframeworkofmulti-
. The problem of making decisions under uncer-
2 armed bandits and present online and offline algo-
tainty,orpartialknowledge,wasstudiedextensively
0
rithms. Through extensive experiments with both
6 intheliterature,andapopularmodelforthisproblem
1 synthetic data and real-world cache allocation to
istheMulti-ArmedBandit(MAB)[1].Inoursetting,
: threadsweshowthe meritsandpropertiesofoural-
v on each iteration the decision maker must choose
i gorithms.
X an allocation of the available resources (memory)
r for the threads to be executed, corresponding to the
a
1 Introduction ’arms’intheMABmodel.Subsequently,allthreads
are executed (arm is pulled), and a stochastic hit
rate (reward) is observed for each thread. A suitable
Consider a real-time X-ray system used for surgery.
framework for our setting is the combinatorial ban-
Such a system performs extensive real time image
dits(akaCMAB)framework,underfullinformation
processingofastreamofimages,andisrequirednot
feedbacksettings.
to have delays, nor loose frames. In practice such a
systemexecutesmanythreads(suchasFFTonvari- A typical relation between the memory allocated
ous parts of the image) on a few cores and a shared to a thread and its hit-rate is presented in Fig. 1 (we
cache,whichallowsfastaccesstomemory.Sincethe present the hit-rate vs the amount of memory allo-
amountofcacheislimited,thereisaneedtoallocate cated for two applications: gcc and bzip, which are
1
part of the CPU SPEC 2006 benchmark). This rela- In their fundamental paper, Auer et al. [4] pre-
tion is stochastic, and, its mean is characterized by sented the UCB1 (upper confidence bound) algo-
two important properties: it is monotonic in the al- rithm. On each iteration, the algorithm chooses the
locatedmemory,andexhibitsa’diminishingreturns’ armwiththehighestUCBoftheestimatedexpected
phenomena.Notethatthememory–expectedhit-rate value. Their key-method performs the exploration-
relation is clearly non-linear, thus discouraging the exploitation tradeoff implicitly. They prove that the
useoflinearbanditsbasedapproaches. number of time steps a sub-optimal arm is played
Many existing approaches for this problem are isbounded,yieldinglogarithmicregret(performance
static and hard-coded into hardware (see Liu et al. difference of an algorithm and optimal policy) uni-
[2]). Moreover, they ignore the specific characteris- formlyforallfinitetimes.
tics of the threads. Instead, we propose to learn the
Another line of research is when the algorithm
statisticalnatureofthreads,anduseittomakeagood may choose more than a single arm, and there
dynamic allocation. We study the empirical proper-
is a dependency between the arms. See again the
tiesofabenchmarkapplications.Wesuggestapara-
manuscript [3] for details and examples. The Com-
metricmodelwiththesamequalitativepropertiesas
binatorial MAB (CMAB) is a special case of MAB.
the real data. The model parameters are estimated
Here, arms (sometime also called super arms) are a
duringrun-time.However,allocationsaremadedur-
combination of basic arms, chosen from a finite set
ingtheprocess,eveniftheestimatesareonlyrough. F. Therefore, there is a structured dependency be-
The rest of the paper is organized as follows: in tween super arms. Ignoring that structure and using
Sec. 2 we describe related work in the field of multi traditional algorithms yields poor performance (see
armedbandits.Inaddition,webrieflyproviderelated e.g.theworkofGaietal.[5]),comparedtostructure-
work in the field of dynamic resource allocation in consideredalgorithm.
computer systems. In Sec. 3 we formulate our prob-
In a more recent paper, Chen et al. [6] proposed
lem and the modeling function chosen with respect
ageneralCMABformulation.Oneachstep,asuper
to the formulation. In Sec. 4 we introduce our algo-
arm, which is a subset of arms, is chosen out of a
rithms together with their analysis of expected per-
finite subset group F ∈ 2[m], where 2[m] is the set
formance. Empirical study with both synthetic and
of all possible subsets of arms, taken from m basic
real data is summarized in Sec. 5, and we conclude
arms. The expected reward is a general function of
with Sec. 6. Due to lack of space, some Technical
the set of arms played and expected performance of
materialandproofsarenotprovidedinthispaper.
all arms. In their algorithm, they assumed the exis-
tenceofanOracle,whichprovidesagoodsuperarm
2 Related work withhighprobability.Inourspecificproblem,wedo
not assume such an oracle exists, thus can not use
their framework directly. We rather provide a self-
The MAB problem is widely studied these days,
contained algorithm, which senses the environment,
where different formulations model various
exploration-exploitation (aka exp-exp) tradeoff providespredictionsandacts.
alternatives. We clearly can not review all variants, Inamorepracticalaspect,resourceallocationwas
and refer the reader to a recent manuscript in the investigatedinthefieldofcomputerarchitecture(see
area[3]. the work of Liu et al. [2] and of Suh et al. [7] for
Lai and Robbins [1] proposed one of the earli- cache hit rate optimization, and of Bitirgen et al.
estMABversion,inwhichthereareN independent [8] for global resources optimization using Artifi-
arms, each producing stochastic i.i.d rewards, taken cialNeuralNetwork).However,weareinterestedin
fromaknownfamilyofdistributions,withunknown providingageneralframeworkfortheresourceallo-
parameters.Theobjectiveistochoosearmssequen- cation problem, rather than a fine-tuned per domain
tially,soastomaximizethetotalreward. practicalsolution.
2
Recently, Lattimore et al. [9] studied a similar re- 1. The expected reward of the algorithm at time t is
source allocation problem, where a system manager givenby,
allocated resources to maximize system gain. How-
(cid:34) N (cid:35)
ever, they assume a piece-wise linear cut-off model, E (cid:88)s = ρ(cid:16)f(cid:126),m(cid:126) (cid:17) .
t,i t
defined by a single parameter, which is the change
i=1
point between the linear and constant range. We use
Similarly, the optimal expected reward is given by,
a different model which we believe is closer to real
(cid:16) (cid:17)
world behavior. Specifically, we assume a nonlinear ρ f(cid:126),m(cid:126)∗ . The expected instantaneous regret is
function, controlled by two parameters to model the defined to be the difference of rewards, R(t) =
(cid:16) (cid:17) (cid:16) (cid:17)
consumer gain. Our more complicated model yields ρ f(cid:126),m(cid:126)∗ −ρ f(cid:126),m(cid:126) ,andthecumulativeexpected
t
differentalgorithmswithdifferentbehavior.
regret is the sum of instantaneous regrets, R =
(cid:80)T R(t). The goal of a learning algorithm is to
t=1
3 Problem Setting minimizetheexpectedcumulativeregret.
It remains to define a family of parametric re-
We now describe formally the memory to threads wardfunctionsF fromwhichf willbechosen.We
i
allocation problem. There are N threads (or arms) restrict our discussion to families with two natural
which share M identical units of memory. We will properties, which are inherent for the resource allo-
next assume, by a simple normalization, that all re- cationmodel:
sourcesaresummedto1.Oniterationtanallocation Monotonicity: Allocating more memory does not
algorithmpartitionsthememorytothethreads,allo- decrease the expected reward, that is f(m ) ≥
1
catingm ∈ [0,1]fractionofthememorytothread f(m )form ≥ m .
t,i 2 1 2
i. We denote by m(cid:126) = (m ...m ) the resource Diminishing returns: Allocating more of memory
t t,1 t,N
allocationvector,whereclearlythealgorithmcannot does not increase the per-memory unit expected re-
allocatemorethan100%oftheresourcenorallocate ward, that is, f(m +δ)−f(m ) ≥ f(m +δ)−
1 1 2
(cid:80)
negative resources, thus m ≤ 1 and m ≥ f(m )form ≤ m .
i t,i t,i 2 1 2
0,∀t,i. Once the resources are allocated, or parti- These two properties were also observed in our
tioned, each thread i receives a stochastic bounded taskofallocatingmemory.InFig.1,mentionedpre-
reward s ∈ [0,1] based on these resources. We viously,wecanclearlyobserve,thatthesetwoabove
t,i
denote the reward vector by (cid:126)s = (s ...s ). propertiesholdforrealapplications.
t t,1 t,N
We assume that the expectation of each reward is Theaboveobservationmotivatedustoproposethe
given by a function of the resource allocated, that is following simple family of functions, with two pa-
E[s ] = f (m ), where f (x) is a fixed unknown rametersγ andβ ,
t,i i t,i i i i
(or partly unknown) function. In our application of
f (m ;γ ) = γ ·mβi
allocating memory to threads, the algorithm reward i i i i i
isthehit-rateobtainedforthespecificallocation.We
where γ ,β ∈ [0,1]. The parameter γ indicates the
denote by f(cid:126) = (f (·)...f (·)), the vector of ex- i i i
1 N maximalexpectedrewardofthreadiifallresources
pectedreward-functions,andby,
are allocated to it, as f (1;γ ) = γ , and thus is
i i i
boundedbyaunit,themaximalpossibleexpectedre-
N
(cid:16) (cid:17) (cid:88)
ρ f(cid:126),m(cid:126) = fi(mi), ward.βi isacurvatureparameterandisboundedby
a linear line. For β ≈ 0 an infinitesimal amount of
i
resource allows maximal gain, while for larger val-
The expected reward of allocation m(cid:126) with reward
ues of β the hit-rate - memory dependency is closer
functionsf(cid:126).GivenasetofN functionsf(cid:126),anoptimal
tolinear.
allocationmaximizestheexpectedrewardandgiven For a given β(cid:126), we identify a concave function
(cid:110) (cid:111)
by m(cid:126)∗ = argmaxm(cid:126) ρ f(cid:126),m(cid:126) subject to (cid:80)imi ≤ f(m) with the single associated parameter (cid:126)γ. We
3
• Input:β(cid:126)
• Initialization: Play N steps, where for each
arm i, allocate all memory to thread i, m =
t,i
δ andreceiverewards .
t,i t,i
• Fort = N +1...T
– Computeestimates: γˆ
t,i
– ComputeUCBγ = min{1,γˆ }+e
(cid:101)t,i t,i t,i
– Allocatememoryaccordingto{γ }
(cid:101)t
– Executewithm(cid:126) allocation
t
– Receivereward(cid:126)s ∈ {0,1}N
t
Figure1:Expectedhit-ratevsmemoryallocationfor
bzip and gcc applications. Error-bars indicate a unit
standard-deviation.Highervaluesindicatebetterper-
Figure2:UCB-RAAlgorithmtoallocatememoryto
formance.
threads.
abuse notation and write the expected instantaneous
(cid:16) (cid:17)
rewardas,ρ (cid:126)γ,m(cid:126);β(cid:126) = (cid:80)N γ mβi ,whichcanbe
i i i bination as an arm. We can now play with all arms
computed analytically if there is a shared parameter
using the existing MAB algorithms (i.e. UCB1 of
β acrossallthreads.Weusethatpropertyinthecon-
Auer et al. [4]). However, such an approach ignores
vergenceproof.
the structure of the problem and, since the actions
Weusethefactthatρ((cid:126)γ,m(cid:126))isaweightedβ-norm
are exponential in the number of threads, that ap-
oftheallocationvectorm(cid:126) tofindtheoptimalalloca-
proach is not feasible in practice. We propose two
tionm(cid:126)∗ whentheparametervector(cid:126)γ isknown.
algorithms. The first algorithm performs an initial
Lemma 1. Assuming β = β,∀i, the opti- pure exploration stage, and then a pure exploitation
i
mal allocation which maximizes the gain m(cid:126)∗ = stage. The second algorithm is a UCB-like algo-
argmax ρ((cid:126)γ,m(cid:126)(cid:48))isgivenby, rithm, which fundamentally trades-off exploration-
m(cid:126)(cid:48)
exploitation.Weanalyzethealgorithmsintheshared
1
γ1−β β case,whileourexperimentsalsodonefortheper-
m∗ = i , (1)
i N 1 threadβ parameters.
(cid:80) γ1−β i
j
j=1 We start with the first algorithm. When the num-
andtheexpectedrewardisgivenby, ber of rounds T is known, the algorithm performs
maxρ(cid:0)(cid:126)γ,m(cid:126)(cid:48)(cid:1) = ρ((cid:126)γ,m(cid:126)∗) = (cid:107)(cid:126)γ(cid:107) . (2) pure-exploitationforξT rounds,allocatinganequal
m(cid:126)(cid:48) 1−1β amount of 1/N memory to all threads. At this point
thealgorithmcomputesestimationoftheparameters
For brevity, the proof is not given here, but it can
γˆ . In the remaining rounds, the algorithm allocates
beeasilyderivedbyapplyingHo¨lderinequality. i
m (γˆ ...γˆ )using(1)asiftheestimatesaretheop-
i 1 N
timalparameters.WecallthatalgorithmFETE(First
4 Algorithms
ExploreThenExploit)anditissummarizedinFig.4.
Asimpleapproachforsolvingourproblemistodis- We use the following linear estimator for γ from
i
crete the allocation space of m(cid:126) and treat each com- pairs {(m ,s )}T (all parameters are subjected
t,i t,i t=1
4
• Input 0 ≤ ξ ≤ 1 the exploration-exploitation
tradeoff parameter, β the system parameter,
horizonT.
• For t = 1,...,ξT: allocate m = 1 for all
t,i N
threadsi = 1···N
t
(cid:80) mβτ,i·sτ,i
• ComputeEstimates:γˆ = τ=1
i t
(cid:80) m2β
τ,i
τ=1
• For t = ξT +1,..., T: allocate memory ac-
cordingtoestimatesγˆ using(1):
(a)
1
γˆ1−β
m = i ,i = 1···N
t,i
N 1
(cid:80)γˆ1−β
i
i=1
Figure 4: First exploit then explore algorithm for
memoryallocation.
OnecanshowthattheMMSEestimatorisgivenby
(b)
(cid:80)t mβ ·s
τ=1 τ,i τ,i
γˆ =
Figure3:Syntheticdataresults.(a)Performanceof t,i (cid:80)t m2β
five algorithms on synthetic data with fixed N = 8 τ=1 τ,i
and β = 0.1. The cumulative hit-rate (reward) is It is also concentrated around its mean, applying
shown vs number of iterations. Five algorithms are Hoeffdinginequality[10]
evaluated: Uniform (Fair-Share), (cid:15)-greedy (with or (cid:32) t (cid:33)
(cid:88)
without the model), UCB-RA, and the FETE algo- Pr[|γˆ −γ | > (cid:15)] ≤ 2exp −2(cid:15)2· m2β .
t,i i τ
rithm (b) Algorithms performance for the case of
τ=1
known but different β’s. FETE produces good re- (4)
sults,UCB-RApresentsthebestperformance.
An interesting property of that concentration in-
equality is that the variance is monotonically de-
creasingintheamountofresourcesanarmreceives.
tothreadi)
However, dependency is not linear. The algorithm
must perform efficient exploration steps given this
(cid:80)t ω ·s
γˆ = τ=1 τ τ . (3) property.
t (cid:80)t ω ·mβ Thefollowingtheoremboundstheregretoftheal-
τ=1 τ τ
gorithm, by setting ξ as a function of the number of
roundsT.
It is easy to show that this general formulation of a
linearestimator,givenasetofcoefficients ωτ,i ∈ Theorem 1. Assuming βi = β,∀i, the regret of
τ=1,···,t the algorithm in Fig. 4 is upper bounded by, R ≤
R,isunbiased: ξTN+√2N√1 √T(cid:112)log(2/δ).Furthermore,plug-
ξ
(cid:113)
E[γˆ ] = E(cid:34) (cid:80)tτ=1ωτ ·sτ (cid:35) = (cid:80)tτ=1ωτ,i·γmβτ,i = γging the opt(cid:113)imal value, ξopt = 3 ln2(T2δ) , yields,
t (cid:80)tτ=1ωτ ·mβτ (cid:80)tτ=1ωτ,i·mβτ,i R ≤ 2NT23 3 ln(22δ) = O(T23).
5
IncontrasttotheFETEalgorithmwhichseparates We first assume shared β and random values of
the exploration and exploitation to distinct epochs, (cid:126)γ.Wevariedthenumberofthreads-N.Oncethese
we now propose an alternative algorithm which in- parameterswereset,weexecutedeachalgorithmfor
herently combines exp-exp, using the UCB tech- 10,000 iterations, and computed reward and regret
nique. with respect to the optimal allocation (according to
Our algorithmic approach for the second algo- thevaluesetfor(cid:126)γ andβ).
rithm is inspired by the UCB1 algorithm of Auer et Next, we ran our algorithms for known values of
al. [4]. However, since each arm reward is a func- different β, one per process. Yet, since knowing the
tionofitsgivenresources,weprovideateachtimet β’s is a strong assumption, we assumed that the β’s
a Model Upper Confidence Bound (which is simply areknownuptoasmalldifference.Thisassumption
a UCB on the model parameters) which we obtain isrealistic,asweobservedinrealdatathatthereare
bythestatisticalmodel,andpreviousallocationsand roughlytwoclustersofprograms:memory-intensive
rewards. We then optimize the allocation according andnon-intensive.Memory-intensiveprogramshave
to the optimistic model, rather than the current esti- lower β’s in contrast with the non-intensive ones.
matedone. This clustering can be performed off-line, with re-
Inthegeneralcase,ouralgorithmestimatesmodel spect to the program’s characteristics. For the FETE
parameters {γˆt,i} and computes optimal allocation algorithm we simply took the highest β as a shared
withrespecttoUCBuponmodelvariables.Thisen- one.
suresthatallthreadswillreceiveenoughmemoryto
We compare five algorithms: (1) Fair-Share,
explore (estimate) well, and that the parameters es-
which is simply a uniform allocation, (2)+(3) (cid:15) -
t
timators will converge quickly enough to their true
greedy, which performs a random exploration step
values.
with probability (cid:15) , or otherwise exploits. We com-
t
We call the algorithm UCB-RA for resource allo-
pare two versions of (cid:15) -greedy: the first uses the
t
cation, and it is summarized in Fig. 2. In each step,
modelintheexploitationstepwhiletheseconduses
we provide a UCB e on the γ estimator. In the
t,i t,i the empiric best-so-far allocation. The parameter (cid:15)
t
general case, we set et,i using (4). For the analysis, decreases over time, and is given by: (cid:15)t = √(cid:15)0 We
we again assume that β is equal for all threads. For t
i tried different values of (cid:15) , and took the one with
0
thatcase,wesete = e = t−η forη = 0.5(1−β).
t,i t bestperformance.(4)TheexplorethenexploitFig.4.
Thistermisclearlynotoptimal,sinceattimetithas
(5)UCB-RA of Fig. 2. The methods of Liu et al.
shared value for all threads, regardless of their per-
[2], Suh et al. [7], Bitirgen et al. [8] are not relevant
formance so far. This choice of η allows us to prove
tooursettings,asthefirstisnotdesignedtooptimize
thefollowingtheorem,whichisprovidedherewith-
thecumulativehit-rate,thesecondusedapre-defined
outproof,forspacelimitation.
model,andthethirdassumesthestatisticsareknown
(i.e.,thereisaseparatetrainingphase).
Theorem 2. The regret of the algorithm of Fig. 2 is
(cid:16) 1+β(cid:17) Experiments are summarized in Fig. 3. The case
upperboundedby,O(cid:101) F(N,β)+T 2 ,wherethe
of same β is not a realistic case, and was simu-
functionF isindependentofT.
lated to support the theoretical analysis. One can
clearly observe the ”knee” of the FETE algorithm,
5 Experimental results which mark the exploration-exploitation transition
point. The uniform and (cid:15)-greedy algorithm w\o the
We evaluate our algorithms’ performance on both model suffer a linear regret. In the case of differ-
synthetic and real data. Starting with the synthetic ent and partly-known β’s (as explained previously),
data, we generated data using that model: a thread our UCB-RA algorithm obtain the best results (see
producesaBernoullidistributedbinarystochasticre- Fig.3(b)).
ward,suchthatPr(s = 1) = γ mβi. We performed extensive experiments to evaluate
t,i i t,i
6
(a) (b) (c)
Figure 5: Synthetic data results. Algorithms performance for the case of known but different β’s. FETE
producesgoodresults,UCB-RApresentsbetterones.Tough(cid:15)-greedypresentsthebestresults-itsstochas-
ticityparameterhastobetunedmanually,whichmeansitislessrobust.
the algorithms in a real-world setting. The task is to Performancefor2,3and4programcombinations
divide L1-cache among threads which are executed issummarizedinFig.5.Eachpointinthegraphrep-
onthecore. resents a specific combination of applications to be
We chose six programs belong to the SPEC- executed.Thex-axisisanindexofaspecificcombi-
CPU2006 benchmark: bzip, lbm, mcf, waves and nation.The y-axisisthe IPCnormalizedbyIPC
opt
2 gcc instances with different inputs. We executed (1isthemaximalvalue).Theoptimalallocationwas
each of the six programs and recorded all the mem- computedbyanexhaustivesearchintheoptionalal-
ory accesses, both read and write. We then divided locationsspace.
each memtrace to T distinct consecutive segments. For each algorithm, higher points indicate better
Wesimulatedeachofthesememtracesusingacache performance. In the left panel we choose 2 out of 6
simulator,with2Kavailable2-setcache,dividedinto programs (15 combinations), in 13 of which UCB-
blocksofsize16bit. RA was better than FairShare. In the middle graph
Weevaluatecombinationsof2,3or4outofthe6 - 19 out of 20, and in the right graph - 13 out of 15.
programs. For example, we started the execution of Notethattherewardtendstobemoresignificantthan
bzipandgccatthesametime,andsotheyneededto FairSharewhentheproblemishard,i.e.,thehitrate
sharememory. is low. Note that though (cid:15)-greedy achieves high per-
Two metrics are used: the hit-rate, which is the formance, we still needed to set its parameters, and
frequency of memory accesses that were stored in thusislessrobust.
the cache, and instructions per cycle (IPC). Specif-
ically, we computed the memory dependent IPC
6 Discussion
(and not the total IPC, which depends on many pa-
rameters). For simplicity, we replaced the different
Weinvestigatedstatisticalmethodstoallocatemem-
cache hierarchy timings and probabilities with one
ory to threads. We proposed a simple model for the
term: Memory , and used the following equation
time problem, that accurately captures the properties of
tocomputetheIPC:
the real memory allocation problem. We provided
(IPC )−1 = MR×Memory +(1−MR)×Cacthweo alg,orithms for the task, and performed an em-
mem time time
piricalstudywithbothsyntheticandreal-worlddata.
where MR is the average miss-rate of a program, Weexecutedseveralrealapplicationsinacontrolled
Memory and Cache are the time (in cycles) memory environment and analyzed a few allocation
time time
needed for a memory access and cache access. The strategies.Thememory-UCBoutperformedallother
formerisabout20timesthelatter. methods.
7
Although we have restricted our discussion to al- [10] W. Hoeffding, “Probability inequalities for sums of
locatingmemorytothreads,thetoolsdevelopedhere bounded random variables,” Journal of the American
Statistical Association, vol. 58, no. 301, pp. 13–30,
canbeusedinothercontexts,suchasallocatingmain
March 1963. [Online]. Available: http://www.jstor.org/
memory and cores to processes, allocating network
stable/2282952?
bandwidthtoclients,andsoon.
References
[1] T.L.LaiandH.Robbins,“Asymptoticallyefficientadap-
tive allocation rules,” Advances in applied mathematics,
vol.6,no.1,pp.4–22,1985.
[2] C.Liu,A.Sivasubramaniam,andM.Kandemir,“Organiz-
ingthelastlineofdefensebeforehittingthememorywall
forcmps,”inSoftware,IEEProceedings-. IEEE,2004,
pp.176–185.
[3] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of
stochastic and nonstochastic multi-armed bandit prob-
lems,” Foundations and Trends in Machine Learning,
vol.5,no.1,pp.1–122,2012.
[4] P.Auer,N.Cesa-Bianchi,andP.Fischer,“Finite-timeanal-
ysisofthemultiarmedbanditproblem,”Machinelearning,
vol.47,no.2-3,pp.235–256,2002.
[5] Y. Gai, B. Krishnamachari, and R. Jain, “Combina-
torial network optimization with unknown variables:
Multi-armed bandits with linear rewards,” arXiv preprint
arXiv:1011.4748,2010.
[6] W. Chen, Y. Wang, and Y. Yuan, “Combinatorial multi-
armed bandit: General framework and applications,” in
Proceedingsofthe30thInternationalConferenceonMa-
chineLearning(ICML-13),2013,pp.151–159.
[7] G.E.Suh,S.Devadas,andL.Rudolph,“Anewmemory
monitoringschemeformemory-awareschedulingandpar-
titioning,” in High-Performance Computer Architecture,
2002. Proceedings. Eighth International Symposium on.
IEEE,2002,pp.117–128.
[8] R.Bitirgen,E.Ipek,andJ.F.Martinez,“Coordinatedman-
agementofmultipleinteractingresourcesinchipmultipro-
cessors:Amachinelearningapproach,”inProceedingsof
the 41st annual IEEE/ACM International Symposium on
Microarchitecture. IEEE Computer Society, 2008, pp.
318–329.
[9] T.Lattimore,K.Crammer,andC.Szepesva´ri,“Optimalre-
sourceallocationwithsemi-banditfeedback,”inProceed-
ings of the 30th Conference on Uncertainty in Artificial
Intelligence(UAI),2014.
8