My question is the following: How to calculate the regret in practice?
I am trying to implement the regret matching algorithm but I do not understand how to do it.
- First, I have $n$ players with the joint action space $\mathcal{A}=\{a_0, a_1,\cdots,a_m\}^n.$
- Then, I fix some period $T$. The action set $A^t\in\mathcal{A}$ is the action set chosen by players at time $t$. After the period $T$ (every player has chosen an action). So I get $u_i(A^t)$.
Now the regret of player $i$ of not playing action $a_i$ in the past is: (here $A^t\oplus a_i$ denotes the strategy set obtained if player $i$ changed its strategy from $a'_i$ to $a_i$) $$\max\limits_{a_i\in A_i}\left\{\dfrac{1}{T}\sum_{t\leqslant T}\left(u_i(A^t\oplus a_i )-u_i(A^t)\right)\right\}.$$ I do not understand how to calculate this summation. Why there is a max over the action $a_i\in A_i$? Should I calculate the regret of all actions in $A_i$ and calculate the maximum? Also, In Hart's paper, the maximum is $\max\{R, 0\}$. Why is there such a difference?
I mean if the regret was: $\dfrac{1}{T}\sum_{t\leqslant T}\left(u_i(A^t\oplus a_i )-u_i(A^t)\right),$ the calculation would be easy for me.
The regret is defined in the following two papers [1] (see page 4, equation (2.1c)) and [2] (see page 3, section I, subsection B).
- A simple adaptive procedure leading to correlated equilibrium by S. Hart et al (2000)
- Distributed algorithms for approximating wireless network capacity by Michael Dinitz (2010)
I would like to get some helps from you. Any suggestions step by step how to implement such an algorithm please?
Asked By : Azzo
Answered By : Rahul Savani
The index set of the max operation is $A_i$, the actions of player $i$. The formula says: take each such action $a_i \in A_i$ and compute its regret (with the sub-formula you say you can implement easily), and then take the maximum of those regrets.
The reason for the $\max(R,0)$ is that actions with negative regrets are performing worse than the action currently chosen.
To implement this in code, just set a temporary variable $t$ to be 0. Now loop through the actions one by one, and for each action $a$, compute its regret $r$, and set $t$ as $\max(r,t)$. Note that this approach includes the $\max(R,0)$ operation; to do this without that, set $t$ initially to $-\infty$.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/27915
0 comments:
Post a Comment
Let us know your responses and feedback