# 无痛的增强学习入门：Q-Learning

| 作者 冯超 7 他的粉丝 发布于 2017年11月20日. 估计阅读时间: 17 分钟 | QCon上海2018 关注大数据平台技术选型、搭建、系统迁移和优化的经验。

## 8 Q-Learning

### 8.1 Q-Learning

$$q_t(s,a)=q_{t-1}(s,a)+\frac{1}{N}[R(s')+q(s',a')-q_{t-1}(s,a)]$$

$$q_t(s,a)=q_{t-1}(s,a)+\frac{1}{N}[R(s')+max_{a'} q(s',a')-q_{t-1}(s,a)]$$

def q_learning(self):
iteration = 0
while True:
iteration += 1
self.q_learn_eval()
ret = self.policy_improve()
if not ret:
break

def q_learn_eval(self):
episode_num = 1000
env = self.snake
for i in range(episode_num):
env.start()
state = env.pos
prev_act = -1
while True:
act = self.policy_act(state)
reward, state = env.action(act)
if prev_act != -1:
return_val = reward + (0 if state == -1 else np.max(self.value_q[state,:]))
self.value_n[prev_state][prev_act] += 1
self.value_q[prev_state][prev_act] += (return_val - \
self.value_q[prev_state][prev_act]) / \
self.value_n[prev_state][prev_act]
prev_act = act
prev_state = state
if state == -1:
break

Timer Temporal Difference Iter COST:4.24033594131
return_pi=81
[0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0]
policy evaluation proceed 94 iters.
policy evaluation proceed 62 iters.
policy evaluation proceed 46 iters.
Iter 3 rounds converge
Timer PolicyIter COST:0.318824052811
return_pi=84
[0 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0
0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0]

### 8.3 展望

8.3.1 Function Approximation

$$S \times A \rightarrow R$$

$$obj=\frac{1}{2}\sum_i^N(v'_i(s,a;w) - v_i)^2$$

$$\frac{\partial obj}{\partial w}=\sum_i^N (v_i' -v_i) \frac{\partial v_i'}{\partial w}$$

$$\frac{\partial v_i'}{\partial w}=1$$

，那么模型的最优解就等于

$$v_i'=\frac{1}{N}\sum_i^N v_i$$

$$max E_{\pi}[v_{\pi}(s_0)]$$

$$\nabla v_{\pi}(s_0)=E_{\pi}[\gamma^t G_t \nabla log \pi(a_t|s_t;w)]$$

$$\theta_{t+1}=\theta_t + \alpha G_t \nabla log\pi(a_t|s_t;w)$$

$$\theta_{t+1}=\theta_t + \alpha (G_t-b(S_t)) \nabla log\pi(a_t|s_t;w)$$

8.3.3 Actor-Critic

$$\theta_{t+1}=\theta_t + \alpha (R_{t+1}+\gamma \hat{v}(s_{t+1};w)-\hat{v}(S_t;w)) \nabla log\pi(a_t|s_t;w)$$