# 无痛的增强学习入门：差分时序法

| 作者 冯超 7 他的粉丝 发布于 2017年11月3日. 估计阅读时间: 13 分钟 | QCon上海2018 关注大数据平台技术选型、搭建、系统迁移和优化的经验。

## 7 差分时序法

### 7.2 差分时序

$$q(s,a)=\sum_{s'}p(s'|s,a)[R + \sum_{a'} \pi(a|s')q(s',a')]$$

$$q(s,a)=\frac{1}{N} \sum_{i=1}^N [R(s'_i)+q(s'_i,a'_i)]$$

$$q_t(s,a)=q_{t-1}(s,a)+\frac{1}{N}[R(s')+q(s',a')-q_{t-1}(s,a)]$$

### 7.3 On-policy：SARSA

for i in range(episode_num):
env.start()
state = env.pos
prev_act = -1
while True:
act = self.policy_act(state)
reward, state = env.action(act)
if prev_act != -1:
return_val = reward + (0 if state == -1 else self.value_q[state][act])
self.value_n[prev_state][prev_act] += 1
self.value_q[prev_state][prev_act] += (return_val - \
self.value_q[prev_state][prev_act]) / \
self.value_n[prev_state][prev_act]

prev_act = act
prev_state = state

if state == -1:
break

def td_demo():
np.random.seed(0)
env = Snake(10, [3,6])
agent = MonteCarloAgent(100, 2, env)
with timer('Timer Temporal Difference Iter'):
agent.td_opt()
print 'return_pi={}'.format(eval(env,agent))
print agent.policy

agent2 = TableAgent(env.state_transition_table(), env.reward_table())
with timer('Timer PolicyIter'):
agent2.policy_iteration()
print 'return_pi={}'.format(eval(env,agent2))
print agent2.policy

Timer Temporal Difference Iter COST:3.8632440567
return_pi=80
[0 0 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0]
policy evaluation proceed 94 iters.
policy evaluation proceed 62 iters.
policy evaluation proceed 46 iters.
Iter 3 rounds converge
Timer PolicyIter COST:0.330899000168
return_pi=84
[0 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0
0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0]