Share on Facebook Share on Twitter Email
Answers.com

Q-learning

 
Wikipedia: Q-learning

Q-learning is a reinforcement learning technique that works by learning an action-value function that gives the expected utility of taking a given action in a given state and following a fixed policy thereafter. One of the strengths of Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. A recent variation called delayed-Q learning has shown substantial improvements, bringing PAC bounds to Markov decision processes[citation needed].

Contents

Algorithm

The problem model consists of an agent, states S and a number of actions per state A. By performing an action a, where a \in A, the agent can move from state to state. Each state provides the agent a reward (a real or natural number) or punishment (a negative reward). The goal of the agent is to maximize its total reward. It does this by learning which action is optimal for each state.

The algorithm therefore has a function which calculates the Quality of a state-action combination:

Q: S \times A \to \mathbb{R}

Before learning has started, Q returns a fixed value, chosen by the designer. Then, each time the agent is given a reward (the state has changed) new values are calculated for each combination of a state s from S, and action a from A. The core of the algorithm is a simple value iteration update. It assumes the old value and makes a correction based on the new information.

Q(s_t,a_t) \leftarrow \underbrace{Q(s_t,a_t)}_{old~value} + \underbrace{\alpha_t(s_t,a_t)}_{learning~rate} \times [\overbrace{\underbrace{r_{t}}_{reward} + \underbrace{\gamma}_{discount~factor} \underbrace{\max_{a}Q(s_{t+1}, a)}_{max~future~value}}^{expected~discounted~reward} - \overbrace{Q(s_t,a_t)}^{old~value}]

Where rt is the reward given at time t, αt(s,a) (0 < \alpha \le 1) the learning rates, may be the same value for all pairs. The discount factor γ is such that 0 \le \gamma < 1

The above formula is equivalent to:

Q(s_t,a_t) \leftarrow Q(s_t,a_t)(1-\alpha_t(s_t,a_t)) + \alpha_t(s_t,a_t) [r_{t} + \gamma \max_{a}Q(s_{t+1}, a)]

Note that one cannot calculate maxaQ(st + 1,a) when st + 1 is a terminal state, as an agent will take no further action in such a state. However, maxaQ(st + 1,a) corresponds to the utility of a state st and when a state is terminal, its utility is simply the reward received in that state (i.e., the last reward of an episode).[1] Thus, the final updating rule of an episode becomes:

Q(s_t,a_t) \leftarrow Q(s_t,a_t)(1-\alpha_t(s_t,a_t)) + \alpha_t(s_t,a_t) [r_{t} + \gamma r_{t+1}]

  1. ^ http://ai.vancouver.wsu.edu/~wallaces/courses/2008/ai/qlearning.pdf

Influence of variables on the algorithm

Learning rate

The learning rate determines to what extent the newly acquired information will override the old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information.

Discount factor

The discount factor determines the importance of future rewards. A factor of 0 will make the agent "opportunistic" by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the Q values will diverge.

Implementation

Q-learning at its simplest uses tables to store data. This very quickly loses viability with increasing levels of complexity of the system it is monitoring/controlling. One answer to this problem is to use an (adapted) artificial neural network as a function approximator, as demonstrated by Tesauro in his Backgammon playing temporal difference learning research. An adaptation of the standard neural network is required because the required result (from which the error signal is generated) is itself generated at run-time.

See also

External links



Search unanswered questions...
Enter a question here...
Search: All sources Community Q&A Reference topics
 
 
Learn More
SARSA
Alpha (machine learning)
Reinforcement learning

Where is the Q? Read answer...
Can you answer a q? Read answer...
What is Q? Read answer...

Help us answer these
Christmas french words to learn beginning with q?
Do you have a q?
What is you Q?

Post a question - any question - to the WikiAnswers community:

 

Copyrights:

Wikipedia. This article is licensed under the Creative Commons Attribution/Share-Alike License. It uses material from the Wikipedia article "Q-learning" Read more