Reinforcement Learning
Reinforcement Learning (RL) is the branch of machine learning where agents learn by doing. Unlike supervised learning (learning from labeled data) or unsupervised learning (finding patterns), RL is about learning from interaction.
An agent takes actions, the environment responds with feedback (reward or penalty), and over time the agent learns a strategy called a policy, that maximizes long-term rewards.
When I first experimented with RL, I trained an agent to play the classic game of Snake. Initially, the snake kept running into walls. But after thousands of tries, it started moving more carefully, collecting food, and surviving longer. Watching it learn felt like watching curiosity in action.
Real-world applications:
- Robotics → teaching robots to walk or grasp objects.
- Games → DeepMind’s AlphaGo beating world champions.
- Business → optimizing ad placements or supply chains.
Section 1: Core Concepts
1. Agent
The learner or decision-maker.
- In chess → the player (human or AI).
- In a self-driving car → the car’s AI system.
2. Environment
The world the agent interacts with.
- In chess → the board and opponent.
- For a car → roads, traffic, pedestrians.
3. Reward
Feedback that tells the agent how good its action was.
- In chess → +1 for winning, -1 for losing.
- For a car → negative reward for accidents, positive reward for reaching destination safely.
Example Visualization
1 2 3 4
+-----------+ action +-------------+ | Agent | ----------------> | Environment | | | <---------------- | | +-----------+ reward,state +-------------+
Insight:
The beauty of RL is that the agent doesn’t need explicit instructions. It just needs a way to act, observe, and receive rewards. Over time, it discovers what works best.
Section 2: Q-Learning Basics
Q-Learning is one of the most popular RL algorithms. It helps an agent learn which actions to take in which situations (states).
The idea:
- Maintain a Q-table, where each entry Q(state, action) represents the expected future reward.
- Update the table as the agent interacts with the environment.
Update rule (don’t worry if it looks mathy):

Where:
- s = current state
- a = action taken
- r = reward received
- s′ = next state
- α = learning rate (how much new info matters)
- γ = discount factor (importance of future rewards)
Code Example: Simple Gridworld Q-Learning
Imagine a 1D world: the agent starts at position 0, goal is at position 4.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
import numpy as np import random # Parameters n_states = 5 actions = [0, 1] # 0=left, 1=right q_table = np.zeros((n_states, len(actions))) alpha = 0.1 # learning rate gamma = 0.9 # discount factor episodes = 20 for episode in range(episodes): state = 0 # start at position 0 while state != 4: # Choose action (epsilon-greedy) if random.uniform(0,1) < 0.2: # explore action = random.choice(actions) else: # exploit action = np.argmax(q_table[state]) # Take action if action == 1: # move right next_state = min(state+1, 4) else: # move left next_state = max(state-1, 0) # Reward reward = 1 if next_state == 4 else 0 # Q-update q_table[state, action] += alpha * (reward + gamma * np.max(q_table[next_state]) - q_table[state, action]) state = next_state print(f"Episode {episode+1} finished. Q-table:\n", q_table)
Explanation of Code
- q_table = np.zeros((n_states, len(actions))) → start with all Q-values = 0 (no knowledge).
- epsilon-greedy → sometimes explore randomly (20%), sometimes exploit known best action.
- reward = 1 if next_state == 4 else 0 → only reaching the goal gives a reward.
- Q-update → adjusts the Q-value toward better estimates using the formula.
Over episodes, the agent learns: “Keep moving right until you reach the goal.”
I once gave a workshop where students trained a Q-learning agent to cross a maze. At first, it wandered aimlessly. But after enough episodes, the agent reliably found the shortest path. Watching their excitement when the agent "got it" reminded me why I love teaching RL.
Lessons Learned
- RL is about learning by interaction, not labels.
- Core concepts: Agent, Environment, Reward are the building blocks.
- Q-Learning is a foundational algorithm that teaches agents to maximize long-term rewards using a Q-table.
- Even simple toy examples like a 1D world or Gridworld help cement the intuition behind RL.
Final thought:
Reinforcement learning feels closer to teaching a child than training a model. You don’t show it the answer — you let it try, fail, and learn. And that’s what makes RL so powerful and fascinating.
Frequently Asked Questions
Reinforcement learning is a type of machine learning where an agent learns by interacting with an environment and receiving rewards or penalties for actions.
Agent is the learner or decision maker, environment is the world the agent interacts with.and reward is feedback that guides learning.
Q-Learning is a reinforcement learning algorithm that teaches agents the best actions to take in each state by updating a Q-table with expected rewards.
Unlike supervised learning, Q-Learning doesn’t require labeled data. It learns through trial and error, making it suitable for dynamic environments like games or robotics.
Reinforcement learning powers self-driving cars, game-playing AIs like AlphaGo, industrial robotics, recommendation engines, and resource optimization systems.
Still have questions?Contact our support team