Deep Reinforcement Studying: 0 to 100

Let Speculation Break Your Python Code Earlier than Your Customers Do

Bringing Imaginative and prescient-Language Intelligence to RAG with ColPali

the way you’d educate a robotic to land a drone with out programming each single transfer? That’s precisely what I got down to discover. I spent weeks constructing a recreation the place a digital drone has to determine land on a platform—not by following pre-programmed directions, however by studying from trial and error, identical to the way you discovered to journey a motorbike.

That is Reinforcement Studying (RL), and it’s essentially totally different from different machine studying approaches. As a substitute of displaying the AI hundreds of examples of “appropriate” landings, you give it suggestions: “Hey, that was fairly good, however perhaps attempt being extra light subsequent time?” or “Yikes, you crashed—in all probability don’t try this once more.” By means of numerous makes an attempt, the AI figures out what works and what doesn’t.

On this submit, I’m documenting my journey from RL fundamentals to constructing a working system that (largely!) teaches a drone to land. You’ll see the successes, the failures, and all of the bizarre behaviors I needed to debug alongside the best way.

1. Reinforcement studying: Overview

Lots of the thought will be associated to Pavlov’s canine and Skinner’s rat experiments. The concept is that you just give the topic a ‘reward‘ when it does one thing you need it to do (constructive reinforcement) and a ‘penalty‘ when it does one thing unhealthy (detrimental reinforcement). By means of many repeated makes an attempt, your topic learns from this suggestions, steadily discovering which actions result in success—just like how Skinner’s rat discovered which lever presses produced meals rewards.

Pavlov's classical conditioning experiment — Fig 1. Pavlov’s classical conditioning experiment (AI-generated picture by Google’s Gemini)

In the identical trend, we would like a system that can study to do issues (or duties) such that it could actually maximize the reward and reduce the penalty. Be aware this truth about maximizing reward, which is able to are available in later.

1.1 Core Ideas

When speaking about techniques that may be carried out programmatically on computer systems, the perfect observe is to jot down clear definitions for concepts that may be abstracted. Within the examine of AI (and extra particularly, Reinforcement studying), the core concepts will be boiled all the way down to the next:

Agent (or Actor): That is our topic from the earlier part. This may be the canine, a robotic making an attempt to navigate an enormous manufacturing unit, a online game NPC, and so on.
Surroundings (or the world): This is usually a place, a simulation with restrictions, a online game’s digital recreation world, and so on. I consider this like, “A field, actual or digital, the place the agent’s complete life is confined to; it solely is aware of of what occurs inside the field. We, because the overlords, can alter this field, whereas the agent will assume that god is exacting his will on his world.”
Coverage: Similar to in governments, firms, and lots of extra comparable entities, ‘insurance policies’ dictate “What actions ought to be taken when given a sure state of affairs”.
State: That is what the agent “sees” or “is aware of” about its present state of affairs. Consider it because the agent’s snapshot of actuality at any given second—like the way you see the visitors mild shade, your velocity, and the space to the intersection when driving.
Motion: Now that our agent can ‘see’ issues in its atmosphere, it could need to do one thing about its state. Possibly it simply awakened from an extended night time’s slumber, and now it needs to get a cup of espresso. On this case, the very first thing it would do is get away from bed. That is an motion that the agent will take to realize its purpose, i.e., GET SOME COFFEE!
Reward: Each time the actor executes an motion (of its personal volition), one thing could change on the earth. For instance, our agent received away from bed and began strolling in direction of the kitchen, however then, as a result of it’s so unhealthy at strolling, it tripped and fell. On this state of affairs, the god (us) rewards it with a punishment for being unhealthy at strolling (detrimental reward). However then the agent makes it to the kitchen and will get the espresso, so the god (us) rewards it with a cookie (constructive reward).

Fig. 2 Illustration of a theoretical RL system

As you’ll be able to think about, most of those key parts have to be tailor-made for the precise process/downside that we would like the agent to resolve.

2. The Fitness center

Now that we perceive the fundamentals, you is likely to be questioning: how will we truly construct one in all these techniques? Let me present you the sport I constructed.

For this submit, I’ve written a bespoke online game that anybody can entry and use to coach their very own machine studying agent to play the sport.

The total code repository will be discovered on GitHub (please star this). I intend to make use of this repository for extra video games and simulation code, together with extra superior methods that I’ll implement in my subsequent installments of posts on RL.

Supply Drone

The supply drone is a recreation the place the target is to fly a drone (doubtless containing deliveries) onto a platform. To win the sport, we now have to land. To land, we now have to satisfy the next standards:

Be in touchdown proximity to the platform
Be gradual sufficient
Be upright (Touchdown the other way up is extra like crashing than touchdown)

All data on run the sport will be discovered within the GitHub repository.

Right here’s what the sport seems like

Sample screenshot of the game — Fig. 3 A screenshot of the sport that I made for this undertaking

If the drone flies off the display screen or touches the bottom, it will likely be thought of a ‘crash’ case and thus result in a failure.

State description

The drone observes 15 steady values that fully describe its state of affairs:

Touchdown Success Standards: The drone should concurrently obtain:

Horizontal alignment: inside platform bounds (|dx| < 0.0625)
Secure method velocity: lower than 0.3
Stage orientation: tilt lower than 20° (|angle| < 0.111)
Right altitude: backside of drone touching platform prime

It’s like parallel parking—you want the fitting place, proper angle, and shifting slowly sufficient to not crash!

How can somebody design a coverage?

There are numerous methods to design a coverage. It may be Bayesian (sustaining chance distributions over beliefs), it may be a easy lookup desk for discrete states, a hand-coded rule system (“if distance < 10, then brake”), a choice tree, or—as we’ll discover—a neural community that learns the mapping from states to actions by means of gradient descent.

Successfully, we would like one thing that takes within the aforementioned state, performs some computation utilizing this state, and returns what motion ought to be carried out.

Deep Studying to construct a coverage?

So how will we design a coverage that may deal with steady states (like precise drone positions) and study complicated behaviors? That is the place neural networks are available in.

In case of neural networks (or in deep studying), it’s typically finest to work with motion possibilities, i.e., “What motion is probably going the perfect given the present state?”. So, we will outline a neural community that can take within the state as a ‘vector’ or ‘assortment of vectors’ as enter. This vector or assortment of vectors needs to be constructed from the noticed state. For our supply drone recreation, the state vector is:

State vector (from our 2D drone recreation)

The drone observes its absolute place, velocities, orientation, gas, platform place, and derived metrics. Our steady state is:

The place every part represents:

All parts are normalized to roughly [0,1] or [-1,1] ranges for steady neural community coaching.

Motion house (three unbiased binary thrusters)

As a substitute of discrete motion mixtures, we deal with every thruster independently:

Essential thruster (upward thrust)
Left thruster (clockwise rotation)
Proper thruster (counter-clockwise rotation)

Every motion is sampled from a Bernoulli distribution, giving us 3 unbiased binary choices per timestep.

Neural-network coverage (probabilistic with Bernoulli sampling)

Let f_θ(s) be the community outputs after sigmoid activation. The coverage makes use of unbiased Bernoulli distributions:

Minimal Python sketch (from our implementation)

# construct state vector from DroneState
s = np.array([
    state.drone_x, state.drone_y,
    state.drone_vx, state.drone_vy,
    state.drone_angle, state.drone_angular_vel,
    state.drone_fuel,
    state.platform_x, state.platform_y,
    state.distance_to_platform,
    state.dx_to_platform, state.dy_to_platform,
    state.speed,
    float(state.landed), float(state.crashed)
])

# community outputs possibilities for every thruster (after sigmoid)
action_probs = coverage(torch.tensor(s, dtype=torch.float32))  # form: (3,)

# pattern every thruster independently from Bernoulli
dist = Bernoulli(probs=action_probs)
motion = dist.pattern()  # form: (3,), e.g., [1, 0, 1] means fundamental+proper thrusters

This exhibits how we map the sport’s bodily observations right into a 15-dimensional normalized state vector and produce unbiased binary choices for every thruster.

Code setup (half 1): Imports and recreation socket setup

We first need our recreation’s socket listener to start out. For this, you’ll be able to navigate to the delivery_drone listing in my repository and run the next command:

pip set up -r necessities.txt # run this as soon as for establishing the required modules
python socket_server.py --render human --port 5555 --num-games 1 # run this at any time when you must run the sport in socket mode

NOTE: You have to PyTorch to run the code. Please just remember to have set it up beforehand

import os
import torch
import torch.nn as nn
import math
import numpy as np

from torch.distributions import Bernoulli

# Import the sport's socket shopper
from delivery_drone.recreation.socket_client import DroneGameClient, DroneState

# setup the shopper and connect with the server
shopper = DroneGameClient()
shopper.join()

How you can design a reward operate?

So what makes a very good reward operate? That is arguably the toughest a part of RL (and the place I spent a LOT of my debugging time 🫠).

The reward operate is the soul of any RL implementation (and belief me, get this improper and your agent will do the weirdest issues). In concept, it ought to outline what ‘good’ behaviour ought to be learnt and what ‘unhealthy’ behaviour shouldn’t be learnt. Every motion taken by our agent is characterised by the entire gathered reward for every behaviour trait exhibited by the motion. For instance, if you need the drone to land gently, you may give constructive rewards for being near the platform and shifting slowly, whereas penalizing crashes or working out of gas—the agent then learns to maximise the sum of all these rewards over time.

Benefit: A greater strategy to measure efficient reward

When coaching our coverage, we don’t simply need to know if an motion rewarded us—we need to know if it was higher than traditional. That is the instinct behind the benefit.

The benefit tells us: “Was this motion higher or worse than what we usually count on?”

In our implementation, we:

Accumulate a number of episodes and calculate their returns (complete discounted rewards)
Compute the baseline because the imply return throughout all episodes
Calculate benefit = return – baseline for every timestep
Normalize benefits to have imply=0 and std=1 (for steady coaching)

Why this helps:

Actions with constructive benefit → higher than common → improve their chance
Actions with detrimental benefit → worse than common → lower their chance
Reduces variance in gradient updates (extra steady studying)

This straightforward baseline already provides us a lot better coaching than uncooked returns! It tries to weigh the total sequence of actions in opposition to the outcomes (crashed or landed) such that the coverage learns to take actions that result in higher benefit.

After a whole lot of trial and error, I’ve designed the next reward operate. The important thing perception was to situation rewards on each proximity AND vertical place – the drone have to be above the platform to obtain constructive rewards, stopping exploitation methods like hovering under the platform.

Brief word on inversely (and non-linearly) scaling reward

Usually, we need to reward behaviors inversely proportional to sure state values. For instance, distance to the platform ranges from 0 to ~1.41 (normalized by window width). We would like a excessive reward when the space ≈ 0 and a low reward when distant. I exploit numerous scaling capabilities for this:

Plot showing an exponentially decaying function — Fig. 4 Gaussian scalar operate

Examples for different helpful scaling capabilities

Helper capabilities:

def inverse_quadratic(x, decay=20, scaler=10, shifter=0):
    """Reward decreases quadratically with distance"""
    return scaler / (1 + decay * (x - shifter)**2)

def scaled_shifted_negative_sigmoid(x, scaler=10, shift=0, steepness=10):
    """Sigmoid operate scaled and shifted"""
    return scaler / (1 + np.exp(steepness * (x - shift)))

def calc_velocity_alignment(state: DroneState):
    """
    Calculate how properly the drone's velocity is aligned with optimum route to platform.
    Returns cosine similarity: 1.0 = good alignment, -1.0 = other way
    """
    # Optimum route: from drone to platform
    optimal_dx = -state.dx_to_platform
    optimal_dy = -state.dy_to_platform
    optimal_norm = math.sqrt(optimal_dx**2 + optimal_dy**2)

    if optimal_norm < 1e-6:  # Already at platform
        return 1.0

    optimal_dx /= optimal_norm
    optimal_dy /= optimal_norm

    # Present velocity route
    velocity_norm = state.velocity
    if velocity_norm < 1e-6:  # Not shifting
        return 0.0

    velocity_dx = state.drone_vx / velocity_norm
    velocity_dy = state.drone_vy / velocity_norm

    # Cosine similarity
    return velocity_dx * optimal_dx + velocity_dy * optimal_dy

Code for the present reward operate:

def calc_reward(state: DroneState):
    rewards = {}
    total_reward = 0

    # 1. Time penalty - distance-based (penalize extra when far)
    minimum_time_penalty = 0.3
    maximum_time_penalty = 1.0
    rewards['time_penalty'] = -inverse_quadratic(
        state.distance_to_platform,
        decay=50,
        scaler=maximum_time_penalty - minimum_time_penalty
    ) - minimum_time_penalty
    total_reward += rewards['time_penalty']

    # 2. Distance & velocity alignment - ONLY when above platform
    velocity_alignment = calc_velocity_alignment(state)
    dist = state.distance_to_platform

    rewards['distance'] = 0
    rewards['velocity_alignment'] = 0

    # Key situation: drone have to be above platform (dy > 0) to get constructive rewards
    if dist > 0.065 and state.dy_to_platform > 0:
        # Reward motion towards platform when velocity is aligned
        if velocity_alignment > 0:
            rewards['distance'] = state.velocity * scaled_shifted_negative_sigmoid(dist, scaler=4.5)
            rewards['velocity_alignment'] = 0.5

    total_reward += rewards['distance']
    total_reward += rewards['velocity_alignment']

    # 3. Angle penalty - distance-based threshold
    abs_angle = abs(state.drone_angle)
    max_angle = 0.20
    max_permissible_angle = ((max_angle - 0.111) * dist) + 0.111
    extra = abs_angle - max_permissible_angle
    rewards['angle'] = -max(extra, 0)
    total_reward += rewards['angle']

    # 4. Velocity penalty - penalize extreme velocity
    rewards['speed'] = 0
    velocity = state.velocity
    max_speed = 0.4
    if dist < 1:
        rewards['speed'] = -2 * max(velocity - 0.1, 0)
    else:
        rewards['speed'] = -1 * max(velocity - max_speed, 0)
    total_reward += rewards['speed']

    # 5. Vertical place penalty - penalize being under platform
    rewards['vertical_position'] = 0
    if state.dy_to_platform > 0:  # Drone is above platform (GOOD)
        rewards['vertical_position'] = 0
    else:  # Drone is under platform (BAD!)
        rewards['vertical_position'] = state.dy_to_platform * 4.0  # Unfavorable penalty
    total_reward += rewards['vertical_position']

    # 6. Terminal rewards
    rewards['terminal'] = 0
    if state.landed:
        rewards['terminal'] = 500.0 + state.drone_fuel * 100.0
    elif state.crashed:
        rewards['terminal'] = -200.0
        # Additional penalty for crashing removed from goal
        if state.distance_to_platform > 0.3:
            rewards['terminal'] -= 100.0
    total_reward += rewards['terminal']

    rewards['total'] = total_reward
    return rewards

And sure, these magic numbers like 4.5, 0.065, and 4.0? They got here from lots of trial and error. Welcome to RL, the place hyperparameter tuning is half artwork, half science, and half luck (sure, I do know that’s three halves).

def compute_returns(rewards, gamma=0.99):
    """
    Compute discounted returns (G_t) for every timestep based mostly on the Bellman equation
    
    G_t = r_t + γ*r_{t+1} + γ²*r_{t+2} + ...
    """
    returns = []
    G = 0
    
    # Compute backwards (extra environment friendly)
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    
    return returns

The essential factor to notice is that reward capabilities are topic to cautious trial and error. One mistake or over-reward right here, and the agent goes off in optimizing behaviour that exploits the errors. This leads us to reward hacking.

Reward hacking

Reward hacking happens when an agent finds an unintended strategy to maximize reward with out truly fixing the duty you needed it to resolve. The agent isn’t “dishonest” on function—it’s doing precisely what you informed it to do, simply not what you meant for it to do.

Basic instance: In case you reward a cleansing robotic for “no seen filth,” it’d study to show off its digicam as a substitute of cleansing!

My painful studying expertise: I discovered this out the laborious approach. In an early model of my drone touchdown reward operate, I gave the drone factors for being “steady and gradual” anyplace close to the platform. Sounds cheap, proper? Unsuitable! Inside 50 coaching episodes, my drone discovered to simply hover in place ceaselessly, racking up free factors. It was technically optimum for my badly-designed reward operate—however truly touchdown? Nope! I watched it hover for five minutes straight earlier than I noticed what was occurring.

Right here’s the problematic code I wrote:

# DO NOT COPY THIS!
# If drone is above the platform (|dx| < 0.0625) and shut (distance < 0.25):
corridor_reward = inverse_quadratic(distance, decay=20, scaler=15)  # As much as 15 factors
if steady and gradual:
    corridor_reward += 10  # Additional 10 factors!
# Whole potential: 25 factors per step!

An instance of reward hacking in motion:

Fig. 5 The drone learnt to hover across the platform and farm rewards

Plot showing hacked rewards — Fig. 6 Plot that exhibits that the drone is clearly reward hacking

Making a coverage community

As mentioned above, we’re going to use a neural community because the coverage that powers the mind of our agent. Right here’s a easy implementation that takes within the state vector and computes a chance distribution over 3 unbiased actions:

Activate the primary thruster
Activate the left thruster
Activate the fitting thruster

def state_to_array(state):
    """Helper operate to transform DroneState dataclass to numpy array"""
    knowledge = np.array([
        state.drone_x,
        state.drone_y,
        state.drone_vx,
        state.drone_vy,
        state.drone_angle,
        state.drone_angular_vel,
        state.drone_fuel,
        state.platform_x,
        state.platform_y,
        state.distance_to_platform,
        state.dx_to_platform,
        state.dy_to_platform,
        state.speed,
        float(state.landed),
        float(state.crashed)
    ])
    
    return torch.tensor(knowledge, dtype=torch.float32)

class DroneGamerBoi(nn.Module):
    def __init__(self, state_dim=15):
        tremendous().__init__()
        
        self.community = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.LayerNorm(128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.LayerNorm(128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.LayerNorm(64),
            nn.ReLU(),
            nn.Linear(64, 3),
            nn.Sigmoid()
        )
        
    def ahead(self, state):
        if isinstance(state, DroneState):
            state = state_to_array(state)
        
        return self.community(state)

Successfully, as a substitute of the motion house being a 2³ = 8 house, I lowered it to choices over the three unbiased thrusters utilizing Bernoulli sampling. This discount makes optimization simpler by treating every thruster independently somewhat than as one massive categorical alternative (not less than that’s what I believe—I could also be improper, nevertheless it labored for me!).

Coaching a coverage with coverage gradients

Studying Methods: When Ought to We Replace?

Right here’s a query that tripped me up early on: ought to we replace the coverage after each single motion, or wait and see how the entire episode performs out? Seems, this alternative issues lots.

While you attempt to optimize based mostly purely on the reward acquired for an motion, it results in a excessive variance downside (mainly, the coaching sign is tremendous noisy and the gradients level in random instructions!). What I imply by “excessive variance” is that the optimization algorithm receives extraordinarily combined alerts within the gradient that’s used to replace the parameters in our coverage community. For a similar motion, the system could emit a selected gradient route, however then for a barely totally different state (however similar motion) may yield one thing fully reverse. This results in gradual, and probably no, coaching.

There are 3 ways we may replace our coverage:

Studying after each motion (Per-Step Updates)

The drone fires its thruster as soon as, will get a small reward, and instantly updates its complete technique. That is like adjusting your basketball type after each single shot—approach too reactive! One fortunate motion that will increase the reward doesn’t essentially imply that the agent did good, and one unfortunate motion doesn’t imply the agent did unhealthy. The training sign is simply too noisy.

My first try: I attempted this method early on. The drone would wiggle round randomly, make one fortunate transfer that received a tiny bit extra reward, instantly overfit to that precise transfer, after which crash repeatedly making an attempt to breed it. It was painful to look at—like watching somebody study the improper lesson from pure likelihood.

Studying after one full try (Per-Episode Updates)

Higher! Now we let the drone attempt to land (or crash), see how the entire try went, after which replace. That is like ending an episode after which excited about what to enhance. At the very least now we see the total penalties of our actions. However right here’s the issue: what if that one touchdown was simply fortunate? Or unfortunate? We’re nonetheless basing our studying on a single knowledge level.

Studying from a number of makes an attempt (Multi-Episode Batch Updates)

That is the candy spot. We run a number of (6 in my case) drone touchdown makes an attempt concurrently, see how all of them went, after which replace our coverage based mostly on the common efficiency. Some makes an attempt may get fortunate, some unfortunate, however averaged collectively, we get a a lot clearer image of what truly works. Though that is fairly heavy on the pc, when you can run it, it really works approach higher than any of the earlier strategies. After all, this technique is actually not the perfect, however it’s fairly easy to know and implement; there are different (and higher) strategies.

Right here’s the code to gather a number of episodes within the drone recreation:

def collect_episodes(shopper: DroneGameClient, coverage: nn.Module, max_steps=300):
    """
    Accumulate episodes with early stopping
    
    Args:
        shopper: The sport's socket shopper
        coverage: PyTorch module
        max_steps: Most steps per episode (default: 300)
    """
    num_games = shopper.num_games
    
    # Initialize storage
    all_episodes = [{'states': [], 'actions': [], 'log_probs': [], 'rewards': [], 'completed': False} 
                    for _ in vary(num_games)]
    
    # Reset all video games
    game_states = [client.reset(game_id) for game_id in range(num_games)]
    step_counts = [0] * num_games  # Observe steps per recreation
    
    whereas not all(ep['done'] for ep in all_episodes):
        # Batch energetic video games
        batch_states = []
        active_game_ids = []
        
        for game_id in vary(num_games):
            if not all_episodes[game_id]['done']:
                batch_states.append(state_to_array(game_states[game_id]))
                active_game_ids.append(game_id)
        
        if len(batch_states) == 0:
            break
        
        # Batched inference
        batch_states_tensor = torch.stack(batch_states)
        batch_action_probs = coverage(batch_states_tensor)
        batch_dist = Bernoulli(probs=batch_action_probs)
        batch_actions = batch_dist.pattern()
        batch_log_probs = batch_dist.log_prob(batch_actions).sum(dim=1)
        
        # Execute actions
        for i, game_id in enumerate(active_game_ids):
            motion = batch_actions[i]
            log_prob = batch_log_probs[i]
            
            next_state, _, completed, _ = shopper.step({
                "main_thrust": int(motion[0]),
                "left_thrust": int(motion[1]),
                "right_thrust": int(motion[2])
            }, game_id)
            
            reward = calc_reward(next_state)
            
            # Retailer knowledge
            all_episodes[game_id]['states'].append(batch_states[i])
            all_episodes[game_id]['actions'].append(motion)
            all_episodes[game_id]['log_probs'].append(log_prob)
            all_episodes[game_id]['rewards'].append(reward['total'])
            
            # Replace state and step depend
            game_states[game_id] = next_state
            step_counts[game_id] += 1
            
            # Verify completed situations
            if completed or step_counts[game_id] >= max_steps:
                # Apply timeout penalty if hit max steps with out touchdown
                if step_counts[game_id] >= max_steps and never next_state.landed:
                    all_episodes[game_id]['rewards'][-1] -= 500  # Timeout penalty
                
                all_episodes[game_id]['done'] = True
    
    # Return episodes
    return [(ep['states'], ep['actions'], ep['log_probs'], ep['rewards']) 
            for ep in all_episodes]

The Maximization-Minimization Puzzle

In typical deep studying (supervised studying), we reduce a loss operate:

We need to go “downhill” towards decrease loss (higher predictions).

However in reinforcement studying, we need to maximize complete reward! Our purpose is:

The issue: Deep studying frameworks are constructed for minimization, not maximization. How will we flip “maximize reward” into “reduce loss”?

The straightforward trick: Maximize J(θ) = Reduce -J(θ)

So our loss operate turns into:

Now, gradient descent will climb up (extra like Gradient Ascend) the reward panorama (as a result of we’re taking place the detrimental reward)!

The REINFORCE Algorithm (Coverage Gradient)

The coverage gradient theorem (Williams, 1992) tells us compute the gradient of anticipated reward:

(I do know, I do know—this seems intimidating. However follow me, it’s truly fairly elegant when you see what’s occurring!)

The place:

In plain English (as a result of that method is dense):

If motion a_t led to a excessive return G_t, improve its chance
If motion a_t led to a low return G_t, lower its chance
The gradient tells us which route to regulate the neural community weights

Including a Baseline (Variance Discount)

Utilizing uncooked returns G_t results in excessive variance (noisy gradients). We enhance this by subtracting a baseline b(s_t):

The only baseline is the imply return:

This provides us the benefit: A_t=G_t-b

Constructive benefit → motion was higher than common → improve chance
Unfavorable benefit → motion was worse than common → lower chance

Why this helps: As a substitute of “this motion gave reward 100” (is that good?), we now have “this motion gave 100 when the common is 50” (that’s nice!). Relative efficiency is clearer than absolute.

Our Implementation

In our drone touchdown code, we use REINFORCE with baseline:

# 1. Accumulate episodes and compute returns
returns = compute_returns(rewards, gamma=0.99)  # G_t with discounting

# 2. Compute baseline (imply of all returns)
baseline = returns_tensor.imply()

# 3. Compute benefits
benefits = returns_tensor - baseline

# 4. Normalize benefits (additional variance discount)
benefits = (benefits - benefits.imply()) / (benefits.std() + 1e-8)

# 5. Compute loss (word the detrimental signal!)
loss = -(log_probs_tensor * benefits).imply()

# 6. Gradient descent
optimizer.zero_grad()
loss.backward()
optimizer.step()

We repeat the above loop as many occasions as we would like or until the drone learns to land correctly. Take a look at this pocket book for extra code!

Present Outcomes (reward operate continues to be fairly flawed)

After numerous hours of tweaking rewards, adjusting hyperparameters, and watching my drone crash in artistic new methods, I lastly received it working (largely!). Although my designed reward operate shouldn’t be good, I do assume that it is ready to educate a coverage community. Right here’s a profitable touchdown:

Gif showing a good run of the agent — Fig. 6 The drone learnt one thing!

Fairly cool, proper? However right here’s the place issues get attention-grabbing (and irritating)…

The persistent hovering downside: A elementary limitation

Even with the improved reward operate that situations rewards on vertical place (dy_to_platform > 0). The skilled coverage nonetheless displays a irritating conduct: when the drone misses the platform, it learns to descend towards it however then hovers under the platform somewhat than making an attempt to land.

I spent over per week observing reward plots (and altering reward capabilities), questioning why my “mounted” reward operate was nonetheless producing this hovering conduct. After I lastly plotted the gathered rewards, the sample turned crystal clear—and truthfully, I couldn’t even be mad on the agent for locating this technique.

What’s occurring?

By analyzing the gathered rewards over an episode the place the drone hovers under the platform, I found one thing attention-grabbing:

Fig. 7 Gif displaying “hovering under platform” downside

Fig. 8 Plot that exhibits that the drone is clearly reward hacking

The plots reveal that:

Distance reward (orange): Accumulates to ~+70 early, then plateaus (no extra rewards)
Velocity alignment (inexperienced): Accumulates to ~+30 early, then plateaus
Time penalty (blue): Steadily accumulates to ~-250 (retains getting worse)
Vertical place (brown): Steadily accumulates to ~-200 (penalty for being under)
Whole reward: Ends round -400 to -600 (after timeout)

The important thing perception: The drone descends from above the platform (accumulating distance and velocity rewards on the best way down), passes by means of the platform top, after which settles into hovering under as a substitute of finishing the touchdown. As soon as under, it stops getting constructive rewards (discover how the space and velocity strains plateau round step 50-60) however continues accumulating time penalties and vertical place penalties. Nonetheless, this technique continues to be viable as a result of making an attempt to land dangers a direct -200 crash penalty, whereas hovering under “solely” prices ~-400 to -600 over the total episode.

Why does this occur?

The elemental problem is that our reward operate r(s', a) can solely see the present state, not the trajectory. Give it some thought: at any single timestep, the reward operate can’t inform the distinction between:

A drone making progress towards touchdown (approaching from above with managed descent)
A drone exploiting the reward construction (oscillating to farm rewards)

Each may need dy_to_platform > 0 at a given second, so that they obtain similar rewards! The agent isn’t dumb—it’s simply optimizing precisely what you informed it to optimize.

So what would truly repair this?

To really clear up this downside, I personally assume that rewards ought to rely on state transitions: r(s, a, s') as a substitute of simply r(s, a). This could allow you to reward based mostly on (s being the present state, and s’ prime being the subsequent state):

Progress: Solely reward if distance(s') < distance(s) (truly getting nearer!)
Vertical enchancment: Solely reward if the drone is constantly shifting upward relative to the platform
Trajectory consistency: Penalize speedy route adjustments that point out oscillation

This can be a extra principled answer than making an attempt to patch the present reward operate with more and more harsh penalties (which is mainly what I attempted for some time, and it didn’t actually work). The oscillation exploit exists as a result of we’re essentially lacking details about the trajectory.

Within the subsequent submit, I’ll discover Actor-Critic strategies and methods that may incorporate temporal data to stop these exploitation methods. Keep tuned!

In case you discover a strategy to repair this, please attain out to me!

This brings us to the top of this submit on “the best strategy to do Deep Reinforcement Studying.”

Subsequent on the listing

Actor-Critic techniques
DQL
PPO & GRPO
Making use of this to techniques that require imaginative and prescient 👀

References

Foundational Stuff

Turing, A. M. (1950). “Computing Equipment and Intelligence.”.
- Unique Turing Check paper
Williams, R. J. (1992). “Easy Statistical Gradient-Following Algorithms for Connectionist Reinforcement Studying.” Machine Studying.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Studying: An Introduction. MIT Press.

Classical Conditioning & Behavioral Psychology

Pavlov, I. P. (1927). Conditioned Reflexes: An Investigation of the Physiological Exercise of the Cerebral Cortex. Oxford College Press.
- Classical conditioning experiments
Skinner, B. F. (1938). The Conduct of Organisms: An Experimental Evaluation. Appleton-Century-Crofts.
- Operant conditioning and the Skinner Field

Coverage Gradient Strategies

Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). “Coverage Gradient Strategies for Reinforcement Studying with Perform Approximation.” Advances in Neural Info Processing Methods.
- Theoretical foundations of coverage gradients
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). “Excessive-Dimensional Steady Management Utilizing Generalized Benefit Estimation.” arXiv preprint arXiv:1506.02438.

Neural Networks & Deep Studying

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Studying. MIT Press.

On-line Sources

Karpathy, A. “Deep Reinforcement Studying: Pong from Pixels.”
Spinning Up in Deep RL by OpenAI

Code Repository

Jumle, V. (2025). “Reinforcement Studying 101: Supply Drone Touchdown.”

Pal

Singh, Navroop Kaur. (2025): For offering “Constructive Vibes & Consideration”. Thanks!

All pictures on this article are both AI-generated (utilizing Gemini), personally made by me, or screenshots & plots that I made.

Deep Reinforcement Studying: 0 to 100

Let Speculation Break Your Python Code Earlier than Your Customers Do

Bringing Imaginative and prescient-Language Intelligence to RAG with ColPali

Related Posts

Let Speculation Break Your Python Code Earlier than Your Customers Do

Bringing Imaginative and prescient-Language Intelligence to RAG with ColPali

Methods to Apply Highly effective AI Audio Fashions to Actual-World Functions

AI Brokers: From Assistants for Effectivity to Leaders of Tomorrow?

Information Visualization Defined (Half 4): A Overview of Python Necessities

Agentic AI from First Ideas: Reflection

API Improvement for Internet Apps and Knowledge Merchandise

Leave a Reply Cancel reply

POPULAR NEWS

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

XMN is accessible for buying and selling!

College endowments be a part of crypto rush, boosting meme cash like Meme Index

What My GPT Stylist Taught Me About Prompting Higher

Scaling Statistics: Incremental Customary Deviation in SQL with dbt | by Yuval Gorchover | Jan, 2025

EDITOR'S PICK

Automating Ticket Creation in Jira With the OpenAI Brokers SDK: A Step-by-Step Information

Coinshift Proclaims Launch Of csUSDL Amid Strategic Partnerships

Telegram CEO confirms sharing criminals’ IP addresses with authorities since 2018

Optimize for Influence: Keep Forward of Gen AI and Thrive as a Information Scientist

About Us

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

Deep Reinforcement Studying: 0 to 100

READ ALSO

1. Reinforcement studying: Overview

1.1 Core Ideas

2. The Fitness center

Supply Drone

State description

How can somebody design a coverage?

Deep Studying to construct a coverage?

Code setup (half 1): Imports and recreation socket setup

How you can design a reward operate?

Reward hacking

Making a coverage community

Coaching a coverage with coverage gradients

Studying Methods: When Ought to We Replace?

The Maximization-Minimization Puzzle

The REINFORCE Algorithm (Coverage Gradient)

Including a Baseline (Variance Discount)

Our Implementation

Present Outcomes (reward operate continues to be fairly flawed)

The persistent hovering downside: A elementary limitation

Subsequent on the listing

References

Foundational Stuff

Classical Conditioning & Behavioral Psychology

Coverage Gradient Strategies

Neural Networks & Deep Studying

On-line Sources

Code Repository

Pal

Related Posts

Leave a Reply Cancel reply

POPULAR NEWS

EDITOR'S PICK

About Us

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?