The basic downside was that my reward perform may solely see the present state, not the trajectory. Once I rewarded it for being near the platform, it couldn’t inform the distinction between a drone making progress towards touchdown and a drone that had already handed by means of the platform and was now exploiting the reward construction from beneath. The reward perform r(s') simply checked out the place the drone was, not the way it acquired there or the place it was going. (This may develop into a recurring theme, by the best way. Reward engineering haunts me in my sleep at this level.)
However right here’s the place issues get attention-grabbing. Whereas I used to be looking at my drone hovering beneath the platform for what felt just like the hundredth time, I stored considering: why am I ready for the whole episode to complete earlier than studying something? REINFORCE made me gather a full trajectory, watch the drone crash (or often land), compute all of the returns, and then replace the coverage. What if we may simply… study after each single step? Like, get rapid suggestions because the drone flies? Wouldn’t that be far more environment friendly?
That’s Actor-Critic. And spoiler alert: it really works method higher than I anticipated. Properly, after I mounted three main bugs, rewrote my reward perform twice, spent two days considering PyTorch was damaged (it wasn’t, I used to be simply utilizing it fallacious), and eventually understood why my low cost issue was making terminal rewards fully invisible. However we’ll get to all of that.
On this put up, I’m going to stroll you thru my total journey implementing Actor-Critic strategies for the drone touchdown job. You’ll see the successes, the irritating failures, and the debugging marathons. Right here’s what we’re overlaying:
Primary Actor-Critic with TD error, which acquired me to 68% success fee and converged twice as quick as REINFORCE. This half labored surprisingly nicely as soon as I mounted the transferring goal bug (extra on that nightmare later).
My try at Generalized Benefit Estimation (GAE), which fully failed. I spent three total days debugging why my critic values had been exploding to 1000’s, tried each repair I may consider, and ultimately simply… gave up and moved on. Typically you should know when to pivot. (I’m nonetheless a bit salty about this one, truthfully.)
Proximal Coverage Optimization (PPO), which lastly gave me steady, strong efficiency and taught me why the whole RL trade simply makes use of this by default. Seems when OpenAI says “that is the factor,” they’re in all probability proper.
However extra importantly, you’ll study in regards to the three important bugs that almost derailed all the things. These aren’t small “oops, typo” bugs. These are “stare at coaching curves for six hours, questioning should you basically misunderstand neural networks” bugs:
- The transferring goal downside that made my critic loss oscillate without end as a result of I didn’t detach the TD goal (this one made me query my total understanding of backpropagation)
- The gamma worth was too low and it made touchdown rewards price actually 0.00000006 after low cost, so my agent simply discovered to crash instantly as a result of why hassle making an attempt? (I printed the precise discounted values and laughed, then cried)
- The reward exploits the place my drone discovered to zoom previous the platform at most velocity, gather distance rewards on the best way, and crash distant as a result of that was someway higher than touchdown. This taught me that 90% of RL actually is reward engineering, and the opposite 90% is debugging why your reward engineering didn’t work. (Sure, I do know that’s 180%. That’s how a lot work it’s.)
Let’s dive in. Seize some espresso, you’re going to wish it. All of the code may be present in my repository on my github.
What’s Actor-Critic?
REINFORCE had one elementary downside: we needed to wait. Anticipate the drone to crash. Anticipate the episode to finish. Wait to compute the complete return. Then, and solely then, may we replace the coverage. One studying sign per episode. For a 150-step trajectory, that’s one replace after watching 150 actions play out.
I ran REINFORCE for 1200 iterations (6 hours on my machine) to hit 55% success fee. And the entire time I stored considering: this feels wasteful. Why can’t I study throughout the episode?
Actor-Critic fixes this with a easy concept: prepare a second neural community (the “critic”) to estimate future returns for any state. Then use these estimates to replace the coverage after each single step. No extra ready for episodes to complete. Simply steady studying because the drone flies.
The end result? 68% success fee in 600 iterations (3 hours). Half the time. Higher efficiency. Similar {hardware}.
The way it works: Two networks collaborate in real-time.
The Actor (π(a|s)): Similar coverage community from REINFORCE. Takes the present state and outputs motion possibilities. That is the community that truly controls the drone.
The Critic (V(s)): New community. Takes the present state and estimates “how good is that this state?” It outputs a single worth representing anticipated future rewards. Not tied to any particular motion, simply evaluates states.
Right here’s the intelligent half: the critic supplies rapid suggestions. The actor takes an motion, the surroundings updates, and the critic instantly evaluates whether or not that moved us to a greater or worse state. The actor learns from this sign and adjusts. The critic concurrently learns to make higher predictions. Each networks enhance collectively as episodes unfold.

In code, they appear to be this:
class DroneGamerBoi(nn.Module):
"""The Actor: outputs motion possibilities"""
def __init__(self, state_dim=15):
tremendous().__init__()
self.community = nn.Sequential(
nn.Linear(state_dim, 128), nn.LayerNorm(128), nn.ReLU(),
nn.Linear(128, 128), nn.LayerNorm(128), nn.ReLU(),
nn.Linear(128, 64), nn.LayerNorm(64), nn.ReLU(),
nn.Linear(64, 3), # Three unbiased thrusters
nn.Sigmoid()
)
def ahead(self, state):
return self.community(state) # Output: possibilities for every thruster
class DroneTeacherBoi(nn.Module):
"""The Critic: outputs state worth estimate"""
def __init__(self, state_dim=15):
tremendous().__init__()
self.community = nn.Sequential(
nn.Linear(state_dim, 128), nn.LayerNorm(128), nn.ReLU(),
nn.Linear(128, 128), nn.LayerNorm(128), nn.ReLU(),
nn.Linear(128, 64), nn.LayerNorm(64), nn.ReLU(),
nn.Linear(64, 1) # Single worth: V(s)
)
def ahead(self, state):
return self.community(state) # Output: scalar worth estimate
Discover the critic community is nearly equivalent to the actor, besides the ultimate layer outputs a single worth (how good is that this state?) as a substitute of motion possibilities.
The Bootstrapping Trick
Okay, right here’s the place it will get intelligent. In REINFORCE, we would have liked the complete return to replace the coverage:
[ G_t = r_t + gamma r_{t+1} + gamma^2 r_{t+2} + cdots + gamma^{T-t} r_T ]
We needed to wait till the episode ended to know all of the rewards. However what if… we didn’t? What if we simply estimated the long run utilizing our critic community?
As a substitute of computing the precise return, we estimate it:
[ G_t approx r_t + gamma V(s_{t+1}) ]
That is known as bootstrapping. The critic “bootstraps” its personal worth estimate to approximate the complete return. We use its prediction of “how good will the subsequent state be?” to estimate the return proper now.

Why does this assist?
Decrease variance. We’re not ready for the precise random sequence of future rewards. We’re utilizing an estimate primarily based on what we’ve discovered about states typically. That is noisier than the bottom fact (the critic may be fallacious!), nevertheless it’s much less noisy than any single episode final result.
On-line studying. We are able to replace instantly at each step. No want to complete the episode first. As quickly because the drone takes one motion, we all know the rapid reward, and we are able to estimate what comes subsequent, so we are able to study.
Higher pattern effectivity. In REINFORCE with 6 parallel video games, every drone learns as soon as per episode completion. In Actor-Critic with 6 parallel video games, every drone learns at each step (about 150 steps per episode). That’s 150x extra studying indicators per episode!
After all, there’s a trade-off: we introduce bias. If our critic is fallacious (and it will likely be, particularly early in coaching), our agent learns from incorrect estimates. However the critic doesn’t have to be excellent. It simply must be much less noisy than a single episode final result. Because the critic progressively improves, the actor learns from higher suggestions. They bootstrap one another upward. In apply, the variance discount is so highly effective that it’s price accepting the small bias.
TD Error: The New Benefit
Now we have to reply: how significantly better or worse was this motion than anticipated?
In REINFORCE, we had the benefit: precise return minus baseline. The baseline was a worldwide common. However we are able to do significantly better. As a substitute of a worldwide baseline, we use the critic’s state-specific estimate.
The TD (Temporal Distinction) error is our new benefit:
[ delta_t = r_t + gamma V(s_{t+1}) – V(s_t) ]
In plain phrases:
- (r_t + gamma V(s_{t+1})) = TD goal. The rapid reward plus our estimate of the subsequent state’s worth.
- (V(s_t)) = Our prediction for the present state.
- (delta_t) = The distinction. Did we do higher or worse than anticipated?
If (delta_t > 0), we did higher than anticipated → reinforce that motion.
If (delta_t < 0), we did worse than anticipated → lower that motion’s chance.
If (delta_t approx 0), we had been spot on → motion was about common.
That is far more informative than REINFORCE’s international baseline. The sign is now state-specific. The drone in a tough spin may get -10 reward and that’s truly fairly good (often will get -50 there). But when it’s hovering peacefully over the platform, -10 is horrible. The critic is aware of the distinction. The TD error captures that.
Right here’s how this flows by means of the coaching loop (simplified):
# 1. Take one motion in every parallel recreation
motion = actor(state)
next_state, reward = env.step(motion)
# 2. Get worth estimates
value_current = critic(state)
value_next = critic(next_state)
# 3. Compute TD error (our benefit)
td_error = reward + gamma * value_next - value_current
# 4. Replace the critic: it ought to have predicted higher
# The critic needs to attenuate prediction error, so we use squared error.
# The gradient then pushes the critic's predictions nearer to precise returns.
critic_loss = td_error ** 2
critic_loss.backward()
critic_optimizer.step()
# 5. Replace the actor: reinforce or discourage primarily based on TD error
# (identical coverage gradient as REINFORCE, however with TD error as a substitute of returns)
actor_loss = -log_prob(motion) * td_error
actor_loss.backward()
actor_optimizer.step()
Discover we’re updating each networks per step, not per episode. That’s the web studying magic.
Yet another comparability to make this crystal clear:
| Methodology | What We Study From | Timing | Baseline |
|---|---|---|---|
| REINFORCE | Full return G_t | After episode ends | World common of all returns |
| Actor-Critic | TD error δ_t | After each step | State-specific V(s_t) |
The second is extra exact, extra informative, and arrives a lot sooner.

Because of this Actor-Critic converged in 600 iterations on my machine whereas REINFORCE wanted 1200. Similar reward perform, identical surroundings, identical drone. However getting suggestions after each step as a substitute of each 150 steps? That’s a 150x info benefit per iteration.
The Three Bugs: A Debugging Odyssey
Alright, I’m about to let you know about three bugs that almost broke me. Not “oops, off-by-one error” damaged. I imply the type of damaged the place you stare at coaching curves for six hours, critically query whether or not you perceive backpropagation, debug your code 5 instances, after which spend one other two hours studying tutorial papers to persuade your self you’re not insane.
These bugs are sufficiently subtle that even skilled RL practitioners must watch out. The excellent news: when you perceive them, they develop into apparent. The dangerous information: you must perceive them first, and I discovered the onerous method.
Bug #1: The Transferring Goal Drawback
The Setup
I applied Actor-Critic precisely because it appeared logical. I’ve two networks. One predicts actions, one predicts values. Easy, proper? I wrote out the TD error computation:
# Compute worth estimates
values = critic(batch_data['states'])
next_values = critic(batch_data['next_states'])
# Compute TD targets and errors
td_targets = rewards + gamma * next_values * (1 - dones)
td_errors = td_targets - values
# Critic loss
critic_loss = (td_errors ** 2).imply()
# Backward cross
critic_loss.backward()
This regarded fully affordable to me. We compute what we anticipated (values), we compute what we must always have gotten (td_targets), we measure the error, and we replace. Normal supervised studying stuff.
The Symptom: Nothing Works
I skilled for 200 iterations and the critic loss was… sitting round 500-1000 and never transferring. Not lowering, not rising, simply oscillating wildly like a sine wave. I checked my reward perform. Regarded effective. I checked the critic community. Normal structure, nothing bizarre. I checked the TD error values themselves. They had been bouncing round between -50 and +50, which appeared affordable given the reward scale.
However the loss refused to converge.
I spent two days on this. I added dropout, considering possibly overfitting. (Improper downside, didn’t assist.) I lowered the educational fee from 1e-3 to 1e-4, considering possibly the optimizer was overshooting. (Nope, simply discovered slower whereas oscillating.) I checked if my surroundings was returning NaNs. (It wasn’t.) I even questioned if PyTorch’s autograd had a bug. (Spoiler: PyTorch is okay, I used to be the bug.)
The Breakthrough
I used to be studying the Actor-Critic chapter in Sutton & Barto (once more, for the fifth time) when one thing caught my eye. The pseudocode had a line about “computing the subsequent worth estimate.” And I assumed: wait, after I compute next_values = critic(next_states), what occurs to these gradients throughout backprop?
After which my mind went click on. Oh no. The goal is transferring as we attempt to optimize towards it. That is known as the transferring goal downside.
Why This Breaks The whole lot
After we compute next_values = critic(next_states) with out detaching, PyTorch’s autograd flows gradients by means of BOTH V(s) and V(s’). Which means we’re updating the prediction AND the goal concurrently—the critic chases a goal that strikes each time it updates. The gradient turns into:
[ frac{partial L}{partial theta} = 2 cdot (r + gamma V(s’) – V(s)) cdot left( gamma frac{partial V(s’)}{partial theta} – frac{partial V(s)}{partial theta} right) ]
That γ · ∂V(s')/∂θ time period is the issue—we’re telling the critic to alter the goal, not simply the prediction. The loss oscillates without end.
The Repair (Lastly)
I wanted to deal with the TD goal as a set fixed. In PyTorch, which means detaching the gradients:
# ✅ CORRECT
values = critic(batch_data['states'])
with torch.no_grad(): # CRITICAL LINE
next_values = critic(batch_data['next_states'])
td_targets = rewards + gamma * next_values * (1 - dones)
td_errors = td_targets - values
critic_loss = (td_errors ** 2).imply()
critic_loss.backward()
The torch.no_grad() context supervisor says: “Compute these subsequent values, however don’t keep in mind the way you computed them. For gradient functions, deal with this as a relentless.” Now in the course of the backward cross:
[ frac{partial L}{partial theta} = 2 cdot (r + gamma V(s’) – V(s)) cdot left( – frac{partial V(s)}{partial theta} right) ]
That problematic time period is gone! Now we’re solely updating V(s), the prediction, to match the mounted goal r + γV(s’). That is precisely what we wish.
The TD goal turns into what it must be: a mounted label, like the bottom fact in supervised studying. We’re now not making an attempt to hit a transferring goal. We’re simply making an attempt to foretell one thing steady.
I modified precisely one line. The critic loss went from oscillating chaotically round 500-1000 to lowering easily: 500 → 250 → 100 → 35 → 8 over 200 iterations. This bug is insidious as a result of the code seems to be fully affordable—however at all times detach your TD targets.

Bug #2: Gamma Too Low (Invisible Rewards)
The Setup
Alright, Bug #1 was delicate. This bug is embarrassingly apparent looking back. However you already know what? Typically the obvious errors are the simplest to overlook since you don’t count on the issue to be that straightforward.
I mounted the transferring goal bug and instantly the critic loss began converging. Incredible! I felt like an actual engineer for a second there. However then I ran the agent for a full coaching iteration and… nothing. Completely nothing improved. The drone would take a couple of random strikes after which instantly crash into the bottom or fly off the display screen. No studying. No enchancment. No indicators of life.
Really, wait. The critic was studying. The loss was happening. However the drone wasn’t getting higher. That appeared backwards. Why would the critic study to foretell values if the agent wasn’t studying something from these values?
The Discovery
I printed the TD targets they usually had been all unfavourable—starting from -5 to -30. No signal of the +500 touchdown reward. Then I did the maths: with 150-step episodes and gamma=0.90:
[ 500 times 0.90^{150} approx 0.00000006 ]
The touchdown reward had been discounted into oblivion. The agent discovered to crash instantly as a result of making an attempt to land was actually invisible to the worth perform.
The low cost issue γ controls the efficient horizon (≈ 1/(1-γ)). With gamma = 0.90, that’s solely 10 steps—method too brief for 100-300 step episodes.
The repair: change gamma from 0.90 to 0.99.
The Impression
I modified gamma from 0.90 to 0.99. Similar community, identical rewards, identical all the things else.
End result: Iteration 5, the drone moved towards the platform. Iteration 50, it slowed when approaching. Iteration 100, first touchdown. By iteration 600, 68% success fee.
One parameter change, fully completely different agent habits. The terminal reward went from invisible to crystal clear. At all times examine: efficient horizon (1/(1-γ)) ought to match your episode size.
Bug #3: Reward Exploits (The Arms Race)
At this level, I’d mounted each the transferring goal downside and the gamma difficulty. My agent was truly studying! It approached the platform, slowed down often, and even landed generally. I used to be genuinely excited. Then I began watching the failures extra rigorously, and one thing bizarre occurred.
After fixing bugs #1 and #2, the agent discovered two new exploits:
Zoom-past: Speed up towards the platform at most velocity, overshoot, crash distant. Web reward: -140 (method rewards +60, crash penalty -200). Higher than crashing instantly (-300), however not touchdown.

Hovering: Get near the platform and vibrate in place with tiny actions (velocity 0.01-0.02) to farm method rewards indefinitely whereas avoiding crash penalties.

Why This Occurs: The Basic Drawback
Right here’s the factor that bothered me: My reward perform may solely see the present state, not the trajectory.
The reward perform is r(s', a): given the subsequent state and the motion I simply took, compute my reward. It has no reminiscence. It will possibly’t inform the distinction between:
- A drone making real progress towards touchdown: approaching from above with managed, purposeful descent
- A drone farming the reward construction: hovering with meaningless micro-movements
Each eventualities might need:
distance_to_platform < 0.3(shut to focus on)velocity > 0(technically transferring)velocity_alignment > 0(pointed in the best route)
The agent isn’t dumb. It’s doing precisely what I advised it to do—maximize the scalar rewards I’m feeding it. The issue is that the rewards don’t truly encode touchdown, they encode proximity and motion. And proximity with out touchdown is exploitable.
That is the core perception of reward hacking: the agent will discover loopholes in your reward specification, not as a result of it’s intelligent, however since you under-specified the duty.
The Repair: Reward State Transitions, Not Snapshots
The repair: reward primarily based on state transitions r(s, s'), not simply present state r(s'). As a substitute of asking “Is distance < 0.3?”, ask “Did we get nearer (distance_delta > 0) AND transfer quick sufficient to imply it (velocity ≥ 0.15)?”
def calc_reward(state: DroneState, prev_state: DroneState = None):
if prev_state isn't None:
distance_delta = prev_state.distance_to_platform - state.distance_to_platform
velocity = state.velocity
velocity_toward_platform = calculate_alignment(state) # cosine similarity
MIN_MEANINGFUL_SPEED = 0.15
if velocity >= MIN_MEANINGFUL_SPEED and velocity_toward_platform > 0.1:
speed_multiplier = 1.0 + velocity * 2.0
rewards['approach'] = distance_delta * 15.0 * speed_multiplier
elif velocity < 0.05:
rewards['hovering_penalty'] = -1.0
Key adjustments: (1) Reward distance_delta (progress), not proximity, (2) MIN_SPEED threshold blocks hovering, (3) Pace multiplier encourages decisive motion.
To make use of this, monitor prev_state in your coaching loop and cross it to calc_reward(next_state, prev_state).
90% of RL is reward engineering. The opposite 90% is debugging your reward engineering. Rewards are a specification of the target, and the agent will discover each loophole.
Primary Actor-Critic Outcomes
I’ve to confess, after I mounted the third bug (that velocity-magnitude-weighted reward perform) and launched a recent coaching run with all three fixes in place, I used to be skeptical. I’d spent a lot time chasing my tail with these algorithms that I half anticipated Actor-Critic to hit some new, inventive failure mode I hadn’t anticipated. However one thing shocking occurred: it simply… labored.
And I imply actually labored. Higher than REINFORCE, in actual fact—noticeably higher. After a whole lot of hours debugging REINFORCE’s reward hacking, I used to be anticipating Actor-Critic to at the very least match its efficiency. As a substitute, it blew previous it.

Why This Beats REINFORCE (And Why That Issues):
Actor-Critic’s on-line updates create a suggestions loop that REINFORCE can’t match. Each single step, the critic whispers within the actor’s ear: “Hey, that state is nice” or “That state is dangerous.” It’s not a worldwide baseline like REINFORCE makes use of. It’s state-specific analysis that will get higher and higher because the critic learns.
Because of this the convergence is 2x sooner. Because of this the ultimate efficiency is 13% higher. Because of this the educational curves are so clear.
And all of it hinged on three issues: detaching the TD goal, utilizing the best low cost issue, and monitoring state transitions within the reward perform. No new algorithm tips wanted. Simply right implementation.
What’s Subsequent: Pushing Past Actor-Critic
With Actor-Critic working alright, you could have seen that the coverage is persistently touchdown the drone on the left aspect of the platform, and likewise the actions are barely jittery. To unravel this, I’m engaged on convering Proximal Coverage Optimization (PPO), which is meant to assist with this by “making the educational course of extra steady”. The nice factor is, this technique has utilized by the researchers at OpenAI to coach their flagship “GPT” fashions.
References
Foundational RL Papers
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Studying: An Introduction (2nd ed.). MIT Press.
Actor-Critic Strategies
- Konda, V. R., & Tsitsiklis, J. N. (2000). “Actor-Critic Algorithms.” SIAM Journal on Management and Optimization, 42(4), 1143-1166.
- Theoretical foundations of Actor-Critic with convergence proofs
- Mnih, V., Badia, A. P., Mirza, M., et al. (2016). “Asynchronous Strategies for Deep Reinforcement Studying.” Worldwide Convention on Machine Studying.
Temporal Distinction Studying
- Sutton, R. S. (1988). “Studying to Predict by the Strategies of Temporal Variations.” Machine Studying, 3(1), 9-44.
- Authentic TD studying paper
Earlier Posts in This Sequence
- Jumle, V. (2025). “Deep Reinforcement Studying: 0 to 100 – Coverage Gradients (REINFORCE).”
Code Repository & Implementation
- Jumle, V. (2025). “Reinforcement Studying 101: Supply Drone Touchdown.”
All photographs on this article are both AI-generated (utilizing Gemini or Sora), personally made by me, or screenshots & plots that I made, except specified in any other case.
















