Building an Adaptive Routing Agent with Reinforcement Learning and PyTorch

Reinforcement Learning looks deceptively simple when you first encounter it.

An agent takes actions, receives rewards, and eventually learns a policy. At least that is the theory.

In practice, RL systems are unstable, highly sensitive to reward design, and often difficult to generalize beyond their training environments.

I wanted to explore those challenges more deeply by building a small but research-oriented project:

An Adaptive Routing Agent trained with Deep Q-Networks (DQN) in a custom Gridworld environment.

The objective was not just to make the agent “solve the maze.”

The real goal was to study:

learning behavior,
reward shaping,
convergence stability,
and generalization across environments.

This project became a surprisingly good demonstration of how reinforcement learning intersects with optimization and decision-making systems.

Project Goal

The project trains a reinforcement learning agent to navigate a routing environment with:

obstacles,
movement costs,
penalties,
and dynamic layouts.

The agent learns policies through trial and error while optimizing cumulative reward.

Instead of focusing only on success rate, the project analyzes:

convergence speed,
reward sensitivity,
variance across runs,
and robustness to unseen layouts.

That evaluation mindset turned out to be much more valuable than simply achieving a working policy.

Why Routing Problems Matter

Routing appears everywhere:

delivery systems,
robotics,
warehouse automation,
traffic systems,
autonomous navigation,
and supply chain optimization.

Traditional optimization approaches often rely on:

heuristics,
graph search,
or mathematical programming.

Reinforcement learning introduces another perspective:

Can an agent learn routing behavior directly from interaction?

That question makes RL particularly interesting for adaptive or uncertain environments.

Choosing the Environment

I used a custom Gridworld environment built with Gymnasium.

The setup is intentionally simple:

an agent,
a goal location,
obstacles,
and movement costs.

The agent can move:

up,
down,
left,
or right.

The environment provides immediate feedback through rewards and penalties.

This simplicity makes it easier to study RL behavior without unnecessary complexity.

Reward Design

One of the most important parts of reinforcement learning is reward shaping.

The initial reward scheme looked like this:

+10  -> reaching the goal
-1   -> each movement step
-5   -> hitting obstacles

At first glance, this seems straightforward.

But even small reward modifications dramatically changed learning behavior.

For example:

increasing movement penalties encouraged shorter routes,
large obstacle penalties caused overly conservative behavior,
sparse rewards slowed convergence,
dense rewards sometimes produced unintended policies.

This project reinforced an important RL lesson:

Reward functions define behavior more than algorithms do.

Building the Agent with DQN

The agent was implemented using Deep Q-Networks (DQN) in PyTorch.

DQN combines:

Q-learning,
neural networks,
and experience replay.

Instead of storing a simple Q-table, the network approximates Q-values for each state-action pair.

The workflow looked like this:

Observe state
Select action
Receive reward
Store experience
Sample replay batch
Update neural network

Even though DQN is considered a foundational RL algorithm today, it still demonstrates many real RL challenges:

instability,
variance,
sensitivity to hyperparameters,
and inconsistent convergence.

Defining the State Space

The state representation included:

agent position,
goal position,
and obstacle layout information.

A simplified example:

state = [agent_x, agent_y, goal_x, goal_y]

More advanced representations could include:

Tracking Learning Metrics

One thing I wanted to avoid was evaluating the agent using only “success” or “failure.”

Instead, the project tracked several metrics:

Episode Reward

Measures cumulative reward across episodes.

Convergence Speed

Tracks how quickly the policy stabilizes.

Variance Across Random Seeds

RL results can vary dramatically depending on initialization.

Failure Rate

Important for identifying unstable policies.

These metrics exposed behavior that would otherwise remain hidden behind average reward numbers.

Experimenting with Reward Variants

The most interesting part of the project was running controlled experiments.

I compared:

Reward Scheme A vs Reward Scheme B
Different learning rates
Different exploration settings
New unseen environment layouts

This revealed how fragile reinforcement learning systems can be.

local neighborhood encoding,
obstacle maps,
or graph-based states.

Keeping the state compact made training easier while still allowing meaningful experiments.

Sometimes:

higher rewards produced worse navigation policies,
faster convergence led to poorer generalization,
and seemingly minor parameter changes destabilized training completely.

That instability is one of the defining characteristics of practical RL.

Visualizing Agent Behavior

Visualization made the experiments much easier to interpret.

The project included:

training curves,
policy heatmaps,
and failure case analysis.

Training curves showed whether learning stabilized over time.

Policy heatmaps revealed:

preferred movement regions,
obstacle avoidance behavior,
and inefficient routing tendencies.

Failure analysis was especially useful because it highlighted:

local minima,
repetitive loops,
and exploration failures.

Generalization Challenges

One major experiment involved testing agents on unseen layouts.

An agent trained on one environment often struggled in slightly modified environments.

For example:

moving obstacles,
changing goal positions,
or introducing new map structures

could significantly reduce performance.

This demonstrates a broader issue in reinforcement learning:

Agents often memorize environments instead of learning transferable reasoning.

Generalization remains one of the biggest open problems in RL research.

Project Structure

The repository was intentionally organized into modular components:

rl-routing-agent/
│
├── env.py
├── agent.py
├── train.py
├── eval.py
├── plots.py
├── requirements.txt
└── README.md

`env.py`

Defines the Gridworld environment and reward logic.

`agent.py`

Contains the DQN implementation and neural network.

`train.py`

Handles training loops and experiment execution.

`eval.py`

Runs evaluation experiments and computes metrics.

`plots.py`

Generates visualizations and training curves.

Reinforcement Learning vs Optimization

One of the most interesting aspects of this project was the connection between RL and classical optimization.

Routing problems are traditionally solved using:

shortest path algorithms,
mixed integer programming,
heuristics,
or metaheuristics.

Reinforcement learning approaches the problem differently:

learning through interaction,
adapting dynamically,
and optimizing long-term reward.

However, RL introduces tradeoffs:

weaker guarantees,
instability,
expensive training,
and poor sample efficiency.

This project helped clarify where RL is powerful and where classical optimization still dominates.

What I Learned

This project taught me that reinforcement learning is far less predictable than most tutorials suggest.

A working RL demo can hide:

unstable learning,
reward exploitation,
or poor generalization.

The most valuable insight was realizing that RL evaluation matters just as much as RL training.

Tracking:

variance,
convergence,
and robustness

often reveals more than average reward alone.

It also strengthened my understanding of:

experimental design,
optimization tradeoffs,
and ML system evaluation.

Future Improvements

Several extensions could make the project more advanced:

Multi-Agent Routing

Introduce cooperative or competing agents.

Curriculum Learning

Gradually increase environment difficulty.

Classical Optimization Baselines

Compare RL against:

A* search,
genetic algorithms,
or heuristic routing methods.

Dynamic Environments

Add moving obstacles or stochastic rewards.

Graph Neural Networks

Represent routing environments as graphs instead of grids.

Final Thoughts

Reinforcement learning is often presented as a breakthrough technology capable of solving complex decision-making problems automatically.

The reality is more nuanced.

RL systems are:

fragile,
highly sensitive,
difficult to stabilize,
and challenging to generalize.

But that complexity is exactly what makes them fascinating.

This project was less about building a perfect routing agent and more about understanding how learning systems behave under uncertainty, constraints, and imperfect reward structures.

And honestly, the failure cases turned out to be more educational than the successful ones.

GitHub Repository Structure

rl-routing-agent/
│
├── env.py
├── agent.py
├── train.py
├── eval.py
├── plots.py
├── requirements.txt
└── README.md

Key Concepts Covered

Reinforcement Learning
Deep Q-Networks (DQN)
Reward shaping
Gridworld environments
Routing optimization
Learning stability
Generalization in RL
PyTorch
Gymnasium
Policy evaluation
RL experimentation
AI decision systems

Here’s a link to this project: https://github.com/ishkhan97/Adaptive-Routing-Agent-Reinforcement-Learning