Modern fulfillment centers such as those operated by Amazon, Walmart, and Alibaba rely on fleets of autonomous mobile robots that move items between storage shelves, packing stations, and loading docks. These robots must make real-time navigation decisions in a highly dynamic environment where other robots, forklifts, and human workers are constantly moving.
The animation below simulates a simplified version of such a system. Two autonomous robots must repeatedly travel across a warehouse floor to deliver items to a packing station while avoiding collisions with moving forklifts. The robots learn their navigation strategy through reinforcement learning, with one using the SARSA algorithm and the other using Q-Learning.
The warehouse floor is modeled as a grid of locations representing the robots' possible positions. Each square represents a state in the reinforcement learning model.
The robots begin each episode at a random starting location. This forces them to explore and learn the optimal path from every possible point in the warehouse.
The green square represents the packing station, which is the destination.
Yellow squares represent moving forklifts, which create dynamic hazards in the environment. Forklifts move back and forth across the warehouse aisles, periodically blocking routes that would otherwise be safe. Because of this, the robots must learn not only the shortest route, but also routes that avoid high-risk areas.
At every time step, the robots may choose one of four actions: move right, move left, move up, move down
The robots receive feedback through a reward signal:
Each movement step: −1 (time penalty)
Collision with forklift: −120 (severe penalty)
Successful delivery: +60 (goal reward)
The arrows displayed in each cell represent the currently learned "best action" according to each algorithm's Q-table. You can use the toggle to view either the SARSA policy or the Q-Learning policy.
As learning progresses, you can observe how the policy evolves:
Initially random and blank.
Gradually forming structured routes as the robots discover the goal.
Eventually stabilizing into efficient, distinct navigation paths that highlight the difference between on-policy and off-policy decision-making.