Distributed Path-finding System: Q-Learning in Bio-inspired Robotics

Overview

What is our project about?

For our Coding 3 project, our group made a quadruped robot, TOTI WAVE, and a PATH-FINDING FLOOR, the environment where TOTI WAVE lives.

The relationship between animals and the natural environment inspired our robots.

We named our quadruped robot, TOTI WAVE–––TOTI is short for “Tortoise Roboti” because our robot was inspired by a Tortoise that walks slowly and carefully. Meanwhile, WAVE, its last name was inspired by how it moves like a wave to its destination.

We decided to make our environment a robot, the PATH-FINDING FLOOR because we got inspired by how animals and their habitats complement each other. Animals and their habitats are influencing each other to survive and evolve

What does it do?

Smart Home in the future will have Machine Learning as the software and a physical robo
Smart Home in the future will have Machine Learning as the software and a physical robot.

We imagine a future where robots will assist humans in many different aspects of their lives, including the implementation of smart home systems and home robot assistants.

TOTI WAVE robot could be the physical robot interacting with physical tasks around the house, such as cleaning, preparing meals, taking care of children, etc. Because TOTI WAVE has 4 legs, it could easily avoid obstacles or climb up and down the stairs.

PATH-FINDING FLOOR robot could be installed in the house to help humans and home robot assistants inside it navigate their way.This technology could help people do tasks around the house, such as finding misplaced or forgotten items, and helping the guests find the shortest route to a bathroom.

For a start, we made TOTI WAVE and PATH-FINDING FLOOR could communicate with each other through light. TOTI WAVE has two light sensors to help it detect the shortest route to another room found by the PATH-FINDING FLOOR.

Initial Idea & Research

Imitating Nature’s Animals: Frog & Wind-up Frog Toy

Moving forward by jumping or bouncing like a frog, inspired by nature.

We wanted to create something unique that moves forward by jumping or bouncing, so we borrowed the structure of a wind-up frog, a children's toy.This toy uses the spring tension of the "wind-up" mechanism to provide the upward jumping force.

Wind-up Frog Toy Mechanism

In our efforts to replace the spring tension with a DC motor, Yilidaer and Royer conducted extensive research on the internal gear structure.They eventually decided to use a half-gear mechanism combined with a spring to generate instantaneous force.

However, when we printed our demo components using 3D printing, we found that all the parts were too rough, the gears didn't mesh properly, and there was significant resistance. It was also challenging to control the stability of its landing, as well as the distance and direction. Considering that we wouldneed to implement complex actions like turning in the future, we decided to change our approach after completing the structural demo.

So we changed our idea.

1st Robot: QUADRUPED ROBOT Inspired by Tortoise

We decided to focus on a walking robot that has 4 legs (quadruped) mimicking a Tortoise.

Royer was inspired by a YouTube video of a quadruped robot climbing stairs. He then motivated our group to create a quadruped robot, pointing out that its movement makes it appear both cute and unique for presentations. Since each leg of the robot has 2 DOF (degree of freedom) the quadruped robot could avoid obstacles or walk on challenging terrains. At the same time, Royer also found the open-source code and models for a quadruped robot, which made us believe that this approach is feasible.

At first, we thought of making the robot learn how to find the shortest path despite the challenging terrain. However, since Tesalonika’s simulation of the q-learning with Arduino and LEDs works, Ye suggested that we could use light sensors for the walking robot to walk towards the shining LEDs.

2nd Robot: PATH-FINDING FLOOR Inspired by Physarum Polycephalum (Slime Mould or “The Blob”)

A team of Japanese and Hungarian scientists revealed that P. polycephalum can solve the shortest path problem.

Looking for intelligence in nature, Tesalonika found videos on YouTube about Slime Mould. Slime Mould or “The Blob” are informal names given to a single-celled organism called Physarum Polycephalum. It is bright yellow and hangs out in cool, shady, wet areas of the forest floor.

Slime Mould has demonstrated behaviours akin to both single-celled organisms and eusocial insects. For instance, research conducted by a team of Japanese and Hungarian scientists revealed that Slime Mould can solve the shortest path problem.

As shown in the graphic above, in only 26 hours, Slime Mould could re-make the design of the Tokyo Railway path that was made by the brightest minds and experienced engineers.

There is also another thing that Slime Mould could do, which is to find a way through a maze. First, we thought of adding a maze-solver feature from the Slime Mould to our robot. However, due to its complex application and our limited knowledge at the beginning of this project, we decided to re-adjust our idea.

Tesalonika suggested making an environment robot that could communicate with the walking robot to find the shortest part instead.Ye Ouyang then added the idea of making the floor light up brightly so the walking robot could detect the light with LDR sensors.

The Core Concepts of Reinforcement Learning

Source: Basic Terminologies of Reinforcement Learning

As we understand it, Reinforcement Learning is a type of machine learning where an agent learns to make decisions by taking actions in an environment to maximise some notion of cumulative reward.

Reinforcement learning in the context of robotics involves an autonomous agent (the robot) learning to perform tasks by interacting with its environment and receiving feedback in the form of rewards.The robot observes its current state, takes actions to affect its state, and uses a policy to decide on these actions to maximize cumulative rewards over time.

This process enables the robot to learn complex behaviours, such as navigation, manipulation, and interaction, by continuously improving its performance through trial and error, without requiring explicit programming for every possible scenario.

Here are the core concepts of reinforcement learning as we understand it:

  • Agent is the entity that makes decisions and takes actions in the environment to achieve a goal.

  • Environment is everything the agent interacts with and within which it performs actions.The environment responds to the agent's actions and presents new situations (states) and rewards.

  • State is a presentation of the current situation of the agent within the environment.The state contains all the information necessary to describe the situation the agent is in.

  • Action is the set of all possible moves the agent can make at a given state.The action taken by the agent influences the subsequent state and reward.

  • Reward is a scalar feedback signal received by the agent after taking an action.The reward indicates the immediate benefit of the action taken.The goal of the agent is to maximize the cumulative reward over time.

  • Policy (𝜋) is a strategy used by the agent to determine its actions based on the current state.The policy can be deterministic or stochastic, mapping states to actions.

  • Value Function is a function that estimates the expected cumulative reward (future rewards) starting from a state (or a state-action pair):

  • State Value Function (𝑉(𝑠)) is the expected cumulative reward starting from state 𝑠 and following the policy 𝜋.

  • Action Value Function (𝑄(𝑠,𝑎)) is the expected cumulative reward starting from state 𝑠, taking action 𝑎, and then following the policy 𝜋.

  • Exploration vs. Exploitation is the dilemma the agent faces when deciding whether to explore new actions to discover their effects and rewards (exploration) or to exploit known actions that yield high rewards (exploitation).

  • Discount Factor (𝛾) is a factor between 0 and 1 that determines the importance of future rewards. A discount factor close to 0 makes the agent prioritize immediate rewards, while a discount factor close to 1 makes the agent consider long-term rewards more significantly.

There are key algorithms in Reinforcement Learning, divided into 2 (two) types: model-based and model-free algorithms. Sub-dividing it further, some algorithms fall under on-policy and off-policy types.

How to calculate the Q-table

Additionally, we found a very helpful tutorial that clearly explains reinforcement learning and provides step-by-step coding instructions. Following the tutorial, we reproduced the code and ran it on the computer, which allowed us to gain a deeper understanding of reinforcement learning.

Bellman Equation

‘S’ represents the current state .‘a’ represents the agent's action from the current state.‘ S’ ’ represents the state resulting from the action.‘r’ is the reward you get for taking the action and ‘γ’ is the discount factor. So, the Q value for the state ‘S’ taking the action ‘a’ is the sum of the instant reward and the discounted future reward (value of the resulting state).The discount factor ‘γ’ determines how much importance you want to give to future rewards. Say, you go to a state that is further away from the goal state, but from that state, the chances of encountering a state with snakes are less, so, the future reward is more even though the instant reward is less.

Then we used this case study to learn how to calculate the Q table.

The initial Q-table would look like ( states along the rows and actions along the columns ) :

Q Matrix

U — up, D — down, L — left, R — right, E represents NULL, i.e., there can be no such transitions.

R Matrix

Here is the algorithmic approach for calculating the Q table:

  1. Initialise Q-matrix by all zeros. Set value for ‘γ’. Fill rewards matrix.

  2. For each episode. Select a random starting state (here we will restrict our starting state to state-1).

  3. Select one among all possible actions for the current state (S).

  4. Travel to the next state (S’) as a result of that action (a).

  5. For all possible actions from the state (S’) select the one with the highest Q value.

  6. Update Q-table using eqn.1 .

  7. Set the next state as the current state.

  8. If the goal state is reached then end.

Here is the code:https://github.com/Yee-Ouyang/Coding_3/tree/master/RL/treasure_maze

Model-Free Reinforcement Learning on Arduino: Q-Learning

Our team would like to try Reinforcement Learning on Arduino, a microcontroller board that our team members are already familiar with. However, because Arduino has a small memory, we could only achieve this using a simple algorithm or model-free reinforcement learning such as Q-learning.

Q-learning is an off-policy and model-free algorithm that learns from random actions (greedy policy). The ‘Q’ in Q-learning stands for the quality of activities that maximize the rewards generated through the algorithmic process.

In Q-learning, a reward matrix is used to store the earned rewards. For instance, if there’s a reward of 50, a reward matrix assigns a value at position 50 to denote that reward.

Tesalonika found a tutorial by Gerardo Franco Delgado on LinkedIn and Ali Demir on Github about doing Q-learning on Arduino.The tutorial uses small LEDs to simulate a house with some rooms.The coding tutorial shows how we could train the “brain” of the robot to find the shortest path from one room to the other.

Other tutorials focus on a walking robot––a physical robot––found by Tesalonika and Ye Ouyang.The first one is “Obstacle Ovaider Robot” on GitHub by Zakaria Pro.The second one is “Machine Learning on Arduino” by Nikodem Bartnik on YouTube. And, the third one is “Q-learning Crawler Robot” by Xtian on YouTube. In theory, it is possible to train the walking quadruped robot to walk towards the light emitted by the path-finding floor on Arduino using Q-learning.

To make the light shine brighter, Tesalonika and Yilidaer scaled up the small LEDs by replacing them using WS2812B LED pixels, using the tutorial from How To Mechatronics. Meanwhile, Ye and Royer worked on the coding of the quadruped robot, programming function for 4 different movements–– forward, backward, left, and right––following the tutorials we have mentioned above.

Model-Free Reinforcement Learning through Neural Networks

At the same time, based on the materials provided by the instructor in class, "Machine Learning Crawler Robot Using Reinforcement Learning, Neural Net and Q-Learning", Ye attempted to use neural networks to train the Arduino robot.Ye found that the algorithm published on the website had significant errors and tried to modify the code. Although the final code ran without obvious errors, we ultimatelydecided to use Q-learning for reinforcement learning.Therefore, this code was not deployed on the physical robot, and its effectiveness remains unknown.

Here is the code:https://github.com/YeeOuyang/Coding_3/blob/master/RL/Neural_Net_train/4/4.ino

Process & Coding

The Making of TOTI WAVE:TOTI WAVE: Adjusting 3D models

After we were sure that we wanted to build the open source robot to be the physical body to do qlearning stuff, we began to build it by Rhino and 3D printing.

Yilidaer and Royer 3d printed the first version of the open source 3d model, but in practice we found that the robot had too much head structure that would cause the SG90 motors to be unable to carry the weight. So we used Rhino to re-modelled the head of the robot.

TOTI WAVE: 3D Printing and Model Assembly

During this phase, the most important thing is to ensure that each part is sized appropriately for our motor to function properly. After making several adjustments, we start combining the parts with the servo. In case we encounter challenges with size or assembly, we address them by using a sturdy machine and reprinting some parts.

The whole process of the early prototype.

We also spent a lot of time resetting the beginning angle of these motors so that they could directly run the same angle number for all 4 legs. After we solved the problems of the legs, we combined them again with the head so it truly looked like a robot.

TOTI WAVE: Motor Assembly, Wiring

Wiring methods and diagrams in open-source code

Since the original code was designed for a robot controlled by the ESP8266 Wi-Fi board, Ye modified the code for compatibility with the Arduino Leonardo and made further adjustments to the programming.

coding for digital input

Yilidaer and Royer set the initial angles of all motors to 90 degrees, handled the soldering of the circuits, mounted them onto the mechanical structure, and took care of the wiring.

First Generation Wiring vs. Second Generation Wiring

TOTI WAVE: Moving Functions

Then Ye wrote a very basic code to control the movement of servos by angle, checking if the assembled hardware could move normally. However, when Ye tried to deploy it on the Turtle entity, he found the code controlling the angles to be quite foolish. Even though Yilidaer and Royer had already set all servo angles to zero through the code before mounting them on the 3D printed parts, there werestill discrepancies in the initial positions after installation. Therefore, I had to adjust each servo individually. Fortunately, all legs could move and perform forward walking. After testing, it was found that the servo angles were inaccurate, leading to the decision to control the servo rotation by amplitude instead.

Ye revised the control logic and functions for servo motion, incorporating automatic servo calibration to make the movements smoother.

TOTI WAVE: LDR Sensors

Ye also conducted tests using LED lights and LDR sensors to observe the intensity of the light and the sensitivity of the sensors, preparing the parameters for the reinforcement learning code.

Small demo for testing

TOTI WAVE: Reinforcement Learning

Here is the code:https://github.com/Yee-Ouyang/Coding_3/tree/master/tortoise/iOlly_v1_3

This program utilizes the Q-learning reinforcement learning algorithm and real-time data (light intensity readings) to automatically adjust the robot's behaviour.

Core Functions:

  • Environmental Sensing: Uses two light-dependent resistors (LDRs) to detect the light intensity on the left and right sides.

  • Learning and Decision-Making: Records and updates the expected utility of each action in each state using the Q-table structure.

  • Action Execution: Controls the robot to perform specific actions, such as turning left, turning right, or moving straight, based on the current state and the information learned.

Environmental Sensing:

Ye uses two light-dependent resistors (LDRs) to detect the light intensity on the left and right sides.

Real-time data input

Learning and Decision-Making:

Ye set up a 3x3 Q-table corresponding to states(Define the three states of the robot based on the difference in light intensity on the left and right sides) and actions (moving straight, turning left, turning right), and included three cases to invoke the physical robot's motion functions.

Setting up the Q-table

Determines the reward, which is positive if the light intensity increases and negative otherwise.

Setting up the reward

Regarding parameter adjustments for decision-making, Ye set the reading range for the LDR sensors. Due to factors related to real-world deployment and the sensitivity of the sensors, a range threshold of 50 was established. Only when the readings of the two sensors differ by more than 50 is an action of 'turning left' or 'turning right' performed.

Setting up the decision making

Setting other parameters for Q-learning, and by using the ε-greedy strategy to balance exploration and exploitation, thereby optimizing the robot's behavior decision-making.

Setting up the Q-learning & ε-greedy strategy

Action Execution:

By setting three cases to call the functions for moving straight, turning left, and turning right.

3 cases
The motion functions for 3 cases

In practice, doing reinforcement learning with Q-learning on Arduino takes longer time thanexpected. Due to the limited memory that the Arduino board has, the robot cannot remember what it has previously learnt. So the robot needs to re-learn every time we turn it on.

See the videos of the TOTI WAVE walking, here:

The Making of PATH-FINDING FLOOR

To make the “environment” robot that we named, PATH-FINDING FLOOR, Tesalonika tried and followed the tutorial by Gerardo Franco Delgado on LinkedIn and Ali Demir on Github about doing Q-learning on Arduino.

In our code, we follow different tutorials but the same in principle. The tutorial by Ali uses 7 LEDs to represent 7 rooms (0-6).The map is depicted in the picture below:

Map or environment for our Q-learning simulation.

The objective is to find the shortest route from ROOM 0 to ROOM 6. We decided to use this map for our path-finding floor because it’s more interesting and random for the quadruped robot.

The q-table (or q-matrix) in this code is a 7x7 matrix, initially filled with zeros. It represents the learned values of state-action pairs, which get updated during the q-learning process. Each element Q[state] [action] in the Q-table indicates the expected utility (or quality) of taking a particular action in a particular state.

The mapping matrix defines which states can be reached from each current state. This helps in selecting the next action during the Q-learning process.

The reward matrix 𝑅 and the mapping matrix are used to define the environment, including the connections between states, obstacles (walls), and the target.The reward matrix 𝑅 specifies the reward values for transitioning from one state to another. It uses -1 for walls, 0 for valid connections, and 100 for the target state.

To see the full code please click here: https://github.com/tesawibisono/arduino-q-learning/blob/main/sketch_path_finding_floor/sketch_path_finding_floor.ino

See the videos of the PATH-FINDING FLOOR here:

Scaling Up PATH-FINDING FLOOR

Ye Ouyang suggested the idea of making the floor light up brightly so the walking robot could detect the light with LDR sensors.To scale up the PATH-FINDING FLOOR, Tesalonika changed the small LEDs to brighter LEDs, WS2812B NeoPixel LED strips that can be programmed individually on Arduino.This also means that we need to modify and adjust the q-learning code.

Ye helped with finding the right library to use the WS2812B NeoPixel LED strips, and Tesalonika then modified it with the q-learning code.

Meanwhile, Yilidaer and Royer helped with the base for a bigger path-finding floor.Yiliader and Royer used an A2 board and worked on the measurement.Then Tesalonika added thick cardboard to add height and a stronger structure for the quadruped robot to walk on.We are using 4 LEDs in 1 strip to make one tile of the floor shine brighter, so the robot can detect it.

The Q-learning works, but because of the hardware changes, some of the lights in some of the LEDS in the strips do not light up the way it supposed to. Fortunately, the path-finding floor is still able to function, with some strips sometimes having 1 or 2 LEDs in lighting up.

Reflections

From Simulation to Physical Friction

Although the algorithm is logically perfect, in reality, the friction of the servo motors, the distribution of the center of gravity, and the undulations of the robot during movement all become "noise." I realized that the core challenge of robot development lies not in writing the algorithm, but in how to use code to accommodate and compensate for physical uncertainties.

Efficiency vs. Biomimetic

When initially choosing the technology approach, I opted for a quadruped robot to mimic the cautious, undulating rhythm of a tortoise, sacrificing some movement speed and stability. This resulted in larger fluctuations in sensor values ​​during later movements, making reinforcement learning more challenging.

Optimization Under Extreme Hardware Constraints

Deploying a Reinforcement Learning algorithm on an Arduino Leonardo was an "extreme challenge" due to its severely limited memory. A major technical hurdle was the lack of data telemetry: since the system lacked Bluetooth/Wi-Fi modules and all available Serial Ports were occupied, I could not stream real-time LDR sensor values or algorithm performance metrics back to a monitor. This created a "black box" debugging environment, where I had to rely solely on observing physical robotic behaviors to infer and iterate on the underlying code logic. If given more time, I would implement a dedicated telemetry system or utilize a board with multiple hardware serials to enhance data visibility and algorithmic transparency.

date published

2023年6月8日

reading time

15 min

my role

Lead Interaction Developer & System Architect

mian tasks

1. Algorithmic Implementation 2.Gait Optimization & Motion Control 3. Distributed Sensor Interaction

.say hello

Open to full-time, freelance, or collaborations. Let’s connect!

.say hello

Open to full-time, freelance, or collaborations. Let’s connect!

Create a free website with Framer, the website builder loved by startups, designers and agencies.