Intro
Humanoid robots are machines resembling the human body in shape and movement, designed for working alongside people and interacting with our tools. They are still an emerging technology, but forecasts predict billions of humanoids by 2050. Currently, the most advanced prototypes are NEO by 1XTech, Optimus by Tesla, Atlas by Boston Dynamics, and G1 by China’s Unitree Robotics.
There are two ways for a robot to perform a task: manual control (when you specifically program what it has to do) or Artificial Intelligence (it learns how to do things by trying). In particular, Reinforcement Learning allows a robot to learn the best actions through trial and error to achieve a goal, so it can adapt to changing environments by learning from rewards and penalties without a predefined plan.
In practice, it is crazy expensive to have a real robot learning how to perform a task. Therefore, state-of-the-art approaches learn in simulation where data generation is fast and cheap, and subsequently transfer the knowledge to the real robot (“sim-to-real” / “sim-first” approach). That enables the parallel training of multiple models in simulation environments.
The most used 3D physics simulators on the market are: PyBullet (beginners) , Webots (intermediate), MuJoCo (advanced), and Gazebo (professionals). You can use any of them as standalone software or through Gym, a library made by OpenAI for developing Reinforcement Learning algorithms, built on top of different physics engines.
In this tutorial, I’m going to show how to build a 3D simulation for a humanoid robot with Artificial Intelligence. I will present some useful Python code that can be easily applied in other similar cases (just copy, paste, run) and walk through every line of code with comments so that you can replicate this example (link to full code at the end of the article).
Setup
An environment is a simulated space where agents can interact and learn to perform a task. It has a defined observation space (the information agents receive) and action spaces (the set of possible actions).
I will use Gym (pip install gymnasium) to load one of the default environments made with MuJoCo (Multi-Joint dynamics with Contact, pip install mujoco).
import gymnasium as gym
env = gym.make("Humanoid-v4", render_mode="human")
obs, info = env.reset()
env.render()

The agent is a 3D bipedal robot that can move like a human. It has 12 links (solid body parts) and 17 joints (flexible body parts). You can see the full description here.
Before starting a new simulation, you must reset the environment with obs, info = env.reset(). That command returns information about the agent’s initial state. The info usually includes extra information about the robot.

While the obs is what the agent sees (i.e. with sensors), an AI model would need to process those observations to decide what action to take.

Usually, all Gym environments have the same structure. The first thing to check is the action space, the set of all the possible actions. For the Humanoid simulation, an action represents the force applied to one of its 17 joints (within a range of -0.4 and +0.4 to indicate the direction of the push).
env.action_space

env.action_space.sample()

A simulation should at least cover one episode, a complete run of the agent interacting with the environment, from start to termination. Each episode is a loop of reset() -> step() -> render(). Let’s make an example running one single episode with the humanoid doing random actions, so not AI.
import time
env = gym.make("Humanoid-v4", render_mode="human")
obs, info = env.reset()
reset = False #reset if the humanoid falls or the episode ends
episode = 1
total_reward, step = 0, 0
for _ in range(240):
## action
step += 1
action = env.action_space.sample() #random action
obs, reward, terminated, truncated, info = env.step(action)
## reward
total_reward += reward
## render
env.render() #render physics step (CPU speed = 0.1 seconds)
time.sleep(1/240) #slow down to real-time (240 steps × 1/240 second sleep = 1 second)
if (step == 1) or (step % 100 == 0): #print first step and every 100 steps
print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
## reset
if reset:
if terminated or truncated: #print the last step
print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
obs, info = env.reset()
episode += 1
total_reward, step = 0, 0
print("------------------------------------------")
env.close()


As the episode continues and the robot moves, we receive a reward. In this case, it’s positive if the agent stays up or moves forward, and it’s a negative penalty if it falls and touches the ground. The reward is the most important concept for AI because it defines the goal. It is the feedback signal we get from the environment after every action, indicating whether that move was useful or not. Therefore, it can be used to optimize the decision-making of the robot through Reinforcement Learning.
Reinforcement Learning
At every step of the simulation, the agent observes the current situation (i.e. its position in the environment), decides to take action (i.e. moves one of its joints), and receives a positive or negative response (reward, penalty). This cycle repeats until the simulation ends. RL is a type of Machine Learning that brings the agent to maximize the reward through trial and error. So if successful, the robot will know what is the best course of action.
Mathematically, RL is based on the Markov Decision Process, in which the future only depends on the present situation, and not the past. To put it in simple words, the agent doesn’t need memory of previous steps to decide what to do next. For example, a robot only needs to know its current position and velocity to choose its next move, it doesn’t need to remember how it got there.
RL is all about maximizing the reward. So, the entire art of building a simulation is designing a reward function that truly reflects what you want (here the goal is not to fall down). The most basic RL algorithm updates the list of preferred actions after receiving a positive reward. The speed at which that happens is the learning rate: if this number is too high, the agent will overcorrect, while if it’s too low, it keeps making the same mistakes and learning painfully slow.
The preferred action updates are also impacted by the exploration rate, which is the frequency of a random choice, basically it’s the AI’s curiosity level. Usually, it’s relatively high at the beginning (when the agent knows nothing) and decays over time as the robot exploits its knowledge.
import gymnasium as gym
import time
import numpy as np
env = gym.make("Humanoid-v4", render_mode="human")
obs, info = env.reset()
reset = True #reset if the humanoid falls or the episode ends
episode = 1
total_reward, step = 0, 0
exploration_rate = 0.5 #start wild
preferred_action = np.zeros(env.action_space.shape) #knowledge to update with experience
for _ in range(1000):
## action
step += 1
exploration = np.random.normal(loc=0, scale=exploration_rate, size=env.action_space.shape) #add random noise
action = np.clip(a=preferred_action+exploration, a_min=-1, a_max=1)
obs, reward, terminated, truncated, info = env.step(action)
## reward
total_reward += reward
if reward > 0:
preferred_action += (action-preferred_action)*0.05 #learning_rate
exploration_rate = max(0.05, exploration_rate*0.99) #min_exploration=0.05, decay_exploration=0.99
## render
env.render()
time.sleep(1/240)
if (step == 1) or (step % 100 == 0):
print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
## reset
if reset:
if terminated or truncated:
print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
obs, info = env.reset()
episode += 1
total_reward, step = 0, 0
print("------------------------------------------")
env.close()


Obviously, that is way too basic for a complex environment like the Humanoid, so the agent will keep falling even if it updates the preferred actions.
Deep Reinforcement Learning
When the relationship between actions and rewards is non-linear, you need Neural Networks. Deep RL can handle high-dimensional inputs and estimate the expected future rewards of actions by leveraging the power of Deep Neural Networks.
In Python, the easiest way to use Deep RL algorithms is through StableBaseline, a collection of the most famous models, already pre-implemented and ready to go. Please note that there is StableBaseline (written in TensorFlow) and StableBaselines3 (written in PyTorch). Nowadays, everyone is using the latter.
pip install torch
pip install stable-baselines3
One of the most commonly used Deep RL algorithms is PPO (Proximal Policy Optimization) as it is simple and stable. The goal of PPO is to maximize total expected reward, while making small updates to this policy, keeping the growth steady.
I shall use StableBaseline to train a PPO on the Gym Humanoid environment. There are a few things to keep in mind:
- we don’t need to render the env graphically, so the training can proceed with accelerated speed.
- The Gym env must be wrapped into
DummyVecEnvto make it compatible with StableBaseline vectorized format. - Regarding the Neural Network model, PPO uses a Multi-layer Perceptron (
MlpPolicy) for numeric inputs, a Convolution NN (CnnPolicy) for images, and a combined model (MultiInputPolicy) for observations of mixed types. - Since I’m not rendering the humanoid, I find it very useful to look at the training progress on TensorBoard, a toolkit to visualize statistics in real time (
pip install tensorboard). I created a folder named “logs”, and I can just runtensorboard --logdir=logs/on the terminal to serve the dashboard locally (http://localhost:6006/).
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
## environment
env = gym.make("Humanoid-v4") #no rendering to speed up
env = DummyVecEnv([lambda:env])
## train
print("Training START")
model = PPO(policy="MlpPolicy", env=env, verbose=0,
learning_rate=0.005, ent_coef=0.005, #exploration
tensorboard_log="logs/") #>tensorboard --logdir=logs/
model.learn(total_timesteps=3_000_000, #1h
tb_log_name="model_humanoid", log_interval=10)
print("Training DONE")
## save
model.save("model_humanoid")

After the training is complete, we can load the new model and test it in the rendered environment. Now, the agent won’t be updating the preferred actions anymore. Instead, it will use the trained model to predict the next best action given the current state.
env = gym.make("Humanoid-v4", render_mode="human")
model = PPO.load(path="model_humanoid", env=env)
obs, info = env.reset()
reset = False #reset if the humanoid falls or the episode ends
episode = 1
total_reward, step = 0, 0
for _ in range(1000):
## action
step += 1
action, _ = model.predict(obs)
obs, reward, terminated, truncated, info = env.step(action)
## reward
total_reward += reward
## render
env.render()
time.sleep(1/240)
if (step == 1) or (step % 100 == 0): #print first step and every 100 steps
print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
## reset
if reset:
if terminated or truncated: #print the last step
print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
obs, info = env.reset()
episode += 1
total_reward, step = 0, 0
print("------------------------------------------")
env.close()

Please note that at no point in the tutorial did we specifically program the robot to stay up. We are not controlling the agent. The robot is simply reacting to the reward function of its environment. In fact, if you train the RL model for much longer (i.e. 30 million time steps), you’ll start seeing the robot not only perfectly standing up, but also walking forward. So, when it comes to training an agent with AI, the design of the 3D world and its rules is more important than building the robot itself.
Conclusion
This article has been a tutorial to introduce MuJoCo and Gym, and how to create 3D simulations for Robotics. We used the Humanoid environment to learn the basics of Reinforcement Learning. We trained a Deep Neural Network to teach the robot how not to fall down. New tutorials with more advanced robots will come.
Full code for this article: GitHub
I hope you enjoyed it! Feel free to contact me for questions and feedback or just to share your interesting projects.
👉 Let’s Connect 👈

(All images are by the author unless otherwise noted)





