
remote-training
by Positronic-Robotics
Python-native stack for real-life ML robotics
SKILL.md
name: remote-training description: Manages remote training infrastructure on Nebius VMs. Use for building/pushing Docker images, starting/stopping VM machines (train, train2, train3), running training jobs, dataset generation, and starting inference servers.
Remote Training Infrastructure
Overview
This skill manages the Positronic training infrastructure on Nebius GPU VMs. It covers Docker image management, VM lifecycle, training jobs, dataset generation, and inference server deployment.
Prerequisites
- Docker contexts configured for VMs:
vm-train,vm-train2,vm-train3 - AWS S3 access configured for checkpoint/dataset storage
- Nebius CLI authenticated (for VM start/stop)
Available Machines
| Context | GPU | Use Case |
|---|---|---|
desktop | RTX 3060 (12GB) | Dataset generation, GR00T inference, lerobot training |
notebook | RTX 4060 Laptop (8GB) | Light tasks, testing, dataset generation |
vm-train | H100 (80GB) | GR00T/OpenPI training and inference |
vm-train2 | H100 (80GB) | GR00T/OpenPI training and inference |
vm-train3 | H100 (80GB) | GR00T/OpenPI training and inference |
Important: Only GR00T training/inference and OpenPI training/inference require H100. Other jobs (dataset generation, lerobot) can run on desktop.
Docker Images
Image Overview
| Image | Source | Depends On | Used For |
|---|---|---|---|
positro/positronic | positronic/docker/ | - | Dataset conversion, lerobot training/inference |
positro/gr00t | positronic/docker/ | positro/gr00t-base | GR00T training and inference |
positro/gr00t-base | gr00t/docker/ | - | Base image for GR00T |
positro/openpi | positronic/docker/ | positro/openpi-base | OpenPI training and inference |
positro/openpi-base | openpi/docker/ | - | Base image for OpenPI |
Build Order for Cross-Repo Changes
If you modify code in ../gr00t or ../openpi:
-
For gr00t changes:
cd /home/vertix/dev/gr00t/docker make push # Pushes positro/gr00t-base cd /home/vertix/dev/positronic/docker make push-groot # Rebuilds and pushes positro/gr00t with new base -
For openpi changes:
cd /home/vertix/dev/openpi/docker make push # Pushes positro/openpi-base cd /home/vertix/dev/positronic/docker make push-openpi # Rebuilds and pushes positro/openpi with new base -
For positronic-only changes:
cd /home/vertix/dev/positronic/docker make push-training # Just positro/positronic # Or for specific images: make push-groot # positro/gr00t make push-openpi # positro/openpi make push # All images
VM Machine Management
Start a VM
../internal/scripts/start.sh train
../internal/scripts/start.sh train2
../internal/scripts/start.sh train3
Note: Requires Nebius CLI authentication. Must be run from a terminal with browser access for OAuth flow.
Check VM Status
ssh -o ConnectTimeout=5 vertix@vm-train 'echo connected'
ssh -o ConnectTimeout=5 vertix@vm-train2 'echo connected'
ssh -o ConnectTimeout=5 vertix@vm-train3 'echo connected'
Docker Contexts
docker context ls # List available contexts
docker --context vm-train ps # Check containers on vm-train
docker --context vm-train2 ps # Check containers on vm-train2
Pipeline Overview
1. Data Collection (positronic-data-collection)
↓
2. Dataset Conversion (positronic-to-lerobot) [desktop]
↓
3. [OpenPI only] Generate Stats (openpi-stats) [desktop]
↓
4. Training (groot-train / openpi-train) [H100]
↓
5. Inference Server (groot-server / openpi-server) [H100 or desktop]
↓
6. Inference Client (positronic-inference) [local]
Dataset Generation
Convert Positronic Dataset to LeRobot Format
From docker/ directory (can run on desktop):
docker compose run --rm --pull always positronic-to-lerobot convert \
--dataset=@positronic.cfg.ds.internal.sim_stack_groot_ft \
--dataset.observation=.groot_rot6d_joints \
--dataset.action=.groot_rot6d \
--output_dir=s3://interim/sim_ft/groot_rot6d_q/ \
--fps=15
Observation/Action Configs
| Observation | Description |
|---|---|
.groot | EE pose (quaternion) |
.groot_joints | EE pose + joint positions |
.groot_rot6d | EE pose (6D rotation) |
.groot_rot6d_joints | 6D rotation + joint positions |
.eepose | For OpenPI/ACT |
| Action | Description |
|---|---|
.groot | EE delta (quaternion) |
.groot_rot6d | EE delta (6D rotation) |
.absolute_position | Absolute EE pose |
GR00T Training
From docker/ directory, on H100 VM:
docker --context vm-train compose run --rm --pull=always groot-train \
--input_path=s3://interim/sim_ft/groot_rot6d_q/ \
--output_path=s3://checkpoints/sim_ft/groot_rot6d_q/ \
--exp_name=YYMMDD \
--num_train_steps=20000 \
--save_steps=2000 \
--num_workers=4 \
--modality_config=ee_rot6d_q
GR00T Modality Configs
| Config | Description |
|---|---|
ee | End-effector pose (quaternion) |
ee_q | EE pose + joint feedback |
ee_rot6d | EE pose with 6D rotation |
ee_rot6d_q | 6D rotation + joint feedback |
ee_rot6d_rel | 6D rotation, relative actions |
ee_rot6d_q_rel | 6D rotation + joints, relative actions |
OpenPI Training
From docker/ directory, on H100 VM:
# 1. Generate stats (can run on desktop)
docker compose run --rm openpi-stats \
--input_path=s3://interim/my_lerobot_data \
--output_path=s3://interim/openpi_assets
# 2. Train (requires H100)
docker --context vm-train compose run --rm --pull=always openpi-train \
--input_path=s3://interim/my_lerobot_data \
--stats_path=s3://interim/openpi_assets/assets/ \
--output_path=s3://checkpoints/openpi \
--exp_name=experiment_v1
Inference Servers
GR00T Server (requires GPU)
docker compose run --rm --service-ports groot-server \
ee_rot6d_joints \
--checkpoints_dir=s3://checkpoints/sim_ft/groot_rot6d_q/040126/
Available variants: ee, ee_joints, ee_rot6d, ee_rot6d_joints, ee_rot6d_rel, ee_rot6d_joints_rel
The server exposes a WebSocket API on port 8000 (same as lerobot-server for interchangeability).
OpenPI Server (requires H100)
docker --context vm-train compose run --rm --service-ports openpi-server \
--checkpoints_dir=s3://checkpoints/openpi/pi05_positronic_lowmem/experiment_v1/
LeRobot/ACT Server (can run on desktop)
docker compose run --rm --service-ports lerobot-server \
--checkpoints_dir=s3://checkpoints/act/experiment_v1/
Inference Client
All servers (GR00T, LeRobot, OpenPI) now expose the same WebSocket API on port 8000, so the client uses the same .remote policy config.
With GUI (requires display)
uv run positronic-inference sim \
--policy=.remote \
--policy.host=desktop \
--policy.port=8000 \
--driver.show_gui
Headless (no display required)
MUJOCO_GL=egl uv run positronic-inference sim \
--policy=.remote \
--policy.host=desktop \
--policy.port=8000 \
--driver.show_gui=False \
--driver.simulation_time=10
Server Types
| Server Type | Encoder/Decoder Config | Notes |
|---|---|---|
| GR00T | --observation_encoder=.groot_rot6d_joints --action_decoder=.groot_rot6d | Matches modality_config=ee_rot6d_q |
| LeRobot ACT | --observation_encoder=.eepose --action_decoder=.absolute_position | Default configs |
| OpenPI | Uses internal encoding | No encoder/decoder args needed |
Monitoring Background Jobs
When running jobs in background:
# Check progress percentage
grep -o '[0-9]*%' /tmp/claude/-home-vertix-dev-positronic/tasks/<task_id>.output | tail -1
# View recent output
tail -50 /tmp/claude/-home-vertix-dev-positronic/tasks/<task_id>.output
# Check for completion/errors
grep -i "error\|complete\|finished" /tmp/claude/-home-vertix-dev-positronic/tasks/<task_id>.output
Common Issues
CUDA Out of Memory
Each GR00T server uses ~6GB GPU memory. On 12GB GPUs (desktop), only run one server at a time.
Port Already Allocated
docker ps -a | grep -E "groot-server|openpi-server"
docker stop <container_id> && docker rm <container_id>
VM Not Reachable
- Start the VM:
../internal/scripts/start.sh train2 - Verify SSH:
ssh -o ConnectTimeout=5 vertix@vm-train2 'echo connected'
Parquet Object Array Error
If dataset generation fails with ValueError: setting an array element with a sequence, the fix is in positronic/dataset/vector.py - use np.stack() to convert object arrays to proper 2D arrays.
gladLoadGL Error (Headless)
Use MUJOCO_GL=egl environment variable for headless rendering:
MUJOCO_GL=egl uv run positronic-inference sim --driver.show_gui=False ...
Nebius Auth (Manual Flow for Headless Environments)
When running from a headless environment without browser access:
-
Start nebius in background with
--no-browser:nebius --no-browser --auth-timeout 5m iam whoami 2>&1Run this in background and extract the auth URL from output.
-
Give the auth URL to the user - they click it and authenticate in their browser.
-
User's browser redirects to localhost URL like:
http://127.0.0.1:PORT/?code=XXX&state=YYYThe page won't load (expected). User copies this full URL from address bar.
-
Curl the localhost URL on the machine running nebius:
curl -s "http://127.0.0.1:PORT/?code=XXX&state=YYY" # Returns: "Login is successful, you may close the browser tab" -
Auth completes - nebius background process finishes, credentials are cached.
After authentication, VM start scripts will work:
../internal/scripts/start.sh train
Score
Total Score
Based on repository quality metrics
SKILL.mdファイルが含まれている
ライセンスが設定されている
100文字以上の説明がある
GitHub Stars 100以上
1ヶ月以内に更新
10回以上フォークされている
オープンIssueが50未満
プログラミング言語が設定されている
1つ以上のタグが設定されている
Reviews
Reviews coming soon


