sidebar_position: 13 title: "RL Training" description: "Reinforcement learning on agent behaviors with Tinker-Atropos — environment discovery, training, and evaluation" lang: ru

Обучение RL

Hermes Agent включает интегрированный конвейер обучения RL (Reinforcement Learning), построенный на Tinker-Atropos. Это позволяет обучать языковые модели задачам, специфичным для среды, с использованием GRPO (оптимизация групповой относительной политики) с адаптерами LoRA, полностью управляемыми через инструментальный интерфейс агента.

Обзор

Система обучения RL состоит из трех компонентов:

Atropos — API-сервер траектории, который координирует взаимодействие среды, управляет группами развертывания и рассчитывает преимущества.
Tinker — сервис обучения, который обрабатывает веса модели, обучение LoRA, выборку/вывод и шаги оптимизатора.
Среды — классы Python, определяющие задачи, функции оценки и вознаграждения (например, математические задачи GSM8K).

Агент может обнаруживать среды, настраивать параметры обучения, запускать обучающие прогоны и отслеживать показатели — и все это с помощью набора инструментов rl_*.

Требования

Обучение RL требует:

Python >= 3.11 (требования к пакету Tinker)
TINKER_API_KEY — API-ключ для службы обучения Tinker.
WANDB_API_KEY — API-ключ для отслеживания метрик Weights & Biases
Субмодуль tinker-atropos (по адресу tinker-atropos/ относительно корня Hermes)

# Set up API keys
hermes config set TINKER_API_KEY your-tinker-key
hermes config set WANDB_API_KEY your-wandb-key

Если присутствуют оба ключа и доступен Python >= 3.11, автоматически включается набор инструментов rl.

Доступные инструменты

Инструмент	Описание
`rl_list_environments`	Откройте для себя доступные среды RL
`rl_select_environment`	Выберите среду и загрузите ее конфигурацию
`rl_get_current_config`	Просмотр настраиваемых и заблокированных полей
`rl_edit_config`	Изменить настраиваемые параметры обучения
`rl_start_training`	Запустить обучающий прогон (запускает 3 процесса)
`rl_check_status`	Отслеживайте прогресс обучения и показатели WandB
`rl_stop_training`	Прекратите работу по бегу
`rl_get_results`	Получение окончательных показателей и весов модели
`rl_list_runs`	Список всех активных и завершенных запусков
`rl_test_inference`	Быстрый тест вывода с использованием OpenRouter

Рабочий процесс

1. Откройте для себя среду

List the available RL environments

Агент вызывает rl_list_environments(), который сканирует tinker-atropos/tinker_atropos/environments/, используя синтаксический анализ AST, чтобы найти классы Python, унаследованные от BaseEnv. Каждая среда определяет:

Загрузка набора данных — источник обучающих данных (например, наборы данных HuggingFace).
Быстрое построение — как форматировать элементы модели.
Оценка/проверка – как оценивать результаты модели и назначать вознаграждения.

2. Выберите и настройте

Select the GSM8K environment and show me the configuration

Агент вызывает rl_select_environment("gsm8k_tinker"), затем rl_get_current_config(), чтобы просмотреть все параметры.

Поля конфигурации разделены на две категории:

Настраиваемые поля (можно изменить): - group_size — Количество завершений на предмет (по умолчанию: 16) - batch_size — Размер обучающего пакета (по умолчанию: 128) - wandb_name — имя запуска WandB (автоматически устанавливается {env}-{timestamp}) - Другие параметры, зависящие от окружающей среды

Заблокированные поля (настройки инфраструктуры, изменить нельзя): - tokenizer_name — токенизатор модели (например, Qwen/Qwen3-8B) - rollout_server_url — URL-адрес API Atropos (http://localhost:8000) - max_token_length — Maximum token length (8192) - max_num_workers — Maximum parallel workers (2048) - total_steps — Total training steps (2500) - lora_rank — LoRA adapter rank (32) - learning_rate@@IC0031 @@max_token_trainer_length — Max tokens for trainer (9000)

3. Start Training

Start the training run

The agent calls rl_start_training() which:

Generates a YAML config file merging locked settings with configurable overrides
Creates a unique run ID
Spawns three processes:
Atropos API server (run-api) — trajectory coordination
Tinker trainer (launch_training.py) — LoRA training + FastAPI inference server on port 8001
Environment (environment.py служить) — the selected environment connecting to Atropos

The processes start with staggered delays (5s for API, 30s for trainer, 90s more for environment) to ensure proper initialization order.

4. Monitor Progress

Check the status of training run abc12345

The agent calls rl_check_status(run_id) which reports:

Process status (running/exited for each of the 3 processes)
Running time
WandB metrics (step, reward mean, percent correct, eval accuracy)
Log file locations for debugging

📝 Note

Rate Limiting Status checks are rate-limited to once every **30 minutes** per run ID. This prevents excessive polling during long-running training jobs that take hours.

5. Stop or Get Results

Stop the training run
# or
Get the final results for run abc12345

rl_stop_training() terminates all three processes in reverse order (environment → trainer → API). rl_get_results() retrieves final WandB metrics and training history.

Inference Testing

Before committing to a full training run, you can test if an environment works correctly using rl_test_inference. This runs a few steps of inference and scoring using OpenRouter — no Tinker API needed, just an OPENROUTER_API_KEY.

Test the selected environment with inference

Default configuration: - 3 steps × 16 completions = 48 rollouts per model - Tests 3 models at different scales for robustness: - qwen/qwen3-8b@ @IC0042@@z-ai/glm-4.7-flash (medium) - minimax/minimax-m2.7 (large) - Total: ~144 rollouts

This validates: - Environment loads correctly - Prompt construction works - Inference response parsing is robust across model scales - Verifier/scoring logic produces valid rewards

Tinker API Integration

The trainer uses the Tinker API for model training operations:

ServiceClient — Creates training and sampling clients
Training client — Handles forward-backward passes with importance sampling loss, optimizer steps (Adam), and weight checkpointing
Sampling client — Provides inference using the latest trained weights

The training loop: 1. Fetches a batch of rollouts from Atropos (prompt + completions + scores) 2. Converts to Tinker Datum objects with padded logprobs and advantages 3. Runs forward-backward pass with importance sampling loss 4. Takes an optimizer step (Adam: lr=4e-5, β1=0.9, β2=0.95) 5. Saves weights and creates a new sampling client for next-step inference 6. Logs metrics to WandB

Architecture Diagram

flowchart LR
    api["Atropos API<br/>run-api<br/>port 8000"]
    env["Environment<br/>BaseEnv implementation"]
    infer["OpenAI / sglang<br/>inference API<br/>port 8001"]
    trainer["Tinker Trainer<br/>LoRA training + FastAPI"]

    env <--> api
    env --> infer
    api -->|"batches: tokens, scores, logprobs"| trainer
    trainer -->|"serves inference"| infer

Creating Custom Environments

To create a new RL environment:

Create a Python file in tinker-atropos/tinker_atropos/environments/
Define a class that inherits from BaseEnv
Implement the required methods:
load_dataset() — Load your training data
get_next_it em() — Provide the next item to the model
score_answer() — Score model outputs and assign rewards
collect_trajectories() — Collect and return trajectories
Optionally define a custom config class inheriting from BaseEnvConfig

Study the existing gsm8k_tinker.py as a template. The agent can help you create new environments — it can read existing environment files, inspect HuggingFace datasets, and write new environment code.

WandB Metrics

Training runs log to Weights & Biases with these key metrics:

Metric	Description
`train/loss`	Training loss (importance sampling)
`train/learning_rate`	Current learning rate
`rewar d/mean`	Mean reward across groups
`logprobs/mean`	Mean reference logprobs
`logprobs/mean_training`	Mean training logprobs
`logprobs/diff`	Logprob drift (reference - training)
`advantages/mean`	Mean advantage values
`advantages/std`	Advantage standard deviation

Log Files

Each training run generates log files in ~/.hermes/logs/rl_training/:

logs/
├── api_{run_id}.log        # Atropos API server logs
├── trainer_{run_id}.log    # Tinker trainer logs
├── env_{run_id}.log        # Environment process logs
└── inference_tests/        # Inference test results
    ├── test_{env}_{model}.jsonl
    └── test_{env}_{model}.log

Они незаменимы при отладке, когда обучение не удается или дает неожиданные результаты.