Back to list
aiskillstore

agentdb-reinforcement-learning-training

by aiskillstore

Security-audited skills for Claude, Codex & Claude Code. One-click install, quality verified.

102🍴 3📅 Jan 23, 2026

SKILL.md


skill_id: when-training-rl-agents-use-agentdb-learning name: agentdb-reinforcement-learning-training description: "Train AI agents using AgentDB's 9 reinforcement learning algorithms including Q-Learning, DQN, PPO, and Actor-Critic. Build self-learning agents, implement RL training loops with experience replay, and deploy optimized models to production." version: 1.0.0 category: agentdb subcategory: machine-learning trigger_pattern: "when-training-rl-agents" agents:

  • ml-developer
  • safla-neural
  • performance-benchmarker complexity: advanced estimated_duration: 6-10 hours prerequisites:
  • AgentDB basics
  • Reinforcement learning fundamentals
  • Neural network knowledge
  • Python/TypeScript proficiency outputs:
  • Trained RL agents
  • Learning plugin modules
  • Performance benchmarks
  • Deployment pipeline validation_criteria:
  • Training converges successfully
  • Reward curve shows improvement
  • Agent passes validation tasks
  • Benchmarks meet targets evidence_based_techniques:
  • Self-consistency validation
  • Program-of-thought decomposition
  • Chain-of-verification
  • Multi-agent consensus metadata: author: claude-flow created: 2025-10-30 updated: 2025-10-30 tags:
    • agentdb
    • reinforcement-learning
    • neural-networks
    • ai-training
    • q-learning

AgentDB Reinforcement Learning Training

Overview

Train AI learning plugins with AgentDB's 9 reinforcement learning algorithms including Decision Transformer, Q-Learning, SARSA, Actor-Critic, PPO, and more. Build self-learning agents, implement RL, and optimize agent behavior through experience.

When to Use This Skill

Use this skill when you need to:

  • Train autonomous agents that learn from experience
  • Implement reinforcement learning systems
  • Optimize agent behavior through trial and error
  • Build self-improving AI systems
  • Deploy RL agents in production environments
  • Benchmark and compare RL algorithms

Available RL Algorithms

  1. Q-Learning - Value-based, off-policy
  2. SARSA - Value-based, on-policy
  3. Deep Q-Network (DQN) - Deep RL with experience replay
  4. Actor-Critic - Policy gradient with value baseline
  5. Proximal Policy Optimization (PPO) - Trust region policy optimization
  6. Decision Transformer - Offline RL with transformers
  7. Advantage Actor-Critic (A2C) - Synchronous advantage estimation
  8. Twin Delayed DDPG (TD3) - Continuous control
  9. Soft Actor-Critic (SAC) - Maximum entropy RL

SOP Framework: 5-Phase RL Training Deployment

Phase 1: Initialize Learning Environment (1-2 hours)

Objective: Setup AgentDB learning infrastructure with environment configuration

Agent: ml-developer

Steps:

  1. Install AgentDB Learning Module
npm install agentdb-learning@latest
npm install @agentdb/rl-algorithms @agentdb/environments
  1. Initialize learning database
import { AgentDB, LearningPlugin } from 'agentdb-learning';

const learningDB = new AgentDB({
  name: 'rl-training-db',
  dimensions: 512, // State embedding dimension
  learning: {
    enabled: true,
    persistExperience: true,
    replayBufferSize: 100000
  }
});

await learningDB.initialize();

// Create learning plugin
const learningPlugin = new LearningPlugin({
  database: learningDB,
  algorithms: ['q-learning', 'dqn', 'ppo', 'actor-critic'],
  config: {
    batchSize: 64,
    learningRate: 0.001,
    discountFactor: 0.99,
    explorationRate: 1.0,
    explorationDecay: 0.995
  }
});

await learningPlugin.initialize();
  1. Define environment
import { Environment } from '@agentdb/environments';

const environment = new Environment({
  name: 'grid-world',
  stateSpace: {
    type: 'continuous',
    shape: [10, 10],
    bounds: [[0, 10], [0, 10]]
  },
  actionSpace: {
    type: 'discrete',
    actions: ['up', 'down', 'left', 'right']
  },
  rewardFunction: (state, action, nextState) => {
    // Distance to goal reward
    const goalDistance = Math.sqrt(
      Math.pow(nextState[0] - 9, 2) +
      Math.pow(nextState[1] - 9, 2)
    );
    return -goalDistance + (goalDistance === 0 ? 100 : 0);
  },
  terminalCondition: (state) => {
    return state[0] === 9 && state[1] === 9; // Reached goal
  }
});

await environment.initialize();
  1. Setup monitoring
const monitor = learningPlugin.createMonitor({
  metrics: ['reward', 'loss', 'exploration-rate', 'episode-length'],
  logInterval: 100, // Log every 100 episodes
  saveCheckpoints: true,
  checkpointInterval: 1000
});

monitor.on('episode-complete', (episode) => {
  console.log('Episode:', episode.number, 'Reward:', episode.totalReward);
});

Memory Pattern:

await agentDB.memory.store('agentdb/learning/environment', {
  name: environment.name,
  stateSpace: environment.stateSpace,
  actionSpace: environment.actionSpace,
  initialized: Date.now()
});

Validation:

  • Learning database initialized
  • Environment configured and tested
  • Monitor capturing metrics
  • Configuration stored in memory

Phase 2: Configure RL Algorithm (1-2 hours)

Objective: Select and configure RL algorithm for the learning task

Agent: ml-developer

Steps:

  1. Select algorithm
// Example: Deep Q-Network (DQN)
const dqnAgent = learningPlugin.createAgent({
  algorithm: 'dqn',
  config: {
    networkArchitecture: {
      layers: [
        { type: 'dense', units: 128, activation: 'relu' },
        { type: 'dense', units: 128, activation: 'relu' },
        { type: 'dense', units: environment.actionSpace.size, activation: 'linear' }
      ]
    },
    learningRate: 0.001,
    batchSize: 64,
    replayBuffer: {
      size: 100000,
      prioritized: true,
      alpha: 0.6,
      beta: 0.4
    },
    targetNetwork: {
      updateFrequency: 1000,
      tauSync: 0.001 // Soft update
    },
    exploration: {
      initial: 1.0,
      final: 0.01,
      decay: 0.995
    },
    training: {
      startAfter: 1000, // Start training after 1000 experiences
      updateFrequency: 4
    }
  }
});

await dqnAgent.initialize();
  1. Configure hyperparameters
const hyperparameters = {
  // Learning parameters
  learningRate: 0.001,
  discountFactor: 0.99, // Gamma
  batchSize: 64,

  // Exploration
  epsilonStart: 1.0,
  epsilonEnd: 0.01,
  epsilonDecay: 0.995,

  // Experience replay
  replayBufferSize: 100000,
  minReplaySize: 1000,
  prioritizedReplay: true,

  // Training
  maxEpisodes: 10000,
  maxStepsPerEpisode: 1000,
  targetUpdateFrequency: 1000,

  // Evaluation
  evalFrequency: 100,
  evalEpisodes: 10
};

dqnAgent.setHyperparameters(hyperparameters);
  1. Setup experience replay
import { PrioritizedReplayBuffer } from '@agentdb/rl-algorithms';

const replayBuffer = new PrioritizedReplayBuffer({
  capacity: 100000,
  alpha: 0.6, // Prioritization exponent
  beta: 0.4, // Importance sampling
  betaIncrement: 0.001,
  epsilon: 0.01 // Small constant for stability
});

dqnAgent.setReplayBuffer(replayBuffer);
  1. Configure training loop
const trainingConfig = {
  episodes: 10000,
  stepsPerEpisode: 1000,
  warmupSteps: 1000,
  trainFrequency: 4,
  targetUpdateFrequency: 1000,
  saveFrequency: 1000,
  evalFrequency: 100,
  earlyStoppingPatience: 500,
  earlyStoppingThreshold: 0.01
};

dqnAgent.setTrainingConfig(trainingConfig);

Memory Pattern:

await agentDB.memory.store('agentdb/learning/algorithm-config', {
  algorithm: 'dqn',
  hyperparameters: hyperparameters,
  trainingConfig: trainingConfig,
  configured: Date.now()
});

Validation:

  • Algorithm selected and configured
  • Hyperparameters validated
  • Replay buffer initialized
  • Training config set

Phase 3: Train Agents (3-4 hours)

Objective: Execute training iterations and optimize agent behavior

Agent: safla-neural

Steps:

  1. Start training loop
async function trainAgent() {
  console.log('Starting RL training...');

  const trainingStats = {
    episodes: [],
    totalReward: [],
    episodeLength: [],
    loss: [],
    explorationRate: []
  };

  for (let episode = 0; episode < trainingConfig.episodes; episode++) {
    let state = await environment.reset();
    let episodeReward = 0;
    let episodeLength = 0;
    let episodeLoss = 0;

    for (let step = 0; step < trainingConfig.stepsPerEpisode; step++) {
      // Select action
      const action = await dqnAgent.selectAction(state, {
        explore: true
      });

      // Execute action
      const { nextState, reward, done } = await environment.step(action);

      // Store experience
      await dqnAgent.storeExperience({
        state,
        action,
        reward,
        nextState,
        done
      });

      // Train if enough experiences
      if (dqnAgent.canTrain()) {
        const loss = await dqnAgent.train();
        episodeLoss += loss;
      }

      episodeReward += reward;
      episodeLength += 1;
      state = nextState;

      if (done) break;
    }

    // Update target network
    if (episode % trainingConfig.targetUpdateFrequency === 0) {
      await dqnAgent.updateTargetNetwork();
    }

    // Decay exploration
    dqnAgent.decayExploration();

    // Log progress
    trainingStats.episodes.push(episode);
    trainingStats.totalReward.push(episodeReward);
    trainingStats.episodeLength.push(episodeLength);
    trainingStats.loss.push(episodeLoss / episodeLength);
    trainingStats.explorationRate.push(dqnAgent.getExplorationRate());

    if (episode % 100 === 0) {
      console.log(`Episode ${episode}:`, {
        reward: episodeReward.toFixed(2),
        length: episodeLength,
        loss: (episodeLoss / episodeLength).toFixed(4),
        epsilon: dqnAgent.getExplorationRate().toFixed(3)
      });
    }

    // Save checkpoint
    if (episode % trainingConfig.saveFrequency === 0) {
      await dqnAgent.save(`checkpoint-${episode}`);
    }

    // Evaluate
    if (episode % trainingConfig.evalFrequency === 0) {
      const evalReward = await evaluateAgent(dqnAgent, environment);
      console.log(`Evaluation at episode ${episode}: ${evalReward.toFixed(2)}`);
    }

    // Early stopping
    if (checkEarlyStopping(trainingStats, episode)) {
      console.log('Early stopping triggered');
      break;
    }
  }

  return trainingStats;
}

const trainingStats = await trainAgent();
  1. Monitor training progress
monitor.on('training-update', (stats) => {
  // Calculate moving averages
  const window = 100;
  const recentRewards = stats.totalReward.slice(-window);
  const avgReward = recentRewards.reduce((a, b) => a + b, 0) / recentRewards.length;

  // Store metrics
  agentDB.memory.store('agentdb/learning/training-progress', {
    episode: stats.episodes[stats.episodes.length - 1],
    avgReward: avgReward,
    explorationRate: stats.explorationRate[stats.explorationRate.length - 1],
    timestamp: Date.now()
  });

  // Plot learning curve (if visualization enabled)
  if (monitor.visualization) {
    monitor.plot('reward-curve', stats.episodes, stats.totalReward);
    monitor.plot('loss-curve', stats.episodes, stats.loss);
  }
});
  1. Handle convergence
function checkConvergence(stats, windowSize = 100, threshold = 0.01) {
  if (stats.totalReward.length < windowSize * 2) {
    return false;
  }

  const recent = stats.totalReward.slice(-windowSize);
  const previous = stats.totalReward.slice(-windowSize * 2, -windowSize);

  const recentAvg = recent.reduce((a, b) => a + b, 0) / recent.length;
  const previousAvg = previous.reduce((a, b) => a + b, 0) / previous.length;

  const improvement = (recentAvg - previousAvg) / Math.abs(previousAvg);

  return improvement < threshold;
}
  1. Save trained model
await dqnAgent.save('trained-agent-final', {
  includeReplayBuffer: false,
  includeOptimizer: false,
  metadata: {
    trainingStats: trainingStats,
    hyperparameters: hyperparameters,
    finalReward: trainingStats.totalReward[trainingStats.totalReward.length - 1]
  }
});

console.log('Training complete. Model saved.');

Memory Pattern:

await agentDB.memory.store('agentdb/learning/training-results', {
  algorithm: 'dqn',
  episodes: trainingStats.episodes.length,
  finalReward: trainingStats.totalReward[trainingStats.totalReward.length - 1],
  converged: checkConvergence(trainingStats),
  modelPath: 'trained-agent-final',
  timestamp: Date.now()
});

Validation:

  • Training completed or converged
  • Reward curve shows improvement
  • Model saved successfully
  • Training stats stored

Phase 4: Validate Performance (1-2 hours)

Objective: Benchmark trained agent and validate performance

Agent: performance-benchmarker

Steps:

  1. Load trained agent
const trainedAgent = await learningPlugin.loadAgent('trained-agent-final');
  1. Run evaluation episodes
async function evaluateAgent(agent, env, numEpisodes = 100) {
  const results = {
    rewards: [],
    episodeLengths: [],
    successRate: 0
  };

  for (let i = 0; i < numEpisodes; i++) {
    let state = await env.reset();
    let episodeReward = 0;
    let episodeLength = 0;
    let success = false;

    for (let step = 0; step < 1000; step++) {
      const action = await agent.selectAction(state, { explore: false });
      const { nextState, reward, done } = await env.step(action);

      episodeReward += reward;
      episodeLength += 1;
      state = nextState;

      if (done) {
        success = env.isSuccessful(state);
        break;
      }
    }

    results.rewards.push(episodeReward);
    results.episodeLengths.push(episodeLength);
    if (success) results.successRate += 1;
  }

  results.successRate /= numEpisodes;

  return {
    meanReward: results.rewards.reduce((a, b) => a + b, 0) / results.rewards.length,
    stdReward: calculateStd(results.rewards),
    meanLength: results.episodeLengths.reduce((a, b) => a + b, 0) / results.episodeLengths.length,
    successRate: results.successRate,
    results: results
  };
}

const evalResults = await evaluateAgent(trainedAgent, environment, 100);
console.log('Evaluation results:', evalResults);
  1. Compare with baseline
// Random policy baseline
const randomAgent = learningPlugin.createAgent({ algorithm: 'random' });
const randomResults = await evaluateAgent(randomAgent, environment, 100);

// Calculate improvement
const improvement = {
  rewardImprovement: (evalResults.meanReward - randomResults.meanReward) / Math.abs(randomResults.meanReward),
  lengthImprovement: (randomResults.meanLength - evalResults.meanLength) / randomResults.meanLength,
  successImprovement: evalResults.successRate - randomResults.successRate
};

console.log('Improvement over random:', improvement);
  1. Run comprehensive benchmarks
const benchmarks = {
  performanceMetrics: {
    meanReward: evalResults.meanReward,
    stdReward: evalResults.stdReward,
    successRate: evalResults.successRate,
    meanEpisodeLength: evalResults.meanLength
  },
  algorithmComparison: {
    dqn: evalResults,
    random: randomResults,
    improvement: improvement
  },
  inferenceTiming: {
    actionSelection: 0,
    totalEpisode: 0
  }
};

// Measure inference speed
const timingTrials = 1000;
const startTime = performance.now();
for (let i = 0; i < timingTrials; i++) {
  const state = await environment.randomState();
  await trainedAgent.selectAction(state, { explore: false });
}
const endTime = performance.now();
benchmarks.inferenceTiming.actionSelection = (endTime - startTime) / timingTrials;

await agentDB.memory.store('agentdb/learning/benchmarks', benchmarks);

Memory Pattern:

await agentDB.memory.store('agentdb/learning/validation', {
  evaluated: true,
  meanReward: evalResults.meanReward,
  successRate: evalResults.successRate,
  improvement: improvement,
  timestamp: Date.now()
});

Validation:

  • Evaluation completed (100 episodes)
  • Mean reward exceeds threshold
  • Success rate acceptable
  • Improvement over baseline demonstrated

Phase 5: Deploy Trained Agents (1-2 hours)

Objective: Deploy trained agents to production environment

Agent: ml-developer

Steps:

  1. Export production model
await trainedAgent.export('production-agent', {
  format: 'onnx', // or 'tensorflowjs', 'pytorch'
  optimize: true,
  quantize: 'int8', // Quantization for faster inference
  includeMetadata: true
});
  1. Create inference API
import express from 'express';

const app = express();
app.use(express.json());

// Load production agent
const productionAgent = await learningPlugin.loadAgent('production-agent');

app.post('/api/predict', async (req, res) => {
  try {
    const { state } = req.body;

    const action = await productionAgent.selectAction(state, {
      explore: false,
      returnProbabilities: true
    });

    res.json({
      action: action.action,
      probabilities: action.probabilities,
      confidence: action.confidence
    });
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

app.listen(3000, () => {
  console.log('RL agent API running on port 3000');
});
  1. Setup monitoring
import { ProductionMonitor } from '@agentdb/monitoring';

const prodMonitor = new ProductionMonitor({
  agent: productionAgent,
  metrics: ['inference-latency', 'action-distribution', 'reward-feedback'],
  alerting: {
    latencyThreshold: 100, // ms
    anomalyDetection: true
  }
});

await prodMonitor.start();
  1. Create deployment pipeline
const deploymentPipeline = {
  stages: [
    {
      name: 'validation',
      steps: [
        'Load trained model',
        'Run validation suite',
        'Check performance metrics',
        'Verify inference speed'
      ]
    },
    {
      name: 'export',
      steps: [
        'Export to production format',
        'Optimize model',
        'Quantize weights',
        'Package artifacts'
      ]
    },
    {
      name: 'deployment',
      steps: [
        'Deploy to staging',
        'Run smoke tests',
        'Deploy to production',
        'Monitor performance'
      ]
    }
  ]
};

await agentDB.memory.store('agentdb/learning/deployment-pipeline', deploymentPipeline);

Memory Pattern:

await agentDB.memory.store('agentdb/learning/production', {
  deployed: true,
  modelPath: 'production-agent',
  apiEndpoint: 'http://localhost:3000/api/predict',
  monitoring: true,
  timestamp: Date.now()
});

Validation:

  • Model exported successfully
  • API running and responding
  • Monitoring active
  • Deployment pipeline documented

Integration Scripts

Complete Training Script

#!/bin/bash
# train-rl-agent.sh

set -e

echo "AgentDB RL Training Script"
echo "=========================="

# Phase 1: Initialize
echo "Phase 1: Initializing learning environment..."
npm install agentdb-learning @agentdb/rl-algorithms

# Phase 2: Configure
echo "Phase 2: Configuring algorithm..."
node -e "require('./config-algorithm.js')"

# Phase 3: Train
echo "Phase 3: Training agent..."
node -e "require('./train-agent.js')"

# Phase 4: Validate
echo "Phase 4: Validating performance..."
node -e "require('./evaluate-agent.js')"

# Phase 5: Deploy
echo "Phase 5: Deploying to production..."
node -e "require('./deploy-agent.js')"

echo "Training complete!"

Quick Start Script

// quickstart-rl.ts
import { setupRLTraining } from './setup';

async function quickStart() {
  console.log('Starting RL training quick setup...');

  // Setup
  const { learningDB, environment, agent } = await setupRLTraining({
    algorithm: 'dqn',
    environment: 'grid-world',
    episodes: 1000
  });

  // Train
  console.log('Training agent...');
  const stats = await agent.train(environment, {
    episodes: 1000,
    logInterval: 100
  });

  // Evaluate
  console.log('Evaluating agent...');
  const results = await agent.evaluate(environment, {
    episodes: 100
  });
  console.log('Results:', results);

  // Save
  await agent.save('quickstart-agent');
  console.log('Quick start complete!');
}

quickStart().catch(console.error);

Evidence-Based Success Criteria

  1. Training Convergence (Self-Consistency)

    • Reward curve stabilizes
    • Moving average improvement < 1%
    • Agent achieves consistent performance
  2. Performance Benchmarks (Quantitative)

    • Mean reward exceeds baseline by 50%
    • Success rate > 80%
    • Inference time < 10ms per action
  3. Algorithm Validation (Chain-of-Verification)

    • Hyperparameters validated
    • Exploration-exploitation balanced
    • Experience replay functioning
  4. Production Readiness (Multi-Agent Consensus)

    • Model exported successfully
    • API responds within latency threshold
    • Monitoring active and alerting
    • Deployment pipeline documented

Additional Resources

Score

Total Score

60/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

0/10
説明文

100文字以上の説明がある

0/10
人気

GitHub Stars 100以上

+5
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon