← スキル一覧に戻る
optimization-benchmark
vamseeachanta / workspace-hub
⭐ 1🍴 0📅 2026年1月19日
Comprehensive performance benchmarking, regression detection, automated testing, and performance validation. Use for CI/CD quality gates, baseline comparisons, and systematic performance testing.
SKILL.md
---
name: optimization-benchmark
description: Comprehensive performance benchmarking, regression detection, automated testing, and performance validation. Use for CI/CD quality gates, baseline comparisons, and systematic performance testing.
---
# Benchmark Suite Skill
## Overview
This skill provides comprehensive automated performance testing capabilities including benchmark execution, regression detection, performance validation, and quality assessment for ensuring optimal system performance.
## When to Use
- Running performance benchmark suites before deployment
- Detecting performance regressions between versions
- Validating SLA compliance through automated testing
- Load, stress, and endurance testing
- CI/CD pipeline performance quality gates
- Comparing performance across configurations
## Quick Start
```bash
# Run comprehensive benchmark suite
npx claude-flow benchmark-run --suite comprehensive --duration 300
# Execute specific benchmark
npx claude-flow benchmark-run --suite throughput --iterations 10
# Compare with baseline
npx claude-flow benchmark-compare --current <results> --baseline <baseline>
# Quality assessment
npx claude-flow quality-assess --target swarm-performance --criteria throughput,latency
# Performance validation
npx claude-flow validate-performance --results <file> --criteria <file>
```
## Architecture
```
+-----------------------------------------------------------+
| Benchmark Suite |
+-----------------------------------------------------------+
| Benchmark Runner | Regression Detector | Validator |
+--------------------+-----------------------+--------------+
| | |
v v v
+------------------+ +-------------------+ +--------------+
| Benchmark Types | | Detection Methods | | Validation |
| - Throughput | | - Statistical | | - SLA |
| - Latency | | - ML-based | | - Regression |
| - Scalability | | - Threshold | | - Scalability|
| - Coordination | | - Trend Analysis | | - Reliability|
+------------------+ +-------------------+ +--------------+
| | |
v v v
+-----------------------------------------------------------+
| Reporter & Comparator |
+-----------------------------------------------------------+
```
## Benchmark Types
### Standard Benchmarks
| Benchmark | Metrics | Duration | Targets |
|-----------|---------|----------|---------|
| Throughput | requests/sec, tasks/sec, messages/sec | 5 min | min: 1000, optimal: 5000 |
| Latency | p50, p90, p95, p99, max | 5 min | p50<100ms, p99<1s |
| Scalability | linear coefficient, efficiency retention | variable | coefficient>0.8 |
| Coordination | message latency, sync time | 5 min | <50ms |
| Fault Tolerance | recovery time, failover success | 10 min | <30s recovery |
### Test Campaign Types
1. **Load Testing**: Gradual ramp-up to sustained load
2. **Stress Testing**: Find breaking points
3. **Volume Testing**: Large data set handling
4. **Endurance Testing**: Long-duration stability
5. **Spike Testing**: Sudden load changes
6. **Configuration Testing**: Different settings comparison
## Core Capabilities
### 1. Comprehensive Benchmarking
```javascript
// Run benchmark suite
const results = await benchmarkSuite.run({
duration: 300000, // 5 minutes
iterations: 10, // 10 iterations
warmupTime: 30000, // 30 seconds warmup
cooldownTime: 10000, // 10 seconds cooldown
parallel: false, // Sequential execution
baseline: previousRun // Compare with baseline
});
// Results include:
// - summary: Overall scores and status
// - detailed: Per-benchmark results
// - baseline_comparison: Delta from baseline
// - recommendations: Optimization suggestions
```
### 2. Regression Detection
Multi-algorithm detection:
| Method | Description | Use Case |
|--------|-------------|----------|
| Statistical | CUSUM change point detection | Detect gradual degradation |
| Machine Learning | Anomaly detection models | Identify unusual patterns |
| Threshold | Fixed limit comparisons | Hard performance limits |
| Trend | Time series regression | Long-term degradation |
```bash
# Detect performance regressions
npx claude-flow detect-regression --current <results> --historical <data>
# Set up automated regression monitoring
npx claude-flow regression-monitor --enable --sensitivity 0.95
```
### 3. Automated Performance Testing
```javascript
// Execute test campaign
const campaign = await tester.runTestCampaign({
tests: [
{ type: 'load', config: loadTestConfig },
{ type: 'stress', config: stressTestConfig },
{ type: 'endurance', config: enduranceConfig }
],
constraints: {
maxDuration: 3600000, // 1 hour max
failFast: true // Stop on first failure
}
});
```
### 4. Performance Validation
Validation framework with multi-criteria assessment:
| Validation Type | Criteria |
|-----------------|----------|
| SLA Validation | Availability, response time, throughput, error rate |
| Regression Validation | Comparison with historical data |
| Scalability Validation | Linear scaling, efficiency retention |
| Reliability Validation | Error handling, recovery, consistency |
## MCP Integration
```javascript
// Comprehensive benchmark integration
const benchmarkIntegration = {
// Execute performance benchmarks
async runBenchmarks(config = {}) {
const [benchmark, metrics, trends, cost] = await Promise.all([
mcp.benchmark_run({ suite: config.suite || 'comprehensive' }),
mcp.metrics_collect({ components: ['system', 'agents', 'coordination'] }),
mcp.trend_analysis({ metric: 'performance', period: '24h' }),
mcp.cost_analysis({ timeframe: '24h' })
]);
return { benchmark, metrics, trends, cost, timestamp: Date.now() };
},
// Quality assessment
async assessQuality(criteria) {
return await mcp.quality_assess({
target: 'swarm-performance',
criteria: criteria || ['throughput', 'latency', 'reliability', 'scalability']
});
}
};
```
## Key Metrics
### Benchmark Targets
```javascript
const benchmarkTargets = {
throughput: {
requests_per_second: { min: 1000, optimal: 5000 },
tasks_per_second: { min: 100, optimal: 500 },
messages_per_second: { min: 10000, optimal: 50000 }
},
latency: {
p50: { max: 100 }, // 100ms
p90: { max: 200 }, // 200ms
p95: { max: 500 }, // 500ms
p99: { max: 1000 }, // 1s
max: { max: 5000 } // 5s
},
scalability: {
linear_coefficient: { min: 0.8 },
efficiency_retention: { min: 0.7 }
}
};
```
### CI/CD Quality Gates
| Gate | Criteria | Action on Failure |
|------|----------|-------------------|
| Performance | < 10% degradation | Block deployment |
| Latency | p99 < 1s | Warning |
| Error Rate | < 0.5% | Block deployment |
| Scalability | > 80% linear | Warning |
## Load Testing Example
```javascript
// Load test with gradual ramp-up
const loadTest = {
type: 'load',
phases: [
{ phase: 'ramp-up', duration: 60000, startLoad: 10, endLoad: 100 },
{ phase: 'sustained', duration: 300000, load: 100 },
{ phase: 'ramp-down', duration: 30000, startLoad: 100, endLoad: 0 }
],
successCriteria: {
p99_latency: { max: 1000 },
error_rate: { max: 0.01 },
throughput: { min: 80 } // % of expected
}
};
```
## Stress Testing Example
```javascript
// Stress test to find breaking point
const stressTest = {
type: 'stress',
startLoad: 100,
maxLoad: 10000,
loadIncrement: 100,
duration: 60000, // Per load level
breakingCriteria: {
error_rate: { max: 0.05 }, // 5% errors
latency_p99: { max: 5000 }, // 5s latency
timeout_rate: { max: 0.10 } // 10% timeouts
}
};
```
## Integration Points
| Integration | Purpose |
|-------------|---------|
| Performance Monitor | Continuous monitoring data for benchmarking |
| Load Balancer | Validates load balancing effectiveness |
| Topology Optimizer | Tests topology configurations |
| CI/CD Pipeline | Automated quality gates |
## Best Practices
1. **Consistent Environment**: Run benchmarks in consistent, isolated environments
2. **Warmup Period**: Always include warmup to eliminate cold-start effects
3. **Multiple Iterations**: Run multiple iterations for statistical significance
4. **Baseline Maintenance**: Keep baseline updated with expected performance
5. **Historical Tracking**: Store all benchmark results for trend analysis
6. **Realistic Workloads**: Use production-like workload patterns
## Example: CI/CD Integration
```bash
#!/bin/bash
# ci-performance-gate.sh
# Run benchmark suite
RESULTS=$(npx claude-flow benchmark-run --suite quick --output json)
# Compare with baseline
COMPARISON=$(npx claude-flow benchmark-compare \
--current "$RESULTS" \
--baseline ./baseline.json)
# Check for regressions
if echo "$COMPARISON" | jq -e '.regression_detected == true' > /dev/null; then
echo "Performance regression detected!"
echo "$COMPARISON" | jq '.regressions'
exit 1
fi
echo "Performance validation passed"
exit 0
```
## Related Skills
- `optimization-monitor` - Real-time performance monitoring
- `optimization-analyzer` - Bottleneck analysis and reporting
- `optimization-load-balancer` - Load distribution optimization
- `optimization-topology` - Topology performance testing
---
## Version History
- **1.0.0** (2026-01-02): Initial release - converted from benchmark-suite agent with comprehensive benchmarking, regression detection, automated testing, and performance validation