Back to list
5dlabs

healer

by 5dlabs

Cognitive Task Orchestrator - GitOps on Bare Metal or Cloud for AI Agents

2🍴 1📅 Jan 24, 2026

SKILL.md


name: healer description: Healer monitoring expertise - detection patterns, API endpoints, dual-model architecture, and remediation workflows. Use when monitoring Play workflows or debugging agent failures.

Healer Skill

Healer is the observability and self-healing layer for CTO Play workflows. It monitors pod logs via Loki, detects issues, and orchestrates remediations.

When to Use

  • Monitoring Play workflow execution
  • Debugging agent failures (pre-flight, runtime)
  • Understanding detection patterns (A10, A11, A12)
  • Checking session status

Healer API Endpoints

EndpointMethodPurpose
/healthGETHealth check
/api/v1/session/startPOSTMCP calls this on play()
/api/v1/session/{play_id}GETGet session details
/api/v1/sessionsGETList all sessions
/api/v1/sessions/activeGETList active sessions only

Check Active Sessions

curl http://localhost:8083/api/v1/sessions/active | jq

Detection Patterns

Priority 1: Pre-Flight Failures (within 60s of agent start)

PatternAlert CodeMeaning
tool inventory mismatchA10Agent missing declared tools
Tool inventory MISMATCHA10Specific tool unavailable
declared tools.*missingA10Tools in config not in CLI
cto-config.*(missing|invalid)A11Config not loaded/synced
mcp.*failed to initializeA12MCP server init failure
tools-server.*unreachableA12Tools-server down

Priority 2: Runtime Failures

PatternSeverityAction
panicked at, fatal errorCriticalImmediate escalation
timeout, connection refusedHighInfrastructure issue
max retries exceededHighAgent exhausted attempts
permission denied.*filesystemCriticalCan't read/write files
unauthorized|invalid tokenCriticalAuth broken

Priority 3: Lifecycle Issues

PatternMeaning
template not foundPrompt template missing
prompt.*missingAgent instructions not loaded
role.*undefinedAgent role not set
task context.*emptyTask details not injected

Dual-Model Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                        DUAL-MODEL HEALER ARCHITECTURE                        │
│                                                                              │
│   DATA SOURCES                                                              │
│   ├─ Loki (all pod logs)                                                    │
│   ├─ Kubernetes (CodeRuns, Pods, Events)                                    │
│   ├─ GitHub (PRs, comments, CI status)                                      │
│   └─ CTO Config (expected tools, agent settings)                            │
│                              │                                               │
│                              ▼                                               │
│   MODEL 1: EVALUATION AGENT                                                 │
│   ├─ Parses and comprehends ALL logs                                        │
│   ├─ Correlates events across agents                                        │
│   ├─ Identifies root cause                                                  │
│   └─ Creates GitHub Issue with analysis                                     │
│                              │                                               │
│                              ▼                                               │
│   MODEL 2: REMEDIATION AGENT                                                │
│   ├─ Reads the GitHub issue                                                 │
│   ├─ Implements the fix                                                     │
│   ├─ Creates PR with changes                                                │
│   └─ Marks issue resolved                                                   │
└─────────────────────────────────────────────────────────────────────────────┘

Session Notification Flow

MCP play() call
    │
    ▼
POST /api/v1/session/start
    │
    └─ Payload: {
         play_id,
         repository,
         cto_config: { agents, tools },
         tasks: [...]
       }
    │
    ▼
Healer stores session with expected tools per agent
    │
    ▼
CodeRuns start with Healer already aware

Watch Logs

Pod Logs

# Watch all CTO pods
kubectl logs -n cto -l app.kubernetes.io/part-of=cto -f --tail=100

# Watch specific agent CodeRun
kubectl logs -n cto -l app=coderun -f

Loki Query

{namespace="cto"} |= "error" | json

Pre-Flight Checklist (Verify within 60s)

For every agent run, Healer verifies:

Prompts

  • Agent type identified
  • Role matches task
  • Template loaded
  • Language context set

MCP Tools (from CTO Config)

  • CTO config loaded
  • Remote tools accessible
  • Local servers initialized
  • Tools-server reachable

Escalation

When issues detected:

  1. Evaluation Agent creates GitHub issue with root cause
  2. Remediation Agent attempts fix (if automatable)
  3. Discord notification for P0/P1 critical issues
  4. Human escalation if remediation fails

Configuration

In cto-config.json:

{
  "defaults": {
    "play": {
      "healerEndpoint": "http://localhost:8083"
    },
    "remediation": {
      "maxIterations": 3,
      "syncTimeoutSecs": 300
    }
  }
}

Reference Documentation

Score

Total Score

65/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

0/10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon