healer

Name: healer
Rating: 65
Author: 5dlabs

by 5dlabs

Cognitive Task Orchestrator - GitOps on Bare Metal or Cloud for AI Agents

⭐ 2🍴 1📅 Jan 24, 2026

ai-agents ai-powered-development autonomous-development code-generation devops-automation github-automation kubernetes-operator mcp-protocol

View on GitHub Run in Manus

SKILL.md

name: healer description: Healer monitoring expertise - detection patterns, API endpoints, dual-model architecture, and remediation workflows. Use when monitoring Play workflows or debugging agent failures.

Healer Skill

Healer is the observability and self-healing layer for CTO Play workflows. It monitors pod logs via Loki, detects issues, and orchestrates remediations.

When to Use

Monitoring Play workflow execution
Debugging agent failures (pre-flight, runtime)
Understanding detection patterns (A10, A11, A12)
Checking session status

Healer API Endpoints

Endpoint	Method	Purpose
`/health`	GET	Health check
`/api/v1/session/start`	POST	MCP calls this on play()
`/api/v1/session/{play_id}`	GET	Get session details
`/api/v1/sessions`	GET	List all sessions
`/api/v1/sessions/active`	GET	List active sessions only

Check Active Sessions

curl http://localhost:8083/api/v1/sessions/active | jq

Detection Patterns

Priority 1: Pre-Flight Failures (within 60s of agent start)

Pattern	Alert Code	Meaning
`tool inventory mismatch`	A10	Agent missing declared tools
`Tool inventory MISMATCH`	A10	Specific tool unavailable
`declared tools.*missing`	A10	Tools in config not in CLI
`cto-config.*(missing\|invalid)`	A11	Config not loaded/synced
`mcp.*failed to initialize`	A12	MCP server init failure
`tools-server.*unreachable`	A12	Tools-server down

Priority 2: Runtime Failures

Pattern	Severity	Action
`panicked at`, `fatal error`	Critical	Immediate escalation
`timeout`, `connection refused`	High	Infrastructure issue
`max retries exceeded`	High	Agent exhausted attempts
`permission denied.*filesystem`	Critical	Can't read/write files
`unauthorized\|invalid token`	Critical	Auth broken

Priority 3: Lifecycle Issues

Pattern	Meaning
`template not found`	Prompt template missing
`prompt.*missing`	Agent instructions not loaded
`role.*undefined`	Agent role not set
`task context.*empty`	Task details not injected

Dual-Model Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                        DUAL-MODEL HEALER ARCHITECTURE                        │
│                                                                              │
│   DATA SOURCES                                                              │
│   ├─ Loki (all pod logs)                                                    │
│   ├─ Kubernetes (CodeRuns, Pods, Events)                                    │
│   ├─ GitHub (PRs, comments, CI status)                                      │
│   └─ CTO Config (expected tools, agent settings)                            │
│                              │                                               │
│                              ▼                                               │
│   MODEL 1: EVALUATION AGENT                                                 │
│   ├─ Parses and comprehends ALL logs                                        │
│   ├─ Correlates events across agents                                        │
│   ├─ Identifies root cause                                                  │
│   └─ Creates GitHub Issue with analysis                                     │
│                              │                                               │
│                              ▼                                               │
│   MODEL 2: REMEDIATION AGENT                                                │
│   ├─ Reads the GitHub issue                                                 │
│   ├─ Implements the fix                                                     │
│   ├─ Creates PR with changes                                                │
│   └─ Marks issue resolved                                                   │
└─────────────────────────────────────────────────────────────────────────────┘

Session Notification Flow

MCP play() call
    │
    ▼
POST /api/v1/session/start
    │
    └─ Payload: {
         play_id,
         repository,
         cto_config: { agents, tools },
         tasks: [...]
       }
    │
    ▼
Healer stores session with expected tools per agent
    │
    ▼
CodeRuns start with Healer already aware

Watch Logs

Pod Logs

# Watch all CTO pods
kubectl logs -n cto -l app.kubernetes.io/part-of=cto -f --tail=100

# Watch specific agent CodeRun
kubectl logs -n cto -l app=coderun -f

Loki Query

{namespace="cto"} |= "error" | json

Pre-Flight Checklist (Verify within 60s)

For every agent run, Healer verifies:

Prompts

Agent type identified
Role matches task
Template loaded
Language context set

MCP Tools (from CTO Config)

CTO config loaded
Remote tools accessible
Local servers initialized
Tools-server reachable

Escalation

When issues detected:

Evaluation Agent creates GitHub issue with root cause
Remediation Agent attempts fix (if automatable)
Discord notification for P0/P1 critical issues
Human escalation if remediation fails

Configuration

In cto-config.json:

{
  "defaults": {
    "play": {
      "healerEndpoint": "http://localhost:8083"
    },
    "remediation": {
      "maxIterations": 3,
      "syncTimeoutSecs": 300
    }
  }
}

Reference Documentation

docs/heal-play.md - Full Healer specification
crates/healer/ - Healer implementation
crates/healer/src/scanner.rs - Detection patterns

Score

Total Score

65/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

✓LICENSE

ライセンスが設定されている

+10

○説明文

100文字以上の説明がある

0/10

○人気

GitHub Stars 100以上

0/15

✓最近の活動

1ヶ月以内に更新

+10

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

healer

SKILL.md

name: healer description: Healer monitoring expertise - detection patterns, API endpoints, dual-model architecture, and remediation workflows. Use when monitoring Play workflows or debugging agent failures.

Healer Skill

When to Use

Healer API Endpoints

Check Active Sessions

Detection Patterns

Priority 1: Pre-Flight Failures (within 60s of agent start)

Priority 2: Runtime Failures

Priority 3: Lifecycle Issues

Dual-Model Architecture

Session Notification Flow

Watch Logs

Pod Logs

Loki Query

Pre-Flight Checklist (Verify within 60s)

Prompts

MCP Tools (from CTO Config)

Escalation

Configuration

Reference Documentation

Score

Reviews

greeter

pr-creator

code-reviewer

skill-creator

browser-use

git-workflow

healer

SKILL.md

name: healer description: Healer monitoring expertise - detection patterns, API endpoints, dual-model architecture, and remediation workflows. Use when monitoring Play workflows or debugging agent failures.

Healer Skill

When to Use

Healer API Endpoints

Check Active Sessions

Detection Patterns

Priority 1: Pre-Flight Failures (within 60s of agent start)

Priority 2: Runtime Failures

Priority 3: Lifecycle Issues

Dual-Model Architecture

Session Notification Flow

Watch Logs

Pod Logs

Loki Query

Pre-Flight Checklist (Verify within 60s)

Prompts

MCP Tools (from CTO Config)

Escalation

Configuration

Reference Documentation

Score

Reviews

Related

Related Skills

greeter

pr-creator

code-reviewer

skill-creator

browser-use

git-workflow