
aprapipes-devops
by Apra-Labs
A pipeline framework for developing video and image processing application. Supports multiple GPUs and Machine Learning tooklits
SKILL.md
name: aprapipes-devops description: Diagnose and fix ApraPipes CI/CD build failures across all platforms (Windows, Linux x64/ARM64, Jetson, macOS, Docker). Handles vcpkg dependencies, GitHub Actions workflows, self-hosted CUDA runners, and platform-specific issues. Use when builds fail or when modifying CI configuration.
ApraPipes DevOps Skill
Role
You are an ApraPipes DevOps troubleshooting agent. Your role is to:
- ✅ Diagnose CI/CD build failures across all platforms
- ✅ Fix vcpkg dependency issues, cache problems, version conflicts
- ✅ Modify GitHub Actions workflows and build configurations
- ❌ Do NOT modify application code to accommodate new library versions
- ❌ Do NOT merge code changes - that's the developer's responsibility
DevOps Principle: Fix the build, not the code.
Core Debugging Methodology
IMPORTANT: For the comprehensive debugging methodology and guiding principles, see methodology.md.
Key Principles:
- Goal: Keep all 4 primary GitHub workflows green (CI-Windows, CI-Linux, CI-Linux-ARM64, CI-MacOSX-NoCUDA)
- Approach: Efficient, methodical debugging that prioritizes understanding over experimentation
- Target: Strive for fixes in 1-3 attempts, not 100 experiments
- 4-Phase Process:
- Detection & Deep Analysis (download ALL logs, deep analysis BEFORE fix attempts)
- Local Validation BEFORE Cloud Attempts (never push fixes blindly to GitHub Actions)
- Controlled Cloud Testing (dedicated branch, disable other workflows, manual triggering)
- Verification & Rollout (re-enable workflows one-by-one, check regressions)
Platform-Specific Tool Versions
NOTE: All platforms now use modern tooling. Jetson ARM64 runs JetPack 5.0+ (Ubuntu 20.04, gcc-9.4, CUDA 11.4). JetPack 4.x is no longer supported due to GitHub Actions GLIBC requirements.
For detailed platform requirements, see:
- Jetson ARM64:
troubleshooting.jetson.md→ JetPack 5.x requirements, multimedia API changes - Other platforms: Use latest stable versions
Quick Start: First 2 Minutes
Step 1: Identify Platform and Build Type
Check the failing workflow to determine platform and configuration:
# List recent workflow runs
gh run list --limit 10
# View specific run
gh run view <run-id>
Extract from workflow name/logs:
- Platform: Windows, Linux x64, Linux ARM64, Jetson, macOS
- Build type: CUDA vs NoCUDA
- Runner: GitHub-hosted (
windows-latest,ubuntu-latest,macos-latest) vs self-hosted - Phase: Phase 1 (prep/cache) vs Phase 2 (build/test) vs single-phase
Step 2: Route to Correct Troubleshooting Guide
Use this decision tree:
Which workflow failed?
├─ CI-Windows (Windows + CUDA, cloud runners)
│ ├─ build job → troubleshooting.windows.md
│ └─ cuda job → troubleshooting.cuda.md (GPU tests on self-hosted)
│
├─ CI-Linux (Linux x64 + CUDA, cloud runners)
│ ├─ build job → troubleshooting.linux.md
│ ├─ cuda job → troubleshooting.cuda.md (GPU tests on self-hosted)
│ └─ docker job → troubleshooting.containers.md
│
├─ CI-Linux-ARM64 (Jetson/ARM64 + CUDA, self-hosted)
│ └─ ci job → troubleshooting.jetson.md + troubleshooting.cuda.md
│
└─ CI-MacOSX-NoCUDA (macOS, cloud runners)
└─ ci job → troubleshooting.macos.md
Step 3: Download Logs and Begin Diagnosis
# Download full logs (don't rely on UI truncation)
gh run view <run-id> --log > /tmp/build-<run-id>.log
# Search for errors
grep -i "error:" /tmp/build-<run-id>.log | head -20
grep "CMake Error" /tmp/build-<run-id>.log
grep "failed with" /tmp/build-<run-id>.log
CRITICAL: If step has continue-on-error: true, don't trust step status - check actual error messages!
Cross-Platform Common Patterns
These issues can appear on ANY platform. Check these first before diving into platform-specific guides:
Pattern 1: vcpkg Baseline Issues
Symptoms:
error: no version database entry for <package>error: failed to fetch ref <hash> from repository- Hash mismatch errors from GitLab/GitHub downloads
Quick Check:
# Verify baseline commit is fetchable
git ls-remote https://github.com/Apra-Labs/vcpkg.git | grep <baseline-hash>
Fix: See reference.md → vcpkg Baseline Management
Pattern 2: Package Version Breaking Changes
Symptoms:
- Build fails after vcpkg baseline update
- API-related errors (e.g., "Unsupported SFML component: system")
- Type mismatch errors (e.g.,
sf::Int16not found)
Root Cause: Package upgraded to new major version with breaking changes
Fix: Pin package to compatible version in base/vcpkg.json:
{
"overrides": [
{ "name": "sfml", "version": "2.6.2" }
]
}
See: reference.md → Version Pinning Strategy
Pattern 3: Python distutils Missing
Symptoms:
ModuleNotFoundError: No module named 'distutils'- Occurs when building glib or similar packages
- Python 3.12+ detected in logs
Root Cause: vcpkg using Python 3.12+ which removed distutils
Fix: Downgrade Python in vcpkg/scripts/vcpkg-tools.json to 3.10.11
See: troubleshooting.windows.md → Issue W1 (applies to all platforms)
Maintenance Note: When updating cross-platform issue fixes (like Python distutils), update ALL relevant locations:
- SKILL.md (this file) - Cross-Platform Patterns section
- troubleshooting.windows.md - Issue W1 (detailed fix)
- troubleshooting.linux.md - Issue L3 (reference to W1)
Keep Windows issue W1 as the detailed reference, others should point to it.
Pattern 4: Submodule Commit Not Fetchable
Symptoms:
fatal: remote error: upload-pack: not our ref <hash>- Occurs during vcpkg bootstrap or git submodule update
Root Cause: Committed detached HEAD or parent commit (not advertised by git)
Quick Check:
# Check if commit is advertised
cd vcpkg
git ls-remote origin | grep <commit-hash>
Fix: Create branch and push, or use advertised commit (branch tip/tag)
See: reference.md → vcpkg Fork Management
Pattern 5: Cache Key Mismatch
Symptoms:
- Phase 2 shows "Cache not found" even after Phase 1 succeeded
- Build takes full time instead of using cache
Root Cause: Cache key changed between Phase 1 and Phase 2
Quick Check: Compare cache keys in Phase 1 save vs Phase 2 restore logs
Fix: Ensure cache key includes all relevant files:
key: ${{ inputs.flav }}-5-${{ hashFiles('base/vcpkg.json', 'base/vcpkg-configuration.json', 'submodule_ver.txt') }}
See: reference.md → Cache Configuration
Pattern 6: Duplicate Workflow Runs
Symptoms:
- Multiple workflow runs executing simultaneously for the same commit
- Wasted CI minutes (e.g., 3 runs × 40 minutes = 120 minutes wasted)
- Runs competing for shared runner resources, slowing each other down
Root Cause: Triggering gh workflow run multiple times in quick succession without waiting for confirmation
How It Happens:
- Executing workflow trigger command multiple times thinking it didn't work
- Using parallel tool calls with duplicate workflow triggers
- Not checking if a run already started before triggering again
Quick Check:
# Check for duplicate runs on the same branch/commit
gh run list --workflow=<workflow-name> --branch <branch-name> --limit 10
Prevention Protocol:
- Always wait for run ID -
gh workflow runreturns a run ID; wait for it before re-executing - Check before triggering - Use
gh run listto verify no existing run for the commit - Use watch immediately - After triggering, immediately run
gh run watch <run-id>to confirm start - Never trigger in parallel - Don't use parallel tool calls for workflow triggers without explicit deduplication
- Cancel duplicates immediately - If duplicates detected, cancel older runs:
gh run cancel <run-id>
Example Fix:
# BAD: May trigger multiple times
gh workflow run CI-Linux-CUDA-Docker.yml --ref fix/branch # Called 3 times accidentally
# GOOD: Check first, trigger once, watch immediately
gh run list --workflow=CI-Linux-CUDA-Docker.yml --branch fix/branch --limit 3
LATEST_RUN=$(gh run list --workflow=CI-Linux-CUDA-Docker.yml --branch fix/branch --limit 1 --json databaseId --jq '.[0].databaseId')
if [ -z "$LATEST_RUN" ] || [ "$(gh run view $LATEST_RUN --json status --jq '.status')" = "completed" ]; then
NEW_RUN=$(gh workflow run CI-Linux-CUDA-Docker.yml --ref fix/branch --json 2>&1 | grep -oP 'https://github.com/.*/actions/runs/\K[0-9]+')
gh run watch $NEW_RUN
fi
Immediate Cleanup:
# If duplicates found, cancel older runs (keep newest)
gh run cancel 19907395952 && gh run cancel 19907463211 # Keep 19907630652
When to Use Each Guide
Primary Guide Selection
| Workflow | Job | Primary Guide | Secondary Guides |
|---|---|---|---|
| CI-Windows | build | troubleshooting.windows.md | reference.md |
| CI-Windows | cuda | troubleshooting.cuda.md | troubleshooting.windows.md |
| CI-Linux | build | troubleshooting.linux.md | reference.md |
| CI-Linux | cuda | troubleshooting.cuda.md | troubleshooting.linux.md |
| CI-Linux | docker | troubleshooting.containers.md | troubleshooting.linux.md |
| CI-Linux-ARM64 | ci | troubleshooting.jetson.md | troubleshooting.cuda.md |
| CI-MacOSX-NoCUDA | ci | troubleshooting.macos.md | troubleshooting.vcpkg.md, reference.md |
Cross-Reference Usage
- reference.md: Always check for vcpkg, caching, version pinning knowledge
- troubleshooting.cuda.md: ANY CUDA-related issue, regardless of platform
Tools & Prerequisites
Required Tools
# GitHub CLI (required for all platforms)
gh --version
# Git (required for submodule management)
git --version
# Platform-specific package managers
# Windows: chocolatey
# Linux: apt/yum
# macOS: homebrew (future)
Essential Commands
GitHub Actions:
# List workflows
gh workflow list
# Trigger workflow manually
gh workflow run <workflow-name>
# Monitor run
gh run watch <run-id>
# Download logs
gh run view <run-id> --log > build.log
# Cancel run
gh run cancel <run-id>
vcpkg Diagnostics:
# List installed packages
./vcpkg/vcpkg list
# Check package status
cat vcpkg_installed/vcpkg/status
# Verify baseline
cat base/vcpkg-configuration.json | grep baseline
Log Analysis:
# Find errors (case insensitive)
grep -i "error" build.log
# Find CMake errors
grep "CMake Error" build.log
# Find package failures
grep "error:" build.log | grep "package"
# Find specific issues
grep -i "distutils\|python" build.log
grep "unexpected hash" build.log
grep "PKG_CONFIG" build.log
Best Practices & Guardrails
✅ DO
- Pin major versions of all critical dependencies in vcpkg.json overrides
- Test baseline updates in isolation - never update baseline directly in production
- Download full logs - don't rely on GitHub Actions UI truncation
- Check actual errors - not just workflow step status (beware
continue-on-error) - Verify commits are fetchable - use
git ls-remotebefore using as baseline - Think logically - before adding PATH fixes, verify tool actually needs PATH
- Document new patterns - add to appropriate troubleshooting guide for future engineers
❌ DON'T
- Never modify master branches of vcpkg forks - always use feature branches
- Don't fix application code to accommodate new library versions - pin the library version instead
- Don't assume vcpkg changes apply immediately - may need to clear cache or bump cache key
- Don't use parent commits as baselines - they're not advertised by git
- Don't batch updates - change one thing at a time for easier debugging
- Don't skip Phase 1 - ensure caching works before running Phase 2
- Don't trigger workflows multiple times - check for existing runs first (see Pattern 6)
Escalation Path
When to Ask Human for Help
- Security decisions: Accepting packages with known vulnerabilities
- Breaking changes: When library upgrade requires code changes
- Infrastructure access: Self-hosted runner configuration, credentials
- Policy decisions: Should we support older library versions?
- Unknown patterns: Errors not matching any documented pattern (document it!)
How to Escalate
- Document what you tried (commands, fixes attempted)
- Include relevant log excerpts (not entire logs)
- State current hypothesis about root cause
- Ask specific question (not "it's broken, help!")
Success Criteria
How to Know Your Fix Worked
Phase 1 (Prep) Success:
- Step "Cache dependencies" shows cache saved
- Cache key logged in output
- No real errors in logs (ignore continue-on-error steps)
Phase 2 (Build/Test) Success:
- Cache restored from Phase 1 (check cache hit log)
- CMake configure completes without errors
- Build completes (cmake --build succeeds)
- Tests run and produce results (pass/fail is separate concern)
- Artifacts uploaded (logs, test results)
Full Build Success:
- Both Phase 1 and Phase 2 complete
- Total time < 2 hours for hosted runners
- Cache reusable for next build (key saved correctly)
Error Pattern Quick Lookup
Use this table to quickly find the right fix for common error messages. Search for key phrases in your error log.
| Error Message (grep pattern) | Issue | Fix Location |
|---|---|---|
No module named 'distutils' | Python 3.12+ removed distutils | troubleshooting.windows.md → W1 |
Could NOT find PkgConfig | pkg-config missing or incompatible | troubleshooting.windows.md → W2 |
unexpected hash | Library download hash mismatch | reference.md → Hash Fix Process |
no version database entry | vcpkg baseline outdated | reference.md → Baseline Management |
not our ref / upload-pack | Git commit not fetchable | reference.md → vcpkg Fork Management |
Cache not found | Cache key mismatch between phases | reference.md → Cache Configuration |
CUDA_HOME is not set | CUDA env vars missing | troubleshooting.cuda.md → C1 |
nvcc: command not found | CUDA toolkit not in PATH | troubleshooting.cuda.md → C2 |
cudnn.h: No such file | cuDNN not installed | troubleshooting.cuda.md → C3 |
No space left on device | Disk full (embedded devices) | troubleshooting.jetson.md → J1 |
nvbuf_utils not found | JetPack 5.x API change | troubleshooting.jetson.md → J7 |
libnpp*.so not found | CUDA libs not in ldconfig | troubleshooting.jetson.md → J8 |
dlsym@@GLIBC undefined | Static linking missing -ldl | troubleshooting.jetson.md → J3 |
libpng/png.h not found | vcpkg header path mismatch | troubleshooting.jetson.md → J4 |
version '2.x.x', required '>= 2.y' | PKG_CONFIG_PATH wrong order | troubleshooting.jetson.md → J6 |
Unsupported SFML component | Library breaking change | reference.md → Version Pinning |
glib requires feature 'libmount' | Platform filter needed | troubleshooting.linux.md → L2 |
Not in table? See "When Stuck" section in methodology.md for research strategies.
Troubleshooting Guide Index
- troubleshooting.windows.md - Windows cloud builds (CI-Windows build job)
- troubleshooting.linux.md - Linux x64 cloud builds (CI-Linux build job)
- troubleshooting.macos.md - macOS cloud builds (CI-MacOSX-NoCUDA ci job)
- troubleshooting.cuda.md - GPU test jobs on self-hosted runners (CI-Windows/CI-Linux cuda jobs)
- troubleshooting.jetson.md - Jetson ARM64 builds (CI-Linux-ARM64 ci job)
- troubleshooting.containers.md - Docker builds (CI-Linux docker job)
- troubleshooting.vcpkg.md - vcpkg-specific issues (cross-platform)
- reference.md - Cross-platform reference (vcpkg, cache, GitHub Actions)
- methodology.md - High-level debugging methodology (detection, validation, testing)
Maintenance
This skill should be updated when:
- New platform added (update decision tree, platform coverage, troubleshooting guides)
- New issue pattern discovered (add to appropriate troubleshooting guide)
- vcpkg baseline updated (update reference.md with new pins)
- Workflow structure changes (update reference.md and this file)
- Self-hosted runner configuration changes (update troubleshooting.cuda.md)
CI/CD Architecture Notes
The new unified architecture (as of Dec 2025):
- Top-level workflows: CI-Windows, CI-Linux, CI-Linux-ARM64, CI-MacOSX-NoCUDA
- Reusable workflows: build-test.yml (unified Windows/Linux), build-test-lin.yml (Linux-specific), build-test-macosx.yml (macOS)
- Nested structure: Top-level calls reusable → reusable runs build/test jobs → reusable calls CI-CUDA-Tests.yml for GPU tests
- Badge naming: Uses
flavparameter - Windows, Windows-CUDA, Linux, Linux-CUDA, Linux-Docker, Linux_ARM64, MacOSX - CUDA tests: Run as nested jobs within CI-Windows/CI-Linux via CI-CUDA-Tests.yml reusable workflow on self-hosted GPU runners
Score
Total Score
Based on repository quality metrics
SKILL.mdファイルが含まれている
ライセンスが設定されている
100文字以上の説明がある
GitHub Stars 100以上
1ヶ月以内に更新
10回以上フォークされている
オープンIssueが50未満
プログラミング言語が設定されている
1つ以上のタグが設定されている
Reviews
Reviews coming soon

