Back to list
5dlabs

ai-ml-infra

by 5dlabs

Cognitive Task Orchestrator - GitOps on Bare Metal or Cloud for AI Agents

2🍴 1📅 Jan 24, 2026

SKILL.md


name: ai-ml-infra description: KubeAI, GPU operators, and model serving patterns for AI/ML infrastructure on Kubernetes. agents: [bolt] triggers: [kubeai, gpu, model, inference, vllm, ollama, llm, ai, ml]

AI/ML Infrastructure

Model serving with KubeAI, GPU scheduling, and inference patterns.

Model Deployment Options

FeatureKubeAIOllama OperatorLlamaStack
BackendvLLM (GPU optimized)Ollama (easy)Multi-backend
Scale from zeroYesNoNo
OpenAI APINativeCompatibleCompatible
Best forProduction GPUCPU/mixedFull AI stack

KubeAI Setup

Model CRD

apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: llama-3-8b
  namespace: kubeai
spec:
  features: [TextGeneration]
  url: "ollama://llama3.1:8b"
  engine: OLlama
  resourceProfile: nvidia-gpu-l4:1
  minReplicas: 0      # Scale to zero
  maxReplicas: 3
  targetRequests: 10  # Scale up threshold

Resource Profiles

ProfileGPUsVRAMUse Case
cpu0-Embeddings, small models
nvidia-gpu-l4:11x L424GB8B models
nvidia-gpu-h100:11x H10080GB70B models
nvidia-gpu-h100:22x H100160GBLarge models

Custom Resource Profile

resourceProfiles:
  nvidia-gpu-l4:
    nodeSelector:
      nvidia.com/gpu.product: "NVIDIA-L4"
    requests:
      cpu: "4"
      memory: "16Gi"
    limits:
      nvidia.com/gpu: "1"
      cpu: "8"
      memory: "32Gi"

Accessing Models

OpenAI-Compatible API

# Port-forward
kubectl port-forward svc/kubeai -n kubeai 8000:80

# List models
curl http://localhost:8000/openai/v1/models

# Chat completion
curl http://localhost:8000/openai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-8b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

In-Cluster Access

env:
  - name: OPENAI_API_BASE
    value: "http://kubeai.kubeai.svc/openai/v1"
  - name: OPENAI_API_KEY
    value: "not-needed"  # KubeAI doesn't require auth

SDK Usage

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://kubeai.kubeai.svc/openai/v1",
  apiKey: "not-needed",
});

const response = await client.chat.completions.create({
  model: "llama-3-8b",
  messages: [{ role: "user", content: "Hello!" }],
});

GPU Operator

NVIDIA GPU Operator manages GPU drivers and device plugins.

Verify GPU Nodes

# Check GPU nodes
kubectl get nodes -l nvidia.com/gpu.product

# Check GPU allocations
kubectl describe node <gpu-node> | grep nvidia.com/gpu

# Check device plugin
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset

GPU Pod Scheduling

spec:
  containers:
    - name: gpu-app
      resources:
        limits:
          nvidia.com/gpu: 1
  nodeSelector:
    nvidia.com/gpu.product: "NVIDIA-L4"

Model Selection Guide

ModelSizeGPU ReqBest For
llama3.1:8b8BL4 x1General, coding
llama3.1:70b70BH100 x2Complex reasoning
qwen2.5-coder7BL4 x1Code generation
nomic-embed-text137MCPUEmbeddings
deepseek-r11.5BCPULight reasoning

Ollama Operator (Alternative)

Simpler setup for Ollama models:

apiVersion: ollama.ayaka.io/v1
kind: Model
metadata:
  name: phi4
  namespace: ollama-operator-system
spec:
  image: phi4
  resources:
    limits:
      nvidia.com/gpu: "1"

Access:

kubectl port-forward svc/ollama-model-phi4 -n ollama-operator-system 11434:11434
ollama run phi4

Validation Commands

# Check KubeAI models
kubectl get models -n kubeai
kubectl describe model <name> -n kubeai

# Check model pods
kubectl get pods -n kubeai -l app.kubernetes.io/name=kubeai

# Check GPU utilization
kubectl exec -n kubeai <pod> -- nvidia-smi

# Test API
curl http://kubeai.kubeai.svc/openai/v1/models

Troubleshooting

Model not starting

# Check model status
kubectl describe model <name> -n kubeai

# Check pod events
kubectl get events -n kubeai --sort-by='.lastTimestamp'

# Check logs
kubectl logs -n kubeai -l model=<name>

Out of memory (OOM)

Reduce model parameters:

spec:
  args:
    - --max-model-len=4096      # Reduce from 8192
    - --gpu-memory-utilization=0.8  # Reduce from 0.9

Slow first response

Set minReplicas to keep model warm:

spec:
  minReplicas: 1  # Always keep one running

Best Practices

  1. Use scale-from-zero - Set minReplicas: 0 to save resources
  2. Right-size GPU profiles - Don't over-allocate expensive GPUs
  3. Use vLLM for production - Better throughput than Ollama
  4. Monitor GPU memory - Set appropriate gpu-memory-utilization
  5. Keep frequently-used models warm - minReplicas: 1
  6. Use OpenAI-compatible API - Easy integration with existing code

Score

Total Score

65/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

0/10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon