ai-ml-infra

Name: ai-ml-infra
Rating: 65
Author: 5dlabs

by 5dlabs

Cognitive Task Orchestrator - GitOps on Bare Metal or Cloud for AI Agents

⭐ 2🍴 1📅 Jan 24, 2026

ai-agents ai-powered-development autonomous-development code-generation devops-automation github-automation kubernetes-operator mcp-protocol

View on GitHub Run in Manus

SKILL.md

name: ai-ml-infra description: KubeAI, GPU operators, and model serving patterns for AI/ML infrastructure on Kubernetes. agents: [bolt] triggers: [kubeai, gpu, model, inference, vllm, ollama, llm, ai, ml]

AI/ML Infrastructure

Model serving with KubeAI, GPU scheduling, and inference patterns.

Model Deployment Options

Feature	KubeAI	Ollama Operator	LlamaStack
Backend	vLLM (GPU optimized)	Ollama (easy)	Multi-backend
Scale from zero	Yes	No	No
OpenAI API	Native	Compatible	Compatible
Best for	Production GPU	CPU/mixed	Full AI stack

KubeAI Setup

Model CRD

apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: llama-3-8b
  namespace: kubeai
spec:
  features: [TextGeneration]
  url: "ollama://llama3.1:8b"
  engine: OLlama
  resourceProfile: nvidia-gpu-l4:1
  minReplicas: 0      # Scale to zero
  maxReplicas: 3
  targetRequests: 10  # Scale up threshold

Resource Profiles

Profile	GPUs	VRAM	Use Case
`cpu`	0	-	Embeddings, small models
`nvidia-gpu-l4:1`	1x L4	24GB	8B models
`nvidia-gpu-h100:1`	1x H100	80GB	70B models
`nvidia-gpu-h100:2`	2x H100	160GB	Large models

Custom Resource Profile

resourceProfiles:
  nvidia-gpu-l4:
    nodeSelector:
      nvidia.com/gpu.product: "NVIDIA-L4"
    requests:
      cpu: "4"
      memory: "16Gi"
    limits:
      nvidia.com/gpu: "1"
      cpu: "8"
      memory: "32Gi"

Accessing Models

OpenAI-Compatible API

# Port-forward
kubectl port-forward svc/kubeai -n kubeai 8000:80

# List models
curl http://localhost:8000/openai/v1/models

# Chat completion
curl http://localhost:8000/openai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-8b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

In-Cluster Access

env:
  - name: OPENAI_API_BASE
    value: "http://kubeai.kubeai.svc/openai/v1"
  - name: OPENAI_API_KEY
    value: "not-needed"  # KubeAI doesn't require auth

SDK Usage

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://kubeai.kubeai.svc/openai/v1",
  apiKey: "not-needed",
});

const response = await client.chat.completions.create({
  model: "llama-3-8b",
  messages: [{ role: "user", content: "Hello!" }],
});

GPU Operator

NVIDIA GPU Operator manages GPU drivers and device plugins.

Verify GPU Nodes

# Check GPU nodes
kubectl get nodes -l nvidia.com/gpu.product

# Check GPU allocations
kubectl describe node <gpu-node> | grep nvidia.com/gpu

# Check device plugin
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset

GPU Pod Scheduling

spec:
  containers:
    - name: gpu-app
      resources:
        limits:
          nvidia.com/gpu: 1
  nodeSelector:
    nvidia.com/gpu.product: "NVIDIA-L4"

Model Selection Guide

Model	Size	GPU Req	Best For
`llama3.1:8b`	8B	L4 x1	General, coding
`llama3.1:70b`	70B	H100 x2	Complex reasoning
`qwen2.5-coder`	7B	L4 x1	Code generation
`nomic-embed-text`	137M	CPU	Embeddings
`deepseek-r1`	1.5B	CPU	Light reasoning

Ollama Operator (Alternative)

Simpler setup for Ollama models:

apiVersion: ollama.ayaka.io/v1
kind: Model
metadata:
  name: phi4
  namespace: ollama-operator-system
spec:
  image: phi4
  resources:
    limits:
      nvidia.com/gpu: "1"

Access:

kubectl port-forward svc/ollama-model-phi4 -n ollama-operator-system 11434:11434
ollama run phi4

Validation Commands

# Check KubeAI models
kubectl get models -n kubeai
kubectl describe model <name> -n kubeai

# Check model pods
kubectl get pods -n kubeai -l app.kubernetes.io/name=kubeai

# Check GPU utilization
kubectl exec -n kubeai <pod> -- nvidia-smi

# Test API
curl http://kubeai.kubeai.svc/openai/v1/models

Troubleshooting

Model not starting

# Check model status
kubectl describe model <name> -n kubeai

# Check pod events
kubectl get events -n kubeai --sort-by='.lastTimestamp'

# Check logs
kubectl logs -n kubeai -l model=<name>

Out of memory (OOM)

Reduce model parameters:

spec:
  args:
    - --max-model-len=4096      # Reduce from 8192
    - --gpu-memory-utilization=0.8  # Reduce from 0.9

Slow first response

Set minReplicas to keep model warm:

spec:
  minReplicas: 1  # Always keep one running

Best Practices

Use scale-from-zero - Set minReplicas: 0 to save resources
Right-size GPU profiles - Don't over-allocate expensive GPUs
Use vLLM for production - Better throughput than Ollama
Monitor GPU memory - Set appropriate gpu-memory-utilization
Keep frequently-used models warm - minReplicas: 1
Use OpenAI-compatible API - Easy integration with existing code

Score

Total Score

65/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

✓LICENSE

ライセンスが設定されている

+10

○説明文

100文字以上の説明がある

0/10

○人気

GitHub Stars 100以上

0/15

✓最近の活動

1ヶ月以内に更新

+10

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

ai-ml-infra

SKILL.md

name: ai-ml-infra description: KubeAI, GPU operators, and model serving patterns for AI/ML infrastructure on Kubernetes. agents: [bolt] triggers: [kubeai, gpu, model, inference, vllm, ollama, llm, ai, ml]

AI/ML Infrastructure

Model Deployment Options

KubeAI Setup

Model CRD

Resource Profiles

Custom Resource Profile

Accessing Models

OpenAI-Compatible API

In-Cluster Access

SDK Usage

GPU Operator

Verify GPU Nodes

GPU Pod Scheduling

Model Selection Guide

Ollama Operator (Alternative)

Validation Commands

Troubleshooting

Model not starting

Out of memory (OOM)

Slow first response

Best Practices

Score

Reviews

greeter

pr-creator

code-reviewer

skill-creator

browser-use

git-workflow

ai-ml-infra

SKILL.md

name: ai-ml-infra description: KubeAI, GPU operators, and model serving patterns for AI/ML infrastructure on Kubernetes. agents: [bolt] triggers: [kubeai, gpu, model, inference, vllm, ollama, llm, ai, ml]

AI/ML Infrastructure

Model Deployment Options

KubeAI Setup

Model CRD

Resource Profiles

Custom Resource Profile

Accessing Models

OpenAI-Compatible API

In-Cluster Access

SDK Usage

GPU Operator

Verify GPU Nodes

GPU Pod Scheduling

Model Selection Guide

Ollama Operator (Alternative)

Validation Commands

Troubleshooting

Model not starting

Out of memory (OOM)

Slow first response

Best Practices

Score

Reviews

Related

Related Skills

greeter

pr-creator

code-reviewer

skill-creator

browser-use

git-workflow