Skip to content

BSVogler/k8s-runpod-kubelet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Virtual Kubelet for RunPod

Go Report Card License Helm Chart Container Image Release

This Virtual Kubelet implementation provides seamless integration between Kubernetes and RunPod, enabling dynamic, cloud-native GPU workload scaling across local and cloud resources - automatically extend your cluster with on-demand GPUs without managing infrastructure.

🌟 Key Features

  • Dynamic Cloud Bursting: Automatically extend Kubernetes GPU capabilities using RunPod
  • Virtual Node Integration: Leverages Virtual Kubelet to create a bridge between your Kubernetes cluster and RunPod
  • On-Demand GPU Resources: Schedule GPU workloads to RunPod directly from your Kubernetes cluster
  • Cost Optimization: Only pay for GPU time when your workloads need it
  • Transparent Integration: Appears as a native Kubernetes node with standard scheduling
  • Intelligent Resource Selection: Choose GPU types based on memory requirements and price constraints
  • Automatic Lifecycle Management: Handles pod creation, monitoring, and termination across platforms

πŸ”§ How It Works

The controller implements the Virtual Kubelet provider interface to represent RunPod as a node in your Kubernetes cluster. When pods are scheduled to this virtual node, they are automatically deployed to RunPod's infrastructure.

The Virtual Kubelet acts as a bridge between Kubernetes and RunPod, providing a fully integrated experience:

  1. Node Representation:

    • Creates a virtual node in the Kubernetes cluster
    • Exposes RunPod resources as schedulable compute capacity
  2. Pod Lifecycle Management:

    • Creates pods on RunPod matching Kubernetes specifications
    • Tracks and synchronizes pod statuses
    • Handles pod termination and cleanup
  3. Resource Mapping:

    • Translates Kubernetes pod specifications to RunPod configurations
    • Manages pod annotations and metadata
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Kubernetes Cluster     β”‚                 β”‚     RunPod Cloud    β”‚
β”‚                         β”‚                 β”‚                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚    RunPod API   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚                   β”‚  β”‚                 β”‚  β”‚               β”‚  β”‚
β”‚  β”‚  Regular Nodes    β”‚  β”‚                 β”‚  β”‚  GPU Instance β”‚  β”‚
β”‚  β”‚                   β”‚  β”‚                 β”‚  β”‚               β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚                 β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                         β”‚                 β”‚                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚     ◄─────►     β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  RunPod           β”‚  β”‚                 β”‚  β”‚               β”‚  β”‚
β”‚  β”‚  Virtual Node     │──┼─────────────────┼─►│  GPU Instance β”‚  β”‚
β”‚  β”‚                   β”‚  β”‚                 β”‚  β”‚               β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚                 β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                         β”‚                 β”‚                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“‹ Prerequisites

  • Kubernetes cluster (v1.19+)
  • RunPod account and API key with appropriate permissions
  • kubectl configured to access your cluster
  • Go 1.19+ (for building from source)

πŸš€ Installation

Using Helm (Recommended)

# Install from GitHub Container Registry
helm install runpod-kubelet oci://ghcr.io/bsvogler/helm/runpod-kubelet \
  --namespace kube-system \
  --create-namespace \
  --set runpod.apiKey=<your-runpod-api-key>

Using an existing secret

# Create secret with RunPod API key
kubectl create secret generic runpod-api-secret \
  --namespace kube-system \
  --from-literal=RUNPOD_API_KEY=<your-runpod-api-key>

# Install with existing secret
helm install runpod-kubelet oci://ghcr.io/bsvogler/charts/runpod-kubelet \
  --namespace kube-system \
  --set runpod.existingSecret=runpod-api-secret

Using kubectl

# Create secret with RunPod API key
kubectl create secret generic runpod-kubelet-secrets \
  --namespace kube-system \
  --from-literal=RUNPOD_API_KEY=<your-runpod-api-key>

# Apply controller deployment
kubectl apply -f https://raw.githubusercontent.com/bsvogler/k8s-runpod-kubelet/main/deploy/kubelet.yaml

Configuration

Helm Configuration

Customize your deployment by creating a values.yaml file:

kubelet:
  reconcileInterval: 30
  maxGpuPrice: 0.5
  namespace: kube-system

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi

Then install with:

helm install runpod-kubelet oci://ghcr.io/bsvogler/charts/runpod-kubelet \
  --namespace kube-system \
  --values values.yaml

Command-line Configuration

When running the binary directly, configure using flags:

./k8s-runpod-kubelet \
  --nodename=virtual-runpod \
  --namespace=kube-system \
  --max-gpu-price=0.5 \
  --log-level=info \
  --reconcile-interval=30

Common configuration options:

  • --nodename: Name of the virtual node (default: "virtual-runpod")
  • --namespace: Kubernetes namespace (default: "kube-system")
  • --max-gpu-price: Maximum price per hour for GPU instances (default: 0.5)
  • --reconcile-interval: Frequency of status updates in seconds (default: 30)
  • --log-level: Logging verbosity (debug, info, warn, error)
  • --datacenter-ids: Comma-separated list of preferred datacenter IDs for pod placement

πŸ” Usage

Once installed, the controller will register a virtual node named virtual-runpod in your cluster. To schedule a pod to RunPod, add the appropriate node selector and tolerations:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  containers:
    - name: gpu-container
      image: nvidia/cuda:11.7.1-base-ubuntu22.04
      command: [ "nvidia-smi", "-L" ]
      resources:
        requests:
          nvidia.com/gpu: 1
        limits:
          nvidia.com/gpu: 1
  nodeSelector:
    type: virtual-kubelet
  tolerations:
    - key: "virtual-kubelet.io/provider"
      operator: "Equal"
      value: "runpod"
      effect: "NoSchedule"

Annotations

You can customize the RunPod deployment using annotations. Add annotations to the pod or to the controlling job.

metadata:
  annotations:
    runpod.io/required-gpu-memory: "16" # Minimum GPU memory in GB
    runpod.io/templateId: "your-template-id"  # Use a specific RunPod template to use a preregistered authentication.
    runpod.io/datacenter-ids: "datacenter1,datacenter2" # Comma-separated list of allowed datacenter IDs

Experimental. Implemented here but not working with the RunPod API.

    runpod.io/cloud-type: "COMMUNITY"   # SECURE or COMMUNITY.should in THEORY support COMMUNITY and STANDARD but in tests only STANDARD lead to results when using the API. Therefore, defaults to STANDARD.
    runpod.io/container-registry-auth-id: "your-auth-id"  # For private registries. Should be supported in theory but only working solution so far found when using template id and preregistering.

Monitoring

Monitor the controller using Kubernetes tools:

# Check logs (Helm installation)
kubectl logs -n kube-system -l app.kubernetes.io/name=runpod-kubelet

# Check logs (kubectl installation)
kubectl logs -n kube-system deployment/runpod-kubelet

# Health checks
kubectl port-forward -n kube-system deployment/runpod-kubelet 8080:8080
curl http://localhost:8080/healthz  # Liveness probe
curl http://localhost:8080/readyz   # Readiness probe

# Check virtual node status
kubectl get nodes
kubectl describe node runpod-node

πŸ› οΈ Development

Building from source

# Clone the repository
git clone https://github.com/bsvogler/k8s-runpod-kubelet.git
cd k8s-runpod-kubelet

# Build the binary
go build -o k8s-runpod-kubelet ./cmd/main.go

# Run locally
./k8s-runpod-kubelet --kubeconfig=$HOME/.kube/config

πŸ§ͺ Testing

The project includes integration tests that deploy actual resources to RunPod. To run the integration tests:

export RUNPOD_API_KEY=<your-runpod-api-key>
export KUBECONFIG=<path-to-kubeconfig>
go test -tags=integration ./...

For regular unit tests that don't communicate with RunPod:

go test ./...

πŸ”„ Architecture Details

The controller consists of several components:

  • RunPod Client: Handles communication with the RunPod API
  • Virtual Kubelet Provider: Implements the provider interface to integrate with Kubernetes
  • Health Server: Provides liveness and readiness probes for the controller
  • Pod Lifecycle Management: Handles pod creation, status updates, and termination

⚠️ Limitations

  • No direct container exec/attach support (RunPod doesn't provide this capability)
  • Container logs not directly supported by RunPod
  • Job-specific functionality is limited to what RunPod API exposes

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“„ License

This project is licensed under a Non-Commercial License - see the LICENSE file for details.

  • Permitted: Private, personal, and non-commercial use
  • Prohibited: Commercial use without explicit permission
  • Required: License and copyright notice

For commercial licensing inquiries or to discuss custom development work as a freelancer, please contact me at [email protected] or via benediktsvogler.com.

About

A k8s virtual kubelet that runs GPU jobs on runpod.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages