GPU Shield is a comprehensive runtime security monitoring system for GPU workloads, providing real-time telemetry collection, integrity verification, and anomaly detection for GPU-accelerated applications.
- Real-time GPU Telemetry: Collect GPU memory utilization, temperature, power consumption, and performance metrics
- Security Monitoring: Integrity verification, anomaly detection, and access control monitoring
- Kubernetes Native: Deploy as DaemonSet with proper RBAC and security contexts
- Multi-vendor Support: NVIDIA (via nvidia-smi and DCGM), AMD (planned), Intel (planned)
- Scalable Architecture: Distributed sensor-collector-alert architecture
- Standards Compliant: Protobuf/gRPC APIs, Prometheus metrics, SBOM generation
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Sensor βββββΆβ Collector βββββΆβAlert Engine β
β (DaemonSet) β β (Deployment)β β(Deployment) β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β GPU Node β β Telemetry β β Alerts β
β Metrics β β Database β β& Responses β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
- Protobuf Schemas: Comprehensive telemetry and integrity message definitions
- Sensor Implementation: Functional GPU metrics collection using nvidia-smi
- Helm Chart: Complete Kubernetes deployment with DaemonSet, RBAC, and security contexts
- CI/CD Pipeline: GitHub Actions with linting, testing, security scanning, and SBOM generation
- Build System: Makefile with protobuf generation, building, and testing targets
- DCGM Integration: Enhanced GPU metrics collection
- Collector Service: Telemetry aggregation and storage
- Alert Engine: Security event processing and response
- AMD GPU Support: ROCm and rocprofiler integration
- Intel GPU Support: Level Zero and Intel GPU metrics
- Advanced Security: ML-based anomaly detection, behavioral analysis
- Dashboard: Grafana dashboards and visualization
- Go 1.24.3+
- Protocol Buffers compiler (
protoc
) - Kubernetes cluster with GPU nodes
- Helm 3.x
- NVIDIA drivers and nvidia-smi (for NVIDIA GPUs)
-
Clone the repository:
git clone https://github.com/ShipKode/gpushield.git cd gpushield
-
Install development tools:
make install-tools
-
Generate protobuf stubs:
make proto
-
Build the sensor:
make build-sensor
-
Test the sensor locally (requires nvidia-smi):
./bin/sensor --output=text --interval=10s
-
Deploy with Helm:
helm install gpu-shield ./helm/gpu-runtime-security
-
Check deployment status:
kubectl get daemonset -l app.kubernetes.io/name=gpu-runtime-security kubectl logs -l app.kubernetes.io/component=sensor
-
View GPU metrics:
kubectl logs -l app.kubernetes.io/component=sensor -f
The sensor supports the following configuration options:
./bin/sensor --help
Key options:
--interval
: Collection interval (default: 30s)--log-level
: Logging level (debug, info, warn, error)--output
: Output format (json, text)--use-dcgm
: Use DCGM instead of nvidia-smi--node-id
: Node identifier (defaults to hostname)
Key configuration values in helm/gpu-runtime-security/values.yaml
:
sensor:
interval: 30 # Collection interval in seconds
logLevel: info # Log level
useDCGM: false # Use DCGM instead of nvidia-smi
nodeSelector:
accelerator: nvidia # Target GPU nodes
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Build all components
make build
# Build individual components
make build-sensor
make build-collector
make build-alert
# Generate protobuf stubs
make proto-go
make proto-python
# Run Go tests
make test
# Run Python tests
make test-python
# Run linting
make lint
make lint-python
# Generate SBOM
make sbom
# Run security scans
make security-scan
The telemetry API is defined in api/proto/telemetry.proto
and includes:
- TelemetryData: Complete GPU and system metrics
- GPUMetrics: Per-GPU device information
- MemoryMetrics: GPU memory utilization
- PerformanceMetrics: GPU performance counters
- SecurityMetrics: Security-related metrics
The integrity API is defined in api/proto/integrity.proto
and includes:
- IntegrityReport: Comprehensive security assessment
- ComponentIntegrity: Per-component integrity verification
- SecurityEvent: Security incidents and alerts
- AttestationData: Hardware-based attestation
GPU Shield exposes Prometheus metrics for:
- GPU utilization and performance
- Memory usage and bandwidth
- Temperature and power consumption
- Security events and integrity status
Structured JSON logging with configurable levels:
- DEBUG: Detailed execution information
- INFO: General operational information
- WARN: Warning conditions
- ERROR: Error conditions requiring attention
Grafana dashboards are available for:
- GPU overview and performance
- Security events and alerts
- System health and status
The sensor runs with privileged access to:
- Access GPU devices and drivers
- Read system information from /proc and /sys
- Monitor container runtime sockets
- All communication uses gRPC with TLS
- RBAC controls limit Kubernetes API access
- Network policies can restrict traffic flow
- Sensitive configuration stored in Kubernetes secrets
- Metrics data encrypted in transit
- Optional data retention policies
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Run the full test suite
- Submit a pull request
- Go: Follow standard Go conventions, use
gofmt
andgolangci-lint
- Python: Follow PEP 8, use
black
andisort
- Protobuf: Use consistent naming and documentation
This project is licensed under the Apache License 2.0. See LICENSE for details.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Project Wiki
- Complete collector and alert engine implementation
- DCGM integration for enhanced metrics
- Basic anomaly detection
- AMD GPU support via ROCm
- Advanced security features
- Performance optimizations
- Production-ready release
- Full multi-vendor GPU support
- Comprehensive security monitoring
- Enterprise features