Troubleshooting float16 is only supported on GPUs with compute capability at least xx Error in vLLM
TOC
Problem DescriptionEnvironmentSymptomsRelated LogsRoot CausePrimary CauseTechnical AnalysisTroubleshootingStep 1: Verify GPU Compute CapabilityStep 2: Check Model Precision RequirementsStep 3: Validate Framework CompatibilitySolutionSolution for Insufficient Compute CapabilityConsiderationsPrerequisitesStepsPreventive MeasuresRelated ContentGPU Compute Capability ReferenceOfficial ReferencesProblem Description
Environment
- Hardware: NVIDIA GPUs with compute capability <8.0 (e.g., Tesla V100, T4)
- Model Types: LLMs requiring bfloat16/FP8 precision (e.g., LLaMA-2-70B, GPT-NeoX-20B)
Symptoms
- Explicit error message:
- Failed kernel compilation during model loading
Related Logs
Root Cause
Primary Cause
Insufficient GPU Compute Capability The GPU's compute capability (CC) doesn't meet the minimum requirement for specific data types:
- bfloat16/FP8: Requires CC ≥8.0 (Ampere architecture or newer)
- FP16 Tensor Core Optimization: Requires CC \≥7.0 (Volta architecture or newer)
Technical Analysis
-
Architecture Limitations:
- Pre-Ampere GPUs (CC <8.0) lack dedicated matrix math units for bfloat16 operations
- Tensor Cores in Volta/Turing (CC 7.0-7.5) only support FP16/FP32 mixed precision
-
Framework Enforcement:
Troubleshooting
Step 1: Verify GPU Compute Capability
Step 2: Check Model Precision Requirements
Step 3: Validate Framework Compatibility
Solution
Solution for Insufficient Compute Capability
Considerations
- Performance degradation expected when downgrading precision
- Model accuracy may vary with different precision types
Prerequisites
- CUDA Toolkit ≥11.8
Steps
- Modify InferenceService yaml:
add args like --dtype=half
- Wait deploy restart
Preventive Measures
-
Pre-Flight Checks:
-
Cluster Configuration:
-
Model Optimization: