Troubleshooting float16 is only supported on GPUs with compute capability at least xx Error in vLLM

Problem Description

Environment

  • Hardware: NVIDIA GPUs with compute capability <8.0 (e.g., Tesla V100, T4)
  • Model Types: LLMs requiring bfloat16/FP8 precision (e.g., LLaMA-2-70B, GPT-NeoX-20B)

Symptoms

  1. Explicit error message:
    ValueError: float16/bfloat16 is only supported on GPUs with compute capability at least 8.0
  2. Failed kernel compilation during model loading
# vLLM error stack trace
File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/__init__.py", line 37, in _verify_cuda_compute_capability
    raise ValueError(
ValueError: bfloat16 is only supported on GPUs with compute capability at least 8.0. Current GPU: Tesla V100-PCIE-16GB, compute capability 7.0

Root Cause

Primary Cause

Insufficient GPU Compute Capability The GPU's compute capability (CC) doesn't meet the minimum requirement for specific data types:

  • bfloat16/FP8: Requires CC ≥8.0 (Ampere architecture or newer)
  • FP16 Tensor Core Optimization: Requires CC \≥7.0 (Volta architecture or newer)

Technical Analysis

  1. Architecture Limitations:

    • Pre-Ampere GPUs (CC <8.0) lack dedicated matrix math units for bfloat16 operations
    • Tensor Cores in Volta/Turing (CC 7.0-7.5) only support FP16/FP32 mixed precision
  2. Framework Enforcement:

    # vLLM's capability check (simplified)
    def _verify_cuda_compute_capability():
        if device.compute_capability < MIN_REQUIRED_CC:
            raise ValueError(f"Requires compute capability ≥{MIN_REQUIRED_CC}")

Troubleshooting

Step 1: Verify GPU Compute Capability

import torch
print(f"Compute Capability: {torch.cuda.get_device_capability()}")

Step 2: Check Model Precision Requirements

cat model/config.json | grep "torch_dtype"
# Expected output: "bfloat16" or "float16"

Step 3: Validate Framework Compatibility

from vllm import _is_cuda_compute_capability_compatible as compat
print(f"bfloat16 supported: {compat((8,0))}")

Solution

Solution for Insufficient Compute Capability

Considerations

  • Performance degradation expected when downgrading precision
  • Model accuracy may vary with different precision types

Prerequisites

  • CUDA Toolkit ≥11.8

Steps

  1. Modify InferenceService yaml: add args like --dtype=half
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: llama-2-service
      annotations:
        serving.kserve.io/enable-prometheus-scraping: 'true'
    spec:
      predictor:
        containers:
          - name: kserve-container
            image: vllm/vllm-serving:0.3.2
            args:
              - --model=meta-llama/Llama-2-7b-chat-hf
              - --dtype=half # Force FP16 precision
              - --tensor-parallel-size=1
            resources:
              limits:
                nvidia.com/gpu: '1'
  2. Wait deploy restart

Preventive Measures

  1. Pre-Flight Checks:

    from vllm import LLM
    LLM.validate_environment(model_dtype="bfloat16")
  2. Cluster Configuration:

    # NVIDIA device plugin config
    helm upgrade -i nvidia-device-plugin \
      --set compatabilityPolicy=strict \
      --set computeCapabilities=8.0+
  3. Model Optimization:

    # Apply AWQ quantization
    llm = LLM(model="codellama/CodeLlama-34b",
              quantization="awq",
              load_format="awq")

GPU Compute Capability Reference

ArchitectureCC RangeSupported Precisions
Volta7.0-7.2FP16 Tensor Core
Turing7.5FP16/INT8
Ampere8.0-8.9bfloat16/TF32/FP8
Hopper9.0+FP4/FP8 with dynamic scale

Official References

  1. NVIDIA Compute Capability Table
  2. vLLM Hardware Requirements