Overview

Choose a causal LM that fits your GPU budget and latency targets. Examples use Qwen and Llama families.

Settings

  • model: HF repo id or internal identifier
  • dtype: e.g., bfloat16
  • max_tokens, max_model_len, sampling_top_p, sampling_temperature