This tool helps you calculate the GPU memory required for serving large language models (LLM) locally. Simply input the number of model parameters (in billions), the quantization number (in bits), and the overhead factor, then click "Calculate" to see the required GPU memory.
For example, for a model Llama3-8B-q4, enter 8 for parameters and 4 for quantization.
The "8B" or "70B" in a model's name refers to the number of parameters (weights) in the model:
More parameters generally mean the model can capture more complex patterns and relationships, but it also requires more memory and computational power.
Quantization is the process of reducing the precision of the numbers used to represent a model's parameters. This involves converting the model's weights from high precision (e.g., 32-bit floating point) to lower precision (e.g., 8-bit or even 4-bit integers). This reduction in precision can significantly decrease the amount of memory needed to store the model and the computational power required for inference.
Quantization is particularly useful for deploying models on resource-constrained devices like mobile phones or embedded systems, where memory and computational power are limited. However, aggressive quantization can sometimes lead to a loss in model accuracy, so it is important to find a balance between efficiency and performance.
These notations indicate the level of quantization applied to the model's parameters: