This technique increases the number of parameters of a model while controlling cost and latency, as the model only uses a fraction of the total set of parameters per token. Concretely, Mixtral has 46.7B total parameters but only uses 12.9B parameters per token. It, therefore, processes input and generates output at the same speed and for the same cost as a 12.9B model. Mixtral is pre-trained on data extracted from the open Web – we train experts and routers simultaneously.
Source: Mixtral of experts | Mistral AI | Open source models