The best open source large language model

Large language models (LLMs) are the definitive category in generative AI. But with tens of thousands of options, it can be hard to feel confident about making the right tradeoffs between output quality, speed, and cost — especially when models specialize in different tasks.

Taking a holistic view across technical specifications, customer conversations, and our own testing, we’ve put together this list of models to guide you in finding the right starting place for building on top of open source text generation models for chat, code completion, retrieval-augmented generation, and more LLM use cases.

Best overall open source LLM: Llama 3.3 70B Instruct

Meta's latest LLM family, Llama 3, offers 8B, 70B, and 405B parameter instruct-tuned models with excellent benchmark performance. The midsize model, Llama 3.3 70B Instruct, compares favorably to top closed-source models including GPT-4o. The models were trained on over 15 trillion tokens, with an emphasis on code and a knowledge cutoff date of December 2023 for Llama 3.3 70B (versus March 2024 for Llama 3.1 8B).

What we love about Llama 3.3 70B Instruct:

  • 128k-token context window with excellent retrieval benchmarks for building RAG-type applications.

  • Along with Meta's systematic investments in safety, Llama 3 models have been instruct-tuned to reduce false refusal rates.

  • Strong code generation and mathematical reasoning capabilities in a general model.

  • New, more efficient tokenizer yields up to 15% fewer tokens, meaning you generate fewer tokens per request.

What to watch out for with Llama 3 70B Instruct:

  • Llama 3.3 70B only supports eight languages, while many models support 2-3x as many.

  • Llama 3.3 models have a custom commercial license that also applies to any fine-tuned derivatives.

Get started with Llama 3.3 70B Instruct or try the smaller but still excellent Llama 3.1 8B Instruct.

The best big LLM: DeepSeek-V3

DeepSeek-V3 is a 671B-parameter open-source LLM that truly rivals closed-source heavyweights like Sonnet 3.5 and GPT-4o. While other large models like Mistral Large 2 and Cohere Command-R plus are also extremely powerful, DeepSeek-V3 is licensed for commercial use with some restrictions, including military use.

What we love about DeepSeek-V3:

  • Benchmarks favorably against the best closed-source models and backs up those scores with excellent observed real-world performance.

  • Massive 128k-token context window with excellent needle-in-the-haystack retrieval benchmarks across all input sequence lengths.

  • Additional multi-token prediction module to enable optimizations like speculative sampling.

What to watch out for with DeepSeek-V3:

  • The model is so large that it generally must be run in FP8 on H200 GPUs.

  • Mixture of Experts architecture can affect batch performance.

  • Inference is expensive even with optimizations, for many use cases a less-powerful model like Llama 3.3 70B will suffice at a lower cost.

  • DeepSeek-V3 has a custom commercial license that also applies to any fine-tuned derivatives.

Contact us for access to DeepSeek-V3.

Best small LLM under 7 billion parameters: Phi 3 Mini

On the opposite end of the spectrum, Phi 3 Mini is an open source instruct-tuned LLM by Microsoft that achieves state of the art performance for models of its size at just 3.8 billion parameters. Phi 3 Mini runs fast on cheap hardware, making it a strong option for low-cost inference.

What we love about Phi 3 Mini:

  • Excellent output quality rivals 7B/8B LLMs from just a few months ago.

  • 128k-token context window variant allows for unprecedented use cases for models of this size class.

  • Permissive MIT license for unrestricted commercial use.

What to watch out for with Phi 3 Mini models:

  • While the LLM is outstanding for its class, output quality falls behind larger models, especially for factual recall.

  • The 4k-token context window variant consistently scores slightly higher on evals; only use the 128k-token variant when the increased context window is strictly necessary.

  • Phi 3 Mini is an English-only model.

Deploy Phi 3 Mini 4k (or the 128k variant) on a T4 GPU.

Another great LLM family: Mistral and Mixtral

Mistral AI is a foundation model lab founded in France that builds both open-source and proprietary language models. Their open-source models include three sizes of base and instruct-tuned foundation LLM as well as vision models and domain-specific models for math and code.

What we love about Mistral models:

What to watch out for with Mistral models:

  • Batching model requests reduces efficiency gains from Mixture of Experts architecture for 8x7B and 8x22B models.

  • Newer models like Mistral Large are not licensed for commercial use.

  • Light-touch alignment may not be suitable for all use cases.

Choose a model from the Mistral family: 7B, 8x7B, 8x22B, or Pixtral!

Best ML model for code generation: Qwen 2.5 Coder

Qwen Coder is a project by Alibaba to finetune their Qwen 2.5 family of models to specialize in code generation tasks. The Qwen Coder family has three sizes for server-side inference (7B, 14B, and 32B) and three sizes for edge inference (0.5B, 1.5B, 3B) with both Base and Instruct variants.

What we love about Qwen Coder:

  • The 32B size matches Claude Sonnet 3.5 on some coding benchmarks.

  • Six sizes and two variants (Base and Instruct, and Python) for maximum flexibility.

  • Large context window (up to 128K tokens) is essential for working with code as context (code is much more token-dense than natural language).

What to watch out for:

  • The most powerful 72B Qwen 2.5 model does not yet have a Coder variant.

  • The 0.5B, 1.5B, and 3B edge models, while powerful for their size, are very limited in real-world use.

  • The Base variant is intended for finetuning, not for out of the box code completion.

Deploy Qwen Coder 7B, 14B, or 32B optimized with TensorRT-LLM.

Best model for fine tuning: Llama 3.1

The Llama 3.1 family of LLMs offers the most flexibility for fine tuning projects across size (8B, 70B, 405B) and focus (base and code variants). Given Llama 3 models’ strong base performance, any model from the family is a powerful foundation to build on.

What we love about Llama 3.1 for fine tuning:

  • Base models in 2 different sizes (8B, 70B, 405B) lets you make tradeoffs between cost and performance.

  • New Llama 3.1 license explicitly allows for derivatives and teacher models.

  • Llama models have a history as popular choices for fine tuning work, so there’s plenty of research and tooling to build on.

What to watch out for:

  • Heavy-handed alignment in the instruct/chat variants of the models means you may need to start from scratch from the base variant.

  • Llama 3.1 models have a special license that also applies to fine tuned variants.

Experiment with Llama 3.1 8B Instruct on autoscaling infrastructure.

Can open source models replace OpenAI and ChatGPT?

Yes. Llama 3.3 70B compares favorably to GPT-4o on most benchmarks.

Newer open source LLMs like Nemotron Llama 3.1 70B Instruct compare favorably to closed-source options like GPT-3.5, Gemini Pro 1.5, and Claude 3 Sonnet. And with fine-tuning, open source models can match or beat the best closed-source models for specific tasks at much lower costs.

How much should I trust model evaluation benchmarks?

Model evaluation benchmarks measure an LLM’s performance on a fixed set of tasks. These benchmarks are designed to assess the accuracy and quality of the model’s output. While there is no universal standard benchmark, there are a number of popular options including ARC, HellaSwag, and MMLU.

There are some worries about the usefulness of evaluation benchmarks. Generally, evaluation benchmarks could be too narrow to fully capture a model’s performance, and more recently there have been concerns about evaluation sets leaking into models’ training data. These problems have solutions. It’s standard practice to look at a model’s average performance across several benchmarks to account for the limitations of any one benchmark, and researchers check for contamination of their training data before releasing models.

Benchmarks performance is a solid signal when picking an LLM, but isn’t the whole story. There’s no need to switch models every time a new variant comes out with a slight uptick in benchmark score, and the most important thing to do is evaluate model output for your exact use case.

What about domain-specific language models?

In many cases, you can get better domain-specific performance for LLMs by fine-tuning open-source models or building custom models. Domain-specific models can rival frontier models on specific topics, like medicine or finance, with a fraction of the parameters, leading to more cost-efficient inference. For example, Writer built custom domain-specific LLMs for medicine and finance and optimized them for fast inference with Baseten.

The best open source LLM

There’s no one best open source LLM, only the LLM that’s best for you. This selection depends on capabilities, features, price point, and license. New models are released every day, and it can feel overwhelming to keep up. But finding the right model for your use case is possible with a bit of guidance and experimentation.

Deploy the best open source LLM for your use case in just a couple of clicks: