Lead Developer Advocate

Philip Kiely

About

Philip Kiely is a software developer and author based out of Chicago. Originally from Clive, Iowa, he graduated from Grinnell College with honors in Computer Science. Philip joined Baseten in January 2022 and works across documentation, technical content, and developer experience. Outside of work, he's a lifelong martial artist, voracious reader, and, unfortunately, a Bears fan.

Community

Building performant embedding workflows with Chroma and Baseten

Integrate Chroma’s open-source vector database with Baseten’s fast inference engine for efficient, real-time embedding inference in your AI-native apps.

Philip Kiely

Build performant embedding workflows with Chroma and Baseten

ML models

The best open-source embedding models

Discover the best open-source embedding models for search, RAG, and recommendations—curated picks for performance, speed, and cost-efficiency.

Philip Kiely

Model performance

How we built high-throughput embedding, reranker, and classifier inference with TensorRT-LLM

Discover how we optimized embedding, reranker, and classifier inference using TensorRT-LLM, doubling throughput and achieving ultra-low latency at scale.

Michael Feil

1 other

A library -- it's classical with dark wooden shelves and glowing golden lights and grand architectural design. However, the books -- which fly on and off the shelves themselves -- are ghostly glowing blue holograms.

Model performance

How multi-node inference works for massive LLMs like DeepSeek-R1

Running DeepSeek-R1 on H100 GPUs requires multi-node inference to connect the 16 H100s needed to hold the model weights.

Phil Howes

1 other

GPU guides

Testing Llama 3.3 70B inference performance on NVIDIA GH200 in Lambda Cloud

The NVIDIA GH200 Superchip combines an NVIDIA Hopper GPU with an ARM CPU via high-bandwidth interconnect

Pankaj Gupta

1 other

ML models

Private, secure DeepSeek-R1 in production in US & EU data centers

Dedicated deployments of DeepSeek-R1 and DeepSeek-V3 offer private, secure, high-performance inference that's cheaper than OpenAI

Amir Haghighat

1 other

Model performance

How we built production-ready speculative decoding with TensorRT-LLM

Our TensorRT-LLM Engine Builder now supports speculative decoding, which can improve LLM inference speeds.

Pankaj Gupta

2 others

Glossary

A quick introduction to speculative decoding

Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.

Pankaj Gupta

2 others

A ghostly, glowing llama walking ahead of a real llama

1 2 3…7

Machine learning infrastructure that just works

Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.