LLM engineering

Welcome to the singularity. The pace at which the tools and best practices in LLM engineering are evolving makes it almost impossible to keep up with if you are not developing almost every day. This article provides an overview of the state of the art as of early 2025. At the core of LLM engineering of course are the LLM models. Frontier (or massive scale) models are the paid versions of models like ChatGPT. Open source (or [[open weight]]) models are also available like LLama. There are multiple ways to use models - **Chat interfaces**: all you need is a text box. Simply type your chat in on the services web interface and get a response. Start for free or upgrade to paid plans with monthly subscriptions. - **Cloud APIs**: Interact with LLMs running in the cloud with an API. Frameworks like [[LangChain]] wrap multiple APIs to provide a seamless experience. - **Managed AI cloud services**: interact with Cloud APIs through an interface managed by the service provider. Includes [[Amazon Bedrock]], [[Google Vertex]], and [[Azure AI Studio]]. - **Direct inference**: Run the LLM locally or on a [[virtual machine]]. Use [[ollama]] to run locally or [[HuggingFace]] to access models in the cloud. ## hardware requirements - What it's like running on a Microsoft Surface 9; Dell Precision 5560 - Build your own machine - NVIDIA CUDA GPU - Apple Silicon - alternative: cloud compute ## running LLMs on a Mac Use Apple Silicon to run LLMs on a Mac. https://www.youtube.com/watch?v=bp2eev21Qfo [[prompting]] [[LLamaIndex]] [[ollama]] [[OpenAI API]] [[Anthropic API]] [[open WebUI]] [[model context protocol]] [[Windsurf]] [[Vellum]] [[replit]] [[structured outputs]] [[webdev arena]] [[Gradio]] [[HuggingFace]] [[tokenizer]] [[chat template]] [[LoRA]] [[quantization]] [[chinchilla scaling law]] [[LLM benchmarks]] [[Retrieval Augmented Generation]] [[LangChain]] [[Weights and Biases]] [[Modal]] ## fine tuned model serverless API workflow *Write out the steps to fine tune a model with hugging face, upload it to hugging face, and use modal to run inference, or make a notebook*