What we’ve also seen is a reluctance to go beyond a single machine for training. What we are seeing is a particular technology stack that combines multiple technologies: The LLM space is an incredibly fast-moving space, and it is currently evolving very rapidly. One, particularly concerning recent development, is the ability to regenerate training data from learned models, and people unintentionally disclosing secret information. Given the sensitivity of organizational data and also frequent legal constraints like data residency, this is especially limiting. Many of the API providers reserve the right to use those instances for retraining. send a few snippets of internal documents and ask the system to summarize them). We have heard from users that optimizing their workflow has often resulted in a 5x or more latency improvement.ĭata Security & Privacy: In order to get the response from these APIs, you have to send them a lot of data for many applications (e.g. by using low-resolution models, tightly packing queries to GPUs, and so on. Bringing the processing in-house allows you to optimize the stack for your application, e.g. Again, this makes many applications impossible. Combine a few round trips from your data center to theirs and it is possible for a query to take minutes. A GPT-3.5 query for example can take up to 30 seconds. Latency: using these LLMs is especially slow. For many applications, this is cost prohibitive. one to help with the prompt generation, post-generation moderation, etc), so it’s very possible that a single interaction with an end user could cost a few dollars. It’s important to remember that many user interactions require multiple backend calls (e.g. Why would you want to run your own? There are a few reasons:Ĭost, especially for fine-tuned inference: For example, OpenAI charges 12c per 1000 tokens (about 700 words) for a fine-tuned model on Davinci. There are many, many providers of LLM APIs online. Showing how the fine-tuned model compares to a prompt engineering approach with large systems. Showing how you can serve the fine-tuned 6B LLM compiled model binary. Showing you, for less than $7, how you can fine-tune the model to sound more medieval using the works of Shakespeare by doing it in a distributed fashion on low-cost machines, which is considerably more cost-effective than using a single large powerful machine. Showing you 40 lines of Python code that can enable you to serve a 6 billion parameter GPT-J model. Showing you the evolving tech stack we are seeing for cost-effective LLM fine-tuning and serving, combining HuggingFace, DeepSpeed, Pytorch, and Ray. Using these three components, you can simply and quickly put together an open-source LLM fine-tuning and serving system.īy taking advantage of Ray’s distributed capabilities, we show how this can be both more cost-effective and faster than using a single large (and often unobtainable) machine.ĭiscussing why you might want to run your own LLM instead of using one of the new API providers. In particular, we illustrate the following: In this blog, we share a practical approach on how you can use the combination of HuggingFace, DeepSpeed, and Ray to build a system for fine-tuning and serving LLMs, in 40 minutes for less than $7 for a 6 billion parameter model.
0 Comments
Leave a Reply. |