Deploy Your LLM API on CPU

The LLAMA 2 is a powerful language model that has demonstrated remarkable capabilities in understanding and generating human-like text. In this article, we will guide you through the process of deploying your LLAMA-2–13b-chat Language Model (LLM) as an API using Python’s FastAPI framework. This will allow you to interact with your LLAMA 2 model over HTTP requests and get a streaming response, enabling a wide range of applications such as chatbots, content generation, and more.
Prerequisites
Before we dive into the deployment process, ensure that you have the following components ready:
LLAMA 2–13b-chat LLM Model: You should have the LLAMA2 Language Model pre-trained and saved in a suitable format for deployment.
wget https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_1.bin
Python Environment: Set up a Python environment with the required packages. You can use virtual environments to manage dependencies cleanly
pip install llama-cpp-python
git clone https://github.com/ggerganov/llama.cpp
pip install fastapi uvicorn sse-starlette requests
cd llama.cpp
Understanding the Code
Let’s start by understanding the code snippet provided, which will serve as the foundation for deploying your LLAMA 2 LLM as an API.
import time
import copy
import asyncio
import requests
from fastapi import FastAPI, Request
from llama_cpp import Llama
from sse_starlette import EventSourceResponse
# Load the model
print("Loading model...")
llm = Llama(model_path="./llama-2-13b-chat.ggmlv3.q4_1.bin") # change based on the location of models
print("Model loaded!")
app = FastAPI()
@app.get("/llama")
async def llama(request: Request, question:str):
stream = llm(
f"""{question}""",
max_tokens=100,
stop=["\n", " Q:"],
stream=True,
)
async def async_generator():
for item in stream:
yield item
async def server_sent_events():
async for item in async_generator():
if await request.is_disconnected():
break
result = copy.deepcopy(item)
text = result["choices"][0]["text"]
yield {"data": text}
return EventSourceResponse(server_sent_events())
The code above does the following:
Imports necessary libraries and modules, including FastAPI, llama_cpp, and EventSourceResponse for handling Server-Sent Events (SSE).
- Creates a FastAPI app instance.
- Defines a route (
/llama
) for the API. When a GET request is made to this endpoint, it executes thellama
function. - Inside the
llama
function:
- The LLAMA2 model is used to generate text in a streaming fashion. The example provided initialises the model with a prompt and generates text while adhering to specified constraints.
- An asynchronous generator is defined to yield items from the text generation stream.
- Another asynchronous function,
server_sent_events
, iterates over the items generated by the LLAMA2 model and yields the generated text as SSE data.
4. The EventSourceResponse
from the server_sent_events
function is returned as the API response. This allows the client to receive a continuous stream of text generated by the LLAMA 2 model.
To know more about the SSE, please refer:
Deploying Your LLAMA 2 LLM as an API
Now that we understand the code, let’s proceed with deploying your LLAMA 2 LLM as an API using FastAPI.
To run the FastAPI server, execute the following command in your terminal:
uvicorn your_script_name:app --host 0.0.0.0 --port 8000
Replace
your_script_name
with the name of the Python script containing the provided code.Interact with the API: Once the server is running, you can interact with your LLAMA2 LLM API by sending a GET request to
http://localhost:8000/llama
using a web browser or a tool likecurl
.
Conclusion
Congratulations! You have successfully deployed your own Language Model as an API using FastAPI. This deployment opens up a world of possibilities for integrating your LLAMA 2 model into various applications and services, enabling dynamic text generation and interaction. As you explore this API further, consider enhancing it with error handling, authentication, and additional features to meet your specific requirements.
Happy Generation with your deployed LLM API!
Comments
Post a Comment