The LLAMA 2 is a powerful language model that has demonstrated remarkable capabilities in understanding and generating human-like text. In this article, we will guide you through the process of deploying your LLAMA-2–13b-chat Language Model (LLM) as an API using Python’s FastAPI framework. This will allow you to interact with your LLAMA 2 model over HTTP requests and get a streaming response, enabling a wide range of applications such as chatbots, content generation, and more.

Prerequisites

Before we dive into the deployment process, ensure that you have the following components ready:

LLAMA 2–13b-chat LLM Model: You should have the LLAMA2 Language Model pre-trained and saved in a suitable format for deployment.

wget https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_1.bin

Python Environment: Set up a Python environment with the required packages. You can use virtual environments to manage dependencies cleanly

pip install llama-cpp-python
git clone https://github.com/ggerganov/llama.cpp
pip install fastapi uvicorn sse-starlette requests
cd llama.cpp

Understanding the Code

Let’s start by understanding the code snippet provided, which will serve as the foundation for deploying your LLAMA 2 LLM as an API.

import time
import copy
import asyncio
import requests

from fastapi import FastAPI, Request
from llama_cpp import Llama
from sse_starlette import EventSourceResponse
# Load the model
print("Loading model...")
llm = Llama(model_path="./llama-2-13b-chat.ggmlv3.q4_1.bin") # change based on the location of models
print("Model loaded!")

app = FastAPI()

@app.get("/llama")
async def llama(request: Request, question:str):
    stream = llm(
        f"""{question}""",
        max_tokens=100,
        stop=["\n", " Q:"],
        stream=True,
    )
    async def async_generator():
        for item in stream:
            yield item
    async def server_sent_events():
        async for item in async_generator():
            if await request.is_disconnected():
                break
            result = copy.deepcopy(item)
            text = result["choices"][0]["text"]
            yield {"data": text}
    return EventSourceResponse(server_sent_events())

The code above does the following:

Imports necessary libraries and modules, including FastAPI, llama_cpp, and EventSourceResponse for handling Server-Sent Events (SSE).

Creates a FastAPI app instance.
Defines a route (/llama) for the API. When a GET request is made to this endpoint, it executes the llama function.
Inside the llama function:

The LLAMA2 model is used to generate text in a streaming fashion. The example provided initialises the model with a prompt and generates text while adhering to specified constraints.
An asynchronous generator is defined to yield items from the text generation stream.
Another asynchronous function, server_sent_events, iterates over the items generated by the LLAMA2 model and yields the generated text as SSE data.

4. The EventSourceResponse from the server_sent_events function is returned as the API response. This allows the client to receive a continuous stream of text generated by the LLAMA 2 model.

To know more about the SSE, please refer:

Realtime Log Streaming with FastAPI and Server-Sent Eventspersonal website
amittallapragada.github.io

Deploying Your LLAMA 2 LLM as an API

Now that we understand the code, let’s proceed with deploying your LLAMA 2 LLM as an API using FastAPI.

To run the FastAPI server, execute the following command in your terminal:

uvicorn your_script_name:app --host 0.0.0.0 --port 8000

Replace your_script_name with the name of the Python script containing the provided code.
Interact with the API: Once the server is running, you can interact with your LLAMA2 LLM API by sending a GET request to http://localhost:8000/llama using a web browser or a tool like curl.

Conclusion

Congratulations! You have successfully deployed your own Language Model as an API using FastAPI. This deployment opens up a world of possibilities for integrating your LLAMA 2 model into various applications and services, enabling dynamic text generation and interaction. As you explore this API further, consider enhancing it with error handling, authentication, and additional features to meet your specific requirements.

Happy Generation with your deployed LLM API!

Search This Blog

JT Blog

Deploy Your LLM API on CPU

Prerequisites

Understanding the Code

Realtime Log Streaming with FastAPI and Server-Sent Events

personal website

Deploying Your LLAMA 2 LLM as an API

Conclusion

Comments

Post a Comment

Popular posts from this blog

Instruct Fine-Tuning Falcon 7B Using LoRA