Skip to main content

OpenAI Proxy Server

A local, fast, and lightweight OpenAI-compatible server to call 100+ LLM APIs.

info

We want to learn how we can make the proxy better! Meet the founders or join our discord

Usage

pip install litellm
$ litellm --model ollama/codellama 

#INFO: Ollama running on http://0.0.0.0:8000

Test

In a new shell, run:

$ litellm --test

Replace openai base

import openai 

openai.api_base = "http://0.0.0.0:8000"

print(openai.ChatCompletion.create(model="test", messages=[{"role":"user", "content":"Hey!"}]))

Other supported models:

Assuming you're running vllm locally
$ litellm --model vllm/facebook/opt-125m

Jump to Code

[Tutorial]: Use with Continue-Dev/Aider/AutoGen/Langroid/etc.

Here's how to use the proxy to test codellama/mistral/etc. models for different github repos

pip install litellm
$ ollama pull codellama # OUR Local CodeLlama  

$ litellm --model ollama/codellama --temperature 0.3 --max_tokens 2048

Implementation for different repos

Continue-Dev brings ChatGPT to VSCode. See how to install it here.

In the config.py set this as your default model.

  default=OpenAI(
api_key="IGNORED",
model="fake-model-name",
context_length=2048, # customize if needed for your model
api_base="http://localhost:8000" # your proxy server url
),

Credits @vividfog for this tutorial.

note

Contribute Using this server with a project? Contribute your tutorial here!

Advanced

Configure Model

To save API Keys, change model prompt, etc. you'll need to create a local instance of it:

$ litellm --create_proxy

This will create a local project called litellm-proxy in your current directory, that has:

  • proxy_cli.py: Runs the proxy
  • proxy_server.py: Contains the API calling logic
    • /chat/completions: receives openai.ChatCompletion.create call.
    • /completions: receives openai.Completion.create call.
    • /models: receives openai.Model.list() call
  • secrets.toml: Stores your api keys, model configs, etc.

Run it by doing:

$ cd litellm-proxy
$ python proxy_cli.py --model ollama/llama # replace with your model name

To set api base, temperature, and max tokens, add it to your cli command

litellm --model ollama/llama2 \
--api_base http://localhost:11434 \
--max_tokens 250 \
--temperature 0.5

Create a proxy for multiple LLMs

$ litellm

#INFO: litellm proxy running on http://0.0.0.0:8000

Send a request to your proxy

import openai 

openai.api_key = "any-string-here"
openai.api_base = "http://0.0.0.0:8080" # your proxy url

# call gpt-3.5-turbo
response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hey"}])

print(response)

# call ollama/llama2
response = openai.ChatCompletion.create(model="ollama/llama2", messages=[{"role": "user", "content": "Hey"}])

print(response)

Tracking costs

By default litellm proxy writes cost logs to litellm/proxy/costs.json

How can the proxy be better? Let us know here

{
"Oct-12-2023": {
"claude-2": {
"cost": 0.02365918,
"num_requests": 1
}
}
}

You can view costs on the cli using

litellm --cost

Ollama Logs

Ollama calls can sometimes fail (out-of-memory errors, etc.).

To see your logs just call

$ curl 'http://0.0.0.0:8000/ollama_logs'

This will return your logs from ~/.ollama/logs/server.log.

Deploy Proxy

Use this to deploy local models with Ollama that's OpenAI-compatible.

It works for models like Mistral, Llama2, CodeLlama, etc. (any model supported by Ollama)

usage

docker run --name ollama litellm/ollama

More details 👉 https://hub.docker.com/r/litellm/ollama

Support/ talk with founders