OpenAI Proxy Server
A local, fast, and lightweight OpenAI-compatible server to call 100+ LLM APIs.
Usage
pip install litellm
$ litellm --model ollama/codellama
#INFO: Ollama running on http://0.0.0.0:8000
Test
In a new shell, run:
$ litellm --test
Replace openai base
import openai
openai.api_base = "http://0.0.0.0:8000"
print(openai.ChatCompletion.create(model="test", messages=[{"role":"user", "content":"Hey!"}]))
Other supported models:
- VLLM
- OpenAI Compatible Server
- Huggingface
- Anthropic
- TogetherAI
- Replicate
- Petals
- Palm
- Azure OpenAI
- AI21
- Cohere
$ litellm --model vllm/facebook/opt-125m
$ litellm --model openai/<model_name> --api_base <your-api-base>
$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
$ litellm --model claude-instant-1
$ export ANTHROPIC_API_KEY=my-api-key
$ litellm --model claude-instant-1
$ export TOGETHERAI_API_KEY=my-api-key
$ litellm --model together_ai/lmsys/vicuna-13b-v1.5-16k
$ export REPLICATE_API_KEY=my-api-key
$ litellm \
--model replicate/meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3
$ litellm --model petals/meta-llama/Llama-2-70b-chat-hf
$ export PALM_API_KEY=my-palm-key
$ litellm --model palm/chat-bison
$ export AZURE_API_KEY=my-api-key
$ export AZURE_API_BASE=my-api-base
$ litellm --model azure/my-deployment-name
$ export AI21_API_KEY=my-api-key
$ litellm --model j2-light
$ export COHERE_API_KEY=my-api-key
$ litellm --model command-nightly
[Tutorial]: Use with Continue-Dev/Aider/AutoGen/Langroid/etc.
Here's how to use the proxy to test codellama/mistral/etc. models for different github repos
pip install litellm
$ ollama pull codellama # OUR Local CodeLlama
$ litellm --model ollama/codellama --temperature 0.3 --max_tokens 2048
Implementation for different repos
- ContinueDev
- Aider
- AutoGen
- Langroid
- GPT-Pilot
- guidance
Continue-Dev brings ChatGPT to VSCode. See how to install it here.
In the config.py set this as your default model.
default=OpenAI(
api_key="IGNORED",
model="fake-model-name",
context_length=2048, # customize if needed for your model
api_base="http://localhost:8000" # your proxy server url
),
Credits @vividfog for this tutorial.
$ pip install aider
$ aider --openai-api-base http://0.0.0.0:8000 --openai-api-key fake-key
pip install pyautogen
from autogen import AssistantAgent, UserProxyAgent, oai
config_list=[
{
"model": "my-fake-model",
"api_base": "http://localhost:8000", #litellm compatible endpoint
"api_type": "open_ai",
"api_key": "NULL", # just a placeholder
}
]
response = oai.Completion.create(config_list=config_list, prompt="Hi")
print(response) # works fine
llm_config={
"config_list": config_list,
}
assistant = AssistantAgent("assistant", llm_config=llm_config)
user_proxy = UserProxyAgent("user_proxy")
user_proxy.initiate_chat(assistant, message="Plot a chart of META and TESLA stock price change YTD.", config_list=config_list)
Credits @victordibia for this tutorial.
pip install langroid
from langroid.language_models.openai_gpt import OpenAIGPTConfig, OpenAIGPT
# configure the LLM
my_llm_config = OpenAIGPTConfig(
#format: "local/[URL where LiteLLM proxy is listening]
chat_model="local/localhost:8000",
chat_context_length=2048, # adjust based on model
)
# create llm, one-off interaction
llm = OpenAIGPT(my_llm_config)
response = mdl.chat("What is the capital of China?", max_tokens=50)
# Create an Agent with this LLM, wrap it in a Task, and
# run it as an interactive chat app:
from langroid.agent.base import ChatAgent, ChatAgentConfig
from langroid.agent.task import Task
agent_config = ChatAgentConfig(llm=my_llm_config, name="my-llm-agent")
agent = ChatAgent(agent_config)
task = Task(agent, name="my-llm-task")
task.run()
Credits @pchalasani and Langroid for this tutorial.
In your .env set the openai endpoint to your local server.
OPENAI_ENDPOINT=http://0.0.0.0:8000
OPENAI_API_KEY=my-fake-key
NOTE: Guidance sends additional params like stop_sequences
which can cause some models to fail if they don't support it.
Fix: Start your proxy using the --drop_params
flag
litellm --model ollama/codellama --temperature 0.3 --max_tokens 2048 --drop_params
import guidance
# set api_base to your proxy
# set api_key to anything
gpt4 = guidance.llms.OpenAI("gpt-4", api_base="http://0.0.0.0:8000", api_key="anything")
experts = guidance('''
{{#system~}}
You are a helpful and terse assistant.
{{~/system}}
{{#user~}}
I want a response to the following question:
{{query}}
Name 3 world-class experts (past or present) who would be great at answering this?
Don't answer the question yet.
{{~/user}}
{{#assistant~}}
{{gen 'expert_names' temperature=0 max_tokens=300}}
{{~/assistant}}
''', llm=gpt4)
result = experts(query='How can I be more productive?')
print(result)
Contribute Using this server with a project? Contribute your tutorial here!
Advanced
Configure Model
To save API Keys, change model prompt, etc. you'll need to create a local instance of it:
$ litellm --create_proxy
This will create a local project called litellm-proxy
in your current directory, that has:
- proxy_cli.py: Runs the proxy
- proxy_server.py: Contains the API calling logic
/chat/completions
: receivesopenai.ChatCompletion.create
call./completions
: receivesopenai.Completion.create
call./models
: receivesopenai.Model.list()
call
- secrets.toml: Stores your api keys, model configs, etc.
Run it by doing:
$ cd litellm-proxy
$ python proxy_cli.py --model ollama/llama # replace with your model name
To set api base, temperature, and max tokens, add it to your cli command
litellm --model ollama/llama2 \
--api_base http://localhost:11434 \
--max_tokens 250 \
--temperature 0.5
Create a proxy for multiple LLMs
$ litellm
#INFO: litellm proxy running on http://0.0.0.0:8000
Send a request to your proxy
import openai
openai.api_key = "any-string-here"
openai.api_base = "http://0.0.0.0:8080" # your proxy url
# call gpt-3.5-turbo
response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hey"}])
print(response)
# call ollama/llama2
response = openai.ChatCompletion.create(model="ollama/llama2", messages=[{"role": "user", "content": "Hey"}])
print(response)
Tracking costs
By default litellm proxy writes cost logs to litellm/proxy/costs.json
How can the proxy be better? Let us know here
{
"Oct-12-2023": {
"claude-2": {
"cost": 0.02365918,
"num_requests": 1
}
}
}
You can view costs on the cli using
litellm --cost
Ollama Logs
Ollama calls can sometimes fail (out-of-memory errors, etc.).
To see your logs just call
$ curl 'http://0.0.0.0:8000/ollama_logs'
This will return your logs from ~/.ollama/logs/server.log
.
Deploy Proxy
- Ollama/OpenAI Docker
- Self-Hosted
- LiteLLM-Hosted
It works for models like Mistral, Llama2, CodeLlama, etc. (any model supported by Ollama)
usage
docker run --name ollama litellm/ollama
More details 👉 https://hub.docker.com/r/litellm/ollama
Step 1: Clone the repo
git clone https://github.com/BerriAI/liteLLM-proxy.git
Step 2: Put your API keys in .env Copy the .env.template and put in the relevant keys (e.g. OPENAI_API_KEY="sk-..")
Step 3: Test your proxy Start your proxy server
cd litellm-proxy && python3 main.py
Make your first call
import openai
openai.api_key = "sk-litellm-master-key"
openai.api_base = "http://0.0.0.0:8080"
response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hey"}])
print(response)
Deploy the proxy to https://api.litellm.ai
$ export ANTHROPIC_API_KEY=sk-ant-api03-1..
$ litellm --model claude-instant-1 --deploy
#INFO: Uvicorn running on https://api.litellm.ai/44508ad4
This will host a ChatCompletions API at: https://api.litellm.ai/44508ad4
Support/ talk with founders
- Schedule Demo 👋
- Community Discord 💭
- Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
- Our emails ✉️ ishaan@berri.ai / krrish@berri.ai