Deploy your own LLaMa2 Service with Digital Ocean (Part II)

Create your own LLaMa2 API with Ollama

Digital Plumbers
Bootcamp
8 min readNov 10, 2023

--

Oh Yeah — it’s LLAMA TIME

Freedom. Privacy. Cost Savings. Model training and tuning.

If you’re interested in any of these as it relates to your LLM usage, don’t change that dial.

Welcome to part 2 of 2 of my exploration of LLaMa2 and the Ollama framework. Previously we discussed the basics of an Ollama and Digital Ocean integration. At the time of writing, Ollama and Digital Ocean represents one of the most cost-effective way to run your own private LLM. If you opt for the orca-mini model, you could use a $20–50 USD droplet and save hundreds of dollars per month in your personal and production openAI API fees. Connecting your Ollama-Digital Ocean service can be a little tricky, though, especially if you aren’t experienced with the GO programming language. In this article, we’ll explore how you can open up your service to public traffic.

If you opt for the orca-mini model, you could use a $20–50 USD droplet and save hundreds of dollars per month in your personal and production openAI API fees.

Without further ado, let’s get started! For reference, I recommend following along the same Youtube video from the first article. Ian’s walkthrough is great and helped me a lot during my deployment!

No affiliation, but highly recommend the video!

Ollama’s API

Ollama makes it pretty easy to communicate locally with an LLM right out of the box. I highly recommend checking out their API Documentation page on Github after reading this article. You can get started with a simple POST request using the code below:

curl -X POST http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?"
}'

The default port for the GO server that hosts the ollama service is port 11434, and our generation endpoint is aptly named api/generate. Nice and intuitive 😌. This request assumes you’ve already ran the ollama pull llama2 command to load the llama2 model. The llama2 model requires at least 8gb RAM (though 16GB is really the minimum for anything production grade), but you could likely get away with 8GB RAM for the orca-mini model. Check out the Ollama Github Repo to learn more about available models!

Failed to connect…

Chances are you received an error when you ran the curl request above, and it probably looks something like this

curl: (7) Failed to connect to 167.71.136.96 port 11434 after 30 ms: Couldn’t connect to server

Essentially what we need to do at this step is allow traffic to our service from IP addresses that differ than that of our Digital Ocean droplet. We can configure the Ollama service to allow outside traffic though with the following command:

mkdir -p /etc/systemd/system/ollama.service.d
echo "[Service]" >>/etc/systemd/system/ollama.service.d/environment.conf
echo "Environment=OLLAMA_HOST=0.0.0.0:11434" >>/etc/systemd/system/ollama.service.d/environment.conf

Reload systemd and restart Ollama:

systemctl daemon-reload
systemctl restart ollama

While these are the official linux instructions, I was also interestingly enough able to set my configuration with the mac OS commands provided (as did Ian in the Youtube video above).

OLLAMA_HOST=0.0.0.0:11435 ollama serve

Once you have successfully stopped your service, changed the traffic permissions, and restarted your service, our original curl request should now work! Without any additional option parameters, your response might look something like this:

{"model":"llama2","created_at":"2023-11-05T19:49:34.592934837Z","response":" The","done":false}
{"model":"llama2","created_at":"2023-11-05T19:49:34.746374318Z","response":" sky","done":false}
{"model":"llama2","created_at":"2023-11-05T19:49:34.899569696Z","response":" appears","done":false}
{"model":"llama2","created_at":"2023-11-05T19:49:35.048110315Z","response":" blue","done":false}
{"model":"llama2","created_at":"2023-11-05T19:49:35.187261622Z","response":" because","done":false}
{"model":"llama2","created_at":"2023-11-05T19:49:35.362707203Z","response":" of","done":false}
{"model":"llama2","created_at":"2023-11-05T19:49:35.544913038Z","response":" a","done":false}
{"model":"llama2","created_at":"2023-11-05T19:49:35.690496415Z","response":" phenomenon","done":false}
{"model":"llama2","created_at":"2023-11-05T19:49:35.826209212Z","response":" called","done":false}
{"model":"llama2","created_at":"2023-11-05T19:49:35.974506161Z","response":" Ray","done":false}
{"model":"llama2","created_at":"2023-11-05T19:49:36.134257202Z","response":"leigh","done":false}
{"model":"llama2","created_at":"2023-11-05T19:49:36.325658087Z","response":" scattering","done":false}

The Ollama2 API allows for plenty of customization, though. You’ll definitely need to check out their API documentation on Github to get a better idea of everything you can do! Here’s a quick highlight, though:

  • model: (required) the model name
  • prompt: the prompt to generate a response for
  • format: the format to return a response in. Currently the only accepted value is json

Advanced parameters (optional):

  • options: additional model parameters listed in the documentation for the Modelfile such as temperature
  • system: system prompt to (overrides what is defined in the Modelfile)
  • template: the full prompt or prompt template (overrides what is defined in the Modelfile)
  • context: the context parameter returned from a previous request to /generate, this can be used to keep a short conversational memory
  • stream: if false the response will be returned as a single response object, rather than a stream of objects
  • raw: if true no formatting will be applied to the prompt and no context will be returned. You may choose to use the raw parameter if you are specifying a full templated prompt in your request to the API, and are managing history yourself.

The options parameter functions similarly to the options object you may be used to using with openAI’s API. Here’s an example of a more complete curl request exploring some of these parameters:

curl -X POST http://localhost:11434/api/generate -d '{                           
"model": "llama2",
"prompt": "Why is the sky blue?",
"system": "You are an expert physicist.",
"stream": false,
"options": {
"temperature": 0.2
}
}'

Your response will look something like this. One of the bigger things to note is that by setting stream to false, we easily can grab our response without any concatenation logic.

Getting ready for production

We now have our Ollama service ready to go, but I encountered an additional problem as I prepared to utilize my new Ollama service for a live Google Chrome Extension (OpenAI Study Buddy). Google’s Chrome Extension API doesn’t like communicating with non https services, and my Digital Ocean droplet was still a simple http server.

After some research, I realized the best long term solution was to bite the bullet and buy a cheap domain and attach an SSL certificate. $10 later, my droplet was was all set and secure. Unfortunately, this created a new issue that proved a little difficult to solve.

The Ollama service runs on an http server, and my droplet is now an https droplet. I believe this is an issue that could be solve through proper Nginx configuration to handle the mis-matched handshake(s), but I was a little too impatient to pursue this.

I instead opted for the solution outlined by ‘thesourmango’ in this Github issue, which involves setting up a nodeJS express server to navigate the http/https mismatch.

You’ll want to install nodeJS and npm on to your droplet if you haven’t dont so already.

   curl -sL https://deb.nodesource.com/setup_18.x | bash -
   apt install -y nodejs

And always double check things have installed smoothly!

   node -v
npm -v
Live shot: contributors to Github Issues and Stack Overflow Questions

His code needed a little tweaking (mainly to address a CORS issue), but you can take a look at my server below that’s ready to serve live traffic:

const express = require("express");
const https = require("https");
const fs = require("fs");
const path = require("path");
const cors = require("cors");

const app = express();
const port = 443;
app.use(cors());
app.use(express.json());

// Routes
app.post("/api/", async (req, res) => {
// Send message to Model API
const response = await fetch("http://localhost:11434/api/generate", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ "model": "llama2", prompt: req.body.prompt, system:
'You are a Pokedex-AI, an expert in all things Pokemon',
stream: false,
options: {
temperature: 0.2,
},
})
});
// Return stream of Uint8Array
console.log(response.body)
for await (const chunk of response.body) {
res.write(chunk); // took me ages to find this oh so obvious method!
}
res.end(); // End the response
});

// Read SSL certificate and key files
const options = {
// Replace these with location of servers certificates
key: fs.readFileSync("/etc/letsencrypt/live/openaistudybuddy.com/privkey.pem"),
cert: fs.readFileSync("/etc/letsencrypt/live/openaistudybuddy.com/fullchain.pem"),
};

// Create HTTPS server
const server = https.createServer(options, app);

server.listen(port, () => {
console.log(`App listening on https://localhost:${port}`);
});

Simply touch server.js (or whatever name you want) followed by nano server.js to do a little editing in your droplet. There are numerous other ways you can add this code (nano isn’t my favorite either), but I opted to use the nano editor in the interest of time. Save and exit, and run node server.jsand you are ready to serve live traffic securely!

But wait, there’s more!

We need to keep our services running even after our SSH session is terminated. To accomplish that, I recommend using pm2, a process manager for node that also works with a couple other languages (such as Ollama’s GO!). We need to keep both our Ollama service and our custom node http/https service running at all times. We’ll start with the Ollama service

pm2 start "OLLAMA_HOST=0.0.0.0:11434 ollama serve" --name ollama
pm2 save
pm2 startup

Then we’ll do the same with our node server:

pm2 start index.js --name ollama-https-server
pm2 save
pm2 startup

Finally, run a quick pm2 list to verify your services have been registered and are running! And just like that, your private LLM service is ready for secure traffic from anyone, anywhere, at anytime!

Closing Thoughts

This example is pretty basic (remember, you have quite a bit of freedom with your Ollama API options!), but you can see just how easy and cheap it is ($10 domain + $25–100/month Digital Ocean Droplet) to create your own LLM service and maybe even the next big thing in AI! I think this model of private LLM frameworks will only increase in popularity as tools like Ollama make it easier for developers to experiment with affordable LLM integration and product/service creation.

As models continue to improve, developers will also be able to host LLMs on more cost effective machines without sacraficing performance. On a political level, I believe tools like Ollama will make regulation of AI increasingly difficult as you can have your own LLM SAAS products for under $50. I’m excited to see what everyone creates, and firmly believe that reducing barriers to entry and exploration (via Ollama and Digital Ocean) will result in some very cool products and software.

Now what are you waiting for? Get started with your own local LLM today!

This article is brought to you by OpenAI Study Buddy, a google chrome extension that helps users learn by providing a tutor that gives hints rather than answers. I recently migrated to my own Ollama LLM Deployment and cut the cord from openAI’s expensive API. Check out the free trial and don’t be afraid to leave a rating or review 😃

--

--