Serve your Ollama API on your Kubernetes cluster

Mar 19, 2024

Author

Wouldn’t it be cool sending HTTP requests to a LLM model on your own infrastructure, and turning it an API to use on any desired purpose?

I am going to show how to do it now.

Requirements

A Kubernetes server
A Helm installed device
A domain
Some Kubernetes knowledge

That’s all we need. We will start by preparing our kubernetes cluster. We need to install ingress-nginx to proxy the requests to our k8s services, cert-manager to have a valid SSL certificate. This is optional but personally I like having free certificates 🤗. And finally we need to install ollama.

Installing Helm Packages

Please make sure that you have Helm installed and have connection to your cluster.

We will first add ingress-nginx package to our helm packages and install it to our cluster.

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx -n ingress-nginx --create-namespace

After that we need to add cert-manager package to Helm and install it to our cluster.

helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace --set installCRDs=true

And now let’s install ollama in our cluster. For this we need to create a values.yaml file to configure our ollama installation like which models to install and how much resources it should use.

ollama:
  gpu:
    enabled: false
    number: 1
  models:
    - mistral
    - llama2

persistentVolume:
  enabled: true
  size: 100Gi

resources:
  limits:
    cpu: '8000m'
    memory: '8192Mi'
  requests:
    cpu: '4000m'
    memory: '4096Mi'

And we can install ollama with our values.yaml file now by running:

helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm repo update
helm install ollama ollama-helm/ollama --namespace ollama -f ./values.yaml

K8S resources

letsencrypt.yaml

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: letsencrypt
  namespace: ollama
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@example.com
    solvers:
      - http01:
          ingress:
            class: nginx

ingress.yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ingress
  annotations:
    nginx.ingress.kubernetes.io/use-regex: 'true'
    cert-manager.io/issuer: 'letsencrypt'
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - llm.example.com
      secretName: tls-secret
  rules:
    - host: llm.example.com
      http:
        paths:
          - path: /(.*)
            pathType: ImplementationSpecific
            backend:
              service:
                name: ollama
                port:
                  number: 11434

Testing

You can send your request on your terminal using curl:

curl https://llm.example.com/api/generate -d '{
 "model": "llama2",
 "prompt": "Why is the sky blue?",
 "stream": false
}'

Or you can send it through Postman

Postman Request

That’s all! Now you have a running API for your LLM models. You can scale it or optimize it by configuring your Kubernetes resources.

If you would like to install a home Kubernetes cluster, check out my previous post about it.