Self-Host Your Own LLM Inference Farm with llm-dock: The Ultimate Local AI Dashboard + Video

Listen to this Post

Featured Image

Introduction

Managing local large language models (LLMs) has traditionally meant wrestling with shell scripts, juggling runners like llama.cpp and vLLM, and endlessly tweaking parameters. For developers and AI enthusiasts who want the latest builds without vendor lock‑in, the friction is real. Enter llm‑dock: a self‑hosted dashboard that orchestrates LLM inference using Docker Compose, reuses your existing Hugging Face cache, and provides a unified interface through Open WebUI. This guide walks you through deploying your own private, customizable LLM environment on Linux—no cloud required.

Learning Objectives

  • Understand how llm‑dock combines llama.cpp, vLLM, and Open WebUI into a cohesive Docker‑based platform.
  • Successfully install and configure llm‑dock on a Linux host, including building custom images optimized for your hardware.
  • Manage local model downloads, run inference benchmarks, and tune parameters for performance and accuracy.

1. What is llm‑dock and Why Use It?

llm‑dock is a collection of scripts and Docker Compose definitions that let you spin up multiple inference backends with minimal effort. Instead of maintaining separate setups for GGUF models (via llama.cpp) and safetensors models (via vLLM), llm‑dock unifies them behind a single dashboard. Key benefits include:

  • Cache reuse – Models downloaded with `huggingface-cli` are automatically detected, avoiding duplicate storage.
  • Latest builds – Scripts compile llama.cpp and vLLM from the latest commits, giving you cutting‑edge features and performance.
  • Web UI – Open WebUI provides a ChatGPT‑like interface for your local models.
  • Parameter guidance – Built‑in help reduces confusion around the hundreds of llama.cpp options.

This approach is ideal for privacy‑conscious users, teams building internal AI tools, or anyone tired of cloud latency and costs.

2. Prerequisites and Initial Setup

llm‑dock is Linux‑only and relies on Docker and Docker Compose. Ensure your system meets these requirements:

  • A recent Linux distribution (Ubuntu 22.04+ or equivalent)
  • Docker Engine (20.10+) and Docker Compose (v2) installed
  • Git
  • Sufficient disk space for models and Docker images (50+ GB recommended)

Step‑by‑step setup:

 Clone the repository
git clone https://github.com/teo-mateo/llm-dock.git
cd llm-dock

Run the interactive setup script
./setup.sh

The `setup.sh` script checks dependencies, creates necessary directories, and guides you through initial configuration. You may encounter your “first error”—commonly missing Docker permissions or an outdated Docker version. Typical fixes:

  • Add your user to the `docker` group: `sudo usermod -aG docker $USER` (log out and back in)
  • Install Docker Compose plugin: `sudo apt install docker-compose-plugin`

    After setup, you’ll have a skeleton configuration ready for customization.

3. Building Custom Images for Your Hardware

llm‑dock includes scripts to build Docker images tailored to your CPU/GPU. By default, it uses pre‑built images, but building from source ensures you get the latest optimizations (e.g., CUDA, ROCm, or Vulkan support for llama.cpp).

Build llama.cpp with GPU support:

cd docker/llamacpp
./build.sh --cuda  For NVIDIA GPUs
 or --rocm for AMD, --vulkan for universal

Build vLLM from source:

cd docker/vllm
./build.sh

These scripts clone the upstream repositories, apply any pending patches, and compile using your host’s Docker setup. After building, the images are tagged locally (e.g., llamacpp:latest-cuda) and ready for use.

Note: Building vLLM requires substantial memory (16+ GB recommended). If you face OOM errors, consider using the pre‑built images first.

4. Adding and Managing Models

One of llm‑dock’s strengths is its integration with the Hugging Face cache. Models you’ve already downloaded with `huggingface-cli` are automatically recognized.

Download a model (example):

huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir ./models/llama2-7b

Place models in the `./models` directory (created by setup.sh). llm‑dock supports both GGUF (for llama.cpp) and safetensors (for vLLM). The service definitions in `docker-compose.yml` mount this directory, so containers can access the files.

Model naming convention:

  • For llama.cpp, use `.gguf` files.
  • For vLLM, point to a Hugging Face model ID or a local path containing safetensors.

You can edit the `docker-compose.override.yml` (or the main compose file) to set the exact model path via environment variables like MODEL_PATH.

5. Configuring and Running Inference Services

The core of llm‑dock is the Docker Compose stack. It typically includes three services:

– `llamacpp` – Runs the llama.cpp server
– `vllm` – Runs the vLLM inference server
– `open-webui` – Provides the chat interface

Start all services:

docker compose up -d

This launches the containers in detached mode. The web UI is accessible at `http://localhost:3000` (default). You can configure which backend to use by setting the `OPENAI_API_BASE_URL` in the Open WebUI settings to point to either the llama.cpp or vLLM endpoint.

Example llama.cpp service snippet (from docker-compose.yml):

llamacpp:
image: llamacpp:latest-cuda  your built image
command: --model /models/llama2-7b.Q4_K_M.gguf --port 8080 --n-gpu-layers 99
volumes:
- ./models:/models
ports:
- "8080:8080"

Tweak parameters like `–n-gpu-layers` to offload layers to GPU, or `–ctx-size` for context length.

6. Benchmarking with Built‑in Tools

llm‑dock includes a benchmarking script for llama.cpp to help you measure tokens per second and latency. This is invaluable when comparing hardware configurations or parameter changes.

Run a benchmark:

cd scripts
./benchmark.sh --model /models/your-model.gguf --prompt "Once upon a time" --n-predict 256

The script outputs metrics like:

llama_print_timings: load time = X ms
llama_print_timings: sample time = Y ms / 256 runs
llama_print_timings: prompt eval time = Z ms / 5 tokens
llama_print_timings: eval time = W ms / 256 runs
llama_print_timings: total time = V ms

Use these numbers to compare batch sizes, GPU offloading, and thread counts. Remember: numbers do matter when you’re optimizing for production.

7. Advanced Tweaks: Multi‑GPU and Parameter Optimization

For users with multiple GPUs, both llama.cpp and vLLM can distribute workload. In llama.cpp, use `–tensor-split` to specify GPU memory fractions. For vLLM, set `–tensor-parallel-size` in the command line.

Example vLLM command with tensor parallelism:

vllm:
image: vllm:latest
command: --model /models/mistral-7b --tensor-parallel-size 2 --gpu-memory-utilization 0.9

Parameter selection can be daunting. llm‑dock’s web UI provides tooltips for common llama.cpp options. For fine‑grained control, refer to the official documentation of each runner.

If you encounter port conflicts (e.g., port 8080 already in use), change the host port in docker-compose.override.yml:

services:
llamacpp:
ports:
- "8081:8080"

Then update the Open WebUI backend URL accordingly.

What Undercode Say:

llm‑dock lowers the barrier to running state‑of‑the‑art LLMs locally, but it’s not a plug‑and‑play appliance—expect to invest time in hardware tuning and parameter experimentation. The true value lies in its modularity: you can swap runners, rebuild with the latest commits, and keep your model cache intact. By embracing Docker, it also isolates dependencies, reducing “works on my machine” headaches. However, the Linux‑only limitation and need for Docker expertise may deter casual users. For teams and tinkerers, though, it’s a powerful foundation for private AI infrastructure. As more organizations demand data sovereignty, tools like llm‑dock will become essential.

Prediction:

The trend toward self‑hosted AI will accelerate as models become more efficient and hardware more capable. In the next 12–24 months, we’ll see integrated solutions that combine model management, fine‑tuning, and monitoring—much like what llm‑dock hints at. This shift will empower small teams to deploy custom AI without cloud dependencies, fundamentally changing how AI is consumed in enterprise and research environments.

▶️ Related Video (82% Match):

🎯Let’s Practice For Free:

IT/Security Reporter URL:

Reported By: Teodor Bardici – Hackers Feeds
Extra Hub: Undercode MoN
Basic Verification: Pass ✅

🔐JOIN OUR CYBER WORLD [ CVE News • HackMonitor • UndercodeNews ]

💬 Whatsapp | 💬 Telegram

📢 Follow UndercodeTesting & Stay Tuned:

𝕏 formerly Twitter 🐦 | @ Threads | 🔗 Linkedin | 🦋BlueSky