Skip to content

Instantly share code, notes, and snippets.

@0xMurage
Last active June 4, 2025 15:23
Show Gist options
  • Save 0xMurage/ac2f8435fd6a4845b71cb0d4b0a9a113 to your computer and use it in GitHub Desktop.
Save 0xMurage/ac2f8435fd6a4845b71cb0d4b0a9a113 to your computer and use it in GitHub Desktop.
How to run LLM AI model locally on a PC/Server

Here are the steps to get a model running on your not-so-powerful computer:

  1. Install llama.cpp (you can also build it from source with CMake):
brew install llama.cpp

other ways are available here

  1. Download the Gemma 3 model from unsloth (https://huggingface.co/unsloth). The 1-billion-parameter version should work with most CPUs, (faster with gpus in which case you need llama.cpp with gpu support):
llama-cli -hf unsloth/gemma-3-1b-it-GGUF

You can also choose different model weight compression, as they're offered in 1- to 16-bit versions.

  1. Once the download is complete, you'll land in a "waiting for prompt" area. You can test the model there, but it's more fun to run it in server mode, accessible via an HTTP API.

  2. To run the model in an OpenAI API-compatible server:

llama-server -hf unsloth/gemma-3-1b-it-GGUF

You should now be able to access the UI via your browser at http://localhost:8080

The chat completions endpoint is available at http://localhost:8080/v1/chat/completions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment