Here are the steps to get a model running on your not-so-powerful computer:
- Install llama.cpp (you can also build it from source with CMake):
brew install llama.cpp
other ways are available here
- Download the Gemma 3 model from unsloth (https://huggingface.co/unsloth). The 1-billion-parameter version should work with most CPUs, (faster with gpus in which case you need llama.cpp with gpu support):
llama-cli -hf unsloth/gemma-3-1b-it-GGUF
You can also choose different model weight compression, as they're offered in 1- to 16-bit versions.
-
Once the download is complete, you'll land in a "waiting for prompt" area. You can test the model there, but it's more fun to run it in server mode, accessible via an HTTP API.
-
To run the model in an OpenAI API-compatible server:
llama-server -hf unsloth/gemma-3-1b-it-GGUF
You should now be able to access the UI via your browser at http://localhost:8080
The chat completions endpoint is available at http://localhost:8080/v1/chat/completions.