[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗 #15313

Vaibhavs10 · 2025-08-14T13:50:28Z

Vaibhavs10
Aug 14, 2025
Collaborator

llama.cpp as a project has made LLMs accessible to countless developers and consumers including me. The project has also consistently become faster over time as has the coverage beyond LLMs to VLMs, AudioLMs and more.

One feedback from community we keep getting is how difficult it is to directly use llama.cpp. Often times users end up using Ollama or GUIs like LMStudio or Jan (there's many more that I'm missing). However, it'd be great to offer a path to use llama.cpp in a more friendly and easy way to end consumers too.

Currently if someone was to use llama.cpp directly:

For Mac - brew install llama.cpp works
For Linux (CUDA) - they need to clone and install directly from github
For Windows - winget (?)

This adds barrier for non technically inclined people specially since in all the above methods users would have to reinstall llama.cpp to get upgrades (and llama.cpp makes releases per commit - not a bad thing, but becomes an issue since you need to upgrade more frequently)

Opening this issue to discuss what could be done to package llama.cpp better and allow users to maybe download an executable and be on their way.

More so, are there people in the community interested in taking this up?

KhazAkar · 2025-08-14T14:03:19Z

KhazAkar
Aug 14, 2025

From this, IMO it only misses Linux+CUDA bundle to be useable as download & use.

Click to expand

If we want better packaging on Linux, we can also work on snap/bash installer when trying to use pre-built packages.

1 reply

michaelgiba Aug 15, 2025

On the first point this was a pain point I faced as well (#15249)

mitkox · 2025-08-14T14:04:55Z

mitkox
Aug 14, 2025

It’s high time HuggingFace to copy Ollama’s packaging and GTM strategy, but this time, give credit to llama.cpp. Ideally, we should retain llama.cpp as the core component.

3 replies

KhazAkar Aug 14, 2025

For HF - they should follow similar path as Jan.ai does for example, a.k.a raw llama.cpp, not some weird forks like ollama.

mitkox Aug 14, 2025

On technology- yes, on marketing strategy, GTM, and partnership lock - there is nothing like Ollama's way of doing it. I can't open any app without ollama support to be the "local AI" support.

crodjer Nov 2, 2025

I use this as an amazing red flag.
Any project which has only Ollama as the AI client for local LLMs can be safely disregarded. Bonus points for not referring to Ollama at all.

slaren · 2025-08-14T14:06:25Z

slaren
Aug 14, 2025
Maintainer

Is the barrier the installation process, or the need to use a complex command line to launch llama.cpp?

1 reply

ericcurtin Aug 15, 2025
Collaborator

A bit of both.

I think packaging has improved in recent times, windows as an example. I kinda like the containers approach for Linux, it makes sense, I'm not sure how many Linux distros are going to tolerate separate llama-cpp-cuda, llama-cpp-rocm, llama-cpp-vulkan packages as per their packaging guidelines.

I do think at a minimum llama.cpp should document upstream what flags should be used with each backend CUDA, Vulkan, Metal (--threads, --cache-reuse, --flash-attention, etc.), as discussed here:

ollama/ollama#11714 (comment)

People use sub-optimal flags often when using llama.cpp and assume performance isn't up to scratch.

qnixsynapse · 2025-08-14T14:09:24Z

qnixsynapse
Aug 14, 2025
Collaborator

4 replies

slaren Aug 14, 2025
Maintainer

Are you using the GGML_CPU_ALL_VARIANTS option?

qnixsynapse Aug 14, 2025
Collaborator

Nope. Separate instructions builds for CPU instructions(For example: using -DGGML_AVX512=ON for AVX512 build). We even found people use processors which doesn't even have avx2.

slaren Aug 14, 2025
Maintainer

You should look into GGML_BACKEND_DL and GGML_CPU_ALL_VARIANTS, it is designed to solve this problem. Most of our releases use it.

qnixsynapse Aug 14, 2025
Collaborator

Thank you!

simonw · 2025-08-14T14:22:06Z

simonw
Aug 14, 2025

For me the biggest thing is llama-server. That tool is fantastic... but it has very low discoverability. Until a few months ago I still thought it was just a demo because it lived in the examples/ folder (I just noticed it moved from there to the tools/ folder in May).

I'd love to see more emphasis placed on llama-server. It's really good! I think it's probably the most end-user appropriate way to interact with this project.

My ideal would be for the llama.cpp project to ship official installers for llama-server on Mac and Windows and Linux, which on Mac and Windows work like a desktop application: you get an icon you can use to launch the tool which start the server running and launch a window that shows that web UI.

Maybe include systray integration and a simple UI for selecting and downloading models too.

At that point llama-server would feel like an alternative to Ollama and LM Studio and Jan. I think it deserves that - it has most of what those tools offer implemented already, what's missing is an installer and a think desktop shell.

7 replies

allozaur Aug 14, 2025
Collaborator

Hey @simonw, as a matter of fact we are working on new WebUI for llama.cpp and we are planning to put it into a native app as well!

The first step is going to be a release of a new version of WebUI rewritten in Svelte and with a much better UI/UX.

You can track the progress here #14839 (comment)

I am planning to have the Pull Request ready for a review at the beginning of the next week, so stay tuned!

calculatortamer Aug 14, 2025

i hope they won't deprecate llama-server CLI for a llama-server GUI, it would be too annoying to use on termux or over SSH

red-co Aug 15, 2025

Hey @simonw, as a matter of fact we are working on new WebUI for llama.cpp and we are planning to put it into a native app as well!

The first step is going to be a release of a new version of WebUI rewritten in Svelte and with a much better UI/UX.

You can track the progress here #14839 (comment)

I am planning to have the Pull Request ready for a review at the beginning of the next week, so stay tuned!

I think I'm not that enough new to this, and I can compile llamacpp for a specific GPU, but I don't know how to use the llamacpp server for persistent chat-session data caching. Does it simply not have this feature? Or am I missing the parameter?

linuxmagic-mp Oct 19, 2025

It's just moving to fast to be ready for a simple installer. Dont think a maintainer wants to keep up ;) But they have sure made it easy for anyone with basic skills in Linux. They have fixed so many compatability issues over the last year. It's pretty simple git pull, and cmake now. The hardest part if the python side of it. And yes, it takes some reading but it's about learning. You can even ask chatGPT on how to install it, the best parameters, what LLM's the would recommend for your hardware.. For most people, they want to learn.. Otherwise they would just get a subscription to some SaaS out there. So, rather than complaining, and putting the onus on the great ppl behind llama.cpp, roll up your sleeves.. However, in the end, a lot of people don't realize the heavy lifting comes in getting agentic tools working with your LLM..llams.cpp just made access to private LLM's possible to millions that want to 'play for free'. But for many, asl yourselves why you want your own LLM first.

openzeka-birol-kuyumcu Jan 14, 2026

llama-server within docker

yaronsumel · 2025-08-14T14:36:51Z

yaronsumel
Aug 14, 2025

It will be cool if 'llama-server' would have auto configuration option to the machine/model like 'ollama' does it.

0 replies

exxocism · 2025-08-14T14:46:52Z

exxocism
Aug 14, 2025

For windows maybe choco and windows store would be a good idea? 🤔

0 replies

acbits · 2025-08-14T14:49:31Z

acbits
Aug 14, 2025

llama.cpp as a project has made LLMs accessible to countless developers and consumers including me. The project has also consistently become faster over time as has the coverage beyond LLMs to VLMs, AudioLMs and more.

One feedback from community we keep getting is how difficult it is to directly use llama.cpp. Often times users end up using Ollama or GUIs like LMStudio or Jan (there's many more that I'm missing). However, it'd be great to offer a path to use llama.cpp in a more friendly and easy way to end consumers too.

Currently if someone was to use llama.cpp directly:
1. For Mac - `brew install llama.cpp` works

2. For Linux (CUDA) - they need to clone and install directly from github

I created a rpm spec to manage installation though I think flatpaks might be more user friendly and distribution agnostic.

0 replies

schnow265 · 2025-08-14T15:38:12Z

schnow265
Aug 14, 2025

For Windows - winget (?)

The released Windows builds are available via Scoop.

Updates happen automatically. Old installed versions are kept, and current one symlinked into a folder „current“ which provides the executables on the path.

0 replies

damianofalcioni · 2025-08-14T16:42:51Z

damianofalcioni
Aug 14, 2025

is it feasible to have a single release for OS including all the backend?

3 replies

slaren Aug 15, 2025
Maintainer

It's technically possible, but the size of the package would be quite big.

SamuelMarks Aug 15, 2025

Easiest way is with an αcτµαlly pδrταblε εxεcµταblε

https://justine.lol/ape.html

MaD70 Oct 26, 2025

Easiest way is with an αcτµαlly pδrταblε εxεcµταblε

https://justine.lol/ape.html

The Actually Portable Executable (APE) for llama.cpp is llamafile, compiled with Cosmopolitan Libc: it can be bundled with the weights for an LLM (Windows has a maximum file size limit of 4GiB for executables, but the same APE can use an external .gguf), all in a single file that can run on six OSes (macOS, Windows, Linux, FreeBSD, OpenBSD, and NetBSD) and two CPU architectures (x86-64 and ARM64).

tellsiddh · 2025-08-14T16:49:52Z

tellsiddh
Aug 14, 2025

For linux I just install the vulkan binaries and run the server from there. Maybe we can have a install script like ollama that detects the system and launches the server which can be controlled from an app as well as cli? The user then gets basic command line utillities like run start stop load list etc?

0 replies

leoxzhao · 2025-08-14T16:50:23Z

leoxzhao
Aug 14, 2025

On Mac, the easiest way (also arguably the safest way) from a user's perspective is to find it in App Store, and install from there. Because of apps from App Store are in a sandbox, so from a user's point of view, installing or uninstalling is simple and clean. Creating a build and passing the App Store review might take some efforts (due to the sandbox constraint), but it should be a one-time thing.

0 replies

O-J1 · 2025-08-14T16:53:53Z

O-J1
Aug 14, 2025

Its my understanding that none of the automated installs support GPU acceleration. I might be wrong but its definitely the case for Windows, which makes it useless to install via winget.

3 replies

slaren Aug 15, 2025
Maintainer

The winget package comes with the Vulkan backend.

O-J1 Aug 25, 2025

Does that work reliably with CUDA cards and does it support the same kinds of models (Qwen2.5VL for instance)

anderspitman Aug 25, 2025

My experience with the Vulkan backend on a RTX 3060 has been excellent.

henk717 · 2025-08-14T16:55:38Z

henk717
Aug 14, 2025

To me the biggest advantage ollama currently has is that the optimal settings for a model are bundled, the gguf spec would allow for this to since its versatile enough to make this a metadata field inside the model. It would allow people to load the settings from a gguf and frontends can extract them and adapt them as they see fit. I think that part is going to be more valuable than obtaining the binary since downloading the binary from github is not that hard.

1 reply

logan-markewich Aug 14, 2025

This is the killer feature right here. Id use llama.cpp directly over any wrapper if this was added

digitalspaceport · 2025-08-14T17:46:49Z

digitalspaceport
Aug 14, 2025

My personal wishlist

Llama-swap integration
Catchall compiled binaries for linux, win, osx
GUI improvements

2 replies

neutrinotek Aug 16, 2025

I feel like point number 1 is widely undermentioned in this thread. The ease that Llama-Swap adds to being able to bounce around models for various tasks is invaluable. If it weren't for that, I would probably still be using Ollama. It still needs a bit of upfront setup, but once you get a model in, it basically becomes "set it and forget it."

Edit I just realized who I was replying to and felt the need to come back and say I love the YT channel man. Keep up the good work!

ericcurtin Aug 16, 2025
Collaborator

Docker Model Runner provides 1 and 2 FWIW... They are trying to grow the community also... It uses llama.cpp upstream, contributes any changes required back... Number 3... I recommend AnythingLLM, but there are many options

MarcusDunn · 2025-11-17T01:48:25Z

MarcusDunn
Nov 17, 2025

I think delegating installation to external projects is entirely legitimate. The end "product" of llama.cpp being the C headers + implementation is an acceptable place for this project to be. I would much rather have the maintainers of llama.cpp not have to deal with the deluge of installation and packaging idiosyncrasies.

I understand a lot of people who are posting issues on this project are users of the server or CLI, but I think there's a sampling bias where most users of this repo are actually through llama-cpp-python, ollama, or some other wrapper (my own being llama-cpp-rs). I do not think this should be considered a problem to be solved.

0 replies

angt · 2025-11-20T11:00:26Z

angt
Nov 20, 2025
Collaborator

Hi everyone!

I’ve just done a major refactor of the build process and published all the work here:

https://github.com/angt/installama.sh

All fixes and new features (like the new downloader or the OpenSSL/BoringSSL backend) are either merged or on their way upstream. Let’s see how far we can go in this direction :)

For now, let’s use this new repo to track issues and feature requests.

5 replies

wtarreau Nov 22, 2025

Nice, but please oh please, don't encourage people to blindly run "curl foo | sh", as this is really bad and it allows to execute truncated contents (e.g. think about what happens with sudo rm -rf /tmp/${TEMP} when the transfer was interrupted just after the first '/'). Please better refer to the commands as curl foo > run.sh && sh run.sh that is not significantly longer and supports error checking.

angt Nov 22, 2025
Collaborator

Thanks for the feedback!

I know some people disapprove the pipe to shell thing, but I believe the risks are often overstated and alternatives often just provide a false sense of security (but I dont want to debate here ^^).

That said, there is a real issue with trust in the current project but my ultimate goal is to host this on ggml.ai to address the trust aspect (at least, that’s my hope).

Regarding the issue you mentioned, you dont need to have the intermediate run.sh file. There is a much better way to handle that and you can check out the code to see it.

wtarreau Nov 22, 2025

I have seen your script already and itself is not subject to this risk. It doesn't change the fact that the vast majority of people who get trained to blindly copy-pasting "curl|sh" don't have the skills to do that, and if they'd read the script in the first place they wouldn't need to download it again since they would already have it, and it's not responsible to encourage this bad practice of copy-pasting whatever this way without even checking. Instead, spreading the message that a download might fail and that it can still be checked in a one-liner at least raises their attention on this matter. It will not solve every risk but participates to not conveying a very bad incite to do bad things.

angt Nov 22, 2025
Collaborator

I understand your points, but we cannot ignore that curl | sh has become a thing. It is now widely used, and the majority of users enjoy it.

Education, as always, is the key. People should be aware of the risk of piping untrusted things to the shell. That's the best. Even experts can be so badly tricked.

But the intermediate run.sh definitely won't help a beginner and doesn't improve security at all. While keeping a temporary file that the user could call later without being updated would be worse. I highly recommend not doing that...

wtarreau Nov 22, 2025

we cannot ignore that curl | sh has become a thing. It is now widely used, and the majority of users enjoy it.

it's widely used among bad projects unfortunately, and users neither enjoy nor hate it, they copy-paste without understanding what they're doing.

But the intermediate run.sh definitely won't help a beginner

sure but it doesn't train them to do bad things without understanding. And if they're curious at the end they can read the script to understand how it works.

and doesn't improve security at all

not security but safety as a general case (not running a truncated script). And by now it has been sufficiently documented as a bad and harmful practice so that experienced developers cannot claim to ignore it anymore.

I highly recommend not doing that...

a lot of security-conscious people are trying hard to stop their users and friends from blindly running the curl|sh thing :-(

rbberger · 2025-12-03T16:05:36Z

rbberger
Dec 3, 2025

Hey everyone, just want make you aware that I started adding llama.cpp into Spack. See spack/spack-packages#2437

0 replies

renet10 · 2025-12-04T23:10:10Z

renet10
Dec 4, 2025

Great to see the continued activity on this topic.

As an outside observer / newcomer, llama.cpp simply feels to me like hard-core development software. People like me who come from a development background but haven't seen a Makefile in decades are forced into LM Studio / Ollama because it takes way too long to retrain atrophied muscles, even when we prefer the elegance of minimally necessary software solutions (i.e. no shim layers).

We also run into continued misinformation. For example, this is what Google returns when I search for "how to install llama.cpp with gpu support" (excerpted)

Prerequisites
An NVIDIA GPU with up-to-date drivers.
The NVIDIA CUDA Toolkit installed. You can download it from the NVIDIA developer website.
Basic build tools like cmake, ninja, a C++ compiler (GCC on Linux, Visual Studio with C++ workload on Windows), and git.
A Python environment (e.g., using conda or a venv).
...

Even in this thread, a recent commenter thought (erroneously) that "none of the automated installs support GPU acceleration".

Reading through this thread and others (like #8188) it seems like llama.cpp is already most of the way to a decent packaging strategy. From what I can tell, it seems a few added lines into install.md plus some documentation structuring will go a long way.

3 replies

angt Dec 5, 2025
Collaborator

installama.sh supports CUDA acceleration for Linux without having to install CUDA.
I'll do Windows soon (after the Vulkan backend).

renet10 Dec 5, 2025

That's great news. So here's what I am seeing from various threads, please correct any misconceptions:

Apple (the easy case): brew install will get you llama.cpp with GPU support via Metal
Windows: winget installs with Vulkan backend support today. CUDA support via installama.sh upcoming. Any limitations for winget/installama.sh (e.g. CPU type, etc.)?
Linux: installama.sh supports CUDA, unclear on nix

The documents I see that need updating include:

README.md: add mention of installama.sh, or point to install.md
Install.md: add sentences on GPU support, add installama.sh

Agree? I'd be happy to do the doc updates if someone will email me how (git wasn't even around when I stopped coding...😱)

angt Dec 7, 2025
Collaborator

installama.sh is still quite new, we need more testing and Windows CUDA support before we can really brag about it :)

The list of supported systems and corresponding requirements can be found here:
https://github.com/angt/installama.sh/blob/master/REQUIREMENTS.md

ibehnam · 2026-01-11T04:17:45Z

ibehnam
Jan 11, 2026

I've hated on ollama for as long as it existed but let's be frank here: there's a reason ollama caught on and llama.cpp/lmstudio/etc. didn't. I would never install ollama on my machines for a variety of reasons, but I also find it insane that llama.cpp devs, despite being great at c++ and hardcore stuff, have not shown any sign of wanting to make the server accessible to more groups. It's been 2 hours since I started the build process of the current commit on my 40-core workstation (dual GPU) because brew install llama.cpp is CPU ONLY, and there's no CUDA release for Linux.

4 replies

Windsage63 Jan 11, 2026

I actually like LM Studio, I never saw the point of Ollama. In some ways (env variables), it's harder to use than llama.cpp.

A couple of months ago I spent about a half a day going back and forth with Opus 4 to generate an LLM instruction document to load, update and test a llama.cpp installation and build in WSL2. I use it now about every 2-3 weeks and it has worked flawlessly. I originally built from source because the pre-built packages didn't fully support the Blackwell cards. Now I just continue this way because I can Vibe-Update without thinking about it much. The last time I updated, I used Gemini 3.0 Flash and it worked great, and now that GitHub copilot works with standard skills files, I will likely rebuild it to be a formal skill.md.

angt Jan 11, 2026
Collaborator

@ibehnam why not curl installama.sh | sh ?

CISC Jan 11, 2026
Collaborator

..also, 2 hours?!

renet10 Jan 14, 2026

@ibehnam it because you have dual GPUs and not plain Jane single Apple Mx that you need to resort to a compile?

Windsage63 · 2026-01-14T04:50:36Z

Windsage63
Jan 14, 2026

install-llama-cpp.zip
I have been toying with automating things using the skills system. GitHub just added support for it and interestingly I find that Google Antigravity works great with skill files, So. I took my old instructions files and created the attached install-llama-cpp skill you can just drop it into your Claude Code or vs code or Antigravity and tell it to use this skill to install or update llama.cpp and it works pretty well. Not guaranteeing anything since I only have my own stuff to test on.

0 replies

renet10 · 2026-01-14T16:22:26Z

renet10
Jan 14, 2026

*** EDITED after reviewing previous comments regarding CLI-based installs.

I fully agree with @ibehnam's comment. This is my main beef with llama.cpp as well. To install Ollama, all you need to do is go to ollama.com and click the "Download" button is at the top right. If you are somewhat technical and don't want an app install, the docs page has clear and concise information for the major methods most of which are about a page long with a LOT of whitespace. The longest is Linux (no surprise) but even that is about 3 pages. To my knowledge, all support GPU.

IMHO this is the #1 reason why Ollama gets mindshare over Llama.cpp.

I believe Llama.cpp is close with the existing methods and, and installama.sh or some other CLI-based method, but it requires testing and documentation, esp. the corner cases, rather than fragmented methods that can only be found in a 150 item long thread.

Hopefully this doesn't sound harsh. It comes from a strong desire to see mainstream adoption for all good work the that have put in, which I assume we all want.

0 replies

Rotatingxenomorph · 2026-02-04T12:00:09Z

Rotatingxenomorph
Feb 4, 2026

Compiling llama.cpp with CUDA takes forever if you don't use multithreading with -j

0 replies

deadprogram · 2026-02-04T12:05:15Z

deadprogram
Feb 4, 2026

We have had a GH build setup as part of https://github.com/hybridgroup/llama-cpp-builder for Ubuntu CUDA and Vulkan binaries for a couple of months now, if this if of any use to the community.

0 replies

WhyNotHugo · 2026-02-16T02:52:02Z

WhyNotHugo
Feb 16, 2026

Ideally, llama.cpp should use libggml from its standalone repository, so distributions can ship libggml a standalone package, and llama.cpp as another.

Currently, llama.cpp needs to bundle its own version of libggml (in /usr/lib/llama.cpp), whisper.cpp bundles its own, and shipping a standalone libggml is kinda pointless since it's mostly unused.

Would using libggml as a submodule not make your lives simpler too (i.e.: no syncing between repos any more)?

0 replies

peterwwillis · 2026-02-22T02:52:04Z

peterwwillis
Feb 22, 2026

I don't think it's reasonable to expect to compile this software in order to install/use it as an end-user tool. If llama.cpp can prepare more builds (ex. CUDA on Linux) then more 3rd party packagers (homebrew, mise, aqua, asdf, etc) can have a plugin added to download and install them.

I'm preparing an aqua plugin now so that I can mise install aqua:ggml-org/llama.cpp@latest the way I do for all my other tools. (Linux/Mac/Windows are all supported by mise and aqua fyi)

As far as release cadence & versioning, fast development is fine, but it's not a great user experience. I think the vast majority of users want an occasional update with sane versioning. Development/alpha/beta/rc channels are the usual place for constant updates.

7 replies

peterwwillis Feb 25, 2026

The final hitch to consider is the version. Using these methods, with the current release process, users would be updating the software 10 times a day. Especially with big binaries, that's not sustainable. The Homebrew package already skips every 10 releases because of this. I think llama.cpp should adopt a more formal tiered versioning system, where the dev branch releases constantly, but a "stable" branch releases maybe once a week at most. This will also help people understand what's changed and make it easier to say if some other software (or model) works with your llama.cpp or not. Bleeding edge 24/7/365 isn't a viable release strategy.

vk2r Feb 27, 2026

I still can't understand how, after so much time, people who have been able to build an engine are incapable of building a car, given that the engine is the main part of a car, but oh well.

It's better to stop dwelling on the issue, as there are many other people who have already solved this.

If we look at how Jan was built (around Llama.cpp), we simply have a frontend client that detects the characteristics of the hardware where it has been installed, and then downloads the updated binaries according to the hardware along with the version of llama.cpp compatible with those binaries.

If there are problems compiling these multiple binaries for specific hardware, it should be automated using prepared pipelines, along with three frontend client pipelines for the operating systems most used by the community (MacOS, Windows, Ubuntu/Debian).

If there are cost or time issues in making the releases (due to the speed of development), simply limit the number of distributions to the public to 1 or 2 per month and offer manual compilations to developers.

This problem has already been solved, and companies generally have a software architect in charge of aligning all of this.

peterwwillis Feb 27, 2026

The car engine is designed by car engineers to produce a given output, pretty much in isolation of the rest of the car. That's part of why you can put the same engine in a sports car, sedan, SUV and pickup truck. They need 2 things: 1) extremely simple, loosely coupled abstractions, and 2) an entire team to build every independent system, then a different team just to combine them all into one vehicle.

Most companies don't delve into the hardware very deeply because it's a huge amount of complexity. They might go into supporting SDK X, or driver Y, or version Z, as that's a lot less complex than tracking specific hardware.

Ollama's install script for example. They play fast and loose with what consistutes detection of a given driver, platform, hardware vendor. They depend on specific tools, specific vendor binaries (made for specific platforms), even specific PCI IDs. This works for a short time, but very soon it's out of date and you have to constantly keep up to date on everything, or have a very good support pipeline for all the bug reports coming in. You end up supporting a majority of whoever complains the most and dropping support for old platforms when people stop complaining. This requires more people time to keep updating things.

A single Linux distribution will burn hundreds (if not thousands) of hours every year to keep track of all the graphics hardware and software, just to try to keep baseline device support. And it's always behind, because you only find out what's broken and what needs supporting well after it's released and users try to use it and it breaks.

Things work easiest when all devices & software supports some abstraction, the platform ships it, and you shared-link against it. But graphics cards have always struggled with abstractions across platforms (especially within Linux). Vulkan is still changing and sort of new, and the rest is all vendor-specific, custom, changing. Since the abstractions are difficult, support is gonna be difficult too.

Llama.cpp is already shipping binaries in GitHub for a couple platforms, so they're kinda doing what you're suggesting? But it's not easy to use, and it doesn't support everything. (And it's definitely not a solved problem, or I wouldn't have had to spend hours and hours wrestling to figure out how my iGPU works enough to get Ollama to run correctly).

I think it can get better. But there's a certain amount of necessary complexity which requires a minimum amount of effort to overcome, and my guess is they don't have the staff (or volunteers) for it yet, plus a pinch of analysis paralysis and conflicting objectives (this project is clearly not intended to be easy to use; the command arguments alone, my "Bob"...).

renet10 Mar 2, 2026

There are a lot of good points in this post @peterwwillis, and I have even seen manufacturers themselves (never mind open source) struggle with their own software due to lack of abstractions with their own hardware!

The question in my mind however is your statement:

Llama.cpp is already shipping binaries in GitHub for a couple platforms, so they're kinda doing what you're suggesting? But it's not easy to use, and it doesn't support everything.

Do Llama.cpp binaries support enough of the fat head that non-developers with some programming background, ones who can twiddle shell scripts and understand brew, pip, etc. in more detail than parroting CLI commands, can install Llama.cpp with GPU support using a cookbook method (brew, winget, or 5-10 CLI commands?) Too often we shift problems to developers, when documentation goes most of the way.

Ideally the documentation is clear enough that Gemini / Claude / ChatGPT can generate the how to for the average user. Last time I checked it could not.

peterwwillis Mar 2, 2026

Do Llama.cpp binaries support enough of the fat head that non-developers [..] can install Llama.cpp with GPU support using a cookbook method

It seems the lack of Linux CUDA is the biggest stumbling block for initial adoption with a cookbook. Other people are providing builds, which is great, but it would be greater if llama.cpp did too 😄

But one focus of OP's thread is all the ways llama.cpp is harder to use than alternatives (Ollama, LMStudio). If llama.cpp wants to be that easy, there is a lot of work to do. (which I think the community would gladly help with, but there needs to be an official person saying 'yes please contribute XYZ in way ABC', and then someone can start working on a PR)

ckastner · 2026-03-02T15:58:07Z

ckastner
Mar 2, 2026

Hi,

We've started a new discussion specifically for the Debian and Ubuntu packaging. Testers and feedback welcome.

2 replies

mbaudier Mar 13, 2026

In the latest version of the official Debian / Ubuntu deb packages, we have added a systemd unit in order to start and monitor llama-server (disabled by default).

It means that running llama-server is now as simple as:

sudo apt install llama.cpp-tools
# sudo apt install libggml0-backend-vulkan # additional backends
sudo systemctl start llama-server.service

By default the service is in multi-models router mode and only available on localhost, but this can be configured in /etc/default/llama-server (along with all other options which support environment variables).

These packages are on their way to the various backports repositories. Meanwhile, you can use them from the Debian AI team repositories following these installation instructions. Feedback welcome in this discussion.

mbaudier Apr 5, 2026

Debian packages of the ggml CUDA backend built against the Nvidia-provided repositories (vs the official Debian repositories) are now available. They can make deployment of llama.cpp easier in cloud instances or on recent Nvidia GPUs.

Installation instructions are available here:
https://salsa.debian.org/deeplearning-team/ggml/-/blob/debian/unstable/debian/README.vendor-cuda.md

More details, and feedback welcome in this discussion.

2Bqingnian · 2026-03-10T09:16:12Z

2Bqingnian
Mar 10, 2026

请问这个问题怎么解决？

用的是

llama-b8254-bin-910b-openEuler-aarch64-aclgraph.tar.gz

cann 8.5.1

1 reply

WhyNotHugo Mar 15, 2026

This seems unrelated to the discussion at hand. Please open a separate issue for better visibility. Please mention which CPU you're using too.

Cyberhan123 · 2026-03-25T09:13:38Z

Cyberhan123
Mar 25, 2026

I'm working on this, and it's currently just a demo in development. Note that it heavily utilizes AI coding and human review, and it's currently aim to providing edge inference for all platforms. Currently, it only supports Windows platform development and has not been officially packaged yet. If anyone is interested, they are welcome to participate.
Regarding the protocol, I spent a significant amount of time on integration, so to protect the core of the project and ensure the success of the initial work, it will use AGPL-3.0 to protect intellectual property rights. Later, I may adopt a more permissive protocol.
https://github.com/Cyberhan123/slab.rs

1 reply

Cyberhan123 Mar 25, 2026

My current difficulty is integrating all the inference acceleration backends, but it seems quite challenging at the moment.
The core difficulty lies in the fact that libllama, libdiffusion, and Whisper all currently use static linking of ggml, especially stable-diffusion.cpp.I am adjusting the custom build

brian-learns · 2026-04-15T15:19:40Z

brian-learns
Apr 15, 2026

compared to vLLM or sglang, llama.cpp is a breeze to install and use.

0 replies

[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗 #15313

Uh oh!

Uh oh!

Vaibhavs10 Aug 14, 2025 Collaborator

Replies: 62 comments · 100 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slaren Aug 14, 2025 Maintainer

Uh oh!

ericcurtin Aug 15, 2025 Collaborator

Uh oh!

qnixsynapse Aug 14, 2025 Collaborator

Uh oh!

slaren Aug 14, 2025 Maintainer

Uh oh!

qnixsynapse Aug 14, 2025 Collaborator

Uh oh!

slaren Aug 14, 2025 Maintainer

Uh oh!

qnixsynapse Aug 14, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

allozaur Aug 14, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slaren Aug 15, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Vaibhavs10
Aug 14, 2025
Collaborator

Replies: 62 comments 100 replies

slaren
Aug 14, 2025
Maintainer

ericcurtin Aug 15, 2025
Collaborator

qnixsynapse
Aug 14, 2025
Collaborator

slaren Aug 14, 2025
Maintainer

qnixsynapse Aug 14, 2025
Collaborator

slaren Aug 14, 2025
Maintainer

qnixsynapse Aug 14, 2025
Collaborator

allozaur Aug 14, 2025
Collaborator

slaren Aug 15, 2025
Maintainer