AI Without a GPU: What a 16GB VPS Can Actually Run

You do not need an expensive GPU server to do real, useful AI. A practical, honest look at what a single modest machine with no GPU can run, what it cannot, and how I host a whole platform on one small box.

12 June 2026 · 5 min read

There is a quiet assumption in a lot of AI conversations that you need a wall of expensive GPUs to do anything real. You do not. A lot of genuinely useful AI runs perfectly well on a plain machine with no GPU at all.

I can say this with some confidence, because I run a whole platform of demos on a single modest VPS. One box, sixteen gigabytes of memory, no GPU. So this post is about what is actually possible on hardware like that, what is not, and where the honest line sits. If you are a small team, a department, or just someone tinkering, this is the practical version.

First, separate two very different jobs

Most confusion about hardware comes from mixing up two things that have wildly different costs.

Training a model from scratch is the heavy, expensive part. That genuinely wants serious GPUs, and unless you are a research lab, you almost never need to do it.

Running a model that someone else already trained, which is called inference, is far, far cheaper. This is what nearly everyone actually needs, and a lot of it runs comfortably on an ordinary machine.

Once you stop conflating the two, the whole thing gets much less intimidating. You are not training. You are running. And running is cheap.

What a 16GB box with no GPU runs comfortably

Here is the genuinely useful stuff that does not need a GPU at all.

The classic machine learning models. Everything from the earlier post in this series, the decision trees, random forests, and gradient boosting, runs on a CPU in seconds. These were designed in an era before GPUs were common. For anything that looks like a table of data, a plain machine is more than enough.

Search and retrieval, the engine behind RAG. Turning your documents into searchable form and finding the right passage for a question is light work. The retrieval half of a RAG system runs happily without a GPU. This is how you can build a useful “look it up in our documents” tool on modest hardware.

Small and distilled models. A lot of models now come in compact versions made specifically to run cheaply. I pointed out earlier that the translation models behind Bhashini ship in distilled sizes of two to three hundred million parameters that will run on a CPU. Small speech recognition models are in the same boat. Smaller, slightly less perfect, but real and usable.

Small language models, if you are patient. You can run a compact, quantised language model on CPU. Quantised just means the model has been squeezed to use less memory and run faster, trading a sliver of quality for a big drop in cost. It will not be instant, and it will not match the giant hosted models, but for drafting and simple tasks on a private box, it works. I run a small Gemma model this way for one of my demos.

What it does not run, and where to be honest

Now the limits, because pretending they do not exist helps nobody.

The big, top-tier language models will not fit. The ones with tens or hundreds of billions of parameters need far more memory than a small box has, and without a GPU they would be painfully slow even if they fit. For those, you either call a hosted service or rent a GPU when you actually need one.

Anything heavy in real time will struggle. A small model answering one request at a time is fine. The same model trying to serve many users at once, fast, is where a CPU starts to hurt and a GPU starts to earn its price.

And training, as I said, is off the table. That is fine. You almost never need to.

The honest rule of thumb is this. If the work is occasional, internal, and can take a second or two, a plain box handles a lot. If it must be instant and serve a crowd, that is when you reach for a GPU.

A note on disk, learned the hard way

One practical warning that has nothing to do with the model and everything to do with running things yourself. On a small box, disk fills up faster than you expect, and not always for the reasons you think.

I once watched the leftover build files from repeatedly rebuilding my containers quietly grow to take up tens of gigabytes, far more than any of the actual data. The fix was one cleanup command, but the lesson stuck. When you self-host on modest hardware, keeping an eye on disk is part of the job. The model is rarely the thing that runs you out of space. The clutter around it is.

Why bother, when the cloud exists

Fair question. If a hosted service is easy, why run anything yourself on a small box.

The same reasons I gave for self-hosting Bhashini. Your data stays with you, which matters enormously for anything sensitive. You can run on a private network with no outside dependency. Your cost is fixed and small instead of growing with every request. And honestly, you learn far more about how these systems really work when you have to host them yourself.

It is not the right choice for everything. But for a small team that values control and privacy over raw scale, a single modest machine goes a lot further than people assume.

My take

The GPU wall is real for training and for serving huge models to crowds. For most other things, it is a myth that quietly stops people from starting.

A plain machine with no GPU will run the classic models, power a document search, serve small and distilled models, and even host a compact language model if you are not in a hurry. That is enough to build something genuinely useful. I know, because the whole thing you are reading this on runs on exactly that kind of box.

Start with the hardware you have. You will be surprised how far it goes before you ever need to spend on a GPU.