Local LLMs in 2026: What I Actually Run, and What I Don't

Okay, confession: I’ve reinstalled Ollama on the same machine four times this year. Not because it broke. Because every time a new model dropped, I’d talk myself into believing this was the one that would finally let me cancel my API bills and run everything on the GPU under my desk.

It mostly didn’t work out that way. But “mostly” is carrying a lot of weight in that sentence, and the gap between what a local model can’t do and what it quietly does well turned out to be the part worth writing about.

So this is the post I wish someone had handed me back in January. I’m not here to sell you on local models, and I’m not here to dunk on them. It’s an honest account of where a local LLM earned a spot in my actual workflow over the last few months, where I gave up and went back to a hosted API, and the one benchmark paper that finally got me to stop arguing from vibes and start measuring.

What “local LLM” actually means now

A local LLM is a model running on hardware you own. Your laptop, a desktop with a decent GPU, a little server in a closet. Nothing leaves the machine, there’s no API key, and there’s no per-token meter ticking away in the background.

The tooling got good. Ollama is what I reach for most days, because it turns “download and run a model” into one command. LM Studio is friendlier if you want a GUI. Under both of them sits llama.cpp, which does the real work of running quantized models on consumer chips.

The thing that matters more than any brand name is weight class. Models come in rough sizes: 3 billion parameters, 7 to 8 billion, 14 billion, 30-plus billion. Most people, me included, run something in the 7-8B range, because that’s what fits comfortably in 8 to 16GB of VRAM once it’s quantized. A 7B model on your desk is not a shrunk-down frontier model. It’s a different animal with different strengths, and forgetting that is how people end up disappointed.

Where a local model earned its place

Here’s the work I actually moved off an API and never moved back.

The boring, repetitive, high-volume stuff. Tagging support tickets. Pulling three fields out of a few thousand messy text blobs. Reshaping scraped junk into clean JSON. None of that needs a clever model. It needs a good-enough one that runs for free while I sleep. A friend’s team of three, building a small customer support tool, pushed their whole ticket-classification backlog through a local 7B overnight and paid nothing. The hosted-API quote for the same job had four digits in it.

Anything privacy-sensitive. If the text is a client’s internal docs, or something I’d rather not hand to another company’s logs, the decision makes itself. The model runs on my machine and the data stays there.

Offline or flaky-connection work. I drafted part of this post on a plane. The model didn’t care.

Small, latency-sensitive calls. A local 7B answering a quick classification prompt comes back faster than the network round trip to a hosted endpoint, mostly because there is no network round trip.

If you’re doing retrieval work, a local model is a fine engine for it too, though your chunking strategy will move the result more than the choice of which 7B you run.

Where I gave up and went back to an API

Now the other half, because this is the part the cheerful guides skip.

Real reasoning on unfamiliar problems is where small local models fall down. I don’t mean that as a feeling. There’s a sharp little paper that uses chess as a controlled test of whether a model is reasoning or just recalling, “Disentangling generalization and memorization in large language models using chess”. The part that stuck with me: performance drops steadily as a position moves further from common, well-trodden ones, and for properly novel positions, base-model play falls back toward random. The everyday read is that a local 8B is strong on tasks that look like its training data and shaky on tasks that don’t. Frontier hosted models handle the unfamiliar stuff better, and on hard problems that gap is the whole game.

Long context is the second wall. A local model will happily accept a giant prompt. Whether it actually uses the middle of that prompt, and how slowly it grinds through the whole thing, is a different question. On my hardware the answer was “not well, and not fast.”

And anything where being wrong is expensive. When a mistake costs real money or real trust, I pay for the better model. That’s just arithmetic.

The benchmark that made me stop guessing

For months I argued about local models from feel. Then I read a paper that did the boring, useful thing and measured.

It’s called “GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval”. The researchers ran Microsoft’s GraphRAG pipeline over real electronic health record schema documentation, using four local models on a single 8GB consumer GPU through Ollama: Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B, and Phi-4-mini at 3.8B.

The results are worth sitting with. Llama 3.1 built the richest knowledge graph, roughly 1,172 entities. Qwen 2.5 produced the best answer quality. Phi-4-mini, the smallest of the four, struggled noticeably. So the models are not interchangeable, and the one that built the biggest graph wasn’t the one that gave the best answers.

But the number that reset my expectations was that answer-quality score. The best model in the test landed at about 3.3 out of 5. Read that plainly: the winner of a careful benchmark on consumer hardware scored a solid C. That isn’t a reason to avoid local models. It’s a reason to measure them on your own data instead of trusting a leaderboard, or a blog post, this one included.

A setup that takes about ten minutes

Enough opinion. Here’s the actual thing, start to finish.

Install Ollama and pull a model sized for roughly 8GB of VRAM:

# macOS or Linux
curl -fsSL https://ollama.com/install.sh | sh

# Qwen 2.5 7B was the strongest all-rounder in the benchmark above
ollama pull qwen2.5:7b

# quick sanity check
ollama run qwen2.5:7b "Reply with one word: working?"

The part people miss is that you don’t have to rewrite your app to use it. Ollama exposes an OpenAI-compatible endpoint, so the swap is mostly a base URL.

Here’s the before, an ordinary hosted call that bills you per token:

from openai import OpenAI

client = OpenAI(api_key="sk-...")

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Classify this ticket: ..."}],
)
print(resp.choices[0].message.content)

And here’s the after, the same client pointed at a model on your own machine:

from openai import OpenAI

# Ollama serves an OpenAI-compatible API on localhost:11434
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

resp = client.chat.completions.create(
    model="qwen2.5:7b",
    messages=[{"role": "user", "content": "Classify this ticket: ..."}],
)
print(resp.choices[0].message.content)

Two lines changed. That’s the real unlock. You can route the cheap, boring, high-volume calls to localhost and let the genuinely hard requests fall through to a hosted model, all behind one client. I wire a fallback like that into most of the tools I build, and you can see a few of them in my work. If you’re putting a model behind a real user-facing feature, the streaming and UI side is its own headache, and I went through that in my notes on the Vercel AI SDK.

So should you bother?

Here’s the rule I landed on. Run it locally when the task is repetitive, privacy-sensitive, or high enough volume that the API bill has started to bug you. Pay for a hosted model when the task needs real reasoning on unfamiliar problems, or when being wrong is costly. Most projects want both, and the localhost swap above means you don’t have to pick a side.

The thing you can do this week: take one task you already send to an API, ideally a dull, high-volume one, pull qwen2.5:7b, point your client at localhost, and run the same hundred inputs through both. Read the outputs yourself. Don’t trust my take, and don’t trust the paper’s score. The only benchmark that counts is your own data, and an afternoon is enough to get the answer.