{"id":270,"date":"2026-05-23T13:02:54","date_gmt":"2026-05-23T13:02:54","guid":{"rendered":"https:\/\/abrarqasim.com\/blog\/local-llm-2026-what-i-actually-run-and-what-i-dont\/"},"modified":"2026-05-23T13:02:54","modified_gmt":"2026-05-23T13:02:54","slug":"local-llm-2026-what-i-actually-run-and-what-i-dont","status":"publish","type":"post","link":"https:\/\/abrarqasim.com\/blog\/local-llm-2026-what-i-actually-run-and-what-i-dont\/","title":{"rendered":"Local LLMs in 2026: What I Actually Run, and What I Don&#8217;t"},"content":{"rendered":"<p>Okay, confession: I&rsquo;ve reinstalled Ollama on the same machine four times this year. Not because it broke. Because every time a new model dropped, I&rsquo;d talk myself into believing this was the one that would finally let me cancel my API bills and run everything on the GPU under my desk.<\/p>\n<p>It mostly didn&rsquo;t work out that way. But &ldquo;mostly&rdquo; is carrying a lot of weight in that sentence, and the gap between what a local model can&rsquo;t do and what it quietly does well turned out to be the part worth writing about.<\/p>\n<p>So this is the post I wish someone had handed me back in January. I&rsquo;m not here to sell you on local models, and I&rsquo;m not here to dunk on them. It&rsquo;s an honest account of where a local LLM earned a spot in my actual workflow over the last few months, where I gave up and went back to a hosted API, and the one benchmark paper that finally got me to stop arguing from vibes and start measuring.<\/p>\n<h2 id=\"what-local-llm-actually-means-now\">What &ldquo;local LLM&rdquo; actually means now<\/h2>\n<p>A local LLM is a model running on hardware you own. Your laptop, a desktop with a decent GPU, a little server in a closet. Nothing leaves the machine, there&rsquo;s no API key, and there&rsquo;s no per-token meter ticking away in the background.<\/p>\n<p>The tooling got good. <a href=\"https:\/\/ollama.com\" rel=\"nofollow noopener\" target=\"_blank\">Ollama<\/a> is what I reach for most days, because it turns &ldquo;download and run a model&rdquo; into one command. LM Studio is friendlier if you want a GUI. Under both of them sits llama.cpp, which does the real work of running quantized models on consumer chips.<\/p>\n<p>The thing that matters more than any brand name is weight class. Models come in rough sizes: 3 billion parameters, 7 to 8 billion, 14 billion, 30-plus billion. Most people, me included, run something in the 7-8B range, because that&rsquo;s what fits comfortably in 8 to 16GB of VRAM once it&rsquo;s quantized. A 7B model on your desk is not a shrunk-down frontier model. It&rsquo;s a different animal with different strengths, and forgetting that is how people end up disappointed.<\/p>\n<h2 id=\"where-a-local-model-earned-its-place\">Where a local model earned its place<\/h2>\n<p>Here&rsquo;s the work I actually moved off an API and never moved back.<\/p>\n<p>The boring, repetitive, high-volume stuff. Tagging support tickets. Pulling three fields out of a few thousand messy text blobs. Reshaping scraped junk into clean JSON. None of that needs a clever model. It needs a good-enough one that runs for free while I sleep. A friend&rsquo;s team of three, building a small customer support tool, pushed their whole ticket-classification backlog through a local 7B overnight and paid nothing. The hosted-API quote for the same job had four digits in it.<\/p>\n<p>Anything privacy-sensitive. If the text is a client&rsquo;s internal docs, or something I&rsquo;d rather not hand to another company&rsquo;s logs, the decision makes itself. The model runs on my machine and the data stays there.<\/p>\n<p>Offline or flaky-connection work. I drafted part of this post on a plane. The model didn&rsquo;t care.<\/p>\n<p>Small, latency-sensitive calls. A local 7B answering a quick classification prompt comes back faster than the network round trip to a hosted endpoint, mostly because there is no network round trip.<\/p>\n<p>If you&rsquo;re doing retrieval work, a local model is a fine engine for it too, though your <a href=\"https:\/\/abrarqasim.com\/blog\/rag-chunking-strategies-2026-what-i-actually-use\" rel=\"noopener\">chunking strategy<\/a> will move the result more than the choice of which 7B you run.<\/p>\n<h2 id=\"where-i-gave-up-and-went-back-to-an-api\">Where I gave up and went back to an API<\/h2>\n<p>Now the other half, because this is the part the cheerful guides skip.<\/p>\n<p>Real reasoning on unfamiliar problems is where small local models fall down. I don&rsquo;t mean that as a feeling. There&rsquo;s a sharp little paper that uses chess as a controlled test of whether a model is reasoning or just recalling, <a href=\"https:\/\/arxiv.org\/abs\/2601.16823\" rel=\"nofollow noopener\" target=\"_blank\">&ldquo;Disentangling generalization and memorization in large language models using chess&rdquo;<\/a>. The part that stuck with me: performance drops steadily as a position moves further from common, well-trodden ones, and for properly novel positions, base-model play falls back toward random. The everyday read is that a local 8B is strong on tasks that look like its training data and shaky on tasks that don&rsquo;t. Frontier hosted models handle the unfamiliar stuff better, and on hard problems that gap is the whole game.<\/p>\n<p>Long context is the second wall. A local model will happily accept a giant prompt. Whether it actually uses the middle of that prompt, and how slowly it grinds through the whole thing, is a different question. On my hardware the answer was &ldquo;not well, and not fast.&rdquo;<\/p>\n<p>And anything where being wrong is expensive. When a mistake costs real money or real trust, I pay for the better model. That&rsquo;s just arithmetic.<\/p>\n<h2 id=\"the-benchmark-that-made-me-stop-guessing\">The benchmark that made me stop guessing<\/h2>\n<p>For months I argued about local models from feel. Then I read a paper that did the boring, useful thing and measured.<\/p>\n<p>It&rsquo;s called <a href=\"https:\/\/arxiv.org\/abs\/2605.20815\" rel=\"nofollow noopener\" target=\"_blank\">&ldquo;GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval&rdquo;<\/a>. The researchers ran Microsoft&rsquo;s GraphRAG pipeline over real electronic health record schema documentation, using four local models on a single 8GB consumer GPU through Ollama: Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B, and Phi-4-mini at 3.8B.<\/p>\n<p>The results are worth sitting with. Llama 3.1 built the richest knowledge graph, roughly 1,172 entities. Qwen 2.5 produced the best answer quality. Phi-4-mini, the smallest of the four, struggled noticeably. So the models are not interchangeable, and the one that built the biggest graph wasn&rsquo;t the one that gave the best answers.<\/p>\n<p>But the number that reset my expectations was that answer-quality score. The best model in the test landed at about 3.3 out of 5. Read that plainly: the winner of a careful benchmark on consumer hardware scored a solid C. That isn&rsquo;t a reason to avoid local models. It&rsquo;s a reason to measure them on your own data instead of trusting a leaderboard, or a blog post, this one included.<\/p>\n<h2 id=\"a-setup-that-takes-about-ten-minutes\">A setup that takes about ten minutes<\/h2>\n<p>Enough opinion. Here&rsquo;s the actual thing, start to finish.<\/p>\n<p>Install Ollama and pull a model sized for roughly 8GB of VRAM:<\/p>\n<pre><code class=\"language-bash\"># macOS or Linux\ncurl -fsSL https:\/\/ollama.com\/install.sh | sh\n\n# Qwen 2.5 7B was the strongest all-rounder in the benchmark above\nollama pull qwen2.5:7b\n\n# quick sanity check\nollama run qwen2.5:7b &quot;Reply with one word: working?&quot;\n<\/code><\/pre>\n<p>The part people miss is that you don&rsquo;t have to rewrite your app to use it. Ollama exposes an OpenAI-compatible endpoint, so the swap is mostly a base URL.<\/p>\n<p>Here&rsquo;s the before, an ordinary hosted call that bills you per token:<\/p>\n<pre><code class=\"language-python\">from openai import OpenAI\n\nclient = OpenAI(api_key=&quot;sk-...&quot;)\n\nresp = client.chat.completions.create(\n    model=&quot;gpt-4o-mini&quot;,\n    messages=[{&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;Classify this ticket: ...&quot;}],\n)\nprint(resp.choices[0].message.content)\n<\/code><\/pre>\n<p>And here&rsquo;s the after, the same client pointed at a model on your own machine:<\/p>\n<pre><code class=\"language-python\">from openai import OpenAI\n\n# Ollama serves an OpenAI-compatible API on localhost:11434\nclient = OpenAI(base_url=&quot;http:\/\/localhost:11434\/v1&quot;, api_key=&quot;ollama&quot;)\n\nresp = client.chat.completions.create(\n    model=&quot;qwen2.5:7b&quot;,\n    messages=[{&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;Classify this ticket: ...&quot;}],\n)\nprint(resp.choices[0].message.content)\n<\/code><\/pre>\n<p>Two lines changed. That&rsquo;s the real unlock. You can route the cheap, boring, high-volume calls to localhost and let the genuinely hard requests fall through to a hosted model, all behind one client. I wire a fallback like that into most of the tools I build, and you can see a few of them in my <a href=\"https:\/\/abrarqasim.com\/work\" rel=\"noopener\">work<\/a>. If you&rsquo;re putting a model behind a real user-facing feature, the streaming and UI side is its own headache, and I went through that in my <a href=\"https:\/\/abrarqasim.com\/blog\/vercel-ai-sdk-v5-in-production-what-usechat-replaced\" rel=\"noopener\">notes on the Vercel AI SDK<\/a>.<\/p>\n<h2 id=\"so-should-you-bother\">So should you bother?<\/h2>\n<p>Here&rsquo;s the rule I landed on. Run it locally when the task is repetitive, privacy-sensitive, or high enough volume that the API bill has started to bug you. Pay for a hosted model when the task needs real reasoning on unfamiliar problems, or when being wrong is costly. Most projects want both, and the localhost swap above means you don&rsquo;t have to pick a side.<\/p>\n<p>The thing you can do this week: take one task you already send to an API, ideally a dull, high-volume one, pull <code>qwen2.5:7b<\/code>, point your client at localhost, and run the same hundred inputs through both. Read the outputs yourself. Don&rsquo;t trust my take, and don&rsquo;t trust the paper&rsquo;s score. The only benchmark that counts is your own data, and an afternoon is enough to get the answer.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I spent a month running local LLMs on my own GPU instead of an API. Here is where self-hosted models earned their place, and where they quietly did not.<\/p>\n","protected":false},"author":2,"featured_media":269,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"rank_math_title":"","rank_math_description":"I spent a month running local LLMs on my own GPU instead of an API. Here is where self-hosted models earned their place, and where they quietly did not.","rank_math_focus_keyword":"local llm","rank_math_canonical_url":"","rank_math_robots":"","footnotes":""},"categories":[4],"tags":[317,5,313,315,316,314],"class_list":["post-270","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","tag-ai-tools","tag-llm","tag-local-llm-2","tag-ollama","tag-open-source-llm-2","tag-self-hosted-llm"],"_links":{"self":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/posts\/270","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/comments?post=270"}],"version-history":[{"count":0,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/posts\/270\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/media\/269"}],"wp:attachment":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/media?parent=270"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/categories?post=270"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/tags?post=270"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}