Hardware I am using for Local LLMs

Nov 25, 2025

One of the questions I most get asked is what hardware I use when working with local AI models.

I have both Windows and Mac machines, and when working with foundational AI via API’s I can interchange between the two depending on what the requirement is (in regards to where, whatever I am working on, is going to be deployed). The speed of the hardware does not really have an impact for cloud based LLM interactions.

When I am on the move however and leveraging local LLMs I tend to use a 64GB GPD Win Max 2 AMD AI 9 HX 370 along with an external GPD G1 eGPU, which connects via an oculink cable.

I find that the cost / weight / portability / performance ratio stands up well against my MacBook Pro M3 max.

I find local LLM’s are mainly constrained by mainly three things:

➡️ Sustained CPU throughput

➡️ Memory bandwidth

➡️ iGPU throughput (if offloading parts)

The WinMax can be easily tuned, though a software application, to raise TDP, boost the CPU/GPU clocks, and increase sustained cooling (see picture).

Even without using the external GPU, this shrinks the performance gap with top end Macbook Pro’s so that it is fairly negligible. In my experience a tuned HX 370 gets you to c. 80–90% of M3 Max speed for practical local LLM use, which is hugely impressive for a portable 10.1” handheld machine.

Memory bandwidth is still much lower than M3 Max’s 400 GB/s, but LPDDR5X @ 7500 MT/s gives the HX 370 around i.e. 120–140 GB/s effective bandwidth. For quantized 7B/13B/20B models this is more than enough.

When you add in the external GPU (GPD G1, Radeon RX 7600M XT) using the supported external oculink cable you get near-PCIe-4.0-x4 bandwidth.

Unlike USB4 or Thunderbolt, OcuLink provides:

➡️ Direct PCIe lanes
➡️ Lower latency
➡️ Higher sustained bandwidth
➡️ No Thunderbolt overhead
➡️ No packetisation penalties

With the eGPU, in most local LLM scenarios, CPU is not the bottleneck and the GPU becomes the driver.

The picture below shows the WinMax running the OpenAI 20B OSS model, which is heavily quantized and extremely CPU efficient. Even without the eGPU its running at c. 80 tokens per second. This is because it’s leaning strongly on the CPU-side optimizations of llama.cpp/Ollama (rather than on full GPU offload).

The lack of CUDA is far less of an issue today than it used to be, especially for Ollama, llama.cpp, and the new generation of CPU optimized OSS models such as OpenAI’s 20B. In general, most modern OSS models from OpenAI, Meta, Mistral etc are now designed to run as equally well without CUDA.

I think this is one of the best accelerated ultra mobile setups possible today, and it works great for local models that are 30B and under. However if I have a need to run local models that are 30B+ then I revert to the MacBook .

Why AI Man

Discussion about this post

Ready for more?