768GB of cheap Intel Optane DIMM memory sticks used to run 1-trillion-parameter LLM on a system with a single GPU — local Kimi K2.5 install achieved roughly 4 tokens per second

768GB of cheap Intel Optane DIMM memory sticks used to run 1-trillion-parameter LLM on a system with a single GPU — local Kimi K2.5 install achieved roughly 4 tokens per second | Tom's Hardware

Unlock world-class roadmaps & trusted Bench data. See More

× Unparalleled insights. Industry analysis. Insider access. Tom's Hardware Premium equips you with world-class coverage and detailed insights into the evolving hardware landscape.

✓Full access to our trusted Bench database: Access granular performance data instantly. ✓Exclusive hardware roadmaps: Peer into the future of the hardware industry. ✓Daily news analysis: Dive deep into the biggest stories.

Subscribe to our annual plan for just $29

Stay On the Cutting Edge: Get the Tom's Hardware Newsletter Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Contact me with news and offers from other Future brands

Receive email from us on behalf of our trusted partners or sponsors

By submitting your information you agree to the Terms & Conditions and Privacy Policy and are aged 16 or over.

You are now subscribed Your newsletter sign-up was successful

An account already exists for this email address, please log in.

(Image credit: Lenovo)

Copy link

Facebook

Flipboard

Share this article

Join the conversation

Add us as a preferred source on Google

Newsletter

Subscribe to our newsletter

A Redditor has caused a stir by coaxing a workstation build using Optane PMem DIMMs as RAM to run a 1-trillion-parameter LLM. APFrisco explains in a mini tutorial/guide on the Local LLaMA subreddit how they bought some used Intel Optane Persistent Memory, acquired relatively cheaply second-hand, to “run a 1 trillion parameter model (in this case Kimi K2.5) locally at ~4 tokens/second” on their Xeon workstation.Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec from r/LocalLLaMACentral to the headlining feat was the Redditor’s sourcing of six Optane PMem (DCPMM) sticks. The discontinued memory format was designed to bridge the DRAM-SSD divide. While the 768GB of Optane (6x 128GB) does indeed offer far lower latency than the best NVMe SSDs, it is still two or three times slower than DRAM. These characteristics are still rather sweet for LLM inference frameworks, and the second-hand price was “much less than what the equivalent DRAM capacity would cost.” But, alas, Optane is dead, so this is an exotic solution.APFrisco’s hardware specs were given as follows:Latest Videos FromIntel Xeon Gold 6246 CPUTyan S5630GMRE-CGN motherboardAsus Dual GeForce RTX 3060 OC 12GB GPU6x 32GB Samsung 2666MHz DDR4 ECC DRAM sticks6x 128GB Intel Optane DCPMM PC4-2666 NMA1XBD128GQS persistent memory modulesWestern Digital WD SN850X 2TB M.2 2280 NVMe SSDASRock Steel Legend SL-850G 850W 80 PLUS GOLD & Cybenetics Platinum Fully Modular Power SupplySilverstone SST-GD08B (Black) Grandia Series Home Theater PC CaseThe build was configured with the Optane in memory mode and the Samsung DDR4 as cache.The software side of the equation relied on the aforementioned Kimi K2.5’s mixture-of-experts architecture. APFrisco used a hybrid GPU/CPU inference methodology with llama.cpp. Also, to optimize processing, the routing components were shoehorned into the 12GB GPU using llama.cpp’s 'override-tensor' flag.The Redditor is rather proud of the resulting ~4 tokens per second performance. “Given the fact that this is a trillion-parameter frontier-class model running on such a limited hardware budget, I would consider it to be a great success,” writes APFrisco. They go on to lament Intel’s withdrawal of Optane products.If you are interested in this rig rundown and what it achieved in terms of local LLM inference, you can find some more details about the configuration in the source post. Furthermore, APFrisco sticks around in the comments to answer questions. They also appear to benefit from recommendations about how to achieve even better results, given the foundation they have laid.Stay On the Cutting Edge: Get the Tom's Hardware NewsletterGet Tom's Hardware's best news and in-depth reviews, straight to your inbox.Contact me with news and offers from other Future brandsReceive email from us on behalf of our trusted partners or sponsorsThe bigger picture, though, seems to be that there is room for a memory product in the chasm between DRAM and SSDs, particularly for LLMs. Many expect that the gap will soon be bridged by the CXL (Compute Express Link) standard, which promises huge pools of affordable, byte‑addressable memory for these kinds of workloads.

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.

TOPICS

See all comments (6)

Mark TysonNews EditorMark Tyson is a news editor at Tom's Hardware. He enjoys covering the full breadth of PC tech; from business and semiconductor design to products approaching the edge of reason.

6 Comments

Comment from the forums

Intel's poor timing continues to impress years after the fact.

usertests said:Intel's poor timing continues to impress years after the fact.Nah, the key detail is that they were purchased used. When new, I think the GB/$ wasn't that much better than DRAM. However, DRAM is now much more expensive, which makes alternative solutions like Optane DIMMs much more attractive.

You really can't use this example to reason about Optane's market viability.

bit_user said:Nah, the key detail is that they were purchased used. When new, I think the GB/$ wasn't that much better than DRAM. However, DRAM is now much more expensive, which makes alternative solutions like Optane DIMMs much more attractive.

You really can't use this example to reason about Optane's market viability.Agreed, the main reason being that fab usage would be about the same for both so it's not like we would get cheaper anything out of it. Optane just like SSD's would be more expensive anyway. Optane was cool tech that was just not cheap enough to make sense.

You really can't use this example to reason about Optane's market viability.The $/GB wasn't dramatically less than DRAM, but that could have improved as production scaled up. But they were losing money and didn't stay the course.

Since it's off the table, we see solutions like High Bandwidth Flash stepping into the empty tier instead. I guess you can disregard write endurance if the model isn't changing rapidly.

usertests said:The $/GB wasn't dramatically less than DRAM, but that could have improved as production scaled up.Not faster than DRAM, unless they managed to solve the issues that kept it from scaling in the Z-dimension like it was meant to.

usertests said:Since it's off the table, we see solutions like High Bandwidth Flash stepping into the empty tier instead.HBF has yet to be deployed in any solution, so it has yet to be a proven alternative.

HBF should have a structural price advantage over Optane, because it's true 3D NAND.

i looked into pdimms before. Interesting, because the PERSISTENT part of them worked on very few chipsets. they're so cheap because that part won't work on most ssystems so most potential buyers take that to mean they won't work at all. But set to run without persistence, they just show up as somewhat slow DIMMS and work in far more machines, which is what it sounds like is being done here.

View All 6 Comments

Show more comments

768GB of cheap Intel Optane DIMM memory sticks used to run 1-trillion-parameter LLM on a system with a single GPU — local Kimi K2.5 install achieved roughly 4 tokens per second

768GB of cheap Intel Optane DIMM memory sticks used to run 1-trillion-parameter LLM on a system with a single GPU — local Kimi K2.5 install achieved roughly 4 tokens per second | Tom's Hardware

Related Articles

As Chinese phone market reportedly picks up steam, iPhone could be a standout winner

Deal: Epson’s premium 4K projector is now $400 cheaper!

Samsung reportedly forced to raise prices due to ever-rising RAM cost

Comments