South Korean AI chip startup FuriosaAI scored a serious buyer win this week after LG’s AI Analysis division tapped its AI accelerators to energy servers working its Exaone household of huge language fashions.
However whereas floating level compute functionality, reminiscence capability, and bandwidth all play a serious function in AI efficiency, LG did not select Furiosa’s RNGD — pronounced “renegade” — inference accelerators for speeds and feeds. Moderately, it was energy effectivity.
“RNGD offers a compelling mixture of advantages: wonderful real-world efficiency, a dramatic discount in our complete value of possession, and a surprisingly simple integration,” Kijeong Jeon, product unit chief at LG AI Analysis, mentioned in a canned assertion.
A fast peek at RNGD’s spec sheet reveals what seems to be a moderately modest chip, with floating level efficiency coming in at between 256 and 512 teraFLOPS relying on whether or not you go for 16- or 8-bit precision. Reminiscence capability can be moderately meager, with 48GB throughout a pair of HBM3 stacks, that is good for about 1.5TB/s of bandwidth.
In comparison with AMD and Nvidia’s newest crop of GPUs, RNGD does not look all that aggressive till you contemplate the truth that Furiosa has managed to do all this utilizing simply 180 watts of energy. In testing, LG analysis discovered the elements have been as a lot as 2.25x extra energy environment friendly than GPUs for LLM inference on its homegrown household of Exaone fashions.
Earlier than you get too excited, the GPUs in query are Nvidia’s A100s, that are getting moderately lengthy within the tooth — they made their debut simply because the pandemic was kicking off in 2020.
However as FuriosaAI CEO June Paik tells El Reg, whereas Nvidia’s GPUs have definitely gotten extra highly effective within the 5 years for the reason that A100’s debut, that efficiency has come on the expense of upper vitality consumption and die space.
Whereas a single RNGD PCIe card cannot compete with Nvidia’s H100 or B200 accelerators on uncooked efficiency, when it comes to effectivity — the variety of FLOPS you possibly can squeeze from every watt — the chips are extra aggressive than you would possibly assume.
Paik credit a lot of the corporate’s effectivity benefit right here to RNGD’s Tensor Contraction Processor structure, which he says requires far fewer directions to carry out matrix multiplication than on a GPU and minimizes information motion.
The chips additionally profit from RNGD’s use of HBM, which Paik says requires far much less energy than counting on GDDR, like we have seen with a few of Nvidia’s lower-end provides, just like the L40S or RTX Professional 6000 Blackwell playing cards.
At roughly 1.4 teraFLOPS per watt, RNGD is definitely nearer to Nvidia’s Hopper technology than to the A100. RNGD’s effectivity turns into much more obvious if we shift focus to reminiscence bandwidth, which is arguably the extra necessary issue on the subject of LLM inference. As a normal rule, the extra reminiscence bandwidth you’ve got bought, the sooner it will spit out tokens.
Right here once more, at 1.5TB/s, RNGD’s reminiscence is not notably quick. Nvidia’s H100 provides each larger capability at 80GB and between 3.35TB/s and three.9TB/s of bandwidth. Nonetheless, that chip makes use of anyplace from 2 to three.9 instances the ability.
For roughly the identical wattage as an H100 SXM module, you possibly can have 4 RNGD playing cards totaling 2 petaFLOPs of dense FP8, 192GB of HBM, and 6TB/s reminiscence bandwidth. That is nonetheless a methods behind Nvidia’s newest technology of Blackwell elements, however far nearer than RNGD’s uncooked speeds and feeds would have you ever imagine.
And, since RNGD is designed solely with inference in thoughts, fashions actually could be unfold throughout a number of accelerators utilizing strategies like tensor parallelism, and even a number of methods utilizing pipeline parallelism.
Actual world testing
LG AI truly used 4 RNGD PCIe playing cards in a tensor-parallel configuration to run its in-house Exaone 32B mannequin at 16-bit precision. In line with Paik, LG had very particular efficiency targets it was searching for when validating the chip to be used.
Notably, the constraints included a time-to-first token (TTFT), which measures the period of time you must wait earlier than the LLM begins producing a response, of roughly 0.3 seconds for extra modest 3,000 token prompts or 4.5 seconds for bigger 30,000 token prompts.
In case you are questioning, these assessments are analogous to medium to giant summarization duties, which put extra stress on the chip’s compute subsystem than a shorter immediate would have.
LG discovered that it was capable of obtain this degree of efficiency whereas churning out about 50-60 tokens a second at a batch dimension of 1.
In line with Paik, these assessments have been carried out utilizing FP16, for the reason that A100s LG in contrast in opposition to don’t natively help 8-bit floating-point activations. Presumably dropping all the way down to FP8 would primarily double the mannequin’s throughput and additional cut back the TTFT.
Utilizing a number of playing cards does include some inherent challenges. Particularly, the tensor parallelism that enables each the mannequin’s weights and computation to be unfold throughout 4 or extra playing cards is moderately network-intensive.
Not like Nvidia’s GPUs, which frequently function speedy proprietary NVLink interconnects that shuttle information between chips at greater than a terabyte a second, Furiosa caught with good previous PCIe 5.0, which tops out at 128GB/s per card.
In an effort to keep away from interconnect bottlenecks and overheads, Furiosa says it optimized the chip’s communication scheduling and compiler to overlap inter-chip direct reminiscence entry operations.
However as a result of RNGD hasn’t shared figures for larger batch sizes, it is arduous to say simply how nicely this strategy scales. At a batch of 1, the variety of tensor parallel operations is comparatively few, he admitted.
In line with Paik, particular person efficiency ought to solely drop by 20-30 p.c at batch 64. That implies the identical setup ought to be capable to obtain near 2,700 tokens a second of complete throughput and help a pretty big variety of concurrent customers. However with out arduous particulars, we will solely speculate.
Aggressive panorama
In any case, Furiosa’s chips are adequate that LG’s AI Analysis division now plans to supply servers powered by RNGD to enterprises using its Exaone fashions.
“After extensively testing a variety of choices, we discovered RNGD to be a extremely efficient resolution for deploying Exaone fashions,” Jeon mentioned.
Much like Nvidia’s RTX Professional Blackwell-based methods, LG’s RNGD containers will probably be obtainable with as much as eight PCIe accelerators. These methods will run what Furiosa describes as a extremely mature software program stack, which features a model of vLLM, a well-liked mannequin serving runtime.
LG will even provide its agentic AI platform, referred to as ChatExaone, which bundles up a bunch of frameworks for doc evaluation, deep analysis, information evaluation, and retrieval augmented technology (RAG).
Furiosa’s powers of persuasion do not cease at LG, both. As you could recall, Meta reportedly made an $800 million bid to amass the startup earlier this 12 months, however finally failed to persuade Furiosa’s leaders at hand over the keys to the dominion.
Furiosa advantages from the rising demand for sovereign AI fashions, software program, and infrastructure, designed and skilled on homegrown {hardware}.
Nonetheless, to compete on a worldwide scale, Furiosa faces some challenges. Most notably, Nvidia and AMD’s newest crop of GPUs not solely provide a lot larger efficiency, reminiscence capability, and bandwidth than RNGD, however by our estimate are a good bit extra energy-efficient. Nvidia’s architectures additionally enable for larger levels of parallelism due to its early investments in rack-scale architectures, a design level we’re solely now seeing chipmakers embrace.
Having mentioned that, it is value noting that the design course of for RNGD started in 2022, earlier than OpenAI’s ChatGPT kicked off the AI increase. At the moment, fashions like Bert have been mainstream with regard to language fashions. Paik, nevertheless, wager that GPT was going to take off and the underlying structure was going to grow to be the brand new norm, and that knowledgeable selections like utilizing HBM versus GDDR reminiscence.
“Looking back I feel I ought to have made an much more aggressive wager and had 4 HBM [stacks] and put extra compute dies on a single bundle,” Paik mentioned.
We have seen numerous chip firms, together with Nvidia, AMD, SambaNova, and others, embrace this strategy with a view to scale their chips past the reticle restrict.
Hindsight being what it’s, Paik says now that Furiosa has managed to show out its tensor compression processor structure, HBM integration, and software program stack, the corporate merely must scale up its structure.
“We’ve a really strong constructing block,” he mentioned. “We’re fairly assured that once you scale up this chip structure it is going to be fairly aggressive in opposition to all the most recent GPU chips.” ®
South Korean AI chip startup FuriosaAI scored a serious buyer win this week after LG’s AI Analysis division tapped its AI accelerators to energy servers working its Exaone household of huge language fashions.
However whereas floating level compute functionality, reminiscence capability, and bandwidth all play a serious function in AI efficiency, LG did not select Furiosa’s RNGD — pronounced “renegade” — inference accelerators for speeds and feeds. Moderately, it was energy effectivity.
“RNGD offers a compelling mixture of advantages: wonderful real-world efficiency, a dramatic discount in our complete value of possession, and a surprisingly simple integration,” Kijeong Jeon, product unit chief at LG AI Analysis, mentioned in a canned assertion.
A fast peek at RNGD’s spec sheet reveals what seems to be a moderately modest chip, with floating level efficiency coming in at between 256 and 512 teraFLOPS relying on whether or not you go for 16- or 8-bit precision. Reminiscence capability can be moderately meager, with 48GB throughout a pair of HBM3 stacks, that is good for about 1.5TB/s of bandwidth.
In comparison with AMD and Nvidia’s newest crop of GPUs, RNGD does not look all that aggressive till you contemplate the truth that Furiosa has managed to do all this utilizing simply 180 watts of energy. In testing, LG analysis discovered the elements have been as a lot as 2.25x extra energy environment friendly than GPUs for LLM inference on its homegrown household of Exaone fashions.
Earlier than you get too excited, the GPUs in query are Nvidia’s A100s, that are getting moderately lengthy within the tooth — they made their debut simply because the pandemic was kicking off in 2020.
However as FuriosaAI CEO June Paik tells El Reg, whereas Nvidia’s GPUs have definitely gotten extra highly effective within the 5 years for the reason that A100’s debut, that efficiency has come on the expense of upper vitality consumption and die space.
Whereas a single RNGD PCIe card cannot compete with Nvidia’s H100 or B200 accelerators on uncooked efficiency, when it comes to effectivity — the variety of FLOPS you possibly can squeeze from every watt — the chips are extra aggressive than you would possibly assume.
Paik credit a lot of the corporate’s effectivity benefit right here to RNGD’s Tensor Contraction Processor structure, which he says requires far fewer directions to carry out matrix multiplication than on a GPU and minimizes information motion.
The chips additionally profit from RNGD’s use of HBM, which Paik says requires far much less energy than counting on GDDR, like we have seen with a few of Nvidia’s lower-end provides, just like the L40S or RTX Professional 6000 Blackwell playing cards.
At roughly 1.4 teraFLOPS per watt, RNGD is definitely nearer to Nvidia’s Hopper technology than to the A100. RNGD’s effectivity turns into much more obvious if we shift focus to reminiscence bandwidth, which is arguably the extra necessary issue on the subject of LLM inference. As a normal rule, the extra reminiscence bandwidth you’ve got bought, the sooner it will spit out tokens.
Right here once more, at 1.5TB/s, RNGD’s reminiscence is not notably quick. Nvidia’s H100 provides each larger capability at 80GB and between 3.35TB/s and three.9TB/s of bandwidth. Nonetheless, that chip makes use of anyplace from 2 to three.9 instances the ability.
For roughly the identical wattage as an H100 SXM module, you possibly can have 4 RNGD playing cards totaling 2 petaFLOPs of dense FP8, 192GB of HBM, and 6TB/s reminiscence bandwidth. That is nonetheless a methods behind Nvidia’s newest technology of Blackwell elements, however far nearer than RNGD’s uncooked speeds and feeds would have you ever imagine.
And, since RNGD is designed solely with inference in thoughts, fashions actually could be unfold throughout a number of accelerators utilizing strategies like tensor parallelism, and even a number of methods utilizing pipeline parallelism.
Actual world testing
LG AI truly used 4 RNGD PCIe playing cards in a tensor-parallel configuration to run its in-house Exaone 32B mannequin at 16-bit precision. In line with Paik, LG had very particular efficiency targets it was searching for when validating the chip to be used.
Notably, the constraints included a time-to-first token (TTFT), which measures the period of time you must wait earlier than the LLM begins producing a response, of roughly 0.3 seconds for extra modest 3,000 token prompts or 4.5 seconds for bigger 30,000 token prompts.
In case you are questioning, these assessments are analogous to medium to giant summarization duties, which put extra stress on the chip’s compute subsystem than a shorter immediate would have.
LG discovered that it was capable of obtain this degree of efficiency whereas churning out about 50-60 tokens a second at a batch dimension of 1.
In line with Paik, these assessments have been carried out utilizing FP16, for the reason that A100s LG in contrast in opposition to don’t natively help 8-bit floating-point activations. Presumably dropping all the way down to FP8 would primarily double the mannequin’s throughput and additional cut back the TTFT.
Utilizing a number of playing cards does include some inherent challenges. Particularly, the tensor parallelism that enables each the mannequin’s weights and computation to be unfold throughout 4 or extra playing cards is moderately network-intensive.
Not like Nvidia’s GPUs, which frequently function speedy proprietary NVLink interconnects that shuttle information between chips at greater than a terabyte a second, Furiosa caught with good previous PCIe 5.0, which tops out at 128GB/s per card.
In an effort to keep away from interconnect bottlenecks and overheads, Furiosa says it optimized the chip’s communication scheduling and compiler to overlap inter-chip direct reminiscence entry operations.
However as a result of RNGD hasn’t shared figures for larger batch sizes, it is arduous to say simply how nicely this strategy scales. At a batch of 1, the variety of tensor parallel operations is comparatively few, he admitted.
In line with Paik, particular person efficiency ought to solely drop by 20-30 p.c at batch 64. That implies the identical setup ought to be capable to obtain near 2,700 tokens a second of complete throughput and help a pretty big variety of concurrent customers. However with out arduous particulars, we will solely speculate.
Aggressive panorama
In any case, Furiosa’s chips are adequate that LG’s AI Analysis division now plans to supply servers powered by RNGD to enterprises using its Exaone fashions.
“After extensively testing a variety of choices, we discovered RNGD to be a extremely efficient resolution for deploying Exaone fashions,” Jeon mentioned.
Much like Nvidia’s RTX Professional Blackwell-based methods, LG’s RNGD containers will probably be obtainable with as much as eight PCIe accelerators. These methods will run what Furiosa describes as a extremely mature software program stack, which features a model of vLLM, a well-liked mannequin serving runtime.
LG will even provide its agentic AI platform, referred to as ChatExaone, which bundles up a bunch of frameworks for doc evaluation, deep analysis, information evaluation, and retrieval augmented technology (RAG).
Furiosa’s powers of persuasion do not cease at LG, both. As you could recall, Meta reportedly made an $800 million bid to amass the startup earlier this 12 months, however finally failed to persuade Furiosa’s leaders at hand over the keys to the dominion.
Furiosa advantages from the rising demand for sovereign AI fashions, software program, and infrastructure, designed and skilled on homegrown {hardware}.
Nonetheless, to compete on a worldwide scale, Furiosa faces some challenges. Most notably, Nvidia and AMD’s newest crop of GPUs not solely provide a lot larger efficiency, reminiscence capability, and bandwidth than RNGD, however by our estimate are a good bit extra energy-efficient. Nvidia’s architectures additionally enable for larger levels of parallelism due to its early investments in rack-scale architectures, a design level we’re solely now seeing chipmakers embrace.
Having mentioned that, it is value noting that the design course of for RNGD started in 2022, earlier than OpenAI’s ChatGPT kicked off the AI increase. At the moment, fashions like Bert have been mainstream with regard to language fashions. Paik, nevertheless, wager that GPT was going to take off and the underlying structure was going to grow to be the brand new norm, and that knowledgeable selections like utilizing HBM versus GDDR reminiscence.
“Looking back I feel I ought to have made an much more aggressive wager and had 4 HBM [stacks] and put extra compute dies on a single bundle,” Paik mentioned.
We have seen numerous chip firms, together with Nvidia, AMD, SambaNova, and others, embrace this strategy with a view to scale their chips past the reticle restrict.
Hindsight being what it’s, Paik says now that Furiosa has managed to show out its tensor compression processor structure, HBM integration, and software program stack, the corporate merely must scale up its structure.
“We’ve a really strong constructing block,” he mentioned. “We’re fairly assured that once you scale up this chip structure it is going to be fairly aggressive in opposition to all the most recent GPU chips.” ®