Andrew Feldman, Co-founder & CEO of Cerebras Systems – Interview Series

0
317

[ad_1]

Andrew is co-founder and CEO of Cerebras Systems. He is an entrepreneur devoted to pushing boundaries within the compute house. Prior to Cerebras, he co-founded and was CEO of SeaMicro, a pioneer of energy-efficient, high-bandwidth microservers. SeaMicro was acquired by AMD in 2012 for $357M. Before SeaMicro, Andrew was the Vice President of Product Management, Marketing and BD at Force10 Networks which was later offered to Dell Computing for $800M. Prior to Force10 Networks, Andrew was the Vice President of Marketing and Corporate Development at RiverStone Networks from the corporate’s inception via IPO in 2001. Andrew holds a BA and an MBA from Stanford University.

Cerebras Systems is constructing a brand new class of laptop system, designed from first ideas for the singular aim of accelerating AI and altering the way forward for AI work.

Could you share the genesis story behind Cerebras Systems?

My co-founders and I all labored collectively at a earlier startup that my CTO Gary and I began again in 2007, referred to as SeaMicro (which was offered to AMD in 2012 for $334 million). My co-founders are a number of the main laptop architects and engineers within the business – Gary Lauterbach, Sean Lie, JP Fricker and Michael James. When we acquired the band again collectively in 2015, we wrote two issues on a whiteboard – that we wished to work collectively, and that we wished to construct one thing that might remodel the business and be within the Computer History Museum, which is the equal to the Compute Hall of Fame. We have been honored when the Computer History Museum acknowledged our achievements and added WSE-2 processor to its assortment final 12 months, citing the way it has reworked the unreal intelligence panorama.

Cerebras Systems is a workforce of pioneering laptop architects, laptop scientists, deep studying researchers, and engineers of every type who love doing fearless engineering. Our mission after we got here collectively was to construct a brand new class of laptop to speed up deep studying, which has risen as one of the necessary workloads of our time.

We realized that deep studying has distinctive, huge, and rising computational necessities. And it isn’t well-matched by legacy machines like graphics processing models (GPUs), which have been essentially designed for different work. As a consequence, AI as we speak is constrained not by purposes or concepts, however by the supply of compute. Testing a single new speculation – coaching a brand new mannequin – can take days, weeks, and even months and value a whole lot of 1000’s of {dollars} in compute time. That’s a serious roadblock to innovation.

So the genesis of Cerebras was to construct a brand new kind of laptop optimized completely for deep studying, ranging from a clear sheet of paper. To meet the large computational calls for of deep studying, we designed and manufactured the biggest chip ever constructed – the Wafer-Scale Engine (WSE). In creating the world’s first wafer-scale processor, we overcame challenges throughout design, fabrication and packaging – all of which had been thought-about not possible for the whole 70-year historical past of computer systems. Every ingredient of the WSE is designed to allow deep studying analysis at unprecedented speeds and scale, powering the business’s quickest AI supercomputer, the Cerebras CS-2.

With each part optimized for AI work, the CS-2 delivers extra compute efficiency at much less house and fewer energy than another system. It does this whereas radically lowering programming complexity, wall-clock compute time, and time to resolution. Depending on workload, from AI to HPC, CS-2 delivers a whole lot or 1000’s of instances extra efficiency than legacy alternate options. The CS-2 gives the deep studying compute assets equal to a whole lot of GPUs, whereas offering the convenience of programming, administration and deployment of a single gadget.

Over the previous few months Cerebras appears to be all around the information, what are you able to inform us in regards to the new Andromeda AI supercomputer?

We introduced Andromeda in November of final 12 months, and it is among the largest and strongest AI supercomputers ever constructed. Delivering greater than 1 Exaflop of AI compute and 120 Petaflops of dense compute, Andromeda has 13.5 million cores throughout 16 CS-2 methods, and is the one AI supercomputer to ever reveal near-perfect linear scaling on massive language mannequin workloads. It can also be lifeless easy to make use of.

By manner of reminder, the biggest supercomputer on Earth – Frontier – has 8.7 million cores. In uncooked core depend, Andromeda is multiple and a half instances as massive. It does totally different work clearly, however this provides an thought of the scope: practically 100 terabits of inside bandwidth, practically 20,000 AMD Epyc cores feed it, and – in contrast to the enormous supercomputers which take years to face up – we stood Andromeda up in three days and instantly thereafter, it was delivering close to excellent linear scaling of AI.

Argonne National Labs was our first buyer to make use of Andromeda, and so they utilized it to an issue that was breaking their 2,000 GPU cluster referred to as Polaris. The downside was operating very massive, GPT-3XL generative fashions, whereas placing the whole Covid genome within the sequence window, in order that you might analyze every gene within the context of the whole genome of Covid. Andromeda ran a novel genetic workload with lengthy sequence lengths (MSL of 10K) throughout 1, 2, 4, 8 and 16 nodes, with near-perfect linear scaling. Linear scaling is amongst essentially the most sought-after traits of a giant cluster. Andromeda delivered 15.87X throughput throughout 16 CS-2 methods, in comparison with a single CS-2, and a discount in coaching time to match.

Could you inform us in regards to the partnership with Jasper that was unveiled in late November and what it means for each firms?

Jasper’s a extremely attention-grabbing firm. They are a frontrunner in generative AI content material for advertising, and their merchandise are utilized by greater than 100,000 prospects around the globe to put in writing copy for advertising, adverts, books, and extra. It’s clearly a really thrilling and quick rising house proper now. Last 12 months, we introduced a partnership with them to speed up adoption and enhance the accuracy of generative AI throughout enterprise and client purposes. Jasper is utilizing our Andromeda supercomputer to coach its profoundly computationally intensive fashions in a fraction of the time. This will prolong the attain of generative AI fashions to the plenty.

With the facility of the Cerebras Andromeda supercomputer, Jasper can dramatically advance AI work, together with coaching GPT networks to suit AI outputs to all ranges of end-user complexity and granularity. This improves the contextual accuracy of generative fashions and can allow Jasper to personalize content material throughout a number of lessons of consumers rapidly and simply.

Our partnership permits Jasper to invent the way forward for generative AI, by doing issues which might be impractical or just not possible with conventional infrastructure, and to speed up the potential of generative AI, bringing its advantages to our quickly rising buyer base across the globe.

In a current press launch, the National Energy Technology Laboratory and Pittsburgh Supercomputing Center Pioneer introduced the primary ever Computational Fluid Dynamics Simulation on the Cerebras wafer-scale engine. Could you describe what particularly is a wafer-scale engine and the way it works?

Our Wafer-Scale Engine (WSE) is the revolutionary AI processor for our deep studying laptop system, the CS-2. Unlike legacy, general-purpose processors, the WSE was constructed from the bottom as much as speed up deep studying: it has 850,000 AI-optimized cores for sparse tensor operations, huge excessive bandwidth on-chip reminiscence, and interconnect orders of magnitude quicker than a conventional cluster may probably obtain. Altogether, it provides you the deep studying compute assets equal to a cluster of legacy machines all in a single gadget, simple to program as a single node – radically lowering programming complexity, wall-clock compute time, and time to resolution.

Our second technology WSE-2, which powers our CS-2 system, can remedy issues extraordinarily quick. Fast sufficient to permit real-time, high-fidelity fashions of engineered methods of curiosity. It’s a uncommon instance of profitable “strong scaling”, which is using parallelism to cut back remedy time with a set measurement downside.

And that’s what the National Energy Technology Laboratory and Pittsburgh Supercomputing Center are utilizing it for. We simply introduced some actually thrilling outcomes of a computational fluid dynamics (CFD) simulation, made up of about 200 million cells, at close to real-time charges.  This video reveals the high-resolution simulation of Rayleigh-Bénard convection, which happens when a fluid layer is heated from the underside and cooled from the highest. These thermally pushed fluid flows are all spherical us – from windy days, to lake impact snowstorms, to magma currents within the earth’s core and plasma motion within the solar. As the narrator says, it’s not simply the visible fantastic thing about the simulation that’s necessary: it’s the pace at which we’re in a position to calculate it. For the primary time, utilizing our Wafer-Scale Engine, NETL is ready to manipulate a grid of practically 200 million cells in practically real-time.

What kind of information is being simulated?

The workload examined was thermally pushed fluid flows, also called pure convection, which is an software of computational fluid dynamics (CFD). Fluid flows happen naturally throughout us — from windy days, to lake impact snowstorms, to tectonic plate movement. This simulation, made up of about 200 million cells, focuses on a phenomenon referred to as “Rayleigh-Bénard” convection, which happens when a fluid is heated from the underside and cooled from the highest. In nature, this phenomenon can result in extreme climate occasions like downbursts, microbursts, and derechos. It’s additionally chargeable for magma motion within the earth’s core and plasma motion within the solar.

Back in November 2022, NETL launched a brand new area equation modeling API, powered by the CS-2 system, that was as a lot as 470 instances quicker than what was doable on NETL’s Joule Supercomputer . This means it may ship speeds past what both clusters of any variety of CPUs or GPUs can obtain. Using a easy Python API that permits wafer-scale processing for a lot of computational science, WFA delivers beneficial properties in efficiency and value that might not be obtained on typical computer systems and supercomputers – in actual fact , it outperformed OpenFOAM on NETL’s Joule 2.0 supercomputer by over two orders of magnitude in time to resolution.

Because of the simplicity of the WFA API, the outcomes have been achieved in just some weeks and proceed the shut collaboration between NETL, PSC and Cerebras Systems.

By remodeling the pace of CFD (which has at all times been a gradual, off-line activity) on our WSE, we will open up an entire raft of recent, real-time use circumstances for this, and lots of different core HPC purposes. Our aim is that by enabling extra compute energy, our prospects can carry out extra experiments and invent higher science. NETL lab director Brian Anderson has informed us that this can drastically speed up and enhance the design course of for some actually huge tasks that NETL is engaged on round mitigating local weather change and enabling a safe power future — tasks like carbon sequestration and blue hydrogen manufacturing.

Cerebras is constantly outperforming the competitors in the case of releasing supercomputers, what are a number of the challenges behind constructing state-of-the-art supercomputers?

Ironically, one of many hardest challenges of massive AI shouldn’t be the AI. It’s the distributed compute.

To prepare as we speak’s state-of-the-art neural networks, researchers typically use a whole lot to 1000’s of graphics processing models (GPUs). And it isn’t simple. Scaling massive language mannequin coaching throughout a cluster of GPUs requires distributing a workload throughout many small gadgets, coping with gadget reminiscence sizes and reminiscence bandwidth constraints, and thoroughly managing communication and synchronization overheads.

We’ve taken a totally totally different strategy to designing our supercomputers via the event of the Cerebras Wafer-Scale Cluster, and the Cerebras Weight Streaming execution mode. With these applied sciences, Cerebras addresses a brand new technique to scale primarily based on three key factors:

The substitute of CPU and GPU processing by wafer-scale accelerators such because the Cerebras CS-2 system. This change reduces the variety of compute models wanted to realize an appropriate compute pace.

To meet the problem of mannequin measurement, we make use of a system structure that disaggregates compute from mannequin storage. A compute service primarily based on a cluster of CS-2 methods (offering enough compute bandwidth) is tightly coupled to a reminiscence service (with massive reminiscence capability) that gives subsets of the mannequin to the compute cluster on demand. As typical, a knowledge service serves up batches of coaching information to the compute service as wanted.

An revolutionary mannequin for the scheduling and coordination of coaching work throughout the CS-2 cluster that employs information parallelism, layer at a time coaching with sparse weights streamed in on demand, and retention of activations within the compute service.

There’s been fears of the tip of Moore’s Law for near a decade, what number of extra years can the business squeeze in and what sorts of improvements are wanted for this?

I believe the query we’re all grappling with is whether or not Moore’s Law – as written by Moore – is lifeless. It isn’t taking two years to get extra transistors. It’s now taking 4 or 5 years. And these transistors aren’t coming on the identical worth – they’re coming in at vastly larger costs. So the query turns into, are we nonetheless getting the identical advantages of transferring from seven to 5 to 3 nanometers? The advantages are smaller and so they price extra, and so the options grow to be extra difficult than merely the chip.

Jack Dongarra, a number one laptop architect, gave a chat just lately and stated, “We’ve gotten much better at making FLOPs and at making I/O.” That’s actually true. Our capability to maneuver information off-chip lags our capability to extend the efficiency on a chip by a fantastic deal. At Cerebras, we have been comfortable when he stated that, as a result of it validates our choice to make an even bigger chip and transfer much less stuff off-chip. It additionally gives some steerage on future methods to make methods with chips carry out higher. There’s work to be executed, not only a wringing out extra FLOPs but additionally in strategies to maneuver them and to maneuver the information from chip to chip — even from very huge chip to very huge chip.

Is there the rest that you just wish to share about Cerebras Systems?

For higher or worse, folks typically put Cerebras on this class of “the really big chip guys.” We’ve been in a position to present compelling options for very, very massive neural networks, thereby eliminating the necessity to do painful distributed computing. I consider that’s enormously attention-grabbing and on the coronary heart of why our prospects love us. The attention-grabbing area for 2023 shall be the right way to do huge compute to the next stage of accuracy, utilizing fewer FLOPs.

Our work on sparsity gives an especially attention-grabbing strategy. We don’t do work that doesn’t transfer us in direction of the aim line, and multiplying by zero is a foul thought. We’ll be releasing a extremely attention-grabbing paper on sparsity quickly, and I believe there’s going to be extra effort is how we get to those environment friendly factors, and the way can we accomplish that for much less energy. And not only for much less energy and coaching; how can we decrease the associated fee and energy utilized in inference? I believe sparsity helps on each fronts.

Thank you for these in-depth solutions, readers who want to study extra ought to go to Cerebras Systems.

LEAVE A REPLY

Please enter your comment!
Please enter your name here