Cisco IT designed AI-ready infrastructure with Cisco compute, best-in-class NVIDIA GPUs, and Cisco networking that helps AI mannequin coaching and inferencing throughout dozens of use instances for Cisco product and engineering groups.
It’s no secret that the stress to implement AI throughout the enterprise presents challenges for IT groups. It challenges us to deploy new expertise sooner than ever earlier than and rethink how knowledge facilities are constructed to fulfill growing calls for throughout compute, networking, and storage. While the tempo of innovation and enterprise development is exhilarating, it could possibly additionally really feel daunting.
How do you shortly construct the information middle infrastructure wanted to energy AI workloads and sustain with vital enterprise wants? This is strictly what our crew, Cisco IT, was going through.
The ask from the enterprise
We had been approached by a product crew that wanted a option to run AI workloads which can be used to develop and check new AI capabilities for Cisco merchandise. It would ultimately help mannequin coaching and inferencing for a number of groups and dozens of use instances throughout the enterprise. And they wanted it performed shortly. want for the product groups to get improvements to our prospects as shortly as attainable, we needed to ship the new setting in simply three months.
The expertise necessities
We started by mapping out the necessities for the brand new AI infrastructure. A non-blocking, lossless community was important with the AI compute material to make sure dependable, predictable, and high-performance knowledge transmission inside the AI cluster. Ethernet was the first-class selection. Other necessities included:
- Intelligent buffering, low latency: Like any good knowledge middle, these are important for sustaining easy knowledge stream and minimizing delays, in addition to enhancing the responsiveness of the AI material.
- Dynamic congestion avoidance for varied workloads: AI workloads can fluctuate considerably of their calls for on community and compute assets. Dynamic congestion avoidance would make sure that assets had been allotted effectively, forestall efficiency degradation throughout peak utilization, keep constant service ranges, and stop bottlenecks that would disrupt operations.
- Dedicated front-end and back-end networks, non-blocking material: With a aim to construct scalable infrastructure, a non-blocking material would guarantee ample bandwidth for knowledge to stream freely, in addition to allow a high-speed knowledge switch — which is essential for dealing with giant knowledge volumes typical with AI purposes. By segregating our front-end and back-end networks, we may improve safety, efficiency, and reliability.
- Automation for Day 0 to Day 2 operations: From the day we deployed, configured, and tackled ongoing administration, we needed to cut back any handbook intervention to maintain processes fast and decrease human error.
- Telemetry and visibility: Together, these capabilities would offer insights into system efficiency and well being, which might enable for proactive administration and troubleshooting.
The plan – with just a few challenges to beat
With the necessities in place, we started determining the place the cluster may very well be constructed. The present knowledge middle services weren’t designed to help AI workloads. We knew that constructing from scratch with a full knowledge middle refresh would take 18-24 months – which was not an possibility. We wanted to ship an operational AI infrastructure in a matter of weeks, so we leveraged an present facility with minor modifications to cabling and gadget distribution to accommodate.
Our subsequent considerations had been across the knowledge getting used to coach fashions. Since a few of that knowledge wouldn’t be saved domestically in the identical facility as our AI infrastructure, we determined to duplicate knowledge from different knowledge facilities into our AI infrastructure storage programs to keep away from efficiency points associated to community latency. Our community crew had to make sure ample community capability to deal with this knowledge replication into the AI infrastructure.
Now, attending to the precise infrastructure. We designed the center of the AI infrastructure with Cisco compute, best-in-class GPUs from NVIDIA, and Cisco networking. On the networking aspect, we constructed a front-end ethernet community and back-end lossless ethernet community. With this mannequin, we had been assured that we may shortly deploy superior AI capabilities in any setting and proceed so as to add them as we introduced extra services on-line.
Products:
Supporting a rising setting
After making the preliminary infrastructure out there, the enterprise added extra use instances every week and we added further AI clusters to help them. We wanted a option to make all of it simpler to handle, together with managing the swap configurations and monitoring for packet loss. We used Cisco Nexus Dashboard, which dramatically streamlined operations and ensured we may develop and scale for the long run. We had been already utilizing it in different elements of our knowledge middle operations, so it was straightforward to increase it to our AI infrastructure and didn’t require the crew to study a further software.
The outcomes
Our crew was capable of transfer quick and overcome a number of hurdles in designing the answer. We had been capable of design and deploy the backend of the AI material in below three hours and deploy all the AI cluster and materials in 3 months, which was 80% sooner than the choice rebuild.
Today, the setting helps greater than 25 use instances throughout the enterprise, with extra added every week. This contains:
- Webex Audio: Improving codec growth for noise cancellation and decrease bandwidth knowledge prediction
- Webex Video: Model coaching for background alternative, gesture recognition, and face landmarks
- Custom LLM coaching for cybersecurity merchandise and capabilities
Not solely had been we capable of help the wants of the enterprise at present, however we’re designing how our knowledge facilities have to evolve for the long run. We are actively constructing out extra clusters and can share further particulars on our journey in future blogs. The modularity and adaptability of Cisco’s networking, compute, and safety provides us confidence that we are able to maintain scaling with the enterprise.
Additional assets:
Share: