This submit has been co-authored by Sheila Mueller, Senior GBB HPC+AI Specialist, Microsoft; Gabrielle Davelaar, Senior GBB AI Specialist, Microsoft; Gabriel Sallah, Senior HPC Specialist, Microsoft; Annamalai Chockalingam, Product Marketing Manager, NVIDIA; J Kent Altena, Principal GBB HPC+AI Specialist, Microsoft; Dr. Lukasz Miroslaw, Senior HPC Specialist, Microsoft; Uttara Kumar, Senior Product Marketing Manager, NVIDIA; Sooyoung Moon, Senior HPC + AI Specialist, Microsoft.
As AI emerges as an important device in so many sectors, it’s clear that the necessity for optimized AI infrastructure is rising. Going past simply GPU-based clusters, cloud infrastructure that gives low-latency, high-bandwidth interconnects, and high-performance storage will help organizations deal with AI workloads extra effectively and produce sooner outcomes.
HPCwire not too long ago sat down with Microsoft Azure and NVIDIA’s AI and cloud infrastructure specialists and requested a collection of inquiries to uncover AI infrastructure insights, traits, and recommendation primarily based on their engagements with clients worldwide.
How are your most attention-grabbing AI use circumstances depending on infrastructure?
Sheila Mueller, Senior GBB HPC+AI Specialist, Healthcare & Life Sciences, Microsoft: Some of probably the most attention-grabbing AI use circumstances are in-patient well being care, each medical and analysis. Research in science, engineering, and well being is creating vital enhancements in affected person care, enabled by high-performance computing and AI insights. Common use circumstances embody molecular modeling, therapeutics, genomics, and well being therapies. Predictive Analytics and AI coupled with cloud infrastructure purpose-built for AI are the spine for enhancements and simulations in these use circumstances and might result in a sooner prognosis and the power to analysis cures. See how Elekta brings hope to extra sufferers world wide with the promise of AI-powered radiation remedy.
Gabrielle Davelaar, Senior GBB AI Specialist, Microsoft: Many manufacturing corporations want to coach inference fashions at scale whereas being compliant with strict native and European-level laws. AI is positioned on the sting with high-performance compute. Full traceability with strict safety guidelines on privateness and safety is important. This could be a tough course of as each step should be recorded for replica, from easy issues like dataset variations to extra advanced issues reminiscent of understanding which surroundings was used with what machine studying (ML) libraries with its particular variations. Machine studying operations (MLOps) for knowledge and mannequin auditability now make this doable. See how BMW makes use of machine learning-supported robots to supply flexibility in high quality management for automotive manufacturing.
Gabriel Sallah, Senior HPC Specialist, Automotive Lead, Microsoft: We’ve labored with automotive makers to develop superior driver help techniques (ADAS) and superior driving techniques (ADS) platforms within the cloud utilizing built-in providers to construct a extremely scalable deep studying pipeline for creating AI/ML fashions. HPC methods had been utilized to schedule, scale, and provision compute assets whereas guaranteeing efficient monitoring, price administration, and knowledge traceability. The end result: sooner simulation/coaching occasions as a result of shut integration of information inputs, compute simulation/coaching runs, and knowledge outputs than their present options.
Annamalai Chockalingam, Product Marketing Manager, Large Language Models & Deep Learning Products, NVIDIA: Progress in AI has led to the explosion of generative AI, notably with developments to giant language fashions (LLMs) and diffusion-based transformer architectures. These fashions now acknowledge, summarize, translate, predict, and generate languages, photos, movies, code, and even protein sequences, with little to no coaching or supervision, primarily based on huge datasets. Early use circumstances embody improved buyer experiences by means of dynamic digital assistants, AI-assisted content material era for blogs, promoting, advertising and marketing, and AI-assisted code era. Infrastructure purpose-built for AI that may deal with laptop energy and scalability calls for is vital.
What AI challenges are clients going through, and the way does the correct infrastructure assist?
John Lee, Azure AI Platforms & Infrastructure Principal Lead, Microsoft: When corporations attempt to scale AI coaching fashions past a single node to tens and a whole bunch of nodes, they rapidly understand that AI infrastructure issues. Not all accelerators are alike. Optimized scale-up node-level structure issues. How the host CPUs connect with teams of accelerators matter. When scaling past a single node, the scale-out structure of your cluster issues. Selecting a cloud accomplice that gives AI-optimized infrastructure will be the distinction between an AI venture’s success or failure. Read the weblog: AI and the necessity for purpose-built cloud infrastructure.
Annamalai Chockalingam: AI fashions have gotten more and more highly effective as a result of a proliferation of information, continued developments in GPU compute infrastructure, and enhancements in methods throughout each coaching and inference of AI workloads. Yet, combining the trifecta of information, compute infrastructure, and algorithms at scale stays difficult. Developers and AI researchers require techniques and frameworks that may scale, orchestrate, crunch mountains of information, and handle MLOps to optimally create deep studying fashions. End-to-end instruments for production-grade techniques incorporating fault tolerance for constructing and deploying large-scale fashions for particular workflows are scarce.
Kent Altena, Principal GBB HPC+AI Specialist, Financial Services, Microsoft: Trying to resolve the most effective architectures between the open flexibility of a real HPC surroundings to the sturdy MLOps pipeline and capabilities of machine studying. Traditional HPC approaches, whether or not scheduled by a legacy scheduler like HPC Pack or SLURM or a cloud-native scheduler like Azure Batch, are nice for when they should scale to a whole bunch of GPUs, however in lots of circumstances, AI environments want the DevOps method to AI mannequin administration and management of which fashions are licensed or conversely want general workflow administration.
Dr. Lukasz Miroslaw, Senior HPC Specialist, Microsoft: AI infrastructure will not be solely the GPU-based clusters but in addition low-latency, high-bandwidth interconnect between the nodes and high-performant storage. The storage requirement is commonly the limiting issue for large-scale distributed coaching as the quantity of information used for the coaching in autonomous driving tasks can develop to petabytes. The problem is to design an AI platform that meets strict necessities by way of storage throughput, capability, help for a number of protocols, and scalability.
What are probably the most continuously requested questions on AI infrastructure?
John Lee: “Which platform should I use for my AI project/workload?” There is not any single magic product or platform that’s proper for each AI venture. Customers often have a superb understanding of what solutions they’re in search of however aren’t positive what AI merchandise or platforms will get them that reply the quickest, most economical, and scalable manner. A cloud accomplice with a large portfolio of AI merchandise, options, and experience will help discover the correct answer for particular AI wants.
Uttara Kumar, Senior Product Marketing Manager, NVIDIA: “How do I select the right GPU for our AI workloads?” Customers need the flexibleness to provision the right-sized GPU acceleration for various workloads to optimize cloud prices (fractional GPU, single GPU, a number of GPUs all the best way as much as a number of GPUs throughout multi-node clusters). Many additionally ask, “How do you make the most of the GPU instance/virtual machines and leverage it within applications/solutions?” Performance-optimized software program is vital to doing that.
Sheila Mueller: “How do I leverage the cloud for AI and HPC while ensuring data security and governance.” Customers wish to automate the deployment of those options, usually throughout a number of analysis labs with particular simulations. Customers need a safe, scalable platform that gives management over knowledge entry to supply perception. Cost administration can be a spotlight in these discussions.
Kent Altena: “How best should we implement this GPU to run our GPUs?” We know what we have to run and have constructed the fashions, however we additionally want to grasp the ultimate mile. The reply will not be at all times an easy one-size-fits-all reply. It requires understanding their fashions, what they’re making an attempt to unravel, and what their inputs and outputs/workflow seems like.
What have you ever discovered from clients about their AI infrastructure wants?
John Lee: The majority of consumers wish to leverage the ability of AI however are struggling to place an actionable plan in place to take action. They fear about what their competitors is doing and whether or not they’re falling behind however, on the identical time, should not positive what first steps to tackle their journey to combine AI into their enterprise.
Annamalai Chockalingam: Customers are in search of AI options to enhance operational effectivity and ship modern options to their finish clients. Easy-to-use, performant, platform-agnostic, and cost-effective options throughout the compute stack are extremely fascinating to clients.
Gabriel Sallah: All clients need to cut back the price of coaching an ML mannequin. Thanks to the flexibleness of the cloud assets, clients can choose the correct GPU, storage I/O, and reminiscence configuration for the given coaching mannequin.
Gabrielle Davelaar: Costs are important. With the present financial uncertainty, corporations have to do extra with much less and wish their AI coaching to be extra environment friendly and efficient. Something lots of people are nonetheless not realizing is that coaching and inferencing prices will be optimized by means of the software program layer.
What recommendation would you give to companies trying to deploy AI or pace innovation?
Uttara Kumar: Invest in a platform that’s performant, versatile, scalable, and might help the end-to-end workflow—begin to end—from importing and getting ready knowledge units for coaching, to deploying a educated community as an AI-powered service utilizing inference.
John Lee: Not each AI answer is similar. AI-optimized infrastructure issues, so make sure you perceive the breadth of merchandise and options accessible within the market. And simply as importantly, be sure you have interaction with a accomplice that has the experience to assist navigate the advanced menu of doable options that finest match what you want.
Sooyoung Moon, Senior HPC + AI Specialist, Microsoft: No quantity of funding can assure success with out thorough early-stage planning. Reliable and scalable infrastructure for steady development is important.
Kent Altena: Understand your workflow first. What do you wish to clear up? Is it primarily a calculation-driven answer, or is it constructed upon a knowledge graph-driven workload? Having that in thoughts will go an extended approach to figuring out the most effective or optimum method to start out down.
Gabriel Sallah: What are the dependencies throughout numerous groups accountable for creating and utilizing the platform? Create an enterprise-wide structure with frequent toolsets and providers to keep away from duplication of information, compute monitoring, and administration.
Sheila Mueller: Involve stakeholders from IT and Lines of Business to make sure all events conform to the enterprise advantages, technical advantages, and assumptions made as a part of the enterprise case.