Efficiency meets sustainability to revolutionize AI compute: Empowering data centers of the future

While AI continues to transform the tech industry at unprecedented speed, companies are increasingly facing obstacles around energy consumption, rising costs and data center capacity.

At Rakuten Technology Conference 2024, a session on “The Future of Responsible AI Compute” addressed these critical issues and highlighted the urgent need for sustainable solutions. Ampere Computing Chief Evangelist and Vice President of Business Development Sean Varley gave a comprehensive overview of the current state of AI compute, and looked at the innovative technology improving its efficiency.

He was later joined by Qualcomm Marketing Director Hiroshi Izumi and Rakuten Group Executive Officer and Division CTO of the Technology Platforms Division Rohit Dewan to discuss instance density, domain-specific accelerators and the need for collaboration in achieving more efficient AI inference.

What is AI compute?

AI compute is classified by two distinct workloads: training and inference. Training refers to the process of teaching a model to generate content by learning patterns from a large dataset, while inference is when the trained model uses these learned patterns to create new content or make predictions based on new inputs.

“Training involves large batch sizes, strict high-precision requirements, and a uniform compute stack that often runs for days and even months at a time. This process demands high-capacity compute resources,” Varley explained.

“On the other hand, inference operates on small batch sizes with relaxed precision requirements and relies on a diverse and often ‘lumpy’ compute stack. Inference is also heavily real-time biased, requiring low latency but high computational efficiency.”

While training has historically been the dominant focus of GenAI engineers, the landscape is shifting.

“Over the next two to three years, the majority of AI workloads will be in inference,” predicted Varley.

The legacy of inefficient processors

“[In the past], no one cared how much the chip actually burned,” noted Varley, encapsulating the inefficiencies of traditional processors.

Developed in an era when power consumption was not a primary concern, aging infrastructures have exacerbated both power and capacity issues in data centers.

“Now, we have space and power constraints because all the computing that we require is starting to outstrip the available capacity. Many utilities are having trouble keeping up with the demand for power,” Varley remarked.

Adding to this challenge is the inflexible design of current data centers.

“Data centers today have racks and racks of GPUs that are only good for one purpose: AI training,” Varley explained.

This specialization limits adaptability, as these training-focused systems cannot be repurposed for inference or other workloads.

“The proliferation of AI, which is expected to triple future power requirements, has only amplified the demand for adequate solutions,” summed up Varley.

So how can the industry keep up? The solution lies in modernizing infrastructure.

A glowing 3D block with the letters 'AI'. Surrounding the block are translucent cubes in pink and blue.
Traditional processors’ inefficiencies have caused significant power and capacity issues in data centers, highlighting the need to update technology to support AI’s growing demands.

A paradigm of responsible AI compute

“The proliferation of AI, which is expected to triple future power requirements, has only amplified the demand for adequate solutions.”

Sean Varley, Ampere Computing Chief Evangelist and Vice President of Business Development

In the quest for improving efficiency and developing sustainable inference solutions, Varley highlighted two necessary ingredients: data center efficiency, and code and orchestration efficiency.

A few years ago, Ampere introduced their Gen4 DDR4 generation processors, which “offered 2-5x better power efficiency in terms of performance per watt” compared to industry standards. This led to the development of a new key metric: performance per rack.

“It really comes down to saving space and power, and how to maximize the amount of compute you can put in an inherently power-limited rack. The more compute you can fit into a rack, the more space and power you save for equal performance.”

Code and orchestration efficiency also plays an important role for responsible AI compute at scale. Key practices such as optimizing containers, stateless execution and power-aware coding can contribute to higher utilization of processors.

“Optimizing the size of containers helps utilize less memory and compute power, which is essential for densely packing digital services. We also need to have stateless execution because it makes systems more independent and easier to build and scale out,” emphasized Varley.

Rakuten and Ampere expand collaboration

Rakuten recently announced plans to expand its collaboration with Ampere to further reduce power consumption and improve data center efficiency. This makes Rakuten the first company in Japan to deploy Ampere-based products on a large scale.

Since 2023, the companies have collaborated to achieve a 36% energy saving per rack and 11% space reduction per rack for Rakuten Cloud’s services. Recent trials of load balancing services on the Ampere-based platforms also showed a 22% reduction in power consumption.

The growing partnership emphasizes Rakuten’s dedication to further enhance energy efficiency and support its growing AI-driven services and initiatives.

Enhancing AI compute efficiency with domain-specific accelerators and partnerships

Later in the session, Varley was joined on stage by Qualcomm’s Izumi for a panel discussion moderated by Rakuten’s Dewan.

Varley introduced another key metric known as instance density per rack, which measures the number of AI instances that can run per rack. This metric is a modification of the performance per rack paradigm and is essential for understanding the efficiency of AI inference.

“The concept of instance density is particularly relevant in the context of AI inference, where numerous instances of AI models need to run simultaneously to handle various tasks such as user interactions, business operations, and multimodal AI applications. High instance density allows data centers to maximize their computational resources, leading to significant improvements in efficiency and cost-effectiveness,” he highlighted.

This high instance density is particularly beneficial for CPUs, but GPUs have limited efficiency in running multiple instances because they are not partitionable or virtualizable.

This is where domain-specific accelerators come in.

Domain-specific accelerators are specialized hardware designed to handle specific types of computing tasks more efficiently than general-purpose processors.

“We are targeting a hybrid AI ecosystem, hybrid AI meaning orchestration between the cloud and the edge. To realize this ecosystem, power efficiency and TCO are important factors. Domain specific accelerators for AI inference will help accelerate this ecosystem,” Izumi explained.

Varley and Izumi ended the session by talking about the importance of collaborating with various enterprises to further advance AI compute solutions.

In fact, Qualcomm and Ampere have been collaborating to pioneer the shift from inflexible GPU-dominated architectures to versatile setups that blend compute platforms with domain-specific accelerators. Ampere has also been leading the establishment of the AI Platform Alliance, a consortium aimed at making AI platforms more open, efficient, and sustainable.

“It takes a village,” Varley remarked. “It takes a lot of different software, hardware, systems, integrators, cloud providers and more to create a much more power-efficient, much more open and sustainable infrastructure for the future of AI inference.”

Tags
Show More
Back to top button