Wellesley Solution Pte. Ltd.

Modern AI SuperClusters pack thousands of high-TDP GPUs into compact data halls, generating 30–40 kW per rack and requiring advanced cooling strategies.If you’re looking to upgrade your data center but don’t have the means to retrofit existing air-cooled infrastructure for liquid cooling, this air-cooled solution is the most suitable option. Traditional air-cooled designs use structured airflow control, optimized rack layouts, and intelligent containment methods to enable efficient, scalable AI operations without the complexity of liquid cooling.

For practical deployments, Supermicro's Air-Cooled AI SuperCluster solution delivers a proven reference architecture. It enables scalable AI workloads using standard rack infrastructure, optimized airflow control, and high-performance GPU nodes—all without switching to complex liquid cooling systems.

How often have you thought about upgrading your Data Center, given the revolutionary changes brought about by AI? In a world where artificial intelligence is rapidly transforming business processes, computing loads and infrastructure requirements are beyond the ordinary. Traditional solutions that have proven themselves with the support of standard servers and legacy cooling systems can no longer provide the necessary performance and energy efficiency. Today, high-density AI superclusters require not only increased computing power and optimized load distribution, but also a radically new approach to heat and power management.

Modernization of data centers is becoming a strategic task for companies seeking to remain competitive in the context of rapid technological development. Integrating innovative solutions such as optimized airflow, strict temperature control, and intelligent energy management systems allows you to create an infrastructure that can effectively support large-scale AI solutions. What questions did you ask yourself before starting the modernization work?

The following table provides an overview of various aspects related to legacy infrastructures, Supermicro's Air-Cooled AI SuperCluster solution, and the importance of precise airflow management. It also covers the implications of different cooling requirements, the impact of PUE range on operational efficiency, and the reliability of error margins for planning.

Questions	Answers
Q1: Challenges for legacy infrastructures	Legacy data centers are typically designed for workloads generating 5–15 kW per rack. Modern AI applications demand 20–40 kW per rack, putting strain on cooling and power systems, leading to inefficient thermal management, increased energy consumption, thermal throttling, and higher operational costs.
Q2: Supermicro's Air-Cooled AI SuperCluster solution	Supermicro's solution leverages standard rack infrastructure with optimized airflow control and intelligent containment strategies. It uses variable-speed fans to prevent heat recirculation and maintain optimal GPU temperatures, balancing peak computational performance with energy efficiency.
Q3: Importance of precise airflow management	In high-density environments generating 30–40 kW of power, achieving uniform intake temperatures prevents thermal hotspots. Optimized airflow ensures effective heat removal, preventing thermal throttling of GPUs. Intelligent containment and dynamic fan speed adjustments maintain consistent cooling performance.
Q4: Implications of different cooling requirements	Cooling performance varies among configurations. For example, the NVIDIA® B200™ SXM benefits from liquid cooling, while the NVIDIA® H100™ Tensor Core GPU (SXM5) and NVIDIA® H200™ Tensor Core GPU (SXM) are optimized for dual-slot air cooling. Understanding these differences is crucial for designing cooling infrastructure to support each GPU’s requirements.
Q5: Impact of PUE range on operational efficiency	PUE measures data center energy efficiency. Modern AI SuperClusters target PUE ranges of 1.1–1.3 for most configurations, and 1.30–1.50 for setups with higher cooling demands. Lower PUE values reduce extra energy consumption for cooling, lowering operational costs and promoting sustainability.
Q6: Reliability of error margins for planning	Error margins (±5 % for power, ±15 % for rack power, ±10 % for airflow, ±15 % for ΔT) account for variations in GPU TDP, fan performance, chassis resistance, and environmental conditions. These margins allow planners to design systems with a buffer, ensuring reliable performance under fluctuating conditions.

Key Technical Parameters

Parameter	NVIDIA H100 SXM5	NVIDIA H200 SXM	AMD Instinct MI300X OAM	NVIDIA B200 SXM
GPU Density (per rack)	32 GPUs (4×8)	32 GPUs (4×8)	32 GPUs (4×8)	32 GPUs (4×8)
Power per GPU	700 W ± 5 %	700 W ± 5 %	750 W ± 5 %	1 000 W ± 5 %
Total GPU Power	22.4 kW ± 5 %	22.4 kW ± 5 %	24.0 kW ± 5 %	32.0 kW ± 5 %
Total Rack Power(incl. PUE 1.1–1.3)	24.6–29.1 kW (± 15 %)	24.6–29.1 kW (± 15 %)	26.4–31.2 kW (± 15 %)	41.6–48.0 kW (± 15 %)
Airflow Volume (CFM)	3 860–4 570 CFM (± 10 %)	3 860–4 570 CFM (± 10 %)	4 145–4 900 CFM (± 10 %)	5 526–6 532 CFM (± 10 %)
Rack ΔT	15 °C ± 15 %	15 °C ± 15 %	12 °C ± 15 %	17 °C ± 15 %
Power Distribution	415 VAC, 32 A, 3φ, N+1	415 VAC, 32 A, 3φ, N+1	415 VAC, 32 A, 3φ, N+1	415 VAC, 32 A, 3φ, N+1
Network Fabric	NVLink™ 4.0 (900 GB/s) + ConnectX-7 VPI (400 Gb/s)	NVLink™ 4.0 (900 GB/s) + ConnectX-7 VPI (400 Gb/s)	Infinity Fabric (896 GB/s) + PCIe Gen 5×16 (128 GB/s)	NVLink™ 5.0 (1.8 TB/s) + ConnectX-7 VPI (400 Gb/s)
Cooling	Air-cooled (dual-slot)	Air-cooled (dual-slot)	Air-cooled (OAM chassis)	Air cooling supported; liquid highly recommended
Exit Temperature	30–35 °C	30–35 °C	30–40 °C	35–40 °C
Temperature Control	Operates best within 5–30 °C ambient	Operates best within 5–30 °C ambient	Operates best within 5–30 °C ambient	Operates best within 5–30 °C ambient
PUE Range	1.1–1.3	1.1–1.3	1.1–1.3	1.30–1.50 (hyperscale < 1.4; enterprise ~ 1.5–1.6)
Key Takeaways	High density, mature NVLink ecosystem	Increased memory & efficiency	Massive memory capacity & bandwidth	Peak compute performance, specialized cooling

Notes on Error Margins

±5 % power variation covers GPU TDP tolerances and workload differences.

±15 % rack power variation arises from PUE 1.1–1.3.

±10 % airflow variation accounts for chassis resistance & fan performance.

±15 % ΔT variation reflects inlet air, containment, and fan‐curve differences.

All rack-level metrics are calculated for Supermicro GPU server chassis and may vary for other vendors.

Blog Details

Optimized Air-Cooled AI SuperClusters for High-Density GPU Workloads

Tags:

Share: