Modern AI SuperClusters pack thousands of high-TDP GPUs into compact data halls, generating 30–40 kW per rack and requiring advanced cooling strategies.If you’re looking to upgrade your data center but don’t have the means to retrofit existing air-cooled infrastructure for liquid cooling, this air-cooled solution is the most suitable option. Traditional air-cooled designs use structured airflow control, optimized rack layouts, and intelligent containment methods to enable efficient, scalable AI operations without the complexity of liquid cooling.
For practical deployments, Supermicro's Air-Cooled AI SuperCluster solution delivers a proven reference architecture. It enables scalable AI workloads using standard rack infrastructure, optimized airflow control, and high-performance GPU nodes—all without switching to complex liquid cooling systems.
How often have you thought about upgrading your Data Center, given the revolutionary changes brought about by AI? In a world where artificial intelligence is rapidly transforming business processes, computing loads and infrastructure requirements are beyond the ordinary. Traditional solutions that have proven themselves with the support of standard servers and legacy cooling systems can no longer provide the necessary performance and energy efficiency. Today, high-density AI superclusters require not only increased computing power and optimized load distribution, but also a radically new approach to heat and power management.
Modernization of data centers is becoming a strategic task for companies seeking to remain competitive in the context of rapid technological development. Integrating innovative solutions such as optimized airflow, strict temperature control, and intelligent energy management systems allows you to create an infrastructure that can effectively support large-scale AI solutions. What questions did you ask yourself before starting the modernization work?
The following table provides an overview of various aspects related to legacy infrastructures, Supermicro's Air-Cooled AI SuperCluster solution, and the importance of precise airflow management. It also covers the implications of different cooling requirements, the impact of PUE range on operational efficiency, and the reliability of error margins for planning.
| Questions | Answers |
|---|---|
| Q1: Challenges for legacy infrastructures | Legacy data centers are typically designed for workloads generating 5–15 kW per rack. Modern AI applications demand 20–40 kW per rack, putting strain on cooling and power systems, leading to inefficient thermal management, increased energy consumption, thermal throttling, and higher operational costs. |
| Q2: Supermicro's Air-Cooled AI SuperCluster solution | Supermicro's solution leverages standard rack infrastructure with optimized airflow control and intelligent containment strategies. It uses variable-speed fans to prevent heat recirculation and maintain optimal GPU temperatures, balancing peak computational performance with energy efficiency. |
| Q3: Importance of precise airflow management | In high-density environments generating 30–40 kW of power, achieving uniform intake temperatures prevents thermal hotspots. Optimized airflow ensures effective heat removal, preventing thermal throttling of GPUs. Intelligent containment and dynamic fan speed adjustments maintain consistent cooling performance. |
| Q4: Implications of different cooling requirements | Cooling performance varies among configurations. For example, the NVIDIA® B200™ SXM benefits from liquid cooling, while the NVIDIA® H100™ Tensor Core GPU (SXM5) and NVIDIA® H200™ Tensor Core GPU (SXM) are optimized for dual-slot air cooling. Understanding these differences is crucial for designing cooling infrastructure to support each GPU’s requirements. |
| Q5: Impact of PUE range on operational efficiency | PUE measures data center energy efficiency. Modern AI SuperClusters target PUE ranges of 1.1–1.3 for most configurations, and 1.30–1.50 for setups with higher cooling demands. Lower PUE values reduce extra energy consumption for cooling, lowering operational costs and promoting sustainability. |
| Q6: Reliability of error margins for planning | Error margins (±5 % for power, ±15 % for rack power, ±10 % for airflow, ±15 % for ΔT) account for variations in GPU TDP, fan performance, chassis resistance, and environmental conditions. These margins allow planners to design systems with a buffer, ensuring reliable performance under fluctuating conditions. |
Key Technical Parameters
| Parameter | NVIDIA H100 SXM5 | NVIDIA H200 SXM | AMD Instinct MI300X OAM | NVIDIA B200 SXM |
|---|---|---|---|---|
| GPU Density (per rack) | 32 GPUs (4×8) | 32 GPUs (4×8) | 32 GPUs (4×8) | 32 GPUs (4×8) |
| Power per GPU | 700 W ± 5 % | 700 W ± 5 % | 750 W ± 5 % | 1 000 W ± 5 % |
| Total GPU Power | 22.4 kW ± 5 % | 22.4 kW ± 5 % | 24.0 kW ± 5 % | 32.0 kW ± 5 % |
| Total Rack Power(incl. PUE 1.1–1.3) | 24.6–29.1 kW (± 15 %) | 24.6–29.1 kW (± 15 %) | 26.4–31.2 kW (± 15 %) | 41.6–48.0 kW (± 15 %) |
| Airflow Volume (CFM) | 3 860–4 570 CFM (± 10 %) | 3 860–4 570 CFM (± 10 %) | 4 145–4 900 CFM (± 10 %) | 5 526–6 532 CFM (± 10 %) |
| Rack ΔT | 15 °C ± 15 % | 15 °C ± 15 % | 12 °C ± 15 % | 17 °C ± 15 % |
| Power Distribution | 415 VAC, 32 A, 3φ, N+1 | 415 VAC, 32 A, 3φ, N+1 | 415 VAC, 32 A, 3φ, N+1 | 415 VAC, 32 A, 3φ, N+1 |
| Network Fabric | NVLink™ 4.0 (900 GB/s) + ConnectX-7 VPI (400 Gb/s) | NVLink™ 4.0 (900 GB/s) + ConnectX-7 VPI (400 Gb/s) | Infinity Fabric (896 GB/s) + PCIe Gen 5×16 (128 GB/s) | NVLink™ 5.0 (1.8 TB/s) + ConnectX-7 VPI (400 Gb/s) |
| Cooling | Air-cooled (dual-slot) | Air-cooled (dual-slot) | Air-cooled (OAM chassis) | Air cooling supported; liquid highly recommended |
| Exit Temperature | 30–35 °C | 30–35 °C | 30–40 °C | 35–40 °C |
| Temperature Control | Operates best within 5–30 °C ambient | Operates best within 5–30 °C ambient | Operates best within 5–30 °C ambient | Operates best within 5–30 °C ambient |
| PUE Range | 1.1–1.3 | 1.1–1.3 | 1.1–1.3 | 1.30–1.50 (hyperscale < 1.4; enterprise ~ 1.5–1.6) |
| Key Takeaways | High density, mature NVLink ecosystem | Increased memory & efficiency | Massive memory capacity & bandwidth | Peak compute performance, specialized cooling |
Notes on Error Margins
±5 % power variation covers GPU TDP tolerances and workload differences.
±15 % rack power variation arises from PUE 1.1–1.3.
±10 % airflow variation accounts for chassis resistance & fan performance.
±15 % ΔT variation reflects inlet air, containment, and fan‐curve differences.
All rack-level metrics are calculated for Supermicro GPU server chassis and may vary for other vendors.