The Lifespan of a Data Center GPU is Only 3 Years

According to a senior expert at Alphabet (Google’s parent company), the lifespan of a data center GPU may be only 1 to 3 years, depending on its utilization rate. Because GPUs handle almost all of the workload for AI training and inference, their performance degrades faster than any other component.

In data centers operated by cloud giants, the utilization rate of GPUs for AI workloads is between 60% and 70%. According to Tech Fund, citing a chief GenAI architect at Alphabet, at this level of utilization, a GPU’s lifespan is typically only one to two years, with a maximum of three years.

The architect’s comments, which were posted on the American social media platform X, sparked a series of discussions. While a GPU lifespan of only 1-3 years may seem exaggerated, it is reasonable. Data center GPUs used for AI and HPC applications have a TDP (Thermal Design Power) that reaches or exceeds 700W, which puts real stress on the silicon chips.

The GenAI architect also stated that one way to extend the lifespan of a GPU is to lower its utilization rate. This would slow down the performance degradation, but it would also prolong the return on investment cycle, failing to meet the business’s demand for speed and agility. Therefore, cloud giants typically choose to keep their GPUs at a higher utilization rate.

Coincidentally, Meta previously released a study (“AI training for 54 days, with a failure every 3 hours; GPU failure rate is 120 times that of CPU!”) that detailed the failure rate situation of their AI cluster, which consisted of 16,384 Nvidia H100 80GB GPUs, when training the Llama 3 405B model. The data shows that the utilization rate of the AI cluster during model training was approximately 38% (based on BF16 precision training). Out of the 419 unexpected failures that caused training interruptions, 148 (30.1%) were due to various GPU failures (including NVLink failures), and 72 (17.2%) were caused by HBM3 high-bandwidth memory failures. HBM3 is also one of the essential core components of a GPU. If you add the two together, the GPU failure rate is approximately 47.3% at around 30% utilization.

Based on Meta’s data, the quality of the H100 seems to be quite good, with an annualized failure rate of about 9% and a three-year annualized failure rate of 27%, although the GPU failure rate will continue to increase with longer usage time.

It is also worth noting that the utilization rate in Meta’s training cluster was 30%. According to the GenAI architect from Alphabet, if a GPU operates at a 60%-70% utilization rate (two times that of Meta), the GPU failure rate would also double.

CPU与GPU的区别

21 October 2024·28 words·1 min

CPU GPU

GDDR 与 HBM 内存之间的区别

17 October 2024·50 words·1 min

GDDR HBM H100

Intel正式发布Gaudi 3 AI加速器

15 October 2024·57 words·1 min

Gaudi 3 H100

Related