Author:
Simondavide Tritto, WW Performance Digital COE Manager, Advantest
Date
10/22/2024
With the insatiable need for more power, high-performance computing (HPC) products create challenges for test cells in delivering and managing high power during manufacturing test. According to data from Market.us, the global computing power market is forecasted to grow from $45.7 billion in 2023 to $81.3 billion in 2033, experiencing a compound annual growth rate (CAGR) of 6.8% during the forecast period of 2024 to 2033. This article focuses on manufacturing-test power delivery, analyzing the factors that drive the demand for more power and explaining how ATE systems address those needs during a period of rapid market growth.
Testing HPC devices for AI applications presents significant power-delivery challenges during the manufacturing process. These challenges arise due to the complex nature of these devices, which exhibit increased gate density per unit area in the range of several hundred million transistors/mm2, leading to large power consumption in most operating modes, including an increased leakage current when all gates are in quiescent mode. In addition, server-grade systems that feature a combination of CPU/GPU chips with multiple cores and an increasing number of high-speed memory dies and HSIO interfaces are working at higher clock frequencies, resulting in a higher dynamic power consumption because power P scales with voltage and frequency.
Click image to enlarge
Figure 2: All test cell components must provide relevant features for power management
Addressing these power-consumption challenges requires innovative design strategies plus precise and reliable power management to ensure performance and functionality are fully validated across the entire manufacturing process. To achieve tight power control, all test-cell components—the ATE system, the interfacing hardware (probe cards and load boards), and the handling equipment (probers or handlers)—must provide relevant features for power management.
Power-management challenges
The first issue to address is power integrity (PI), a related aspect of signal integrity (SI). PI requires sophisticated filtering/decoupling to minimize the noise and transients a power distribution network (PDN) might inject due to its inductance/resistance. PI must be maintained from the power-supply instrument all the way through to the DUT power-supply pins or pads.
Moreover, testing environments must be scalable to handle different types of HPC devices and their varying power requirements. Therefore, the tester must accommodate a wide range of power levels to support the multiple, independently controlled power domains that modern HPC devices require.
Consequently, the tester must manage multiple power supplies and control circuits to provide the necessary voltages accurately across all the power domains and must also allow precise and user-definable power-up and power-down sequences for the various domains to avoid damage and ensure proper initialization and operation.
Thermal management
Thermal management is another key issue. It affects the entire test cell, including test fixtures and the handling equipment, which require efficient heat-dissipation mechanisms, such as advanced cooling systems. Finally, accurately monitoring test temperatures and balancing the power delivered with its thermal impact is necessary for the entire duration of a test flow.
The power demand of HPC devices often varies rapidly across different tests and even within one single test execution. Consequently, the tester must respond to fast changes in power demand without causing instability or delays, and the response of the power supplies must be adequately fast to suit the power variations. A fast response depends on the PDN and its relevant capacitor network, requiring an optimal combination of bulk vs. filtering and ceramic vs. polymer components.
At the same time, the handling equipment must react quickly to sharp temperature changes and peaks to ensure proper test setup and the integrity of the devices under test. Power-consumption and thermal-profile monitoring are key elements for correct test execution and to provide insights into the device's performance and efficiency, especially during the device validation and bring-up.
Power and multi-site considerations for ATE
Concerning the ATE systems, architectural aspects usually define the available power budget for testing power-hungry devices. Each test system has global maximum ratings for the power it can handle/supply but also local ratings for each individual instrument. These ratings account for several aspects of the tester hardware: the overall tester supply infrastructure, the cooling capabilities, the components, and the current-carrying capacity of cables and connectors up to the docking mechanical interface with the test fixture.
Delivering the needed power to the DUT has several implications for a test setup’s maximum site count. Focusing on the ATE system, the increasing number of power pins/pads calls for a growing number of device-power-supply (DPS) instruments, whose presence in the tester infrastructure is subject to some boundary conditions. The above-mentioned power ratings apply to any system resource, including digital, mixed-signal, and RF resources. In many cases, ATE vendors provide a power-budget calculator to make sure the tester infrastructure supports the target parallelism. With the increasing number of power domains in current DUTs, a flexible ganging feature (the option of connecting together several DPS channels to source the combined current through a single pin) across the ATE DPS resources enables better resource allocation.
Another aspect is the resource layout in the ATE infrastructure. With the increasing number of digital pins and power connections, routing all needed signals becomes a challenge, especially with increasing site count. To make the routing as simple as possible and the test fixture easier to design, the tester resource layout and density have a key role.
Another important parameter is the available real estate for load board/probe-card components. With the growing pin count and DPS connections, signal routing complexity and the increased number of components (e.g., switches and capacitors) call for more available space on the test fixture; this space can strongly limit the maximum achievable site count, so it is key that the ATE system be conceived to maximize this space.
Last but not least is the heat dissipation capabilities of the ATE system, which can also influence the maximum achievable parallelism. The thermal characteristics of the test fixture determine the maximum site count, as do the cooling effectiveness and the thermal stability of the ATE instruments.
Power handling for probe cards and load boards
The growing complexity of leading-edge HPC devices entails a similar level of complexity also for probe card and load board design. The need for more power handling, with the increasing pin count and die size (many HPC devices have reached the reticle size), drives a series of needed and sometimes competing enhancements to probe-card manufacturing technology. One example is the need for more and bigger/longer power probes and shorter high-speed needles, leading to hybrid probing schemes where needles of different lengths and forces co-exist in the same probe card, complicating the planarity checks and increasing the overall contact force. A relevant figure of merit is the ratio of power pins versus total pins for a typical HPC device, which went from something around 10 to 20% in early times to more than 60% in recent applications.
For these reasons, power delivery significantly influences the design of probe cards not only for the needles but for all the components, presenting both mechanical and thermal issues. Regarding the former, increasing pin count (power and signal pins) yields a higher total contact force, which in turn calls for lower force needles (the current trend is to go below 2 g/needle, with 1 g/needle on the horizon) with increased stiffness. Probe cards with a total needle count of several tens of thousands are common, with the new technologies enabling > 100k needles and beyond.
Regarding thermal considerations for the probe card, power means heat, so the heat dissipation capabilities of the whole stack are pushed to the limits. ATE vendors are pursuing new methodologies and new materials with better dissipation coefficients to enhance the passive cooling of the probe-card structure. Active cooling options are also being evaluated. Finally, thermal simulation of the probe card assembly, with the power profiling of the DUT, help to ensure no extreme hot spots arise during normal operation.
Similar considerations apply to load board design, where the thermal and power handling requirements apply to the DUT socket structure. An accurate PDN design, followed by a thermal simulation of the final structure, can ensure PI and DPS stability and avoid the risk of thermal hot spots, with potential damages to the DUT or load board components.
Addressing the power requirements for HPC/AI devices
To support power-hungry HPC/AI applications, Advantest offers the V93000 SOC Test Platform, now in its latest EXA Scale generation. Platform instruments include the newly launched DC Scale XHC32 power supply, which provides full application compatibility with existing cards, enabling a seamless transition to the new generation of high-power supplies while reusing the existing DUT boards efficiently. Together with the DC Scale XPS256, the power supply with the highest instrument density and fastest load-change response on the market, and DUT Scale Duo device interface extension, which enables over 50% more real estate and up to 3x component height, the V93000 EXA Scale platform provides the best-in-class power delivery and integrity across the entire manufacturing test process for HPC/AI devices.
Click image to enlarge
Figure 3: Advantest’s DC Scale XHC32 power supply for the V93000 EXA Scale SoC test platform