TG Telegram Group & Channel
PDP-11🚀 | United States America (US)
Create: Update:

How to Evaluate Deep Neural Network Processors
Vivienne Sze

Why FOPS/W metric is not enough
Common metric for hardware efficieny measurement is FOPS/W (floating-point operations per second per watt) or TOPS/WTerra FOPS/W. However, TOPS/W alone is not enough. It often goes along with, the peak performance in TOPS, which gives the maximum efficiency since it assumes maximum utilization and thus maximum amortization of overhead. However, this does not tell the complete story because processors typically do not operate at their peak TOPS and their efficiency degrades at lower utilization. Following 6 metrics and it's combination must be considered:

🎯1. The accuracy determines if the system can perform the given task. To evaluate it, several benchmark were proposed, among them MLPerf.

2. The latency and throughput determine whether it can run fast enough and in real time. Throughput is the amount of inferences per second and latency is the time period between the input sample arrival and generation of the result. Batching technique improves throughput, but degrades latency. Thus, achieving low latency and high throughput simultaneously can sometimes be at odds depending on the approach, and both metrics should be reported

🏋🏼3. The energy and power consumption primarily dictate the form factor of the device where the processing can operate. Memory read and write operations are still the main consumers of the power, not an arithmetic. 32b DRAM read takes 640 pH and 32b FP multiply 0.9pJ.

🤑4. The cost, which is primarily dictated by the chip area and external memory BW requirements, determines how much one would pay for the solution. Custom DNN processors have a higher design cost (after amortization) than off-the-shelf CPUs and GPUs. We consider anything beyond this, e.g., the economics of the semiconductor business, including how to price platforms, to be outside the scope of this article. Considering the hardware cost of the design is important from both an industry and a research perspective as it dictates whether a system is financially viable

🐍5. The flexibility determines the range of tasks it can support. To maintain efficiency, the hardware should not rely on certain properties of the DNN models to achieve efficiency, as the properties of DNN models are diverse and evolving rapidly. For instance, a DNN processor that can efficiently support the case where the entire DNN model (i.e., all of the weights) fits on chip may perform extremely poorly when the DNN model grows larger, which is likely, given that the size of DNN models continues to increase over time;

🚡6. The scalability determines whether the same design effort can be amortized for deployment in multiple domains (e.g., in the cloud and at the edge) and if the system can efficiently be scaled with DNN model size.

👨‍👩‍👧‍👦7. Interplay Among Different Metrics
All the metrics must be accounted for to fairly evaluate the design tradeoffs

🐭Case 1. Tiny binarized NN architecture, with very low power consumption, high throughput, but unacceptable accuracy.
🐘Case 2. Complete floating point DNN chip, high throughput and moderate chip power consumption. But it's pure arithmetic chip with MACs and all the data read/write/storage is performed outside chip. So total system power consumption will very high.

How to Evaluate Deep Neural Network Processors
Vivienne Sze

Why FOPS/W metric is not enough
Common metric for hardware efficieny measurement is FOPS/W (floating-point operations per second per watt) or TOPS/WTerra FOPS/W. However, TOPS/W alone is not enough. It often goes along with, the peak performance in TOPS, which gives the maximum efficiency since it assumes maximum utilization and thus maximum amortization of overhead. However, this does not tell the complete story because processors typically do not operate at their peak TOPS and their efficiency degrades at lower utilization. Following 6 metrics and it's combination must be considered:

🎯1. The accuracy determines if the system can perform the given task. To evaluate it, several benchmark were proposed, among them MLPerf.

2. The latency and throughput determine whether it can run fast enough and in real time. Throughput is the amount of inferences per second and latency is the time period between the input sample arrival and generation of the result. Batching technique improves throughput, but degrades latency. Thus, achieving low latency and high throughput simultaneously can sometimes be at odds depending on the approach, and both metrics should be reported

🏋🏼3. The energy and power consumption primarily dictate the form factor of the device where the processing can operate. Memory read and write operations are still the main consumers of the power, not an arithmetic. 32b DRAM read takes 640 pH and 32b FP multiply 0.9pJ.

🤑4. The cost, which is primarily dictated by the chip area and external memory BW requirements, determines how much one would pay for the solution. Custom DNN processors have a higher design cost (after amortization) than off-the-shelf CPUs and GPUs. We consider anything beyond this, e.g., the economics of the semiconductor business, including how to price platforms, to be outside the scope of this article. Considering the hardware cost of the design is important from both an industry and a research perspective as it dictates whether a system is financially viable

🐍5. The flexibility determines the range of tasks it can support. To maintain efficiency, the hardware should not rely on certain properties of the DNN models to achieve efficiency, as the properties of DNN models are diverse and evolving rapidly. For instance, a DNN processor that can efficiently support the case where the entire DNN model (i.e., all of the weights) fits on chip may perform extremely poorly when the DNN model grows larger, which is likely, given that the size of DNN models continues to increase over time;

🚡6. The scalability determines whether the same design effort can be amortized for deployment in multiple domains (e.g., in the cloud and at the edge) and if the system can efficiently be scaled with DNN model size.

👨‍👩‍👧‍👦7. Interplay Among Different Metrics
All the metrics must be accounted for to fairly evaluate the design tradeoffs

🐭Case 1. Tiny binarized NN architecture, with very low power consumption, high throughput, but unacceptable accuracy.
🐘Case 2. Complete floating point DNN chip, high throughput and moderate chip power consumption. But it's pure arithmetic chip with MACs and all the data read/write/storage is performed outside chip. So total system power consumption will very high.


>>Click here to continue<<

PDP-11🚀




Share with your best friend
VIEW MORE

United States America Popular Telegram Group (US)