Company
Solutions
Developers
Careers
Search
EN
EN
Ampere Computing Logo
Solutions
Solutions Home
Systems
Solutions
Performance Overview
Workload Briefs Overview
Tuning Guides Overview
Where to Try
Ampere Systems
Ampere Altra
Azure
Equinix
Google Cloud
Oracle
Tencent Cloud
Ampere AIDownloadsHow It WorksFAQs
Developers
Developer CenterDesigning Cloud ApplicationsBuilding Cloud ApplicationsDeploying Cloud ApplicationsUsing Your DataEnabling the Open-Source CommunityAmpere Ready Software
Support
Search
AI Solution Brief

Benchmarking Configuration

TensorFlow benchmarks were performed on bare metal single socket servers with equivalent memory, networking and storage configurations for the x86 platforms shown. Processors tested include AMD EPYC 7763 “Milan” with TF2.7 ZenDNN, Intel Xeon 7375 “Cascade Lake” with TF 2.7 DNNL, Intel Xeon 8380 “Ice Lake” with TF 2.7 DNNL and Ampere Altra Max M128-80 with Ampere Optimized TF 2.7. ARM-64 based “Graviton 2”, available exclusively through AWS (c6g shape), was tested in 64-core configuration. Benchmarks were performed with Ampere’s internal testing software based on Ampere Model Library. This software is written entirely in Python and complies with MLCommons Inference (a.k.a. MLPerf) methodology of calculating latency and throughput. It utilizes the standard APIs of frameworks and common ways while replicating usage in real-life applications. For latency benchmarks for each configuration listed below, a single system process has been executed at a time. Each process, following a warm-up run, has run workloads of batch size equal to 1 in loop for a minimum of 60 seconds. A final latency value has then been calculated based on collected net inference time of each pass through the network.

  • Intel Xeon 8380 “Ice Lake” - number of threads: 1, 4, 16, 32, 64, 80
  • Intel Xeon 7375 “Cascade Lake” - number of threads: 1, 4, 16, 32, 64
  • AMD Epyc 7763 “Milan” - number of threads: 1, 4, 16, 32, 64, 128
  • Ampere Altra Max M128-80 – number of threads: 1, 4, 16, 32, 64, 128
  • AWS Graviton 2nd gen (c6g) - number of threads: 1, 4, 16, 24, 32, 48, 64

When it comes to the multi-process throughput benchmarks a search-space of different batch sizes and number of threads per process has been covered. Final throughput values have been estimated based on average (50th percentile) latencies observed during 60 second multi process runs. All systems were benchmarked running workloads of following batch sizes per each of n parallel processes: [1, 4, 16, 32, 64, 128, 256]. The number of threads per process vs. the number of processes in total were respectively:

  • Intel Xeon 8380 “Ice Lake” - 1x80, 2x40, 4x20, 16x5, 32x2, 64x1, 80x1
  • Intel Xeon 7375 “Cascade Lake” - number of threads: 1x64, 4x16, 16x4
  • AMD Epyc 7763 “Milan” - number of threads: 1x128, 2x64, 4x32, 16x8, 32x4, 64x2, 128x1
  • Ampere Altra Max M128-80 – number of threads: 1x128, 2x64, 4x32, 16x8, 32x4, 64x2, 128x1
  • AWS Graviton 2nd gen (c6g) - 1x64, 2x32, 4x16, 16x4, 32x2, 64x1

Benchmarks of all platforms were run using the same scripting, same datasets, same representation of models. All platforms ran the same workloads, applying identical pre- and post- processing and making uniform inference calls. In the case of fp16 Altra data, values were obtained with the use of same scripting, while AI model representations differed from their fp32 counterparts only in the precision of weights – model quantization process consisted of only casting to the lower float precision.

Across all systems that were tested, TensorFlow library was used in its latest version available for a given platform:

  • Intel CPUs – TF 2.7 DNNL, available as Docker hub image: intel/intel-optimized-tensorflow:2.7.0
  • AMD CPUs – TF 2.7 Zen-DNN, available at https://developer.amd.com/zendnn/#download as TF_v2.7_ZenDNN_v3.2_Python_v3.8.zip
  • AWS Graviton 2nd gen – TF 2.7 (native aarch64 build), available at https://github.com/KumaTea/tensorflow-aarch64/releases/download/v2.7/tensorflow-2.7.0-cp38-cp38-linux_aarch64.whl
  • Ampere Altra Max – TF 2.7 Ampere Optimized, available at https://solutions.amperecomputing.com/solutions/ampere-ai as AIO for TensorFlow All benchmarks were run with Python 3.8 in Linux-based environments of the following flavors:
  • Intel Xeon 8380 “Ice Lake” - Ubuntu 20.04, kernel: 5.11
  • Intel Xeon 7375 “Cascade Lake” - Ubuntu 20.04, kernel: 5.11.0-1022-aws
  • AMD Epyc 7763 “Milan” - Cent OS 8, kernel: 4.18.0-305.3.1.el8.x86_64
  • Ampere Altra Max M128-80 – Fedora 35, kernel: 5.16.9-200.THP_NO_FIE.fc35. aarch64
  • AWS Graviton 2nd gen (c6g) - Ubuntu 20.04, kernel: 5.11.0-1022-aws
Created At : May 25th 2022, 5:58:55 pm
Last Updated At : August 29th 2022, 3:47:54 pm

Ampere Computing

4655 Great America Parkway

Suite 601 Santa Clara, CA 95054

Tel: +1-669-770-3700

info[at]amperecomputing.com

About
image
image
image
image
© 2022 Ampere Computing LLC. All rights reserved. Ampere, Altra and the A and Ampere logos are registered trademarks or trademarks of Ampere Computing.