Cache MX On-chip memory expansion

The Cache MX IP compresses on-chip L2, L3 SRAM cache enabling 2x effective capacity. SRAM Caches can take upto 30-50% of an SoC xPU silicon real estate and a significant power budget that increases with physical dimensions. While digital logic scales effectively with process technology node shrink, SRAM essentially stopped scaling from 5nm to 3nm technology nodes. The number of compute cores demands higher SRAM capacity to effectively scale compute IPC performance. Increasing SRAM area can negatively impact both the die cost as well as die yield. Cache MX offers a power, area and cost effective alternative to enable performance scaling with single digit latency.

Overview

Standards

Z-Trainless (proprietary)
Z-ZID (proprietary)

Architecture

Modular architecture, enables seamless scalability: Multiple, independent Cache MX instances can coexist within SoC without requiring co-ordination
Architectural configuration parameters accessible to fine tune performance

HDL Source Licenses

Synthesizable System Verilog RTL (encrypted)
Implementation constraints
UVM testbench (self-checking)
Vectors for testbench and expected results
User Documentation

Features

On-the-fly compression / decompression of cache lines
Optional secureTraining on metadata capability
Silicon Verified TSMC N5
On-the-fly Multi-algorithm switching capability without recompression

Deliverables

Performance evaluation license C++ compression model for integration in customer performance simulation model

FPGA evaluation license

Encrypted IP delivery (Xilinx)

Applications

Server xPUs, Smart devices, and Embedded systems deal with a a wide range of workloaddata sets from diverse applications. Cache MX has been evaluated across a wide range of workload benchmarks including high performance compute benchmarks like SPEC2017INT, SPEC2017FP, AI/ ML Benchmarks like MLPerf Training, database benchmarks including Renaissance and MonetDB+ TPC-H. Cache MX delivers 2X compression on average across a wide suite of benchmarks at single digit latencies.

Integration

The Cache MX IP contains the compression and decompression accelerators as an integrated block that can be easily integrated into the SoC Cache controller design. The Tag array is decoupled from the Data array. The Tag array is doubled in size to accommodate the additional tags needed to address more blocks. The Data array remains unchanged. Optional Custom integration into existing.

Benefits

The Cache MX compression solution increases the (L2$, L3$, SLC) cache capacity by 2x at an 80% area and power saving to comparable SRAM capacity. Real-time compression, compaction and transparent memory management. Operating at cache speed and throughput.

Performance / KPI

Feature	Performance
Compression ratio:	2x across diverse data sets
Latency read & write (cycles):	5 cycles (ZSD algorithm)
Added latency read & write (%):	7% (L3$, SLC), 33% (L2$)
Performance acceleration:	15-30%
Frequency:	L2$, L3$, SLC speed
IP area:	Starting at 0.1mm² (@5nm TSMC) excluding customer-required Tag array modifications
Memory technologies supported:	On-chip SRAM and VCACHE

System integration of Cache MX

There are three steps to integrate Cache MX:

When implementing Cach-MX the cache controller is enhanced with: Compression / decompression engines packaged in an IP block.
The Tag-array must be decoupled from data-array and the number of tags are increased, to address more blocks.
Lastly, there is a slight tag modifications (extra metadata) to support more logical blocks per physical frame in the data array. (data-array remains unchanged).

Cache MX

The Cache MX compression solution increases the cache capacity by 2x at an 80% area and power saving to comparable SRAM capacity.

HBM Memory Expansion

AI-MX offers up to 1.5x more effective HBM or LPDDR capacity. AI-MX enhances AI accelerators by transparently compressing memory traffic.

HBM Bandwidth Acceleration

AI-MX offers up to 1.5x more effective HBM or LPDDR bandwidth. AI-MX enhances AI accelerators by transparently compressing memory traffic.

Ziptilion™ MX

High performance and low latency hardware accelerated compression at unmatched power efficiency.

Ziptilion™ BW

Delivers up to 25% more (LP)DDR bandwidth at nominal frequency and power, enabling a significantly more performance and energy efficient SoC.

DenseMem

Double the CXL connected memory capacity with data DenseMem.

NVMe Expansion

Extend NvMe storage capacity 2-4x with LZ4 or zstd hardware accelerated compression.

Sphinx

High Performance and Low Latency AES-XTS industry-standard encryption / decryption. Independent non-blocking encryption and decryption channels.