Valkey 9.0 ElastiCache: T3 vs T4g across micro, small, and medium

This is one fixed Valkey 9.0 load read across six finished Amazon ElastiCache reports: cache.t3.micro and cache.t4g.micro, cache.t3.small and cache.t4g.small, cache.t3.medium and cache.t4g.medium. Two axes at once: Intel (T3) versus Graviton (T4g), and micro to small to medium scaling.

The result is not a flat winner. Micro and small nodes reached maxmemory, so they read as behavior under memory pressure. Only the medium pair kept real headroom. Underneath the headline ops/s sit three signals most comparisons skip: burstable CPU credits, engine-side latency versus client-observed latency, and how much each node swapped.

Result table

Node	Avg ops/s	Peak ops/s	P99 ms	CV	Hit	Evict.	Max mem	Headroom	Peak CPU
cache.t3.micro	5,983.8	16,947.5	1.001	35.85%	62.28%	102,861	100.00%	0.00%	24.26%
cache.t4g.micro	5,429.3	11,576.5	1.198	19.06%	56.64%	85,894	100.00%	0.00%	11.75%
cache.t3.small	9,133.3	11,229.0	0.677	8.45%	44.32%	43,497	100.00%	0.00%	22.52%
cache.t4g.small	11,498.4	17,744.7	0.515	11.93%	51.57%	148,935	100.00%	0.00%	29.17%
cache.t3.medium	13,868.4	16,467.7	0.432	5.85%	35.17%	0	81.89%	18.11%	23.65%
cache.t4g.medium	11,337.4	14,240.2	0.524	6.97%	31.88%	0	70.53%	29.47%	24.01%

Reading by size: where the cliff is

The cleanest way to read this set is by memory state, not by raw ops/s. Four of the six nodes ended the hour at 100% memory and evicting. Only the medium pair finished with headroom.

Micro (pressure): both at maxmemory, the highest eviction counts relative to capacity, and by far the most swap. This is the memory cliff, not a capacity read.
Small (pressure): still at 100% memory, but the engine had more room, with lower latency and steadier throughput than micro.
Medium (headroom): zero evictions on both, 18-29% memory headroom. This is the only pair that measures sustained capacity rather than eviction behavior.

T3 versus T4g, size by size

At micro, t3.micro posted higher average throughput (5,983.8 vs 5,429.3 ops/s) and a lower client p99 (1.001 vs 1.198 ms), but it did so by running far burstier, at CV 35.85% against t4g.micro's 19.06%. t4g.micro was the steadier engine.

At small, t4g.small led on served load (11,498.4 vs 9,133.3 ops/s) and p99 (0.515 vs 0.677 ms), but paid for it with 148,935 evictions against t3.small's 43,497, and roughly 3x the swap.

At medium, the order flips: t3.medium was faster (13,868.4 vs 11,337.4 ops/s, 0.432 vs 0.524 ms p99), while t4g.medium kept more memory headroom (29.47% vs 18.11%) and used less load-generator CPU. No architecture wins every size.

The burstable-credit caveat

T3 and T4g are burstable: they earn CPU credits at a baseline rate and spend them to run above it. Every node here held a near-floor credit balance for the whole hour, so none of these numbers describe burst. They describe sustained operation at baseline. The micro nodes were the most credit-starved of all.

Node	EngineCPU avg/peak	Credit balance avg	Credit balance min	Credit usage avg
cache.t3.micro	8.26% / 24.26%	0.01	0.01	1.0212
cache.t4g.micro	8.26% / 11.75%	0.01	0.01	1.0095
cache.t3.small	17.82% / 22.52%	0.06	0.03	2.0235
cache.t4g.small	20.01% / 29.17%	0.02	0.02	1.9981
cache.t3.medium	19.80% / 23.65%	0.04	0.03	2.0414
cache.t4g.medium	19.23% / 24.01%	0.06	0.04	1.9689

Engine CPU never crossed about 29% on any node, so CPU was not the wall. Memory was, everywhere except the medium pair.

Engine speed versus client-observed latency

This is the result that resists a simple ranking. At the engine level (per-command service time in microseconds), Graviton (t4g) was faster at every size. Yet at micro and medium the client still saw lower p99 on the Intel (t3) node. Engine efficiency and client tail latency are different axes, and here they point in opposite directions.

Node	Engine GET (us)	Engine SET (us)	Engine string (us)	Client p99 ms
cache.t3.micro	3.477	7.749	3.865	1.001
cache.t4g.micro	1.546	3.547	1.728	1.198
cache.t3.small	2.131	5.794	2.464	0.677
cache.t4g.small	1.470	3.200	1.627	0.515
cache.t3.medium	1.326	3.688	1.540	0.432
cache.t4g.medium	1.253	3.636	1.469	0.524

The starkest case is micro: t4g's engine answered a GET in 1.546 us against t3's 3.477 us, more than twice as fast per op, yet t3.micro still delivered higher throughput and a lower client p99, because it ran far burstier. Client-observed latency is shaped by throughput, queueing, and the credit picture above, not engine service time alone.

How steady was the throughput?

Averages hide shape. Coefficient of variation (CV) and the peak-to-average ratio say how steady each run actually was. Steadiness improves almost monotonically with size: the medium nodes are calm, the micro nodes are violent. t3.micro is the extreme: it peaked at 2.83x its own average and posted the highest CV in the set.

Node	Avg ops/s	Peak ops/s (peak/avg)	CV %
cache.t3.micro	5,983.8	16,947.5 (2.83x)	35.85
cache.t4g.micro	5,429.3	11,576.5 (2.13x)	19.06
cache.t3.small	9,133.3	11,229.0 (1.23x)	8.45
cache.t4g.small	11,498.4	17,744.7 (1.54x)	11.93
cache.t3.medium	13,868.4	16,467.7 (1.19x)	5.85
cache.t4g.medium	11,337.4	14,240.2 (1.26x)	6.97

Swap: the cost of the cliff

Swap is the clearest single number for how hard a node was pushed past its memory limit, and it scales sharply with size. The micro nodes swapped two orders of magnitude more than the medium nodes.

Micro: cache.t3.micro 76.4 MB and cache.t4g.micro 72.7 MB, both deep into the cliff.
Small: cache.t4g.small 10.6 MB versus cache.t3.small 3.2 MB. t4g.small pushed about 3x harder for its throughput edge.
Medium: both 0.25 MB, effectively no swap, consistent with zero evictions.

Run reading

cache.t3.micro: highest micro throughput and best client p99, but the burstiest run in the set (CV 35.85%, 2.83x peak) and the most swap.
cache.t4g.micro: the steadier micro engine, with less than half t3.micro's per-op latency, but lower served throughput under the same cliff.
cache.t3.small: fewer evictions and less swap than t4g.small, at the cost of throughput and p99.
cache.t4g.small: best small-tier served load and p99, but the heaviest eviction churn in the whole set (148,935).
cache.t3.medium: the fastest report overall and the steadiest, with less headroom left over.
cache.t4g.medium: most memory headroom and lowest load-generator CPU. A comfortable, if not the fastest, medium node.

Practical read

If you only carry one idea away: read these by memory state first. Micro and small are eviction-regime reads where Graviton's engine efficiency does not translate into a client win. Medium is the headroom regime where Intel's t3.medium turned in the fastest, steadiest run while t4g.medium kept the most room to grow. A same-load comparison should not flatten "capacity with headroom" and "engine behavior at 100% memory" into one ranking.

Where I stop reading

This is one fixed one-hour load on Valkey 9.0, the same plan4 profile across all six nodes. It is not a scaling curve, not a forced-eviction benchmark, and not a universal T3-versus-T4g claim. Because every node ran at a near-floor CPU credit balance, none of these numbers describe burst behavior.

I am also not making an ops-per-dollar statement here, because current us-east-1 pricing was not checked on the same basis. And the micro and small results are memory-pressure reads by construction. They reached maxmemory, so treat their throughput as "served under eviction," not "capacity."

Frequently asked questions

Is t4g faster than t3 for Valkey 9.0 on ElastiCache?

Not as a flat rule. Across these same-load runs, t4g (Graviton) had lower engine-level GET/SET latency at every size, but client-observed throughput and p99 depended on the tier: t4g led at small, while t3 led at micro and medium.

Why do the micro and small nodes evict keys but the medium nodes do not?

Micro and small reached 100% memory under this load, so Valkey evicted to stay under maxmemory. Both medium nodes finished with 18-29% headroom and zero evictions.

Do burstable CPU credits affect these results?

Yes. T3 and T4g are burstable. All six runs held a near-floor credit balance (0.01-0.06), meaning they ran at baseline rather than bursting. Engine CPU peaked at about 29%, so memory, not CPU, was the limiting factor outside the medium pair.

Which node had the most memory headroom?

cache.t4g.medium, at 29.47% headroom (70.53% peak memory), versus 18.11% on cache.t3.medium. Every micro and small node finished at 0% headroom.

Why does t3.micro beat t4g.micro on throughput despite a slower engine?

t3.micro ran much burstier, at CV 35.85% and a peak 2.83x its average, which lifted its hourly average and client p99 even though t4g.micro answered each individual command in less than half the time at the engine level.

Tags:

Amazon ElastiCache, AWS, valkey, Valkey 9.0, load testing