arb8020

emergent outlier features in 2025

tl;dr: re-running Dettmers’ 2022 “6.7 B outlier-feature phase-transition” experiment on 7 recent MoE and 5 Qwen3 dense models shows no sudden jump in layer coordination. the activation signature for the anecdotal 7B “intelligence threshold” seems to have vanished, probably thanks to post-2022 architecture/data changes, but further study on the question is required

exploration

reading RL rollouts while post-training small models, I noticed a huge gap in intelligence of models above and below ~7B parameters. twitter seemed to corroborate this, and it seems to actually be generally accepted folklore that different parameter counts just had emergent step changes in intelligence.

vikhyatk and _xjdr

naturally my first thought upon finding this was ‘well is it 7B total or active’. I think the tweets I saw were talking about dense models, so it wasn’t exactly clear what the answer would be here. digging through the threads, I found references to Tim Dettmers’ 2022 paper on int8 quantization, and decided to read it.

in the process of getting int8 quantization to work, dettmers found what he called ‘outlier features’ in model residual streams. residual streams act as a communication channel for deep models to more effectively stack processing layers for more depth. rather than computing y = g_0(g_1(g_2(x))) and implicitly recomputing the output so far, models can instead compute y = x + g_0(x) + g_1(x) + g_2(x), where each layer reads from and writes to the shared ‘working memory’ accumulated so far.

outlier features are essentially specific dimension indices within the residual stream that had extreme magnitude values (>= 6.0, compared to typical ranges +- 3.5) that showed up across layers and tokens in the residual stream tensor.

the intuitive explanation is that transformers have two processing streams: one that processes inputs, and one that suppresses noisy features that aren’t relevant to the current context. dedicating specific hidden dimensions to feature removal means that layers can coordinate, and know which dimensions can be multiplied by large values to suppress other features. (more on this here)

in particular, dettmers studied the presence of outlier features along layers, as well as along a given sequence’s tokens. in particular, he dubbed an outlier feature to be one that was present in at least 25% of the layers, 6% of the sequence positions, and magnitude at least 6.0.

here’s quick visualization of outlier features persisting through a layer’s computations. these tensors are in [batch, sequence, hidden_dim] format (shape 2,3,4), with typical values ranging from -3 to +3, but outlier dimensions showing values of 6-9:

input residual stream (dimension 2 has outliers):
-3   0   6  -2     0   3   6   1
-2  -3   6   1     3  -1   9  -3
-2   0   8   3     0  -2   6  -2

           +

layer contribution:
-1   1   1  -1     2  -1   0  -1
-1   1   1  -1    -2   1   0   1
 1   1   1   2     2  -2   0   2

           =

output residual stream (outlier persists in dimension 2):
-4   1   7  -3     2   2   6   0
-3  -2   7   0     1   0   9  -2
-1   1   9   5     2  -4   6   0

notice how dimension 2 (third column, hidden dimension) maintains high values of 6-9 across all sequence positions in both batches, even after adding the layer contribution. this systematic appearance across layers and tokens is layer coordination

dettmers’ found that these extreme magnitude values in model residual streams started to systematically coordinate across layers around ~6.7B parameters. he called this a phase transition and claimed that ‘transformers after the phase shift at 6.7B parameters behave very different to transformers before the phase shift … one should not try to generalize from <6.7B to beyond’. this is a pretty strong claim, so I thought it would be important for my own experiments to find out how this claim held up for MoE models.

Dettmers Phase Transition figure 3 from Dettmers et al. (2022): percentage of layers and sequence dimensions affected by large magnitude outlier features across transformers by (a) model size or (b) C4 perplexity. shows the phase transition at ~6.7B parameters.

so i ran the experiment Dettmers’ did. i swept over 7 MoE models that have come out recently, to try to pinpoint a phase transition point, either based on active/total parameters, or perhaps even related to the sparsity. i suspected it would be based on total params, since that would be more related to ‘total model capacity’, or that we wouldn’t find it at all due to the router.

the MoE results were messy - basically no pattern. outliers ranged from 49 to 4,635, layer coordination (L%) was all over the place from 29% to 63%.

MoE Models - Total Parameters vs Layer & Sequence Coverage figure 1: no clear phase transition in MoE models when plotted by total parameters - both layer % (blue) and sequence % (orange) show erratic patterns

MoE Models - Active Parameters vs Layer & Sequence Coverage figure 2: similarly messy results when plotted by active parameters

so i thought my hypothesis about the routing was most likely. but then i remembered these were also 2024-2025 models, and Dettmers was using models from 2022, basically an eternity ago. to be rigorous, i decided to run another sweep on some frontier open dense models, specifically the Qwen3 series

here’s where it gets interesting.

Dense Models - Parameters vs Layer & Sequence Coverage figure 3: qwen3 dense models show stable ~30% layer coordination (blue) and high ~66-79% sequence coverage (orange) across all sizes - no phase transition

instead of seeing the phase transition Dettmers found at 6.7B, i found ~30% layer coordination across ALL model sizes. no phase transition. interestingly, the sequence coverage is consistently high (~66-79%), meaning outliers still affect most tokens when they appear.

its not clear what’s changed between 2022 and 2025. one hypothesis is that we’ve had so many architectural improvements between now and 2022: RoPE, RMSNorm, new FFNs/activation functions like SwiGLU. so its hard to say what exactly might have changed the need for these outlier features.

additionally, new training datasets have become much more curated and filtered, and potentially higher signal. dettmers also found that the emergence correlated with model perplexity, not just model size. so higher quality data might have allowed models to learn feature representations that didn’t require the outlier mechanism.

Dense Models - Perplexity vs Layer & Sequence Coverage figure 4: plotting by perplexity (ranging from 8.97 to 22.28) still shows the same flat patterns - layer % (blue) remains ~30-35% and sequence % (orange) stays high at ~66-79%. no correlation between perplexity and outlier emergence

finally, it might just be the case that the 7B intelligence threshold doesn’t really hold up at all. the stark difference in capabilities between models like Moondream, or Phi, at similar total parameter counts demonstrates that architecture and training data are more relevant to study than arbitrary parameter thresholds. perhaps in the past the parameter threshold alone was enough to make the distinction, with architectures and training data largely the same. but now post-training and divergent architectures make this threshold much harder to claim.

i’m still very interested in the meta question here of ‘whats the least amount of compute i can use for a given task’, and more mechanistic understandings of ‘why does that model feel smarter than that one’. further research might include the perplexity study that dettmers performed, or more rigorous investigation into benchmarks to find a better way to disambiguate this fuzzy idea of an ‘intelligence gap’. i think circuit analysis will also be relevant here, as described in https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-addition.

below are some more details on methodology, and the code is available here for scrutiny. i’d love to get feedback/more thoughts in DMs/replies at x.com/arb8020. always happy to chat!

limitations

only tested one modern dense family: qwen3
could have tested more models from the same families in the MoE sweep
gpt-oss-20b and gpt-oss-120b were quantized with mxfp4, much more than other models studied
used fineweb-edu instead of C4. this study should be robust to dataset choice but thought i would mention this anyways

methodology

detection criteria

replication of Dettmers 2022 Figure 3a
detected outliers with magnitude ≥6.0 affecting ≥25% of layers AND ≥6% of sequence positions
analyzed activation tensors from ln_attn and ln_mlp across all layers
dataset: 16 sequences × 2048 tokens from FineWeb-Edu

MoE Models Results

Model	Total Params	Active Params	Experts	Top-K	Routing	Outliers	Mean L%	Mean S%
OLMoE-1B-7B	7B	1.3B	64	8	Token-based (dropless)	49	29.5%	13.6%
GPT-OSS-20B	21B	3.6B	128*	4*	Standard top-k	1,465	38.1%	45.4%
Qwen3-30B	30.5B	3.3B	128	8	Standard top-k	110	35.5%	45.1%
Mixtral-8x7B	47B	12.9B	8	2	Standard top-k	4,635	50.2%	37.4%
Qwen3-Next-80B	80B	3.0B	512+1 shared	10	Standard top-k	504	57.5%	35.1%
GLM-4.5-Air	106B	12.0B	128+1 shared	8	Sigmoid gating (loss-free balance)	459	63.3%	45.7%
GPT-OSS-120B	117B	5.1B	128	4	Softmax-weighted top-k	1,695	33.1%	50.0%

Note: Perplexity not reported for MoE models due to instruct-tuning and response format considerations.

Metric Definitions:

L% = avg % of layers each outlier affects
S% = avg % of sequence positions each outlier affects
Experts = number of expert modules per MoE layer
Top-K = number of experts activated per token

Dense Models Results

Model	Total Params	Perplexity	Outliers	Mean L%	Mean S%
Qwen3-0.6B	0.6B	22.28	9,212	32.7%	66.5%
Qwen3-1.7B	1.7B	17.08	16,563	30.3%	78.4%
Qwen3-4B	4.0B	13.59	1,042	34.8%	68.8%
Qwen3-8B	8.0B	10.19	777	32.3%	70.8%
Qwen3-14B	14.0B	8.97	985	31.4%	79.0%

Model Precision Notes

Model	Native Precision	Analysis Precision	Notes
OLMoE-1B-7B	float32	float32	Native precision
GPT-OSS-20B	MXFP4 (MoE weights)	MXFP4	Native precision
Qwen3-30B	bfloat16	bfloat16	Native precision
Mixtral-8x7B	bfloat16	bfloat16	Native precision
Qwen3-Next-80B	bfloat16	bfloat16	Native precision
GLM-4.5-Air	bfloat16	bfloat16	Native precision
GPT-OSS-120B	MXFP4 (MoE weights)	MXFP4	Native precision

← back to home