New Attention vs MHA
Overview
Architecture
Training
System
Diffora Attention vs Normal Attention[LLM]
New Attention
0.16M params
vs
Multi-Head Attention
0.42M params
🤖 Humanoid-v5
🎯 Reinforce Algorithm
New Attention
⚡
0.16M
Parameters — 2.6× fewer than MHA
Multi-Head Attention (MHA)
🧠
0.42M
Parameters — Standard transformer attention
Parameter Efficiency
📊
2.6×
New Attention uses 62% fewer parameters
🏗️ Architecture Details
New Attention Policy
Novel
action_dim
17
d_model
256
d_k
256
d_v
256
N (heads)
1
Parameters
0.16M
Param ratio
38%
MHA Policy
Baseline
action_dim
17
d_model
256
d_k
256
d_v
256
N (heads)
1
Parameters
0.42M
Param ratio
100%
📈 Training Performance
Return — New Attention
Reward per episode · Steps 30–59
smooth-wave-1
Return — MHA
Reward per episode · Steps 0–29
smooth-wave-1
Loss — New Attention
Policy loss · Steps 30–59
smooth-wave-1
Loss — MHA
Policy loss · Steps 0–29
smooth-wave-1
Episode Count — New Attention
Cumulative episodes · Steps 30–59
Episode Count — MHA
Cumulative episodes · Steps 0–29
🖥️ System Metrics
Network Traffic (Bytes)
~35 MB sent
Disk I/O (MB)
~15 GB written
Disk Utilization (GB)
~21.21 GB
Disk Utilization (%)
~19.7%
Process Memory Available (MB)
~10 GB
Process Memory In Use (%)
~12%
Process Memory In Use (MB)
~1,750 MB
System Memory Utilization (%)
~20%
Process CPU Utilization (%)
Peak 100%