ASMa: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning

Transactions on Machine Learning Research (TMLR), 2026
1Queen's University, School of Computing

Abstract

Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition. However, existing methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints, resulting in biased and incomplete feature representations. We propose ASMa, a novel combination of masking strategies that learns the full spectrum of spatio-temporal dynamics in human actions. ASMa employs two complementary masking strategies: one that selectively masks high-degree joints and low-motion frames, and another that masks low-degree joints and high-motion frames. This ensures balanced and comprehensive skeleton representation learning.

We introduce a learnable feature alignment module to effectively align representations learned from both masked views. To facilitate deployment in resource-constrained settings, we compress the learned representation into a lightweight model using knowledge distillation. Extensive experiments demonstrate that our approach outperforms existing SSL methods with an average improvement of 2.7–4.4% in fine-tuning and up to 5.9% in transfer learning. Our distilled model achieves 91.4% parameter reduction and 3× faster inference on edge devices while maintaining competitive accuracy.

Motivation

Key Observation: Low-degree joints (e.g., hands, feet) exhibit high motion across actions, while high-degree joints (e.g., spine) act as structural stabilizers with minimal motion. A balanced approach should incorporate both types of motion dynamics.

Method Overview

ASMa Framework

ASMa Framework: (a) Asymmetric masking strategies for joints and frames. (b) Three-stream architecture with anchor, spatial, and temporal streams. (c) Encoders trained with Barlow Twins loss. (d) Feature alignment module for downstream tasks. (e) Knowledge distillation for efficient deployment.

Key Components

1. Asymmetric Spatio-temporal Masking

Two encoders learn complementary representations:

  • Encoder fθ: Masks low-degree joints (hands, feet) + high-motion frames
  • Encoder fφ: Masks high-degree joints (spine, torso) + low-motion frames

Why asymmetric? Different body parts exhibit different motion patterns. By training separate encoders on complementary views, the model learns a more complete representation of human actions.

2. Feature Alignment Module

Bi-directional cross-attention mechanism fuses diverse representations from both encoders for downstream classification tasks.

Key insight: Rather than simply averaging features, cross-attention allows each encoder to selectively attend to relevant information from the other, creating a richer combined representation.

3. Knowledge Distillation

Lightweight student model learns from the combined teacher representation, achieving 91.4% parameter reduction with minimal accuracy drop.

Practical benefit: Enables deployment on resource-constrained devices like Raspberry Pi with 3× faster inference while maintaining competitive accuracy.

💡 Key Takeaway

ASMa demonstrates that what you mask matters. By strategically masking complementary spatio-temporal patterns, we achieve better representations than random or uniform masking strategies.

Datasets

We evaluate ASMa on three standard skeleton action recognition benchmarks:

NTU RGB+D 60

56,578 sequences

60 action classes

40 subjects

Evaluated on cross-subject (xsub) and cross-view (xview) splits

NTU RGB+D 120

113,945 sequences

120 action classes

106 subjects

Evaluated on cross-subject (xsub) and cross-setup (xset) splits

PKU-MMD

28,443 sequences

51 action classes

Part I: Clean data for pretraining

Part II: Noisy data for transfer learning evaluation

Results

Main Results

Method NTU-60 xsub NTU-60 xview NTU-120 xsub NTU-120 xset
Lin. FT. Lin. FT. Lin. FT. Lin. FT.
3s-PSTL 79.1 87.1 83.8 93.9 69.2 81.3 70.3 82.6
SCD-Net 86.6 - 91.7 - 76.9 - 80.1 -
STJD-CL - 89.3 - 94.8 - 83.5 - 86.8
3s-ASMa (Ours) 87.3 92.0 91.9 96.8 80.1 87.9 81.0 88.8
3s-ASMa-Distill - 91.7 - 95.9 - 86.9 - 88.3

Edge Device Performance

Model Params (M) FLOPs (G) Time (ms) Memory (MB) FPS
3s-ASMa (Teacher) 6.3 4.5 59.16 314.47 16.90
3s-ASMa-Distill 0.54 1.26 21.38 173.28 46.52
Improvement ↓ 91.4% ↓ 72% ↓ 63.8% ↓ 44.8% ↑ 63.6%

Performance on Raspberry Pi 4B (2GB RAM, CPU only)

Transfer Learning to Noisy Datasets

Method NTU60 → PKU-II NTU120 → PKU-II PKU-I → PKU-II
SkeletonMAE 58.4 61.0 62.5
S-JEPA 71.4 74.2 70.9
3s-PSTL 72.4 70.1 69.1
3s-ASMa (Ours) 77.2 77.0 76.8

Key Findings

🎯

Asymmetric Masking Works

Complementary masking strategies (high-degree + low-motion vs. low-degree + high-motion) consistently outperform symmetric or random masking by 2-4%.

🔄

Feature Alignment Matters

Cross-attention based feature alignment improves accuracy by 1-2% over simple averaging, showing the importance of adaptive fusion.

📱

Efficient Deployment

Knowledge distillation achieves 91.4% parameter reduction and 3× speedup with only 0.3-1% accuracy drop, enabling real-time edge inference.

🎓

Linear-Probed Distillation

Students distilled from linear-probed teachers outperform their teachers by 5.2%, producing more compact and generalizable representations.

🌐

Strong Transfer Learning

ASMa achieves 4-6% improvement on noisy datasets (PKU-MMD Part II), demonstrating superior generalization to challenging real-world scenarios.

🔧

General Augmentation Strategy

ASMa masking improves existing SSL methods (MS²L, AimCLR, CrossCLR) by 1-2.3%, showing it works as a plug-and-play augmentation.

Ablation Studies

Masking Strategy Combinations

Masking Ablation

Asymmetric pairings (high-degree + low-motion, low-degree + high-motion) consistently outperform symmetric or random masking.

Distillation Sensitivity

Distillation Sensitivity

Left: Temperature sensitivity. Right: Model depth sensitivity. 5 ST-GCN layers with τ=8-9 provides optimal trade-off.

Linear-Probed vs Fine-tuned Distillation

t-SNE Visualization

Student distilled from linear-probed teacher shows more compact clusters and outperforms its teacher by 5.2% on average.

Citation


        @article{
        anand2026asma,
        title={{ASM}a: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning},
        author={Aman Anand and Amir Eskandari and Elyas Rashno and Farhana Zulkernine},
        journal={Transactions on Machine Learning Research},
        issn={2835-8856},
        year={2026},
        url={https://openreview.net/forum?id=kIFo1q3VMS},
        note={}
        }