Paper ID: 2303.04244

A Light-Weight Contrastive Approach for Aligning Human Pose Sequences

Robert T. Collins

We present a simple unsupervised method for learning an encoder mapping short 3D pose sequences into embedding vectors suitable for sequence-to-sequence alignment by dynamic time warping. Training samples consist of temporal windows of frames containing 3D body points such as mocap markers or skeleton joints. A light-weight, 3-layer encoder is trained using a contrastive loss function that encourages embedding vectors of augmented sample pairs to have cosine similarity 1, and similarity 0 with all other samples in a minibatch. When multiple scripted training sequences are available, temporal alignments inferred from an initial round of training are harvested to extract additional, cross-performance match pairs for a second phase of training to refine the encoder. In addition to being simple, the proposed method is fast to train, making it easy to adapt to new data using different marker sets or skeletal joint layouts. Experimental results illustrate ease of use, transferability, and utility of the learned embeddings for comparing and analyzing human behavior sequences.

Submitted: Mar 7, 2023