Paper ID: 2310.10845

CoTFormer: More Tokens With Attention Make Up For Less Depth

Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi

The race to continually develop ever larger and deeper foundational models is underway. However, techniques like the Chain-of-Thought (CoT) method continue to play a pivotal role in achieving optimal downstream performance. In this work, we establish an approximate parallel between using chain-of-thought and employing a deeper transformer. Building on this insight, we introduce CoTFormer, a transformer variant that employs an implicit CoT-like mechanism to achieve capacity comparable to a deeper model. Our empirical findings demonstrate the effectiveness of CoTFormers, as they significantly outperform larger standard transformers.

Submitted: Oct 16, 2023