Decoupling Visual-Semantic Feature Learning for Robust Scene Text Recognition [2111.12351]