Abstract:Aimed at the problems that in scene text image super-resolution, prior information is inaccurate and insufficient in utilization and text edge is incomplete in recovery, a scene text image super-resolution method guided by text semantics is proposed. This network structure is composed of a super-resolution reconstruction module and a text semantic-aware module. To further improve the expression ability of the super-resolution network, a recurrent crisscross attention mechanism is used to capture global contextual information, making the model pay more attention to the text region during training. And simultaneously, in order to generate sharp edges, a soft-edge loss and a gradient loss are proposed to constrain the reconstruction process. The performance of the proposed model is verified on the public scene text image super-resolution dataset TextZoom with eight mainstream deep network models. Compared with TSRN, the average recognition accuracy of the proposed model is promoted to 2.06%, 1.80%, and 2.89% by three different recognizers respectively, and the proposed model also has advantages in PSNR and SSIM indicators.