Focus Longer to See Better: Recursively Reﬁned Attention for Fine-Grained Image Classiﬁcation

code：https://github.com/TAMU-VITA/Focus-Longer-to-See-Better
paper：https://arxiv.org/abs/2005.10979

Abstract

类间的边缘视觉差异（the marginal visual difference）使得细粒度分类很难。

focus on these marginal differences to extract more representative features.（基于此观察，作者将关注点放在了边缘视觉差异来提取更具有代表性的特征）。
另外，使用可视化方法来验证模型怎样focus changes from coarse to fine details。
一个简单的注意力模型可以聚合(加权)图像中最主要的鉴别部分。
由于模型比较简单，使得它成为一个即插即用模块（an easy plug-n-play module）。

优点：

可解释
相比baseline模型，acc增加高达2%。

3.Our Proposal

image III ；label ccc

a two-stream feature extractor is used to extract global and object-level feature representations to boost the classiﬁcation accuracy.

3.1.Two-Stream Architecture

对于每个image III 会得到一组patches PPP，从中随机选择一个patch PiP_iPi （patches P是通过该 paper 得到的），PiP_iPi是由一对坐标表示 [(xitl,yitl),(xibr,yibr)][(x_i^{tl},y_i^{tl}),(x_i^{br},y_i^{br})][(xitl,yitl),(xibr,yibr)]，tl，br分别表示左上角、右下角坐标。

网络的输入就是由 image III 和patch PiP_iPi 得到的。

The top stream 就是常见的分类网络，原图放到CNN中提取feature，然后送到classification layer+softmax进行分类。
The second stream将patch送到CNN中提取feature，然后送到LSTMs中去得到更细化的特征表示，该特征表示再通过加权融合的方式形成一个最具有判别性的特征表示。

Global Stream

对应the top steam，CNN模块主要是在ImageNet上pretrian好的模型，图片经过CNN提取到的特征表示为 Wg∗IW_g *IWg∗I， WgW_gWg表示整个神经网络的权重，∗*∗表示所有的conv、pooling、非线性激活等函数，之后该feature送到softmax中得到每个类别的概率。公式如下：

GI=F(Wg∗I)G_I = F(W_g * I)GI=F(Wg∗I)

GIG_IGI 表示图片的global representation，F(.)F(.)F(.) 表示global avg pooling之后接softmax。

Global Stream的作用：

提供global information，因为patch送到网络中学到的是object-level的信息，仅关注object本身。
提供了一个baseline，加了local steam后可以用来验证contribution。

Local Stream

weakly supervised patch P=[P1,P2,P3,…,Pn]P = [P_1,P_2,P_3,…,P_n]P=[P1,P2,P3,…,Pn]

图二中显示的The set of cropped image regions可以表示为 I(P)=I(P1),I(P2),I(P3),…,I(Pn)I(P)=I(P_1),I(P_2),I(P_3),…,I(P_n)I(P)=I(P1),I(P2),I(P3),…,I(Pn)

第i个patch送到pretrain的CNN网络中

Fi=(Wg∗I(Pi))F_i = (W_g * I(P_i))Fi=(Wg∗I(Pi)) size=w x h x c

WgW_gWg 表示CNN中所有的参数，* 表示conv、pooling、非线性激活等函数。

注意，两个CNN模块不共享参数。

stacked-LSTMs中所有time steps输出表示为 [ϕ(Fi1),ϕ(Fi2),…,ϕ(FiT)]，t=1,2,3，…，T[ \\phi(F_i^1), \\phi(F_i^2),…, \\phi(F_i^T)]，t=1,2,3，…，T[ϕ(Fi1),ϕ(Fi2),…,ϕ(FiT)]，t=1,2,3，…，T , ϕ(Fit)∈RD\\phi(F_i^t) \\in R^Dϕ(Fit)∈RD

实验4.2验证假设，即特征如何随着时间步长而变化，以关注part的更细微的细节。

将通过LSTMs得到的finer details送到Attention机制中，得到的输出AiA_iAi表示为
其中，Wt∈RDW^t \\in R^DWt∈RD表示其中的参数