一. Fine-Grained Visual Classiﬁcation via Progressive Multi-Granularity Training of Jigsaw Patches

基于渐进式多粒度拼图训练的细粒度视觉分类

code：https://github.com/PRIS-CV/PMG-Progressive-Multi-Granularity-Training

paper：https://arxiv.org/abs/2003.03836

Abstract

细粒度分类比传统分类难

目前解决方法主要关注：locate the most discriminative parts, more complementary parts, and parts of various granularities.

但是，对于哪些粒度最具有识别性，以及如何跨多粒度融合信息的研究较少。fuse information cross multi-granularity

提出：

a novel progressive training strategy.
a simple jigsaw puzzle generator to form images contain information of diﬀerent granularity levels.(一个用于生成图像的简单拼图生成器,其包含不同粒度级别的信息。)

Approach

模型在较浅的层中学习稳定的细粒度信息，并随着训练的进展逐渐将注意力转移到更深的层中学习大粒度级别的抽象信息。

4个step表示对应着4种不同的输入，

对于step1来说，最乱，分成的patch为 n=2L−(L−2)+1=8n = 2^{L-(L-2)+1}=8n=2L−(L−2)+1=8 ，frozen之后的2个stage，

step 2，分成的patch为 n=2L−(L−1)+1=4n = 2^{L-(L-1)+1}=4n=2L−(L−1)+1=4 ，frozen之后的1个stage，

step 3，分成的patch为 n=2L−(L)+1=2n = 2^{L-(L)+1}=2n=2L−(L)+1=2

step 4，原图，将原图送入网络中，最后三个stage得到的feature map concatenate后送去分类。

每一个step都有loss，share the same label

Network Architecture

FFF：backbone feature extractor。其含有LLL个stage

每个stage的输出特征图：Fl∈RHl×Wl×Cl,l={1,2,3,…,L}F^l∈R^{H_l \\times W_l \\times C_l} , l = \\{1,2,3,…,L\\}Fl∈RHl×Wl×Cl,l={1,2,3,…,L}

目标：在不同的中间阶段（stage）对提取的特征图施加分类损失。

FlF^lFl 经过convolution block HconvlH_{conv}^lHconvl 后化简成向量表示为 Vl=Hconvl(Fl)V^l = H_{conv}^l(F^l)Vl=Hconvl(Fl)
之后接 classiﬁcation module HclasslH_{class}^lHclassl，包含两个fc，BN，Elu，预测类的概率分布 yl=Hclassl(Vl)y^l = H_{class}^l(V^l)yl=Hclassl(Vl)

对于最后S个stage，即 l=L,L−1,…,L−S+1l = L,L-1,…,L-S+1l=L,L−1,…,L−S+1

将最终这S个输出concat起来得到 Vconcat=concat[VL−S+1,…,VL−1,VL]V^{concat} = concat[V^{L-S+1},…,V^{L-1},V^L]Vconcat=concat[VL−S+1,…,VL−1,VL]
最后分类，yconcat=Hclassconcat(Vconcat)y^{concat} = H_{class}^{concat}(V^{concat})yconcat=Hclassconcat(Vconcat)

Progressive Training

train the low stage ﬁrst and then progressively add new stages for training.

浅层的感受野和表示能力受限，所以更能学习到一些local details

与直接训练整个网络相比，这种渐进式的训练方法allows the model to locate discriminative information **from local details to global structures **

两种形式的output：

the outputs from each stages
the output from the concatenated features

loss采用的交叉熵

Jigsaw Puzzle Generator

输入图像：d∈R3×W×Hd \\in R^{3 \\times W \\times H}d∈R3×W×H

将其分成n×nn \\times nn×n 个patch，每个patch的大小为 3×Wn×Hn3 \\times \\frac{W}{n}\\times\\frac{H}{n}3×nW×nH，W和H应该是n的整数倍

然后，将这些小块随机打乱并合并到一个新的图像P(d,n)P(d,n)P(d,n)中。patch的粒度由超参数n控制。

n需要满足以下条件：

patch的size需要小于相应stage的感受野，否则其性能会受到影响
patch size应随着stage感受野成比例增加

通常情况下，每一stage的感受野大约是后面stage的两倍。因此，对于第lll个stage，n=2L−l+1n = 2^{L-l+1}n=2L−l+1

Inference

输入：原始图片，不需要jigsaw puzzle generator （即不需要打乱了）

只需用yconcaty^{concat}yconcat进行预测，就可以移除其它stage中fc层带来的计算量，在该情况下，最终结果：C1=argmax(yconcat)C_1 = argmax(y^{concat})C1=argmax(yconcat)

但是multi-output可以带来更好的performance，该情况下，最终结果：C2=argmax(∑l=L−S+1lyl+yconcat)C_2 = argmax(\\sum_{l=L-S+1}^l {y^l+y^{concat}})C2=argmax(∑l=L−S+1lyl+yconcat)

4 Experiment Results and Discussion

数据集：Caltech UCSD-Birds (CUB), Stanford Cars (CAR) , FGVC-Aircraft (AIR)

4.1 Implementation Details

backbone: VGG16 & ResNet50

stage数目：L=5L=5L=5

S=3,α=1,β=2S = 3,\\alpha = 1,\\beta=2S=3,α=1,β=2 （哪里需要α,β???\\alpha,\\beta???α,β???）实验验证了S取3最好

训练阶段：先resize成 550×550，之后随机裁剪448×448，使用水平翻转进行数据增强

测试阶段：resize成550×550，之后中心裁剪成448×448

优化器：SGD

对于新加的conv和fc层来说，初始化学习率为0.002，之后采用cosine annealing schedule 进行学习率的衰减。

对于pre-trianed的卷积层来说，其学习率为新加层的110\\frac{1}{10}101。

epoch=300，batch size=16，weight decay=0.0005，momentum=0.9

4.2 Comparisons with State-of-the-Art Methods

combined accuracy表示concat四个output。

4.3 Ablation Study

数据集：CUB

ResNet50，no jigsaw puzzle generator jigsaw puzzle generator

S不是越大越好，S大于4之后会导致patch太小，很难保留一些信息，导致模型变差。

4.4 Visualization

baseline模型只在最后一个stage给出了一两个部分的注意力，而作者的model却可以覆盖到整体，这是由于 jigsaw puzzle generator 迫使网络在不同粒度level下学习到more discriminative parts。

Fine-Grained Visual Classiﬁcation via Progressive Multi-Granularity Training of Jigsaw Patches