However, due to the typically small spatial support for max-pooling (e.g. 2 × 2 pixels) this spatial invariance is only realised over a deep hierarchy of max-pooling and convolutions, and the intermediate feature maps (convolutional layer activations) in a CNN are not actually invariant to large transformations of the input data [6, 22]. This limitation of CNNs is due to having only a limited, pre-defined pooling mechanism for dealing with variations in the spatial arrangement of data.

문제 제기: max-pooling은 작은 spatial이기 떄문에 spatial invariance는 오직 deep한 hierarchy에만 실행됨. 그리고 intermediate feature maps는 입력 데이터의 실질적인 transformation에도불변함.

In this work we introduce a Spatial Transformer module, that can be included into a standard neural network architecture to provide spatial transformation capabilities. The action of the spatial transformer is conditioned on individual data samples, with the appropriate behaviour learnt during training for the task in question (without extra supervision). Unlike pooling layers, where the receptive fields are fixed and local, the spatial transformer module is a dynamic mechanism that can actively spatially transform an image (or a feature map) by producing an appropriate transformation for each input sample.

spatial transformer는 spatial transformation이 가능하도록 해주줌. 각 데이터 샘플에 맞는 적절한 행동을 하도록함. pooling layer는 receptive fields는 고정되고 local 하지만 spatial transformer는 입력 sample에 적절한 transformation을 생성으로 인해 활동하는 dynamic mechanism임.

1. Spatial Transformers

Untitled

The spatial transformer mechanism is split into three parts, shown in Fig. 2.

localisation network, grid generator, sampler

localisation network (Sect. 3.1) takes the input feature map, and through a number of hidden layers outputs the parameters of the spatial transformation that should be applied to the feature map – this gives a transformation conditional on the input. Then, the predicted transformation parameters are used to create a sampling grid, which is a set of points where the input map should be sampled to produce the transformed output.

localisation network는 feature map을 hidden layer를 통하여 spatial transformation의 파라마티 를 내놓음. 그러고 예측된 transformation parameters는 sampling grid를 생성하기 위해 사용됨

1.1 Localisation Network

The localisation network function floc() can take any form, such as a fully-connected network or a convolutional network, but should include a final regression layer to produce the transformation parameters θ.

localisation network function f_loc()은 어떤 form이든 가능함. 즉 conv, fcn이든 상관 없지만 무조건 마지막 regression layer는 transformation parameters theta를 생성해야함.

1.2 Spatial Transformer Networks