[논문 리뷰] Deformable DETR: Deformable Transformers for End-to-End Object Detection

728x90

기존의 DETR이 가진 문제점
1. 수렴이 오래 걸린다. (학습시간이 길다)
2. 작은 물체에 대한 성능이 매우 낮다.
이 두 가지에 대한 문제점을 해결한 모델
Deformable Convolution의 아이디어를 가져와 Attention 구조에 적용하였다.
attention의 결과 key를 sampling point offset으로 사용하여 k개로 sampling 하여 attention module로도 많은 양의 image feature를 처리하게 해 주어 학습시간을 줄였다.
multi-scale의 feature map을 고려하기 때문에 FPN과 비슷한 효과로 작은 물체에 대한 성능을 올릴 수 있었다.
이렇게 DETR의 한계점을 시사하고 그 문제를 해결한 논문이었다.

Abstract

DETR의 문제점
1. 수렴하기 위한 학습 epoch이 매우 길다.
  →Attention weight가 uniform하게 초기화되는데 이를 의미있는 위치에 focus 시키기 위해 학습하는 시간이 길기 때문이다. (key가 160개라면 1/160으로 시작해서 gradient도 작은 상태에서 시작)
2. 작은 물체에 대한 성능이 매우 낮다.
Deformable DETR은 small object를 위해 multi scale feature를 사용하여 high resolution feature map으로 detect 하는데 DETR에서는 복잡도 때문에 사용이 불가능하다.

multi-head attention 대신 사용되는 deformable attention module을 제안한다.
deformable convolution 처럼 sampling location을 정해서 attention을 해주는 역할을 한다.
FPN 없이 multi-scale feature를 모아준다.

Deformable Attention Module
- reference point 주변의 point를 only attends to a small set of key로 sampling 하여 keys로 사용한다.
- Annotation
  q: query element
  Pq: query element q의 reference point
  k: sampling point index
  delta_P_mqk: reference point에 더해줄 offset
  m: attention head index
  Zq: query element q의 content feature
  x: input feature map
Multi-scale Deformable Attention Module
- multi-scale feature map을 이용해 Backbone network를 통해 FPN과 같이 여러 stage의 feature map을 고려한다.
- 기존엔 각 attention head에서 1개의 sampling point에 focus 하지만 여기선 각 attention head에서 multi-sacle의 sampling point에 focus 한다.
Deformable Transformer Encoder
- Encoder의 input은 여러 stage의 feature map을 받는다.
- 여러 stage의 feature map은 1x1 convolution을 통하여 동일한 dimension으로 변환되고 결국 input과 output의 resolution은 같게 나온다.
- multi-scale feature map 간의 정보 교환을 스스로 할 수 있게 해 주기 위해서 Top-down 구조의 FPN을 사용하지 않는다. (FPN과 유사하지만 조금 다르다?)
Deformable Transformer Decoder
- cross-attention과 self-attention module으로 구성된다.
- self attention은 기존의 MHA를 그대로 사용하지만 cross-attention은 multi-scale deformable attention module을 사용한다.

Two-stage Deformable DETR
- first stage - Deformable DETR Encoder만을 통하여 high-recall proposals를 준다. 이것이 object query가 된다.

728x90

[논문 리뷰] CenterNet: Keypoint Triplets for Object Detection (2019) (0)	2022.07.08
[논문 리뷰] CornerNet: Detecting Objects as Paired Keypoints (2018) (0)	2022.07.07
[논문 리뷰] DETR: End-to-End Object Detection with Transformers (0)	2022.07.05
[논문 리뷰] (RetinaNet) Focal Loss for Dense Object Detection (0)	2022.07.01
[논문 리뷰] SSD: Single Shot MultiBox Detector (0)	2022.06.30