본문 바로가기
Computer Vision/Computer Vision

[논문] Exploiting the Distortion-Semantic Interaction in Fisheye Data

by sk_victoria 2023. 5. 21.

출처 : https://arxiv.org/pdf/2305.00079.pdf


Abstract

  • Fisheye Data
    • ADVANTAGE
      • Wide FOV (Field of View)
    • DISADVANTAGE
      • Huge radial distortion
      • Neural network performance degradation due to radial distortion
      • Difficult to identify the semantic context of the objects if further from the camera center
  • Improvements
    • 1.1% higher mAP
    • 0.6% higher than SOTA

Introduction

  • Fisheye camera sensor: critical to Autonomous Vehicle
    • Capture a more holistic representation of the entire scene
    • Effective receptive field (180 degrees)
    • But radial distortion as a function of distance from the center of the image
    • Undistorting the pixel introduces artifacts in the edge pixel & reduces the overall FOV
  • Previous Studies with Fisheye camera
    • Model Centric Strategies
      • Change certain architectural features
      • Better conformation to an fisheye feature
    • Data Centric Strategies
      • manipulate the data (distort/undistort)
      • Augment fisheye data
    • But both strategies do not guide the model towards learning a representational space that reflects the interaction between semantic and distortion context.

Previous Study: Only use Semantic class labels in the training paradigm. Both cars above are all labeled as Y_c.

  • Therefore, this study models the representation space that links the distortion model of fisheye data and its semantic context.
  • Underlying distribution of fisheye data reflects Complex interaction between both semantic context & distortion

Proposed Study: Add the loss term based on distortion.

 

Fisheye Image Analysis

  • Specifically, the proposed model extracts Distortion class labels based on an object's distance from the center of the image.
    • WHY Distance?
      • First, the author estimates the mAP of objects in different regions. Even though the model was not biased towards center objects and the size of the objects has no difference between edge objects and center objects, the mAP of the edge objects is much lower than that of center objects.
      • This is because the distortion manifolds of the two objects are different, despite the segmentation class of those two being the same.

Edge Objects have lower mAP than those in the center.
The number of uncentered objects is more than that of centered objects. So the network is not biased towards centered objects.
The sizes of the centered objects are not bigger than those of uncentered objects. So the size would not affect the higher mAP of the centered objects.

  • Can we Quantify the distortion?

Distortion increases the further away from the center of the image.

  • BRISQUE ('No-Reference Image Quality Assessment in the Spatial Domain', 2012) shows the amount of distortion in a single image. Natural Image usually shapes like a Gaussian pixel histogram, but the histogram could be modified depending on the distortion.
  • BRISQUE helps quantify the distortion by estimating the amount of change.

The center objects and the edge objects are different in terms of BRISQUE feature.

 

Methodology

  • Enabling the model to recognize both semantic and distortion information.
  • 3 Distinct steps: Regional label Extraction -> Pre-training of ResNet-18 (w/ contrastive loss) -> Fine-tuning

 

1. Regional Class Label Extraction

The entire process to define the distortion-based labels

  • Use class information of opendataset to acquire semantic labels.
  • Use bounding box information of opendataset to acquire distortion labels.
  • If the center of the bounding box is in the inscribed box with an upper left coordinate of (.25, .25) and lower right coordinate of (.75, .75), the object is labeled as a lower of distortion version of its class.
  • Total 10 possible distortion classes due to two variants of each of the 5 classes.

 

2. Contrastive Pre-Training

  • Perform a contrastive learning objective that constraints the representation to consider both semantic and distortion concepts.
    • Contrastive learning is a self-supervised visual representation learning.
    • The output of the model would be the representation vector of the corresponding input image.
    • The model minimizes the distance between similar representations, so the representation vector of those would be similar as well.
    • The model maximizes the distance between different representations, so the representation vector of those would be different as well.
    • Find the encoder parameter that minimize the contrastive loss of the two similar vector representations.
    • Contrastive Loss quantifies the similarity of two vectors.
    • Contrastive Loss = Positive Loss + Negative Loss
  • Shape Backbone's representation space (w/ Weighted Contrastive Loss)
    • Same semantic class and distortion class to be close to each other
    • trained with both semantic & distortion information
    • fine-tuned w/ object Detection model
  •  

댓글