論文閱讀_Exploiting Temporal Relations on Radar Perception for Autonomous Driving_CVPR2022

Z.H. Shen

Published in

馬鈴薯獵人的狂想曲

49 min readJul 7, 2022

Paper

Abstract

雷達的優點有二，一是與光達相比更具有成本上的優勢，二是具有全天候的感知能力，但缺點是對於鄰近物體具有低角解析度和精準度。
為了加強應用於自駕車上，雷達的能力，作者利用自我為中心的 Bird-eye-view(BEV) 雷達影像幀的時間訊息，來進行雷達物件辨識。也就是以相同物體在連續幀內有相同屬性 (大小、方位(Orientation)) 為理論基礎，提出 Temporal relational layer，利用物體和連續雷達影像之間的關係來建立模型、進行物件辨識，並在物件偵測 (Object detection) 及物件追蹤 (Object tracking) 的任務上，與數個 Baseline 相比更有優勢。

1. Introduction

自駕車利用感測技術所提供的穩健動態物體感知，來進行可信賴的安全決策系統 [40]。在眾多感測技術中，相機和光達是最主要感知周圍物體的技術，相機提供語意豐富的交通場景，光達提供由物體反射回來的高解析度點雲。雷達與相機或光達做比較，在自駕技術的應用中，提供了一個獨特的應用，主要是使用 77/79 GHz (波長 4mm 左右) 的電磁波(毫米波)來預測物體的距離、速度和與雷達的角度。
在雷達的毫米波波長下，能力避開微小顆粒(Tiny particles)，比如雨、霧、雪和灰塵，在這些環境下提供長距離的感測 [41]；而光達的波長比毫米波更短，也許會反彈這些微小物體 (Tiny particles)，而降低操作的距離，雷達與相機做比較，對光強度也有更大的彈性，例如夜晚或白天強光下。雷達較光達提供具有成本效益且可信賴的選擇，可以補足其他感測器，具有全天候的感知能力。在 2022 年以較為激進的方式估計光達，成本大概是 500~1000 美元 [1]，相較之下，雷達低於 100 美元 [10]，因此雷達比光達更具有成本上的優勢。
雷達對於周遭物體具有低角解析度，但高解析度的方位角 (Azimuth) 和仰角 (Elevation)，如同雷達輔助的優勢，在自駕感知中是不可或缺的，最近開放的自駕車雷達資料集，雷達方位角解析度為 1 度變得可行，但仰角解析度仍然落後於此，在方位角解析度 1 度的前提下，短距離的物體語意(邊角或形狀)可以被感知，但在遠距離的情況下，因為水平角和仰角解析度交互影響下 (Cross-range resolution)，語意仍然是不清楚的，所以總結來說雷達對物體的定位能力，仍然不足於使用於 Full-level 的自駕車等級。
最近一些研究成果，使用演算法的角度，加強了自駕用雷達對於物體的辨識能力，[17] 提出一個使用 Range-azimuth-doppler 量測的深度學習方法。[20] 利用同步雷達和光達來偵測物件。相似的，[15]、[36] 開拓了 Multi-modal sensing fusion，除此之外，Bayesian learning 也試圖藉由雷達點雲解決拓展目标追踪(Extended Object Tracking)的問題。上述貢獻主要集中於 Multi-modal sensing fusion for robust perception [15]、[20]、[36]。不同的是本篇論文僅有使用雷達資訊來加強感知能力，僅需要少數的感知資訊，避免需要將 Multi-modal sensor 訊號進行同步的複雜過程。
本篇論文將自我為中心的 BEV 雷達點雲顯示在笛卡爾座標幀上 (Cartesian frame)，像素值表示雷達的反射強度，作者利用 Temporal information 來加強雷達的感知能力，藉由圖一觀察所得，我們假設一些雷達所偵測到的物體在連續幀中，保有、共用相同的屬性，例如物體的存在、長度和方位(Orientation)等等，結果可以藉由 Object-level 的相關性，藉由過去或未來的幀來促進一幀的偵測。

為了補償雷達低角解析度所引起的模糊性 (Blurriness)，作者利用時間相關性，集合了客製的時間關係層來處理連續幀中物體層的關係。時間關係層在潛在對象的中心獲取特徵向量，並對其位置包含的物體特徵進行時間性的 Self-attention，白話的說，時間關係層連結時間上，相似的物體表達式，就像是特徵平滑化 (Feature smoothing)，因此時間關係層可以藉由物體在時間上的連續性插入 Inductive bias，然後根據時間關係層去更新 Feature representation，推斷出 Object heatmap(指定物體的中心) 和相關屬性。

在這項工作中，作者認為在自駕車領域中的物體辨識問題使用雷達，是一個關鍵的傳感技術，具有獨特的優勢，以下為作者認為的主要貢獻：

藉由額外的時間性資訊，來促進雷達的感知，補償雷達感測器的模糊性和低解析度的問題。
設計客製化的時間關係層，在網路中插入 Inductive bias，相同物體在連續幀中應共享一定的外觀和屬性。
作者雷達資料集中，評估此項方法的物件偵測和多物件追蹤，通過與 Baseline 的比較，顯示皆由此方法的改進。

2. Radar Perception: Background

這部分在介紹雷達相關特性，所有雷達都有相似的過程，不過最後的資料呈現有所不同而已，可略。

自駕車雷達主要使用 Frequency modulated continuous waveform (FMCW) 來偵測物體，並在多個物理域中生成點雲。如圖 2(a) 所示，發射連續的 FMCW pluses 藉由它的 M 傳輸天線:

m 和 q 為傳輸天線和脈衝，T_PRI 為脈衝間隔，f_c 為載波頻率(e.g. 79GHz)，S_p(t) 為 FMCW 波形的基頻(以 Sinusoids 呈現，如圖 2(a))。一個物體在距離 R_0 的位置，速度為 V_t，有一個 Far-field spatial angle (i.e. azimuth, elevation or both) 誘導振幅減少，和在每個 N，Receiver RF chain 上利用相位調變去接收 FMCW 訊號(包括 Low noise amplifier(LNA)、Local oscillator(LO) 和 Analog-to-digital converter(ADC)，如圖2(b))。來自目標物的誘導調變，被 Baseband signal processing block 捕捉(包括與計算距離相關的 Fast Fourier transforms (FFTs)、Doppler 和空間域，如圖 2(b)。)，這些過程導致多維的頻譜 (Spectrum)。
在 Constant False Alarm Rate (CFAR) 的檢測步驟，透過 Adaptive threshold 去比較的頻譜 (Spectrum)，藉由距離、頻率和與雷達間的角度來生成雷達點雲 [4]、[13]、[30]。但考慮到計算和成本的限制，自駕車雷達製造商可能會定義雷達點雲在一個四維的子集中，例如傳統自駕雷達在 Range-Doppler domain 產生偵測點，然而一些雷達是在 Range-Doppler-Azimuth plane 產生 [21]。

Range-Doppler map (source: Fast implementation for modified adaptive multi-pulse compression)

Range-Doppler-Azimuth map(source: RADDet)

在本篇論文所使用的 Radiate dataset[25]，雷達點雲被定義為 360 度視角的 Range-Azimuth plane，由此產生的集點座標點雲 (Polar-coordinate point cloud) 進一步轉換為自我為中心的 Cartesian coordinate system，然後利用 Standard Voxelization 將點雲轉換為圖像。

本篇論文所用的 Radiate dataset 的雷達型號為 Navtech Radar CTS 350-X，提供 4Hz 的 360 度高解析度 Range-Azimuth 的雷達圖像，最大工作範圍為 100m，距離解析度為 0.175m，方位角 (Azimuth) 和仰角 (Elevation) 解析度為 1.8 度，目前不提供 Doppler 的資料。

本篇輪文所使用的雷達 Dataset (Radiate dataset[25])的資料(source)

3. Radar Perception with Temporality

作者利用圖三呈現網路架構，與之相應在接下來的章節介紹兩個時間幀的特徵提取、時間關係層、學習方法和多目標追蹤。

先介紹定義符號

θ：神經網路的學習係數，為了簡化作者對所有模型統一參數為 θ。
作者利用三維矩陣 (three-dimensional matrix) 後的括號 (bracket) 來表示在座標中特徵收集過程。
下面公式 Z 為特徵表達式，Z∈實數的 C×H×W 作為 Feature representation，C、H、W 分別表示為 Channel、Height 和 Width。

4. 讓 P 表示為 (x,y) 坐標或一組二維坐標系 {(x,y)}_K， K 表示為有 K 個座標，x, y 皆為實數。

5. Z[P] 表示在 P 座標系的維度下提取特徵，特徵為實數的 C 陣列或實數的 K×C 陣列。

3.1. Temporal Feature Extraction

單一個 Radar frame 表示為 I，為實數的 1×H×W。

作者將兩個連續幀，前一幀現在這一幀，沿著 Channel 維度 Concate 在一起，這樣包含時間性的資料作為輸入層，分別表示為 I_{c+p} 或 I_{p+c} 為 2×H×W 的矩陣，p 表示 Previous，c 表示為 Current，表示誰先誰後的先後順序，I_{c+p} 或 I_{p+c} 為 2*H*W 的時數矩陣。

I_{c+p} 或 I_{p+c} 作為骨幹網路 Fθ(·) 的輸入，輸出為兩個幀的特徵表示式，如下：

骨幹網路 Fθ(·) 為標準深度卷積神經網路 (e.g., ResNet)，模型的參數在兩個 Input (I_{p+c} and I_{c+p}) 是共享的，為了在特徵表示中融合 High-level semantics 和 Low-level finer，作者在神經網路的不同尺度中建立 Skip connection。具體來說，對於一個 Skip connection，作者透過雙線性插值法 (Bilinear interpolation)，將深層網路中，經過提取池化的特徵，使其大小與淺層 Shallow layer 保持一致，對 Up-sampled feature 一系列的操作包括 Connvolution、Non-linear activation 和 Batch normalization。接著 Up-sampled features 藉著 Channel dimension，與 Shallow layer 連接 (Concatenated) 起來。在網路中插入三個 Skip connection 以推動包含語意的四個不同 level 的 Feature。

骨幹網路最後的 Feature 表示為

s 為空間維度上的 Down-sampling ratio，可以參照附錄Ａ，為 4。

3.2. Modeling Object Temporal Relations

作者設計一個 Temporal relational layer 來模擬連續幀中的潛在物體的相關性和一治性，Temporal relational layer 接收來自連續兩幀的多個 Feature vector，每個向量表示在雷達圖中的潛在物體。

作者應用了一個 Filtering module，在 Feature Z_c 和 Z_p 為關係建模選擇 Top K 個潛在物的 Feature，得到座標集 P_c，P_c ，透過這個 Filtering module，原本 2 Channel 的特徵表示式，就變為一層了！定義如下：

表示在 Spatial space (H/S∗W/S) 中，第 K 個最大的值。下標 x, y 表示在座標系中 (x, y) 的數值。
顯然的

也就是座標集的數量等於選取的 K 值，也就是說作者要加強的雷達特徵有 K 個，再應用物件偵測和物件追蹤時，得到的物件超過 K 個也是很正常的！K 值最好接近每個雷達影像的平均物體數量，作者在 Ablation study 中得出選擇 8，可以得到最好的結果
P_p 代表 Z_p 也可以獲得相似的東西。作者沒有將所有座標的特徵納入 Temporal relational layer，因為後續的 Attention mechanism 的計畫的複雜性，會朝向 2^K 成長。
透過將座標集 P_c 和 P_p 帶入 Feature representations，得到 Selective feature matrix 如下：

依次，將連續兩幀的 Top-K selective feature matrix 並聯在一起，作為 Temporal relational layer 的輸入，定義 H_{c+p} 為 H_c 和 H_p 並聯後，做轉至矩陣，為實數的 2K*C 矩陣，C 表示為 Channel，作者因為使用兩幀雷達影像，所以為 2。

但在將 H_{c+p} 作為 Temporal relational layer 之前，加入 Position encoding，這是因為 CNN 具有平移不變性 (Translational invariance property)，導致CNN 的 Output feature 不包含絕對的位置。然而位置在物體的時間關係上是至關重要的，因為物體在兩個連續幀中，有一定的空間範圍，很有可能有相似的物體屬性。同一物體之間的空間範圍，取決於幀速度和車輛的移動，這可以透過數據驅動 (data-driven) 的方法來學習。
H_{c+p} 為 Selective feature matrix，透過 Feature concatenation 的方式，將 Positional encoding 加到 H_{c+p} ，用 D_{pos} 表示Positional encoding 的維度，Positional encoding 是從 Normalized coordinate (x, y) 投影而來，藉由 Linear mappings 提供數值為 [0, 1]。

有了上述的公式，我們有了跨幀關係建模的主要操作，對單個 l 階的 Temporal relational layer，作者使用上標 l 來表示 Input feature，利用 l+1 表示 Output feature。

q(·), k(·) 和 v(·) 皆為表示 Linear transformation layers 的 Query、Keys 和 Values。d 表示 Query 和 Key 的維度，用於他們之間縮放的點積，Masking matrix，Ｍ，為 2K*2K 的實數矩陣，定義如下：

其中 1_{K,K} 為 All-one matrix，大小為 K*K，0_{K,K} 為 All-zero matrix，大小為 K*K，1_{2K} 為 identity matrix，大小為 2K，σ 為負常數，作者的設置為 −(1e+10)，也就是 -1 乘以 10 的正 10 次方，以保證通過 Softmax 的輸出中的值接近為零，在 1_{K,K} 的對角矩陣中禁用對同一幀特徵的 Attention，而 0_{K,K} 的非對角矩陣允許交叉幀的 Attention，相同的，Identity matrix ，1_{2K}，解開了物體 Self-attention。
在 Self-attention 背後的邏輯是相同的物體，不能總是保證在連續幀同時存在，因為物件可以移出範圍，因此當一個對象在同一幀消失時，Self-attention 是可以使用的。值得注意的是，Positoin encoding 只有附加 Keys 和 Querry，但沒有附加 Value，因此輸出 Feature 不包含 Locality。其他技術上的細節參照 Transformer 的設計[29]，作者為了簡化而刪除了其他細節。
在公式 5 中，在跨幀執行 Object temporal attention 後，作者按照順序應用一個 Feed-forward function，該函數由兩個 Linear layers、Layer normalization 和 Shortcut on feature 所組成，關係建模由多個 Temporal relational layers 所建構，設計相同。在最後，我們從

分割出

將 Feature vector Z_c、Z_p 代入 P_c、P_p 的空間座標系中，的到預測的 Heatmap。
下一小節的 Regressions 是在重新填充的特徵表示式上進行。
Discussion
討論上述功能操作與 Transformer 有像似之處 [29]，Transformaer 專為語言表示學習而設計，如果兩個單詞在訓練語庫之間共享相關性，包括共存、單詞位置和語意，則將單詞映射到類似的潛在表示中。
Multi-head attention 操作在 Stacked architecture，可以理解為平滑語意相似的單詞特徵[6]、[8]、[14]。
在我們的語意中，連續幀中具有相同的 ID 的物體特徵，應共享一個類似的潛在表示，這一點尤為重要，因為潛在表示存儲所有與對象相關的屬性，並將用於隨後的 Decoding purpose，如圖 3.3 小節所敘述。
在連續幀中對同一物體的兩個特徵向量進行平滑化滿足基本的 Temporal consistency assumption，並且可以增強物體訊息，因為雷達模糊而在單幀中由於雷達的 Blurriness 的部分損失。

3.3. Learning

作者從 Heatmap 中選取出物體的中心座標，並通過回歸特徵表示中學習他的屬性 (例如：長、寬、位置和中心座標的 Offset)。

Heatmap

為了定位物體，Heatmap 中峰值的二維座標被視為對象的中心，Heatmap 是由先前的計算中得到。
我們通過將 2D radial basis function(RBF) kernel 在每個 Ground-truth 的物體中心來產生 Ground-truth heatmap，而 RBF kernel 中的參數 σ 與對象的寬度和長度成正比。
考慮到物體在雷達圖中稀疏性，作者使用 Focal lose[16] 來平衡 Ground-truth 和背景的 Regression。
h_i and h_head_i 表示在第 i 個座標中的 Ground-truth 和預測值，N 為 Heatmap 所有值的加總，作者將 Focal loss 表下為：

α and β 為 Hyper-parameter 按照之前的文獻為使用 2 和 4[38]。
對下列公式，使用相同的損失函數，以糾正關係建模的特徵選擇。

在推斷過程中，在 Heatmap 中設定閾值來從背景中區分出物體的中心位置， Non-maximum 被應用於避免產生多餘 BBox。

Width & Length

作者通過另一個 Regression head 從位於 Feature map 中心座標的特徵向量預測 Oriented bounding box 的長寬。

讓 P^{k}_{gt} 表示中心第 K Ground truth 物體的座標系 (x, y)位置，而 b^k 表示包含第 k 個物體的長寬的 Ground truth vector、Z 是 Z_c 和 Z_p 的統一符號 (Unified notation)。

Orientation

所有的載具在 Bird-eye-view 圖像中都有一個方向，角度在 [0,360) 之間，可以透過物體方向和 Ego-view 的視軸 (boresight) 之間的角度測量。
作者透個下列公式回歸 Sine 和 Cosine value 的角度 ϑ

Offset

Down sampling 可能會造成 Backbone network 的中心座標造成偏移，在 Heatmap 中的中心座標為整數，但可能會因空間上的 Down sampling 造成真實座標離開 Heatmap grids。為了補償位移，作者利用下列公式計算第 K 個物件的 Ground-truth offset。

C^{k}_{x} 和 C^{k}_{y} 表示第 k 個中心座標，S 為 Down sampling ratio，Bracket [·] 表示將數值四捨五入為整數的操作。
因為下列公式

Regression for center positional offset 可以表示為下列公式：

Training

上述所有 Regression function 可以通過線性組合為 Final training objective。

為了簡化，作者省略每個 Term 的 Balance factor，對於每個訓練步驟都會去計算 Loss L，並同時對當前和前一幀進行向後的操作。在當前幀處，當前幀中的對象接收來自過去的訊息以進行物件辨識。另一方面，從上一幀的角度來看，物件利用來自 Immediate future 的 Temporal information。因此，優化可以被視為對兩個連續幀的前後雙向訓練。目前，作者沒有將現在的框架拓展到更多個 Frame，因為中間幀沒有適當的時間順序來串連 Input image (無論是從過去到未來或未來到過去)，這會降低訓練的效率。

3.4. Extending to Multiple Object Tracking

本篇論文的架構可以利用相似的追蹤過程 [42] 輕易的拓展到多物件追蹤。
對於多物件追蹤，作者對 Center feature vector 去預測 2D 平面上，為使前後幀保持相同的 Tracking ID，增加了一個 Regression head，僅需要使用歐幾里得距離 (Euclidean distance) 來 Tracking decoding 的關聯，作者對多物件追蹤定義一個 Detailed illustration 和演算法在附件 B。

4. Experiment

4.1. Experimental Setup

Dataset
作者基於下列原因而使用 Radar dataset Radiate [25]，第一，它包含高解析度的雷達圖，第二，他提供很好的定向 BBox，同時也包含物體的 Tracking IDs，第三，它記錄惡劣天氣下，高速公路到城市的多樣化的真實駕駛場景，惡劣天氣包含日照、夜晚、雨、霧、雪的場景。
資料的形式為點雲，像素值表示雷達反射訊號的強度。
Radiate dataset 利用雷達 Navtech CTS350-X 去掃描，提供 360 度高解析度的 4 Hz range-azimuth images，但目前雷達無法提供 Doppler 或 Velocity 資訊。
Radiate dataset 總共有 61 個片段，作者依官方分為三個分類，一為訓練在好天氣下 (31 sequences, 22383 frames, only in good weather, sunny or overcast)，二為訓練在好天氣和壞天氣下 (12 sequences, 9749 frames, both good and bad weather conditions)，三為測試用 (18 sequences, 11305 frames, all kinds of weather conditions)。
作者分別訓練模型在前兩個 Training set 中，並在 Test set 中去做評估，並分別對這兩個分類的訓練集做報告。作者也全面的考慮其他的 Radar datasets，並在 Section 5 討論為何他們在作者的實驗中不可行。

Baseline

作者實現了幾種 Detector，他們很好的做出視覺化的物件偵測，並與之做比較。這些 Detector 包括 Faster-RCNN [22, RetinaNet [16], CenterPoint [43], and BBAVectors [38]。與不同的 Backbone network [9]、[27] 進行比較，但傳統的 Detector 並未專門設計對定向物體 (Oriented objects)。
為了使他們適合於定向物件偵測，作者手動在 Anchor 或 Regression 上添加一個額外的維度，來預設物體的方向角度。
作者在表 1 中檢測器名稱的結尾的表示改為 OBB (Oriented bounding box)
為了凸顯出時間建模的優點，作者將時間輸入添加到 Baseline 中，其中 T 表示具有兩個連續幀的輸入，’Ours-w/o TRL’ 在建構上等同於具有時間輸入的 CentorPoint model。
對多物件追蹤中，作者將包過 CenterTrack[42] 對定向物體的追蹤，與本篇論文使用相同的追蹤啟發式 (Tracking heuristics) 方式，進行比較

4.2. Result and Analysis

[25]

Detection

作者在 Table 1 和 Table 2 報告的結果，本篇論文的方法在兩個訓練分割不同 IoU threshold 皆得到更好的結果。
除此之外，有無 Temporal relational layer 的差異，更進一步確信連續幀中 Tempotal object consistence 的重要性。
關於訓練集的分割，個多的天氣條件可以加強模型偵測和追蹤的魯棒性，因為在測試時包含了更多的天氣，然而對於雷達來說，在惡劣天氣在沒有明顯的差異，兩個訓練分割的 Margin，主要來自訓練的樣本。關於圖像大小的差異，在更大邊界的偵測中，性能會也一些下降。這主要是因為 Cross-range resolution，更遠的物體也許會有更大的 Blurriness。

Tracking

作者報告了多目標追蹤的結果在 Table3，與 Baseline 相比較達到更好的結果。
Baseline 的方法，CenterTrack 也在 Inference stage 增加前一幀的 Heatmap 和前一幀來增加時間性的資訊，他們在訓練時，使用 Ground-truth heatmap，在 Inference stage 使用前一幀。
這種學習方式可以很好的應用於 RGB video tracking，因為大多數的偵測是準確的。然而在雷達的偵測上，很難達到一樣的精準度，因此在訓練和 Inference 打破了 Haetmap 的對齊，在追蹤的效能上，有無ＴTemporal relational layer，顯示了 modeling temporal object-level relation 的效能。

Visualization

作者將 Object detection 和 Multiple object tracking 的視覺化的結果放在圖 4，更多的資訊詳見附錄 4。作者注意到預測結果和註釋間有些微的偏差，除了正確預設外，還有 False positive predictions。然而仔細看這些假陽性的預測 (False positive predictions) 有很高的機率，將這些 Reflection 聚集之後位於 Box 的內部，可以視為偽影。這可能是偽影主要造成的原因，同時，我們的模型也漏掉一些物體的外殼，原因是 Low angular resolution，讓物體淹沒在靜態的環境中，如何加強偽影的偵測和 Blurriness 會是個有趣的問題，作者增加一個實驗在附件 D，分析在 Temporal relational layers 中最佳的 Selective feature 選擇數量，最佳的經驗結果會被選定為 heuristic setting，K。

5. Related Work

Radar Perception in Autonomous Driving

人們越來越注意自駕車的雷達應用，作者查看了最近演算法和雷達的相關研究，[17] 使用量測 range-azimuth-doppler 提出了一個 Deep-learning approach 對自駕車的物件偵測。
[20] 針對雷達和光達的感測器融合提出了一個物件偵測。
[15]、[36] 也相同針對自駕車開發了多模態感測。
除了 Deep learning 外，Bayesian learning 也使用雷達用於 Extended object tracking [34]、[37]。
作者指利用雷達訊號，通過物體的 Temporal consistency 加強了物體的物件辨識，這些是之前相關研究沒有使用到的。
作者在附件 E 中，對本篇論文所使用的 Dataset 作了簡短的回顧。

附件Ｅ
— — —
- 除了算法設計，雷達數據集的出現，對於機器學習研究至關重要，這些數據集中包括 Radio frequency heatmap, Radar reflection image, or Point cloud。
- Radar Scenes dataset 為汽車雷達提供使用 Doppler 註釋的雷達點雲，但沒有對物體進行 BBox 的註釋。
-Carrada dataset 記錄了 Range- angle 和 Range-Doppler heatmap，但主要場景在停車場等實驗場所，並非真實的駕駛環境。
CRUW dataset 提供帶有攝影鏡頭註釋的雷達頻率圖像。
-nuScenes dataset 提供 Multi-modal data，包含相機、光達和雷達，但 nuScenes dataset 只提供稀疏的點雲，相機和光達是此 dataset 主要優勢。
-MulRan 和 Oxford dataset 為成是駕駛場景提供高解析度的雷達影像，但沒有對物體進行註釋。
-本篇論文所使用的 Radiate dataset 是在惡劣天氣中，基於點雲的雷達圖進行物件偵測和物件追蹤的實驗，每個重要的物體都有 BBox 和 Tracking ID，可用於訓練。

Detection with Temporality

連續影片可以為物件辨識，提供空間-時間的資訊。[32] 利用 Feature bank 拓展空間-時間動作時的位置的時間範圍。[26] 和 [3] 插入短時間或長時間的物體層的關聯於 Fast-RCNN [22] 中，來捕捉物件的空間-時間資訊。其他技術像是應用 Video pixel flow 或 3D convolutoin [35]、[44]、[45] 於視覺化豐富的影像串流，但對於雷達圖太過肥大且效率不高。作者的方式，有相同的理念，在時間範圍內，利用物體層在空間-時間的關聯。然而所有的研究皆關注於 RGB 影像，而非 Oriented object。在物體離開或接近相機的拍攝範圍，物體的大小和尺度或許不會一直維持。不同的是作者著重於雷達資料於自駕車上，利用 Bird-eye-view point cloud-based images ，提供相較於 RGB video data 更重要的物體屬性。作者設計一個時間性的 anchor-free one stage detector，不需要依靠預定義好的 Anchor parameter。Center-based detector 也相同適用於 bird-eye-view 的表示，因為沒有在它的視角中沒有 Object overlap，使得 Central feature 可以完整表示一個物體。作者沒有拓展更長時間的依賴性，只有限定時間範圍的連續幀內，因為載具可能會因為時間過長移動出範圍之外，或沒有更多的時間相依性的資訊可以使用

Multiple Object Tracking

視覺化多目標追蹤 [18] 的 Well-established paradigm 為 Tracking-bydetection [11]、[23]、[28]。External detector 提供偵測物件的 Bounding box，以物體關聯技術為基礎在物體表面或運動應用偵測，在多個連續幀內去關聯相同物體的候選框。
最近的研究發展在多目標追蹤上，轉換偵測器為追蹤演算法，聯合偵測和追蹤的物體 [7]、[39]、[42]。作者也依照相同的簡單的追蹤方法，建立於 Cost of euclidean distance [39]、[42] 去拓展本篇論文的架構於 Multiple object tracking。不同於 [39]、[42]，只有將多個時間步驟中的 Stack frames 作為 Input，且作者的網路架構明確地考慮到 Object-level 的一治性。

6. Conclusion

作者研究了在自駕車使用雷達來辨識物體的問題，在連續幀的物體有一致和共有相同的屬性下，藉由 Vedio frame 促進了帶有時間性的雷達預側。
設計一個可以插入時間相關層的架構讓模型知到物體間的恆定性，並藉由物件偵測和多物件追蹤的實驗，證明了本篇論文的效能。

Acknowledgement

作者群感謝 Petros T. Boufounos, Toshiaki Koike-Akino, Hassan Mansour, and Philip V. Orlik 有幫助的討論。

Reference

論文參考

[1] Alan Ohnsman. Luminar Surges On Plan To Supply Laser Sensors For Nvidia’s Self-Driving Car Platform, 2021. 1
[2] Dan Barnes, Matthew Gadd, Paul Murcutt, Paul Newman, and Ingmar Posner. The oxford radar robotcar dataset: A radar extension to the oxford robotcar dataset. In 2020 IEEE International Conference on Robotics and Automation, pages 6433–6438, 2020. 12
[3] Sara Beery, Guanhang Wu, Vivek Rathod, Ronny Votel, and Jonathan Huang. Context r-cnn: Long term temporal context for per-camera object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13075–13085, 2020. 7
[4] I. Bilik, O. Longman, S. Villeval, and J. Tabrikian. The rise of radar for autonomous vehicles: Signal processing solutions and future research directions. IEEE Signal Processing Magazine, 36(5):20–31, Sep. 2019. 2
[5] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora,Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621–11631, 2020. 11
[6] Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. arXiv preprint arXiv:2103.03404, 2021. 4
[7] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision, pages 3038–3046, 2017. 8
[8] Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, and Qiang Liu. Improve vision transformers training by suppressing over-smoothing. arXiv preprint arXiv:2104.12753, 2021. 4
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 6
[10] Jessie Lin and Hana Hu. Digitimes Research: 79GHz to replace 24GHz for automotive millimeter-wave radar sensors, 2017. 1
[11] Xiaolong Jiang, Peizhao Li, Yanjing Li, and Xiantong Zhen. Graph neural based end-to-end data association framework for online multiple-object tracking. arXiv preprint arXiv:1907.05315, 2019. 8
[12] Giseop Kim, Yeong Sang Park, Younghun Cho, Jinyong Jeong, and Ayoung Kim. Mulran: Multimodal range dataset for urban place recognition. In 2020 IEEE International Conference on Robotics and Automation, pages 6246–6253, 2020. 12
[13] J. Li and P. Stoica. MIMO Radar Signal Processing. John Wiley & Sons, 2008. 2
[14] Peizhao Li, Jiuxiang Gu, Jason Kuen, Vlad I. Morariu, Handong Zhao, Rajiv Jain, Varun Manjunatha, and Hongfu Liu. Selfdoc: Self-supervised document representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5652–5660, June 2021. 4
[15] Teck-Yian Lim, Amin Ansari, Bence Major, Daniel Fontijne, Michael Hamilton, Radhika Gowaikar, and Sundar Subramanian. Radar and camera early fusion for vehicle detection in advanced driver assistance systems. In Machine Learning for Autonomous Driving Workshop at the 33rd Conference on Neural Information Processing Systems, 2019. 2, 7
[16] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In ´ Proceedings of the IEEE International Conference on Computer Vision, pages 2980–2988, 2017. 5, 6
[17] Bence Major, Daniel Fontijne, Amin Ansari, Ravi Teja Sukhavasi, Radhika Gowaikar, Michael Hamilton, Sean Lee, Slawomir Grzechnik, and Sundar Subramanian. Vehicle detection with automotive radar using deep learning on rangeazimuth-doppler tensors. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019. 1, 7
[論文閱讀] DAY 1 Vehicle Detection With Automotive Radar Using Deep Learning on Range-Azimuth-Doppler_Dr. Qing的博客-CSDN博客
[18] Anton Milan, Laura Leal-Taixe, Ian Reid, Stefan Roth, and ´ Konrad Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016. 7, 8, 11
[19] Arthur Ouaknine, Alasdair Newson, Julien Rebut, Florence Tupin, and Patrick Perez. Carrada dataset: camera and auto- ´ motive radar with range-angle-doppler annotations. In 2020 25th International Conference on Pattern Recognition, pages 5068–5075, 2021. 11
[20] Kun Qian, Shilin Zhu, Xinyu Zhang, and Li Erran Li. Robust multimodal vehicle detection in foggy weather using complementary lidar and radar signals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 444–453, 2021. 1, 2, 7
[21] Karthik Ramasubramanian and Brian Ginsburg. AWR1243 sensor: Highly integrated 76–81-GHz radar front-end for emerging ADAS applications. In Texas Instruments Technical Report, pages 1–12, 2017. 2
[22] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28:91–99, 2015. 6, 7
[23] Samuel Schulter, Paul Vernaza, Wongun Choi, and Manmohan Chandraker. Deep network flow for multi-object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6951–6960, 2017. 8
[24] Ole Schumann, Markus Hahn, Nicolas Scheiner, Fabio Weishaupt, Julius F Tilly, Jurgen Dickmann, and Christian ¨ Wohler. Radarscenes: A real-world radar point cloud data set ¨ for automotive applications. arXiv preprint arXiv:2104.02493, 2021. 11
[25] Marcel Sheeny, Emanuele De Pellegrin, Saptarshi Mukherjee, Alireza Ahrabian, Sen Wang, and Andrew Wallace. Radiate: A radar dataset for automotive perception. arXiv preprint arXiv:2010.09076, 2020. 1, 2, 5, 6, 12
[26] Mykhailo Shvets, Wei Liu, and Alexander C Berg. Leveraging long-range temporal relationships between proposals for video object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9756–9764, 2019. 7
[27] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR, 2019. 6
[28] Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, and Bernt Schiele. Multiple people tracking by lifted multicut and person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3539– 3548, 2017. 8
[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017. 4
[30] Pu Wang, Petros Boufounos, Hassan Mansour, and Philip V. Orlik. Slow-time MIMO-FMCW automotive radar detection with imperfect waveform separation. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8634–8638, 2020. 2
[31] Yizhou Wang, Gaoang Wang, Hung-Min Hsu, Hui Liu, and Jenq-Neng Hwang. Rethinking of radar’s role: A cameraradar dataset and systematic annotator via coordinate alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2815–2824, 2021. 11
[32] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 284–293, 2019. 7
[33] Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liangpei Zhang. Dota: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3974–3983, 2018. 6
[34] Yuxuan Xia, Pu Wang, Karl Berntorp, Lennart Svensson, Karl Granstrom, Hassan Mansour, Petros Boufounos, and ¨ Philip V Orlik. Learning-based extended object tracking using hierarchical truncation measurement model with automotive radar. IEEE Journal of Selected Topics in Signal Processing, 15(4):1013–1029, 2021. 2, 7
[35] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision, pages 305–321, 2018. 7
[36] Bin Yang, Runsheng Guo, Ming Liang, Sergio Casas, and Raquel Urtasun. Radarnet: Exploiting radar for robust perception of dynamic objects. In European Conference on Computer Vision, pages 496–512, 2020. 2, 7
[37] Gang Yao, Perry Wang, Karl Berntorp, Hassan Mansour, P Boufounos, and Philip V Orlik. Extended object tracking with automotive radar using b-spline chained ellipses model. In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8408–8412, 2021. 2, 7
[38] Jingru Yi, Pengxiang Wu, Bo Liu, Qiaoying Huang, Hui Qu, and Dimitris Metaxas. Oriented object detection in aerial images with box boundary-aware vectors. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2150–2159, 2021. 5, 6
[39] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Centerbased 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11784–11793, 2021. 8
[40] Ekim Yurtsever, Jacob Lambert, Alexander Carballo, and Kazuya Takeda. A survey of autonomous driving: Common practices and emerging technologies. IEEE access, 8:58443– 58469, 2020. 1
[41] Shuqing Zeng and James N. Nickolaou. Automotive radar. In Gregory L. Charvat, editor, Small and Short-Range Radar Systems, chapter 9. CRC Press, Inc., 2014. 1
[42] Xingyi Zhou, Vladlen Koltun, and Philipp Krahenb ¨ uhl. Track- ¨ ing objects as points. In European Conference on Computer Vision, pages 474–490, 2020. 5, 6, 8, 11
[43] Xingyi Zhou, Dequan Wang, and Philipp Krahenb ¨ uhl. Objects ¨ as points. arXiv preprint arXiv:1904.07850, 2019. 6
[44] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 408–417, 2017. 7
[45] Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. Deep feature flow for video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2349–2358, 2017. 7

其他參考

毫米波雷達是什麼？自動駕駛、智慧家庭都少不了它！｜大和有話說 — 大和有話說
- 77 GHz radar 頻段為 76~77 GHz，79 GHz radar 頻段為 77~81 GHz，但一般不會特別去區分，通常一起講 77/79 GHz radar 為同一類，大概是 77 GHz radar 可以偵測比較遠的距離，但若單用則頻寬太少，像是 TI 的 IWR 1843 的應用頻寬是 76~79 GHz 的 4 GHz。
- 頻寬太少的問題可以參考 TI 文件
- 雷達偵測角度為 10~70 度
TI 培訓，中文視頻
- 大概是一些應用介紹
- TI 雷達 IWR 系列和 AWR 系列兩者的差距是在晶片的安全級別上。
圖解傅立葉分析 — HackMD
如何理解Inductive bias？ — 知乎
CNN的平移不变性是什么？_ytusdc的博客-CSDN博客
ClearWay Technical Specifications — Navtech Radar
LRR 為 Long range radar、MRR 為 Medium range radar、 SRR 為 Short range radar。