LOFT: Learned Open Flow-based Tracking
Qingyuan Li Alexandru Luchianov
Final project for MIT 6.7960
Outline

Preface

Motivation

Related works

Model inputs

U-Net

Dataset Collection

Metrics

Evaluation

Conclusion

Codebase

References

Appendix

Example frame from our algorithm. Outputs from top left, clockwise: RGB image with annotated dynamic objects, depth image, raw model output mask, post-processed model output mask.

Preface

One of the previous projects of Qingyuan Li involved dynamic object tracking by using the flow residual information, however the result was lacking. This section aims to provide context as to what the previous approach did, why it failed and why deep learning is the solution.

In very succinct terms, the core pipeline of the previous project could be described as such:

  1. Gather data for the dynamic object tracking experiment.
  2. Extract the optical flow from the image using a known model(RAFT)
  3. Process the optical flow and the original frame to obtain the objects
The idea seems simple in practice, so simple, in fact, that this project could also be described by a similar list of steps. However, "processing the optical flow" hides a lot of abstraction underneath. In the previous project, this processing meant tagging every pixel that is moving faster than a certain velocity as a "dynamic object" and then merely post-processing the binary mask with OpenCV morphological operations in order to remove the noise. This game of post-processing was akin to Whac-a-mole, opening an image would remove small points of noise, closing would remove small holes, but the noise would still be noticeable. (See Video 1 for an example.) The winning strategy of this game was, obviously, not to play at all. For this project, we decided that a model is much better suited at learning what set of transformations would be best.
Video 1. Example output sequence from the evaluation dataset with the non-learned algorithm

Unfortunately, replacing the manual transformation of data by a learning model subtly (and signficantly) changes the other steps of the pipeline too. The most significant change was that of what ground truth is necessary. In the previous iteration of the project our evaluation metric was merely whether the detected location of the object was close enough to where we know the object to be. However, our segmentation model requires the whole segmentation of the object as a ground truth.

We could not collect this ground truth as we were only able to collect data about the approximate pose of the objects. Note that if we had the shape of the objects, we would have been able to generate the ground truth by simply projecting them. Our solution is to generate the ground truth using the Segment Anything model ([2]) and then train our smaller model on that. Note that training on this synthetic data makes our model inherently less accurate than SAM, however our model is much lighter.

Motivation

Unmanned Aerial Vehicles and Unmaned ground vehicles have numerous applications in fields such as delivery, photography, monitoring and many others. However, being unmanned leaves them dependent on signal coverage and a stable connection in order to be teleoperated by an operator. In order to overcome this limitation, much effort has been directed towards the issue of autonomous control.

Unfortunately, autonomous control has its own set of limitations. Where a human would be able to operate with only limited signal, an autonomous device often requires significantly more sensors. For example, a Waymo car employs a total of 29 cameras and possesses Lidar capabilities. This extensive payload is the reason why adoption is limited even for low-risk applications.

We aim to provide a lightweight solution for performing real-time dynamic obstacle recognition.

Model inputs

As one of the main ingredients, it is natural that we dedicate some of our attention to the problem of optical flow. Due to it being a relatively common problem there exist pretty performant solutions such as RAFT ([6]). We additionally remark that not only is RAFT one of the solutions, it is the primary solution to the problem of optical flow, being at the base of most of the related works. Even though optical flow can be summarized roughly as "how did a pixel move between frames" and then used as blackbox, we consider that some context about this "black box" could help us make an informed decision about the rest of our architecture.

Formally, optical flow solves the below optimization problem: \[ E = \min_{\mu} \left[ (\nabla^T \mathbf{I}_1\mu - (\mathbf{I}_2-\mathbf{I}_1)^2+\alpha(|\nabla \mu_x|^2+|\nabla \mu_y|^2)) \right] \]

There are two core assumptions that motivate the above formulation:

  1. Color constancy: Pixels won't spontaneously change colors.
  2. Velocity averaging: Nearby pixels move together.

Obviously, these assumptions are not universal truths and one can construct scenarios in which they do no hold ranging from the most contrived (we are trying to follow a chameleon or every pixel has a mind of its own and running independently) to quite mundane things (lighting conditions change or it is raining). In our case, we performed our data collection in an indoor environment where such anomalies can be controlled and minimized.

A pitfall of the optimization problem formulation is that often we don't have only two adjacent frames, but rather an entire video. Solving an optimization problem for each two adjacent frames would be extremely computationally expensive. RAFT circumvents expensive computation by extracting per-pixel features and building multi-scale 4D correlation volumes. Aside from the efficiency gains, this representation confers it some additional robustness against lighting changes and blur.

For the sake brevity, we will only provide a description of each of the model inputs. (For an actual derivation of them, please refer to the appendix):

These inputs will be later fed into the U-Net in the form of an image with enough channels for all of the above.

U-Net

Image segmentation is a notoriously hard problem to annotate data for. Rather than trying to annotate an ever-increasing amount of data, some projects have opted for trying to make more efficient use of the existing annotated data. This has led to the creation of the U-Net ([7]). Aside from their efficiency in terms of training data, these models are small, fast and reliable. This makes them a natural choice of a base on which to build our custom model.

A succinct description of the U-net architecture is that it consists of a contracting path and an expansive path. The contractive path reduces the resolution of the image by downsampling while adding more channels in order to propagate context. The expansive path does the exact opposite: it upsamples while reducing the number of channels. This leads the network to have a "U" structure, thus providing it with its name. The details of the paths are more easily apparent in the visual reminder of the U-Net architecture that we also provide below in the form of Figure 1:

Traditional U-Net architecture diagram showing the encoder, decoder, and skip connections.
Figure 1: The U-Net Architecture.
The main distinction between the traditional architecture and ours is that instead of using a traditional RGB image, we instead add additional channels that provide additional context. This additional context is mostly either depth information or optical flow information. Note that optical flow itself also can be presented in many ways such as either 3-dimensional, 2-dimensional, RAFT flow or geometric flow.

We use the Dice coefficient in order to evaluate the outputs of the U-Net and train it. We note that this metric only measures the quality of the segmentation which is merely an intermediate product. The overall metric that we use for dynamic object detection is different. We remind that the dice formula is $$DSC = \frac{2|X \cap Y|}{|X| + |Y|}$$ If our collected data was free of noise, then X would be the set of segmented pixels in the ground truth and Y would be set of segmented pixels in the output of our model. However, due to imperfect collection of data some of our depth pixels were nan. In order to account for the errors of our equipment, we only included the unaffected pixels in this metric.

The specific U-Net implementation on which we based our work can be found here at [8].

Dataset colection

We collected all of the training and evaluation data at the Highbay of the Aerospace Controls Laboratory at MIT. This controlled environmnet provided us with stable lightning and weather conditions in order for the optical flow to be easy to detect accurately, and contains a ground truth motion capture system for evaluation. We remind that our aim is to measure the efficacy of the techniques applied for the processing of optical flow and image data, not to optimize optical flow detection itself.

In terms of hardware, we used a RealSense D455 depth camera on a Clearpath Husky with an Ouster lidar and MicroStrain IMU running lidar-inertial odometry. The moving objects are teleoperated hexcopter and another ground robot.

The ground truth pose data was collected using a Vicon motion capture system and recorded in a ROS2(Robot Operating System) bag. The Husky recorded a ROS1 bag with depth camera and pose data. We use our own ROS bag processing software for dataloading. Note that the ground truth collected above only provided pose information. Ground truth pose information combined with a rought estimate of the shapes of the robots allowed us to compute approximate bounding boxes. This allowed us to use SAM([2]) with bounding box-guided prompts to generate ground truth for the image segmentation.

Metrics

We used the standard precision and recall metrics in order to quantify the performance of our model. Note that since ground truth data and visual data are collected by different machines, there might exist discrepancies gaps in the collected timestamps. In order to account for this, we only compute the metrics for frames for which there exists a close enough ground truth. Let the model return a list of detected dynamic objects and their poses.

We consider an object to be relevant only if we could feasibly detect it and the detected position is close to the true location (within one meter). By feasibly detect, we mean that we only consider objects that are both moving with high enough velocity (at least 0.1 m/s) and whose centroid lies within the boundary of the current image frame. Based on this definition of relevant we provide the following metrics:

All evaluations were run with the RAFT Sintel model. Opening and closing with a 9x9 kernel were used as morphological operations for post-processing all binary masks to reduce point noise. Objects with a maximum standard deviation below 0.05m or above 1.0m are ignored. Objects beyond the depth range are also ignored.

Evaluation

Table 1. Comparison of the inputs included for each one of the models
Model RAFT Flows 2D Geometric Flow 2D Residual 2D RAFT Flows 3D Geometric Flows 3D Residual Flow 3D Depth (Cur) RGB (Cur) Depth (Prev) RGB (Prev)
Non-learned
1xRGBD Only
1xRGBD + 2d Residuals
1xRGBD + 3d Residuals
2xRGBD Only
2xRGBD + 2d Residuals
2xRGBD + 3d Residuals
2xRGBD Full

Table 2. Comparison of model performance
Model Dice coefficient Recall Precision
Non-learned 0.475 0.713 0.320
1xRGBD Only 0.440 0.973 0.510
1xRGBD + 2D Residuals 0.499 0.934 0.499
1xRGBD + 3D Residuals 0.478 0.886 0.424
2xRGBD Only 0.491 0.908 0.523
2xRGBD + 2D Residuals 0.474 0.936 0.468
2xRGBD + 3D Residuals 0.464 0.940 0.462
2xRGBD Full 0.528 0.904 0.538


Video 2. Example output sequence from the evaluation dataset (2xRGBD Full)
The full evaluation video for 2xRGBD Full can be found here.

Conclusion

We observed significantly better results with the model-based approaches when compared to the original non-learned algorithm in both precision and recall, achieving our objective of real-time dynamic object detection with high accuracy.

We note that the recall of the models was significantly higher than their precision. Based on examination of the videos, we reached the conclusion that this is due to noisy frames with many false positives in which the RAFT and geometric flow significantly diverge, most likely due to noisy odometry or depth. This leads the whole frame to become noisy, thus possibly detecting many more objects in the image. In our training and evaluation data there are at most 2 objects, thus there is only so many false negatives that one can incur per frame, and our models tend to be aggressive in their detections.

Interestingly, and perhaps somewhat expectedly, the model that only uses RGBD data without any flow information already outperforms the non-learned method by a large margin. Therefore, much of this improvement can probably be attributed to the model learning how to detect robots (moving or not) from the RGB and depth data alone. Still, the model that used all of the inputs outperformed all other methods in Dice coefficient and precision, suggesting that the optical flow residuals do provide useful information for the task of dynamic object detection. These specific improved metrics suggest that the model is better able to distinguish between moving and static robots, thus slightly reducing false positives (though not significantly, because of the high noise ratio), which reflects the value of the optical flow information. To better understand this behavior, we would need to collect and train on data in which the robots do not move a majority of the time.

As seen in Figure 2, the model also generalizes to humans, despite never having seen any human data during training, suggesting highly promising generalization capabilities.

Human being detected.
Figure 2: a moving human being detected

We believe that the performance of our models could be further improved by providing them with additional training data and more training time, especially due to the high false positive rate. Given how hard it is to generate meaningful training data, data augmentation methods should be explored in order to allow for creating larger datasets from pre-existing data. To leverage the additional data, a more complex model, possibly using attention mechanisms, could be explored as well. However, model complexity would likely come at the cost of inference speed.

Another area of improvement could be to improve temporal consistency. Currently, each sequence of two frames is treated independently, however, in practice, temporal consistency could provide significant improvements. Methods such as LSTMs or 3D convolutions could be explored for this purpose. Again, these methods would likely slow the method down significantly.

In conclusion, we believe that our method provides a lightweight and efficient solution for dynamic object detection. Future work could explore more complex models and temporal consistency in order to further improve performance, but over 90% recall and over 50% precision is a promising result for a model trained from scratch with limited data.

Codebase

See [dynamic-object-detection on Github] for our implementation. Instructions for downloading our data and running the algorithm and evaluations are in the README. Note that this repository only contains the code responsible for the dynamic object detection side of things. Our fork of the U-Net repository is linked as a submodule and contains more training and evaluation information.

Acknowledgements

We would like to thank Professor Jonathan How for his guidance throughout this project and for generously providing access to the resources that we utilized. Special thanks as well to Lucas Jia for assisting us with physical experiments, data collection and providing us with hardware powerful enough to train our model.

References:

[1] Moving Object Segmentation: All You Need Is SAM (and Flow), Junyu Xie, Charig Yang, Weidi Xie, and Andrew Zisserman, 2024

[2] Segment Anything, Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick, 2023

[3] SAM 2: Segment Anything in Images and Videos, Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, and 14 others, 2024

[4] SAM 3: Segment Anything with Concepts, Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, and 13 others, 2025

[5] DepthFlow: Exploiting Depth-Flow Structural Correlations for Unsupervised Video Object Segmentation, Suhwan Cho, Minhyeok Lee, Jungho Lee, Donghyeong Kim, and Sangyoun Lee, 2025

[6] RAFT: Recurrent All-Pairs Field Transforms for Optical Flow, Zachary Teed and Jia Deng, 2020

[7] U-Net: Convolutional Networks for Biomedical Image Segmentation, Olaf Ronneberger, Philipp Fischer, and Thomas Brox, 2015

[8] Pytorch-UNet: PyTorch implementation of the U-Net for image semantic segmentation with high quality images, Milesial, 2024

[9] P. L. et al., "RST-MODNet: Real-time Spatio-temporal Moving Object Detection for Autonomous Driving," 2019.

[10] N. Huang, W. Zheng, C. Xu, K. Keutzer, S. Zhang, A. Kanazawa, and Q. Wang, "Segment Any Motion in Videos," *Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR)*, 2025.

[11] M. Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision," *Trans. Mach. Learn. Res. (TMLR)*, 2024.

Appendix

This analysis of the difference between the RAFT optical flow and geometric optical flow is reproduced from the previous project.
We obtain our flow residuals by computing the difference between the RAFT optical flow and the "geometric optical flow", or optical flow resulting solely from camera-induced scene flow.

For each pixel \( [u, v] \) with corresponding depth \( d \) in the first frame, we first unproject the pixel coordinates into 3D space using the depth data and camera intrinsics: \[ p_1=d K^{-1} [u, v, 1]^T \] We assume that \( p_1 \) is static in the 3D scene. We transform the 3D point \( p_1 \), which is in the reference frame of the camera at frame 1, into the reference frame of the camera at frame 2. Given that the robot pose is \( T_{world}^{r1} \) at frame 1 and \( T_{world}^{r2} \) at frame 2, and that the transform between the odometry frame and camera frame is \( T_{odom}^{cam} \), we can transform the point as follows: \[ T_{r2}^{r1} = (T_{world}^{r2})^{-1} \cdot T_{world}^{r1} \] \[ p_2 = (T_{odom}^{cam})^{-1} \cdot T_{r2}^{r1} \cdot T_{odom}^{cam} \cdot [p_1^T \dots 1]^T \] Finally, we project the point \( p_2 \) into the image plane to get its pixel location in the second frame: \[ [u', v'] = [f_x \frac{p_{2,x}}{p_{2,z}} + c_x, f_y \frac{p_{2,y}}{p_{2,z}} + c_y] \] where \( f_x, f_y \) are the focal lengths of the camera and \( c_x, c_y \) are the coordinates of the camera center in the image plane. Then, \( [u', v'] - [u, v] \) gives us the geometric optical flow at \( [u, v] \) in frame 1. The key observation is the assumption that \( p_1 \) is static in the 3D scene. If \( p_1 \) moves \( \delta = [\delta_x, \delta_y, \delta_z] \) (in the camera reference frame at frame 2) between frame 1 and frame 2, we get the following image projection at frame 2: \[ [u'_{dynamic}, v'_{dynamic}] = [f_x \frac{p_{2,x} + \delta_x}{p_{2,z} + \delta_z} + c_x, f_y \frac{p_{2,y} + \delta_y}{p_{2,z} + \delta_z} + c_y] \] Since RAFT estimates the actual observed optical flow from frame 1 to frame 2, it will return an optical flow close to \( [u'_{dynamic}, v'_{dynamic}] - [u, v] \) at \( [u, v] \). Then, the residual (RAFT minus geometric flow) at \( [u, v] \) is \( [u'_{dynamic}, v'_{dynamic}] - [u', v'] \).

Because we have the depth data for both frame 1 and frame 2, in theory we can compute the full 3D movement of each pixel between frames. Given RAFT flow \( [\delta u_{raft}, \delta v_{raft}] \) at pixel \( [u, v] \), we can compute the 3D residual as follows: \[ p_2 = (T_{odom}^{cam})^{-1} \cdot T_{r2}^{r1} \cdot T_{odom}^{cam} \cdot [p_1^T \dots 1]^T + [u_{raft}, v_{raft}, 0]^T \] \[ p_{2, raft} = d_2 K^{-1} [u + \delta u_{raft}, v + \delta v_{raft}, 1]^T \] \[ \delta \approx p_{2, raft} - p_2 \] where \( d_2 \) is the depth of pixel \( [u + \delta u_{raft}, v + \delta v_{raft}] \) in frame 2. This method works by assuming that RAFT correctly predicts pixel correspondences, in which case the 3D point corresponding to pixel \( [u + \delta u_{raft}, v + \delta v_{raft}] \) in frame 2 is the same as the 3D point corresponding to pixel \( [u, v] \) in frame 1. Since \( p_2 \) accounts for the camera motion between frames but not the motion of the point itself, while RAFT accounts for both, the difference is the approximate 3D motion of the point between frames.

Although we could theoretically recover the depth of \( [u'_{dynamic}, v'_{dynamic}] \) from the depth data for frame 2, in practice, noisy depth data and imperfect pixel associations make this a noisy process. Instead, we assume that the object maintains constant depth between frames (\( \delta_z = 0 \)) and use the residual to recover only \( [\delta_x, \delta_y] \): \[ [\delta_x, \delta_y] = p_{2,z} ([u'_{dynamic}, v'_{dynamic}] - [u', v']) \cdot [\frac{1}{f_x}, \frac{1}{f_y}] \] This assumption works reasonably well in practice, although it is not effective at detecting objects moving along the camera Z axis.

Finally, given \( v=\| \delta_x, \delta_y \| \) as an estimate for the velocity of the 3D point corresponding to each pixel, we can mask areas of the image with \( v \) higher than a set threshold to obtain masks of dynamic objects in frame 1.