Object Tracking For Dummies

ai ai in computer vision ai-cv ar artificial intelligence augmented reality augmented startups computer vision deep learning Mar 28, 2022

In this article, we will delve into the computer vision task of object tracking. We’ll understand what object tracking is and how it is used, why it is needed, the challenges faced by object tracking algorithms, and finally, cover some popular object tracking approaches such as DeepSORT. Before we begin, it’s recommended that you have a basic understanding of object detection, if that isn’t the case you can read this article to get up to speed.

What Is Object Tracking?

Object tracking is a computer vision application that takes in a set of initial object detection, develops a visual model for the objects, and tracks the objects as they move around in a video. Or simply put, object tracking automatically identifies objects in a video, assigns them unique IDs, and maps their paths of motion with high accuracy.

We know videos are just sequences of images, and that object detectors can locate objects in images, this raises the question — “Why not just use object detection?” There are a few problems with that. Take for instance a video with multiple objects, if you use object detection you have no way of connecting the objects in the current frame to the previous or later frames. If the object you are tracking goes out of the camera’s view for a few frames and comes back in, you would have no way of knowing if it’s the same object. Basically, detection works with one image at a time and has no contextual information regarding the current and past motion of the objects.

Another issue with using object detection frame-by-frame is the higher computation cost associated with it. As detection algorithms have no contextual information, they scan the whole image for the object. On the other hand, object tracking approaches use their contextual information regarding the direction and velocity of the motion to narrow down the search space for the object. This makes object tracking algorithms more computationally efficient and more feasible for deployment in edge devices with limited resources.

Object tracking has a wide range of applications in computer vision, such as surveillance, human-computer interaction, traffic flow monitoring, human activity recognition.

How Does Object Tracking Work?

Although there is a wide range of techniques and algorithms that try to solve the object tracking tasks in different ways, most of the techniques rely on two key things:

Visual Appearance Model

Most tracking algorithms need to understand and model the visual appearance of the target object before they can track it. If the goal is to only track one object just using the visual appearance model can be enough. However, if there are multiple target objects the algorithm will need something extra. The visual appearance models are usually initialized using an object detection model, this is why you’ll often see object detectors being used in conjunction with tracking models.

Motion Model

The differentiating component of good tracking models is the ability to understand and estimate the motion of the object. The motion model captures the dynamic behavior of the objects and estimates their potential position in future frames. This helps reduce the search space for the visual appearance models. Kalman filter, optical flow, Kanade-Lucas-Tomasi (KLT) feature tracker, and mean-shift are some examples of motion modeling techniques.

Challenges Faced By Object Tracking Algorithms

Confusing Background

The background of the video frames used to train tracking models can have a severe impact on their accuracy. If the background of the object is cluttered, constantly changing, or similar to the object, it becomes harder for the model to accurately track small objects. When the background is blurry, out-out-focus, or static it’s easier for tracking models to do their job.


Occlusion is one of the most prevalent issues in seamless object tracking. It happens when the object being tracked gets hidden by other objects. For instance, two people walking past each other, or a ball being caught by a player. The challenge here is what the tracking model should do when an object disappears and reappears again.

Multiple Camera Angles

Depending on the use case, tracking methods often have to work with multiple perspectives captured by multiple cameras. Take for example the use of tracking in sports analytics. These changing perspectives can significantly change what the object looks like. In cases like these, the features used to identify and track the object need to be invariant to changing perspective for the tracking model to perform well.

Popular Object Tracking Models

Recurrent YOLO (ROLO)

Recurrent YOLO (ROLO) is a spatially supervised recurrent convolutional neural network, it is a combination of YOLO and LSTM. ROLO uses a YOLO module to collect visual features, along with location inference priors, and an LSTM network for finding the trajectory of target objects. For each frame, the LSTM uses an input feature vector of length 4096 (obtained by concatenating the high-level visual features and YOLO’s detection) to infer the location of the target object.

Simple Online And Realtime Tracking (SORT)

The central idea of SORT is to use both the position and size of the bounding boxes for estimating the motion and data association through frames. The Faster RCNN is used as the object detector. The displacement of objects in the consecutive frames is estimated by a linear constant velocity model which is independent of other objects and camera motion. The state of each target is defined as x = [u, v, s, r, u,’ v,’ s’] where (u,v) represents the center of the bounding box r and u indicate scale and aspect ratio. The other variables are the respective velocities.

For the ID assignment, i.e., data association task the new target states are used to predict the bounding boxes that are later on compared with the detected boxes in the current timeframe. The IOU metric and the Hungarian algorithm are utilized for choosing the optimum box to pass on the identity.


Despite the effectiveness of SORT, it fails in many of the challenging scenarios we mentioned earlier, like occlusions, different camera angles. To overcome this limitation DeepSORT introduces another distance metric based on the “deep appearance” of the object. The core principle is to obtain a vector that can be used to represent a given image. To do this DeepSORT creates a classifier and strips the final classification layer, this leaves us with a dense layer that produces a single feature vector.

Furthermore, DeepSORT adds extra dimensions to its motion tracking model. The state of each target is denoted on the eight-dimensional state space (u, v, γ, h, x,˙ y,˙ γ, ˙ h˙) that contains the bounding box center position (u, v), aspect ratio γ, height h, and their respective velocities in image coordinates. These additions enable DeepSORT to effectively handle challenging scenarios and reduce the number of identity switches by 45%.

Want to learn more about object tracking, DeepSORT, and implement your own computer vision applications? Enroll in our YOLOR course HERE today! It is a comprehensive course that covers not just the state-of-the-art computer vision models such as YOLOR and DeepSORT but also leverages them to solve a plethora of real-world problems.


Stay connected with news and updates!

Join our mailing list to receive the latest news and updates from our team.
Don't worry, your information will not be shared.

We hate SPAM. We will never sell your information, for any reason.