For some years now I have been using the terms object recognition, object detection and object tracking interchangeably as if they described the same concept or class of algorithms.
This bad habit or lack of knowledge even seeped into my naming conventions for classes and methods, when developing software. In the back of my head I knew that recognition, detection and tracking had to be separate concepts, but I never took the time to look them up and correct myself. Until recently, when I accidentally stumbled upon a distinction between the three concepts while browsing the internet. In this article I will treat object recognition, object detection and object tracking on a conceptual level and describe how they all differ from and relate to each other, and so I will not delve into algorithmic details or mention any specific algorithms. 
An object recognition algorithm takes as input an image (this image can be a still-image or a video frame, gray-scale or color image depending on the specifics of the algorithm) and produces an output consisting of pairs of labels and probabilities (label, probability). The labels describe what objects are in the image based on a pre-trained classifier: “dog”, “cat”, “cow” etc. The probabilities describe how sure the algorithm is on the labeling, for instance an object in an image could be 89% dog, but 75% cow. In conclusion, an object recognition algorithm attempts to answer the question: “What objects are in the image?”
Typically, the first tutorials you’ll encounter, when venturing into deep neural networks for computervision, will have you recognize objects, namely the numbers from 0-9, in small images of around 20×20 pixels. During training of the deep neural network you’ll most likely be using the MNIST dataset – a dataset of small images containing the numbers 0-9. As you progress in these tutorials, you’ll learn how to build simple architectures based on convolutional neural networks that will let you recognize and classify more sophisticated real world objects in images, like “person” and “cat” etc.
An object detection algorithm takes as input an image and produces an output consisting of bounding boxes for the objects, if any, and corresponding (label, probability) pairs. For instance you could input an image of a dog and a cat, and the object detection algorithm would then – granted that it works correctly – return the bounding boxes and (label, probability) pairs for both the dog and cat. In conclusion, an object detection algorithm attempts to answer the question: “What objects are in the image, and where are the objects?” By now, you may have realized, that object recognition is a prerequisite for object detection. In fact, object detection can be seen as running an object recognition algorithm on different patches of an image. For instance imagine sliding a window from left to right, top to bottom on an image and running an object recognition algorithm on each image patch inside the window. Then, the image patches that yield a high probability of containing an object are the outputs of the object detection algorithm.
For practical experience with object detection, I recommend using Googles Cloud Vision API or their Object Detection API for TensorFlow. The ladder will allow you to use a pre-trained neural network to track predefined classes of objects or to re-train the neural network to detect object classes of your own in images and video.
Finally, let’s talk about object tracking in video. Often object detection is a prerequisite for object tracking, because an object detection algorithm must provide a bounding box of an object to the object tracking algorithm before it will work. A good object tracking algorithm will use the information inside the bounding box – appearance and location of object – to track the object as it moves frame by frame in the video. An object tracking algorithm is typically faster than an object detection algorithm, because it is given the current location of an object and therefore constrains its search for the object in the next frame to an area around the current location. After a few frames of successive tracking, the tracking algorithm may infer about the speed and direction of the object as well to further constrain the search for the object. However, the bounding box of the object tracking algorithm tends to drift from the object it is attempting to track, and therefore you will typically see a mix of object detection and object tracking, where object detection is used every x frames to re-establish a precise bounding box around the object. Another advantage of using object tracking instead of solely relying on object detection is that an object tracking algorithm tracks a specific instance of an object from frame to frame. Imagine tracking two moving soccer balls in a video. The object tracking algorithm will know the identity of “soccer ball 1” and “soccer ball 2” in each successive frame, whereas the object detection algorithm will detect two soccer balls and their locations, but cannot infer about which is which because they belong to the same class.
So, with this short article, I hope to have clarified the differences and interrelations between object recognition, object detection and object tracking, and that you find the distinctions helpful in your studies or work.
 
				