Intro to Object Recognition
When I get home after an exciting day of working at Foundation AI, I sit down on my couch, grab the remote, point it at the TV, and press the power button to turn it on. This is only possible because (and I don’t mean to toot my own horn) I am great at object recognition. I can distinguish my couch from the wall and identify that it is a couch. I can locate the right remote among the pile of other remotes (yes, I have too many). I know where my TV is, and I know which button on the remote is the power button. This is a skill we all learned as children and for the most part take for granted.
It can be exceptionally useful for a computer to do this kind of work instead of a person. If I install a security camera in my house, I don’t want to have to watch the footage 24 hours a day to see if someone breaks in. If I want to find a picture of a cute puppy to share with the CEO, I don’t want to have to look through every picture on the internet to find one of a puppy cute enough for his attention.
Object Recognition is one of the most active areas of research in AI. While overall it isn’t close to matching the human ability to recognize objects, AI does approach human levels when dealing with narrow cases (for example, identifying if a picture contains a dog).
Object Recognition is used to identify discrete objects in images and videos. To a large degree, the techniques used for images and video are the same. Video is at its core a collection of still frame images. Video can have some added complexity if you need to track an object from frame to frame. Images and video are unstructured data. This means that the data of the image isn’t broken out into its features.
When you see an image, you need to identify whether it contains a dog and whether that dog is big or small, brown or black. Because images and video can contain a great deal of variability (they can contain every object on earth), Object Recognition algorithms require very large amounts of data to be trained effectively. Object Recognition algorithms are trained using supervised learning. This means that they are fed a large number of photographs that have been labeled with what they contain. The algorithm then develops rules for how to identify those objects in new pictures.
There are pre-trained object recognition algorithms like YOLO, which have already been trained to detect a limited number of objects. If however you need to detect objects not included in the pre-trained model, you need to increase accuracy, or you need to determine what direction an object is moving in a video after it has been identified, you will need to either re-train the off-the-shelf algorithm with new data, develop a new algorithm, or assemble an ensemble of algorithms to conduct different tasks.
Object Recognition is used in several different ways today. It can identify what is in an image. This is called tagging and is used by Google’s image search. It can determine if an image contains a dog or a cat, identify age-sensitive content, and group photos with similar characteristics. Object Recognition can also find images that are similar to other images. Google’s reverse image search is an example of this. In this case, the algorithm doesn’t need to be trained at all, because it doesn’t need to identify what’s in the image. It just needs to identify other images with similar features.
Object Recognition can also find the differences between images. This is most often used in medical imaging. An object recognition system can scan images of the body and identify anomalies. In this approach an algorithm is trained on a set of images, for example X-rays, that are already tagged as healthy or unhealthy. New images can be evaluated by the trained algorithm to determine if the patient is at risk of developing an illness. The Foundation AI team has extensive experience developing solutions with this type of functionality.