Intro to Clustering and Classification

Have you ever bought something online and then were presented with an offer to buy something complementary? Have you received a special offer in the mail from your cable company right as you were thinking of switching providers? What about receiving a phone call from your credit card company because they thought your spending behavior looked unusual? All these activities are made possible by the sheer amount of data available to companies today and all of it is made possible by classification and clustering.

Humans are notoriously bad at dealing with large quantities of data. When faced with millions of data points with no apparent patterns to speak of, we often go with our gut. We create rules based on our prior experience and personal preferences. These rules can serve us well. We reason that people buy more at grocery stores when they are hungry because it makes sense and because we personally exhibit this behavior. Today in business, these kinds of insights aren’t enough to drive us above our competitors. We need to be able to identify for example that men over the age of 30 from a certain part of town buy more of a product directly after a sports game. This is where AI can help.

Clustering and Classification algorithms can quickly examine huge data sets looking for patterns. When an algorithm processes data, it is simple looking for patterns in the numbers it has been fed. It doesn’t know that those numbers refer to features in the real world. As a result, it can identify patterns that humans might dismiss because they don’t fit with our prior experience.

Clustering and Classification are some of the more basic elements of Machine Learning. These algorithms work on structured data. Structured data is data that has been organized and formatted into a repository (usually a database). Examples of structured data include numbers, dates, and groups of words and numbers. (This contrasts with unstructured data like a free-form email, where none of the information contained in the email has been broken out or classified.)

Clustering uses unsupervised learning to analyze a data set and identify patterns and clusters of similar data. This means that if you feed a clustering algorithm a bunch of data about your customers, it can group those customers into categories based on their behavior or features. These groups will be based entirely on the data provided to it and may be entirely different than the groups that humans like you and I would select. As a result, Clustering can identify interesting and previously unknown patterns in consumer behavior and business processes.

Classification, unlike Clustering, uses supervised learning. Instead of blindly looking for new patterns in data, Classification is trained to recognize certain patterns. You might have a large data set with pictures of moles, some labeled as cancerous and others labeled as benign. A Classification algorithm can be trained on this data set and then fed new images. It will then be able to identify whether the new picture contains a cancerous or a benign mole.

Clustering and Classification are the basis of Predictive Analytics. To make predictions or identify anomalies, patterns must first be identified. Clustering and Classification are the main tools we can use to identify patterns and whether new data fits those patterns. Armed with this information, we can much more accurately forecast future outcomes.

In a future article we will discuss in more detail how Clustering and Classification work and how these statistical models can lead to inaccurate results (“overfitting”) if not used properly.