Clustering in Machine Learning

Feb 01, 2019

Machine learning is the field of computer science that gives computer systems the ability to learn from data — and it’s one of the hottest topics in the industry right now.

We will now take a closer look at one of the techniques used in unsupervised learning, which is referred to as clustering. So, what exactly is clustering? This is a technique where your data is divided into a number of logical groups. This grouping is done based on the characteristics of the data, and all of the data within a particular cluster will have certain similarities.

To understand this, consider that you have a huge dataset where you have millions of data points, and there are no labels for you to work with. Before you use the data for anything, you want to see if there are any patterns which can help you identify groups in your data. For that, you can make use of a clustering algorithm, and these will examine all of the attributes in your dataset.

And will be able to determine that there are several data points which share similar attributes or characteristics. Based on the similarities, a number of different groups may be formed in your dataset. And each of these groups will form a cluster of data points. So the task of a clustering algorithm is to examine the data which is presented to it, and then break it up into a number of logical clusters. All the data points within a cluster will have far more in common than the points outside the cluster.

So why exactly would you want to employ the technique of clustering? Well, consider that entities in the real world are in fact very complex and may not be very easy to categorize. For example, you may have a lot of data about the products which are sold on an e-commerce site, but how many different categories should you divide them into? And what exactly should those categories be? You're faced with similar questions when categorizing the users who are on a social media platform, or even in the case of readers of an online newspaper.

Do you classify people by their age, their gender, their geographical location. And if you do opt for geographical location, then what exactly is the definition of a single location? Do you divide people by state, by city or by neighborhood? So there is clearly no way to objectively divide your data into groups. This is where clustering comes into the picture.

But before you can apply it you will need to ensure that the defining characteristics of each of your data points are represented using numbers. For example, in the case of products, one of the attributes of the data can be a rating which is given to the product by the users. It can also be an overall review sentiment. So for all of the reviews made for the product, you may come up with an aggregate sentiment score which could range somewhere between 0 and 1.

[Video description begins] The following bullet point is shown on the screen: Review sentiment (1 - positive and 0 - negative). [Video description ends]

The attributes in fact need not be entirely numeric, but could be categorical with numeric labels representing each of the categories. For example, the number 1 could represent electronic products, and number 2 for some fashion accessory. Some other attributes for a product could include the dimensions such as the size, the weight and so on. You may even have a color for the product, and this again, because it's a categorical value, will need to be represented by some numeric label.

So for fields such as the product category and the color, these are some of the predefined categories you come up with. But your clustering algorithm may help you identify certain other categories. For example, highly rated electronic items which are available in blue. Moving along now to the defining characteristics of users, whether on a social media platform or the readers of some online newspaper. You may have some kind of rating for the post made by the user as well as the comments, likes and shares. You may even have some kind of categorical label for each of their posts and this could be by the topic of the post itself.

[Video description begins] The following bullet point is shown on the screen: Score every post by topic (music lovers, sports lovers). [Video description ends]

You may rate each user or reader by their activity on the platform. So 100 can represent a highly active user and 0 if they're not active at all. In the case of social media, you may rate each user by the number of connections which they have. And you may also give a score for each user depending on how complete their profile is. So sticking with the example of users on a social media platform, based on the factors such as connections, activity, and profile completeness. You can represent each user by a point on a three-dimensional plot, such as this one.

[Video description begins] The plot comprises three arrows. Two of these form a right angle and the third starts at the intersection of the first 2 arrows, pointing away from them at a 45 degree angle. The base of the right angle has the label Profile complete %. The perpendicular arm of this right angle has the label Connections. The arrow at 45 degree angle has the label Activity. There is a big dot in between the Connections and the Profile complete % space. [Video description ends]

Rather than just considering three attributes, you can extend this to N attributes where you'll have an N-dimensional hyperplane on which you will represent your users as points. Once each user is represented as a point on this hyperplane, then there will be number of points which happen to be close to each other in terms of overall distance.

So there may be one group of users who are very highly active, but also have very few connections. As we can imagine, there is likely to be a cluster where the profile completeness is close to zero, as is the activity and probably the connections as well. It is these kind of clusters which a clustering algorithm will help you find. Moving to another example, consider that you have plotted on an N-dimensional plane all the readers of a newspaper such as the The New York Times. Well, if you implement some kind of clustering algorithm on these points. You may discover that one cluster represent users who spend a lot of time in the technology section of the newspaper.

So that is one common attribute for all the members in one cluster. On the other hand there will be different clusters and this may represent readers in different sections of the paper, such as current affairs or sports. So when you examine the users within a particular cluster, the distance between the users is an indication of how similar they are.

So if you have one cluster, where most of the readers read articles from the technology section. Then those points which are close to each other may represent users who read multiple technology articles in a single day. Whereas you may have other points within the same cluster representing users who read technology articles just once a week. When dividing any dataset into a number of clusters, the goal of the clustering algorithm is to ensure that all of the data points within the same cluster should be as similar as possible. This may involve creating clusters based on multiple attributes rather than just a single one.

So maximizing intra-cluster similarity is one of the goals of clustering, the other one being to minimize inter-cluster similarity. So all of the data points which are in different clusters should be as far apart from each other as possible. This is where the number of clusters your data will be split into comes into the picture. Create too few clusters then you may minimize the similarity between different clusters.

However, in that case data points within the same cluster may in fact be rather far apart. Conversely, if you create too many clusters then the data points within the cluster will be similar. However, two different clusters may in fact be very close to each other without there being any significant distinguishing factors.