A Walk-through of Principal Component Analysis

Derek Leung
Data Science Student Society @ UC San Diego
6 min readNov 18, 2019

--

What a cool picture. I bet a lot of people have used it.

The residence halls strike again. You sat up too fast and have begun to wonder what makes certain bumps in the popcorn ceiling above your bunk bed so incredibly unbearable. Time to collect data.

Unfortunately, we encounter a problem (other than your headache): we can’t plot the width, height, sharpness, and shade of white for all twenty popcorn ceiling bumps that smashed into your head when you got up this morning because we, like the rest of the world, lack an effective four dimensional graph.

Well, congratulations, we have unlocked the perfect opportunity to implement the dimension reduction strategy known as principal component analysis or PCA! While you may have, like an absolute madman, mapped out the width, height, sharpness, and shade of white for each little painful bump above you, what we really want to figure out is the key factor, or factors, that make different groups of popcorn bumps special and so very painful. So then we do math.

High Level Explanation of PCA

So what purpose does PCA serve? While categorizing bumps in a popcorn ceiling does seem to be an extremely pressing issue in today’s society, there are situations where using PCA may be more applicable: like determining the best kind of rock for skipping. Yes. Long nights in the CS dungeon, without even the hope to see a glimmer of sunlight from the world above, have left countless souls yearning to revisit the time when their eyes still shined. Skipping rocks may fulfill this desire.

But again, there is a problem. It has been what felt like a lifetime and you no longer remember what kinds of rocks are good for skipping. You don’t even have the rocked grouped by type! So you collect the data to categorize them: volume, weight, smoothness, how thin they are relative to their volume, and how circular they are. However, these cannot be graphed concurrently because there are more than three attributes and subjects (the rocks).

Therefore, we have to make a decision and cut off some attributes like you did with your high school friends upon entering college! Remember, the purpose of this process is to see what makes different groups of rocks special. So, the attributes we want to represent in the end result are the ones with the greatest variance.

Different attributes can be ranked in order of variance, or importance, like in the scree plot below. Each point on the x-axis represents a different attribute, or component, and the y-axis, computed from the eigenvalue, represents the variation that the attribute accounts for. In this specific instance, whatever the first principal component represents accounts for over 45% of the variation and the second accounts for over 25% of the variation. In other words, the first principal component has the greatest variance and the next principal components follow in decreasing variance.

Given that these two variables can account for over 70% of the variation between subjects, it should be reasonable to assume that there is a high likelihood of these traits possessing the differences necessary to distinguish between groups of subjects. So we can drop those other attributes like the class you failed the first midterm of. PCA teaches us that those other traits totally didn’t matter for our purposes because they barely make groups of subjects different from the others (assuming those variables accounted for very little variance).

Ideally, the principal components we choose will be enough to classify subjects into meaningful clusters (though it may be useful to see what happens when we pick other components). The end result should be something like the figure below, with the axes representing our chosen variables.

The “PC 1” and “PC 2” lines are the line of best fit for each individual chosen variable. The graph can also be depicted with those lines of best fit as the axes, since it makes for an appealing visualization and the important part is the points in relation to each other. Now that we can see what variables make rocks the most different from the others, we can test a few from each group or extreme to find out how they perform and predict the performance of all the others from those results.

How Industries Use PCA

Face Recognition

The first two PCA most discriminant principal components to separate females from males

When a program tries to recognize a face, there are a lot of variables that can be collected because a single image can contain so much information. Even though pretty much everyone has the same basic facial structure (eyes, nose, mouth, etc.), there are enough subtle variations that we can tell everyone we know apart from each other.

Given the abundant yet subtle differences people’s faces exhibit, a kind of balancing act arises: we need to have a machine process enough information to tell people apart, yet not so much that the machine takes too long to process or the machine finds a significant difference in the same person. Therefore, PCA is well fit to determine the most important traits we can use to differentiate people. PCA allows the machine to run faster, checking fewer variables, and run more effectively, as it finds traits that vary the most between different people and uses these distinguishing traits to categorize faces or distinguish between male and female, as seen above.

Detection and Visualization of Computer Network Attacks

There are loads of data involved in analyzing network attacks and dozens of variables that can be measured for each kind of attack. Because some of us have lives, though rare for UCSD stem-majors like us, we are not about to waste all our time staring at each little piece of code. Instead PCA can help us find what is important from all this data so we can more quickly determine if a network attack is in progress, typically in one of two ways: anomaly detection, which attempts to identify abnormal activity, and signature detection, which attempts to match behaviors from previous attacks to current activity.

In one study where PCA was used to learn about defending against such attacks, there were seven data sets with three-hundred feature vectors each (way too much to analyze everything). Thanks to PCA, they were able to interpret their data graphically, using bi-plots. “Future work includes testing this model to work in a real-time environment at which network traffic is collected, processed and analyzed for intrusions dynamically.

Aftermath

Principal component analysis is a great way to save time, reduce the computational cost and increase efficiency when processing large amounts of data. Consequently, it is a persisting method of dimension reduction implemented by numerous modern industries as the amount of data which technology allows us to collect continues to grow. The time has come. Gather the rocks.

--

--