Exploratory Data Visualization with Pandas Tutorial

--

Starting a Data Science project can be really tough, especially if you’ve never done one before.

Sure, there are a ton of methods, data structures and types of tests to learn, and that can take some time to do properly. But the hardest part of it all, so we don’t end up with some jumbled mess of statistics, is finding proper direction by truly understanding our dataset so we can best determine a specific question (or hypothesis) to explore.

This article will be about using Pandas to find the clues that will help us understand our dataset — the step before our inspired hypothesis.

Step 1: Finding the Dataset and Our First Clue

To start, we better have something we’re interested in; it can be hard to find something when you aren’t looking for anything. Fortunately, we college students are savants when it comes to low effort meals. The task today? Cereal. And hopefully this latest box can last the rest of the week.

There are several databases provided to help us start looking for our data, but since it was the first link on Google, we’re going with Kaggle.

Wonderfully user-friendly, Kaggle has given us our first clue: a description of the columns. Make sure to read through these when looking through any dataset, to leave as little room for misinterpretation as possible.

https://www.kaggle.com/crawford/80-cereals/data

Thanks, Kaggle!

Note: If one dataset doesn’t have all the information we want, like if we were looking for the price (which this dataset does not have), we aren’t stuck with just this and can look for another dataset, and perhaps even combine datasets.

Step 2: Univariate Analysis

We have a dataset — how exciting!

If we take a quick peek in Jupyter Notebook, we’ll see some of the values and the dimensions of the dataset.

Note: Examples below are all just one way of using these functions. Please read the documentation for a comprehensive description of the use cases of each function, as well as alternative functions.

Another note: There are multiple packages that have many more and similar visualizations. Check out matplotlib, Plotly and seaborn, (popular at the moment) for possibilities that might fit your dataset the best.

Duplicated

This is just a quick line to see how many unique values there are, if there are any. Here, we see that there are no duplicate cereal names, so they are all unique.

Value counts

For categorical/qualitative data, I probably use this function the most in my own projects when trying to get a general idea of the spread of data. We seem to have varying amounts of cereals coming from different manufacturers.

Bar graph

A convenient way to visualize the previous numbers.

Describe

For numerical/quantitative data, this one is my favorite. The function can be used on any column or a full dataset. While individual functions for mean, median, max, ect. do exist, this is an easy way to get it all at once. Note that pandas ignores missing values for calculations like mean, median, ect. It looks like our calorie values seem to be centered around about 110.

Box plot

A convenient way to visualize the previous numbers. Wondering about where the median went? So did I. The describe() function above shows that the median and third quartile share the value 110.

Histogram

Another way to visualize the previous numbers. Reading the documentation for this one is something I’d like to emphasize a bit more due to the multitude of useful selection choices.

Apply

This is one way we could modify all values in a column. A couple uses for this may be changing strings to DateTimes or changing temperatures from Fahrenheit to Celsius. In nutrition, a “calorie” is actually a kilocalorie, the energy it takes to change a kilogram of water by one degree Celsius. Here, we’re gonna change the kilocalories to calories by multiplying by 1000.

A note about missingness: Missing values can be important so shouldn’t always just be dropped or ignored when viewing our visualizations. For instance, how would it impact our results if a column had values that were left missing when the value was 0? Or if information was voluntarily provided in a way that made missingness disproportional to the true distribution?

Step 3: Bivariate Analysis

In the univariate analysis, we focused on the ‘manufacturer’ and ‘calories’ columns separately. Now we can look at the relationship between them. While the truth is that they were the first qualitative and first quantitative columns listed and I was too lazy to look past them, I will now state that this was all part of the plan so we can use a bivariate analysis to see if the question “Do any manufacturers typically produce cereals with more calories than other manufacturers?” is reasonable.

Group-by

Here we can group calorie values by manufacturer and check a few statistics. While we won’t go into the testing part, this is an example of how we could get the observed values for a permutation/significance test. If you’re wondering about the NaN in the first row below, our value_counts() function shows only one cereal from manufacturer ‘A’ so standard deviation cannot be calculated.

From these results, we may find it reasonable to construct a test with an alternative hypothesis that cereals from manufacturer ‘R’ have, on average, more calories than any other manufacturer — as opposed to the null hypothesis where calorie differences between manufacturers are roughly equal.

Scatterplot

Could the amount of sugar and sodium in our cereals be proportionally related? Inversely proportional? Unrelated? At last we can find out from a visual inspection. Note that heatmaps are another interesting way to view this kind of correlation, but scatter plots will suffice for this amount of data points.

Correlation

Well, it looks to be a pretty weak relationship, but how can we quantify that? Easily answered by this correlation function.

Closing Thoughts

Starting a project can be pretty daunting, especially the major steps involved like finding a data set and forming a hypothesis. Fortunately, we have many functions to help us out and streamline our procedure, even just within the pandas package. However, we shouldn’t forget that there are many other ways and variations to help visualize our data and findings. So the best thing to do is just to try it out for yourself, learn as much as possible and figure out the best solutions case by case.

--

--