Cumulative Distribution Functions, Part I
Data scientists care about the mean and other measures of central tendency. But they also care about distributions. Although not as well known as some other methods for examining distributions, the cumulative distribution function (CDF) deserves to be in the toolkit of every aspiring data scientist. It also deserves to be in the working vocabulary of every literate data interpreter.
In a series a posts I examine the CDF. My approach is inductive. First, I generate sample CDF plots to give an initial footing for our intuitions. Paradoxically, data science is not about data. At its core data science is reasoning about probabilities. But probabilities are not reclusive creatures. They roam in packs, reinforcing each other additively and cumulatively. CDFs provide a way then to study and understand probabilities in one of their native habitats. Second, I consider the CDF as a mathematical function alongside more familiar functions such as the histogram and the probability mass function (PMF). Here the goal is to use a mix of code and visualization to probe the CDF’s inner workings. Finally, I state the mathematical formalism of CDFs at a sufficient level of generality to accommodate, for example, differences between discrete and continuous variables and to keep our more formally minded friends at bay.
Note: The Python code for the series “Topics in Data Science” are available in the form of Jupyter notebooks at my GitHub site. The direct link to the: Jupyter Notebook. As time permits I will also make available versions in R.
Gender Earnings Disparity: Elite Institutions
The first dataset we examine originates from U.S. Department of Education and is called “College Scorecard Data.” The data reports earnings of students 10 years after entry to college. Here we view a sample of the data that compares earnings for men and women at elite institutions:
Now let’s draw some quick plots beginning with the histogram. We can see immediately that there is a salary difference for men and women. The median salaries for men and women are $110K and $80K, respectively.
Kernel Density Estimation
We can look at a smoothed out version of the same distribution using what’s called kernel density estimation (KDE). A KDE plot nicely reveals the shape of the distributions.
Next, let’s look at a violin plot which combines the box plot with a kernel density plot.
Cumulative Distribution Function
Finally, let’s plot the CDF. Let’s note that the x-axis in this case is salary and the y-axis is cumulative percent. Using the CDF we can pose and easily answer the following type of question: What percentage of women and men make less than $100,000? The vertical line at $100,000 shows us that 83% of women and 29% of men make less than $100,000, respectively. Or, only 17% of women make more than $100,000 compared to 71% of men.
Let’s conclude Part I with a working definition of CDF. A CDM for a random variable X, evaluated at some value x, is the probability that X will take a value less than or equal to x. For our example, the random variable X (= women’s salaries) at the value $100,000 the probability is .83. Whereas for men the probability at the value $100,000 is .29.
In the next part we will consider some other examples, unpack the CDF as a mathematical function, and look at code in Python to understand and generate CDF plots.