CS 2120: Class #11¶
Plotting with Python¶
Now that you know how to load and manipulate data, we’re going to spend some time learning how to visualize our data (and the results of processing it).
If you’re running Python from the command line, try this:
% ipython --pylab
otherwise, make sure you do this before trying anything below:
>>> from pylab import *
Anyone remember what that’s doing?
If you’re like me and like the namespace, do this:
>>> import matplotlib.pylab as plt
- Just, if you do this, be sure to add
plt.
before whenever you’re calling a plotting thing. - Ex.
plt.plot(x)
orplt.clf()
orplt.title('something')
- Ex.
- Just, if you do this, be sure to add
Before we start: If you need to clear your plot at any time, just type:
>>> clf()
Let’s get some data¶
Activity
Download this Google Trends CSV .
- Each row is a week. Starting in 2004, up to the 2012-ish
- The columns are the search terms ‘vampire’, ‘zombie’, ‘flu’, ‘ice cream’
- The numbers are “search volume index” (normalized to ‘flu’).
Now get this data into Python!
- Open a
csv.reader
(look at last class’ notes)- Read each row into a list
You should now have a ‘list of lists’. Convert it to a NumPy array of floats, and switch rows for columns. e.g., if it’s in the variable ‘data’, do this:
>>> data = numpy.array(data).astype(numpy.float).transpose()
Make sure you understand how that line works!
Simple plots¶
We can do a simple line plot of 1D data with the
plot()
command.Try this:
>>> plot(data[0])
OR, try this if you’re like me and
import matplotlib.pylab as plt
:>>> plt.plot(data[0])
Activity
What did we just plot? How could you do a similar plot for the popularity of the search term ‘zombie’?
Can you plot both the search volumes for ‘vampire’ and ‘zombie’ on the same graph?
Activity
Experiment with the following commands. What do they do to your plot?
grid()
xlabel('This is a label!')
ylabel('Another label!')
title('My title')
axvline(100)
Save your plot to disk as an image.
There are a crazy number of options that you can pass to
plot()
. Like these:>>> plot(data[0],':') >>> plot(data[1],'--') >>> plot(data[2],'r--')
Activity
- Plot search volume for ‘flu’ (
data[2]
) against ‘ice cream’ (data[3]
). - Don’t forget about
clf()
- Don’t forget about
Use different line types for the two plots. Use the ‘zoom tool’ to magnify the portion of the graph below y==20
.
See any trends worth noting? Visual inspection is a power tool for data analysis.
I wonder if any of our keywords have search volumes that are linearly related to each other?
Pearson Correlation is a good way to check this.
We could compute r-values, for each pair, like this:
>>> import scipy.stats >>> scipy.stats.pearsonr(data[1],data[0]) (0.7604487911797595, 1.0173257365818087e-87) ...
Or we could be lazy, and complete the full correlation matrix with one command:
>>> cor = numpy.corrcoef(data)
Activity
Build the correlation matrix for data
. Look at it. What does it tell you?
2D Plots¶
Let’s look at our correlation matrix visually.
>>> matshow(cor)
Each square is one entry in the 2D array. Pretty intuitive.
We can change colour schemes, too. E.g.:
>>> gray() >>> hot()
And, if the axis labels are annoying us, or we need a colour scale:
>>> axis('off') >>> colorbar()
Activity
Start with a bigger array: r = numpy.random.rand(50,50)
. Plot this array,
using matshow
with a colour bar and no axis labels. What happens if you
use imshow
instead of matshow
? (Try zooming WAAAY in).
Histograms and boxplots¶
Sometimes you want to see the distribution of the values your data, rather than the values themselves.
Consider these data:
>>> u = numpy.random.rand(1000) >>> g = numpy.random.normal(size=1000)
If I just plot them, what intuitions do I get? (Assume I don’t know where it came from!)
>>> plot(u) >>> plot(g)
What about if I plot the distributions of values in
u
andd
?>>> hist(u) >>> hist(g)
As usual,
hist()
has a lot of options .
Activity
Plot a histogram of the data in g
, with bins from -2 to -1, -1 to 0, 0 to 1 and 1 to 2.
Plot a cumulative histogram of the data in g
(with the default automatically chosen bins) and u
. How do they differ?
Let’s create 3 fake sets of experimental data:
>>> d1 = numpy.random.normal(0,10,size=1000) >>> d2 = numpy.random.normal(5,10,size=1000) >>> d3 = numpy.random.poisson(size=1000)
Activity
Compare the histograms of d1, d2 and d3.
Scatter plots¶
Earlier, we used Pearson correlation to investigate relationships in time series data.
A more visual way to investigate this is with a scatter plot:
>>> scatter(d1,d2)
For every pair of datapoints (d1,d2)... we just plot them as if they were the (x,y) co-ordinates of a point.
Let’s fake some correlated data:
>>> d4 = d2 + 1.0 + numpy.random.normal(1,2,size=1000)
- d4 = d2 + a constant offset + some noise
Activity
Scatterplot d2
against d1
.
Now scatterplot d2
against d4
.
What conclusions can you draw? Back up your conclusions with scipy.stats.pearsonr()
on both pairs.
Onward¶
We’ve barely even scatched the surface of the surface of what’s available with Python.
The types of plots that are of interest to you will depend heavily on what your needs are.
You’ve now got the fundamentals to go forth and steal examples wholesale from the internet.
- Yes, I’m advocating this methodology for practical visualization:
- Find an existing visualization in Python that looks close to what you want
- Get the code
- Spend some time figuring out how it works
- Modify it to suit your purposes
- PROFIT!!!
This kleptoprogramming approach is enabled nicely by the Python community’s strong tradition of publishing source.
- Good places to steal ideas (and code) from:
- Matplotlib gallery (click the picture to get the code!)
- Matplotlib cookbook
- Mayavi gallery
- Scipy cookbook (look under “Graphics”)
Activity
Pick an attractive looking plot from one of the galleries above.
Get the code for the plot working on your machine (100% cut and paste).
Now modify the code to visualize one of the variables we worked with in class today.