Download notebook (.ipynb)

Correlation Plot#

The corr_plot builder takes a dataframe (can be Pandas Dataframe or just Python dict) as the input and builds a correlation plot.

It allows to combine ‘tile’, ‘point’ or ‘label’ layers in a matrix of ‘full’, ‘lower’ or ‘upper’ type.

A call to the terminal build() method will create a resulting ‘plot’ object. This ‘plot’ object can be further refined using regular Lets-Plot (ggplot) API, like + ggtitle(), + ggsize() and so on.

The Ames Housing dataset for this demo was downloaded from House Prices - Advanced Regression Techniques (train.csv), (c) Kaggle.

import numpy as np
import pandas as pd

from lets_plot import *
from lets_plot.bistro.corr import *
LetsPlot.setup_html()
mpg_df = pd.read_csv('https://raw.githubusercontent.com/JetBrains/lets-plot-docs/master/data/mpg.csv')\
    .drop(columns=['Unnamed: 0']).select_dtypes(include=np.number)
print(mpg_df.shape)
mpg_df.head()
(234, 5)
displ year cyl cty hwy
0 1.8 1999 4 18 29
1 1.8 1999 4 21 29
2 2.0 2008 4 20 31
3 2.0 2008 4 21 30
4 2.8 1999 6 16 26

Combining ‘tile’, ‘point’ and ‘label’ layers.#

When combining layers, corr_plot chooses an acceptable plot configuration by default.

gggrid([
    corr_plot(mpg_df).tiles().build() + ggtitle("Tiles"),
    corr_plot(mpg_df).points().build() + ggtitle("Points"), 
    corr_plot(mpg_df).tiles().labels().build() + ggtitle("Tiles and labels"),
    corr_plot(mpg_df).points().labels().tiles().build() + ggtitle("Tiles, points and labels")
], ncol=2)

The default plot configuration adapts to the changing options - compare ‘Tiles and labels’ plot above and below.

You can also override the default plot configuration using the parameter ‘type’ - compare ‘Tiles, points and labels’ plot above and below.

gggrid([
    corr_plot(mpg_df).tiles().labels(color="white").build() + ggtitle("Tiles and labels"),
    (corr_plot(mpg_df)
     .tiles(type="upper")
     .points(type="lower")
     .labels(type="full").build() + ggtitle("Tiles, points and labels"))
], ncol=2)

Customizing colors.#

Instead of the default blue-grey-red gradient you can define your own lower-middle-upper colors, or choose one of the available ‘Brewer’ diverging palettes.

Let’s create a gradient resembling one of Seaborn gradients.

bld = corr_plot(mpg_df).points().labels().tiles()

# Configure gradient resembling one of Seaborn gradients.
gradient = (bld
            .palette_gradient(low='#417555', mid='#EDEDED', high='#963CA7')
            .build()) + ggtitle("Custom gradient")

# Configure Brewer 'BrBG' palette.
brewer = (bld
            .palette_BrBG()
            .build()) + ggtitle("Brewer")
gggrid([
    gradient,
    brewer
], ncol=2)

Correlation plot with large number of variables in dataset.#

The Kaggle House Prices dataset contains 81 variables.

housing_df = pd.read_csv("https://raw.githubusercontent.com/JetBrains/lets-plot-docs/master/data/Ames_house_prices_train.csv")\
    .select_dtypes(include=np.number)
print(housing_df.shape)
housing_df.head()
(1460, 38)
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
0 1 60 65.0 8450 7 5 2003 2003 196.0 706 ... 0 61 0 0 0 0 0 2 2008 208500
1 2 20 80.0 9600 6 8 1976 1976 0.0 978 ... 298 0 0 0 0 0 0 5 2007 181500
2 3 60 68.0 11250 7 5 2001 2002 162.0 486 ... 0 42 0 0 0 0 0 9 2008 223500
3 4 70 60.0 9550 7 5 1915 1970 0.0 216 ... 0 35 272 0 0 0 0 2 2006 140000
4 5 60 84.0 14260 8 5 2000 2000 350.0 655 ... 192 84 0 0 0 0 0 12 2008 250000

5 rows × 38 columns

Correlation plot that shows all the correlations in this dataset is too large and barely useful.

corr_plot(housing_df).tiles(type='lower').palette_BrBG().build()

The threshold parameter.#

The ‘threshold’ parameter let us specify a level of significance, below which variables are not shown.

(corr_plot(housing_df, threshold=.5).tiles(diag=False).palette_BrBG().build() 
 + ggtitle("Threshold: 0.5")
 + ggsize(550, 400))

Let’s further increase our threshold in order to see only highly correlated variables.

(corr_plot(housing_df, threshold=.8)
 .tiles(diag=False)
 .palette_BrBG().build() 
 + ggtitle("Threshold: 0.8")
 + ggsize(550, 400))