Introduction to Canonical Correlation Analysis (CCA) in Python

Increasingly, we obtain several high-dimensional datasets from the same samples. Canonical Correlation Analysis, also known as CCA, is ideal for scenarios where you get two high-dimensional datasets from the same samples and can examine the datasets at the same time. A classic example are audio and video recordings of the same people. One could also think of CCA as another procedure for reducing dimensionality, such as Principal Component Analysis. Unlike the CPA, with the CCA you are dealing with two sets of data instead of one.

If you’re interested in a little history: CCA was originally developed by the same Hotelling company that PCA developed in the 1930s.

What is Canonical Correlation Analysis?

In this lesson we look at examples of how to perform an RCA using the Palmer penguin dataset. We will use scientific knowledge to perform a canonical correlation analysis (CCAA). We will not go into the calculation of CCA, but we will see a practical example of how CCA is done and understand the main intuition behind the results.

As explained in a previous article on the use of CCA with R, the idea of CCA can be understood as follows.

Suppose there are one or more variables that produce two high-dimensional datasets X and Y. Here are the X and Y datasets. And we don’t know anything about the hidden variable or variables behind the two records. Since both datasets are in the latent variable, there will be many common or shared differences in the two datasets. With CCA we can identify the total variation, the canonical variations that are strongly correlated with the unknown latent variable.

In general, these two datasets may show a different variation than the variation due to the main latent variable. In addition, CCA helps us to remove variation or dataspecific noise in both datasets and arrive at the canonical variable that captures the latent variable.

Palmer Penguin Data Set for Canonical Correlation Analysis

To start with, we put Pandas, Matplotlib, Numb and Seaborn online.

import pandas as pd
import matplotlib.pyplot as plt
import marine pandas as sns
import numbering as np

Let’s load Palmer Penguin’s data and clean it up a bit by removing the lines with missing values.

link2data = https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv
df = pd.read_csv(link2data)
df =df.dropna()
df.head()

Since canonical correlation analysis involves two high dimensional datasets, we divided the penguin’s data into two datasets X and Y. The two datasets have been used for the canonical correlation analysis. The X data set has two variables corresponding to the length and depth of the count.

X = df[[invoice length_mm, invoice depth_mm]] X.header()

beak length_mm beak depth_mm
0 39,1 18,7
1 39,5 17,4
2 40,3 18,0
4 36,7 19,3
5 39,3 20,6

We also have to normalize the variables by subtracting them from the mean and dividing them by the standard deviation.

X_mc = (X-X.mean())/(X.std()))
X_mc.head()

Beak length_mm Beak depth_mm
0 -0.894695 0.779559
1 -0.821552 0.119404
2 -0.675264 0.424091
4 -1.333559 1.084246
5 -0.858123 1.744400

The second data set Y contains the length of the fins and the weight of the body.

Y = df[[flipper_length_mm, body_mass_g]] Y.head()

Let’s also standardize the Y-record.

Y_mc = (Y-Y.mean())/(Y.std()))
Y_mc.head()
flipper_length_mm body_mass_g
0 -1.424608 -0.567621
1 -1.067867 -0.505525
2 -0.425733 -1.188572
4 -0.568429 -0.940192
5 -0.782474 -0.691811

Canonical Correlation Analysis with Scikit-learn on Python

We now have two sets of data from the same penguins. As we know, the characteristics of penguins vary a lot depending on the species. And in our toy example, the display is a latent variable common to both X and Y datasets.

We will use the CCA module of sklearn.cross_decomposition to perform CCA in Python.

from sklearn.cross_decomposition import CCA

We start by instantiating a CCA object and use fit() and transform() with two standard matrices to perform the CCA.

ca = CCA()
ca.fit(X_mc, Y_mc)
X_c, Y_c = ca.transform(X_mc, Y_mc)

And our result is two canonical correlation matrices.

Printing (X_c.form)
Printing (Y_c.form)
(333.2)
(333.2)

Understand the results of the canonical correlation analysis

Let’s dig a little deeper to understand the results of the canonical analysis of the correlations and become intuitive. First we use a few canonical covariates and see how they relate to each other.

Let’s add the CCA results with the metadata corresponding to the penguin’s data to the Pandas data framework.

cc_res = pd.DataFrame({CCX_1:X_c[:, 0],
CCY_1:Y_c[:, 0],
CCX_2:X_c[:, 1],
CCY_2:Y_c[:, 1],
Species:df.species.tolist(),
Island:df.island.tolist(),
sex:df.sex.tolist()).

cc_res.head()CCX_1 CCY_1 CCX_2 CCY_2 View Island floor0 -1.186252 -1.408795 -0.010367 0.682866 Adeli Torgersen MALE1 -0.709573 -1.053857 -0.456036 0.429879 Adeli Torgersen FEMAL2 -0.790732 -0.130809 -0.839620 NAME Adeli Torgersen MALE1 -0.709573 -1.053857 -0.456036 0.4298793550 -0.130809 -0.839620 Adelie Torgersen NOM3 -1.718663 -0.542888 -0.073623 -0.458571 Adelie Torgersen NOM4 -1.772295 -0.763548 0.736248 -0.014204 MALDAN Adelie Torgersen

Let’s examine the correlation between the first pair of canonical covariates. We use NumPy’s corrcoef() function to calculate the correlation. And we see that the first of the two canonical covariates is highly correlated.

input number as np
np.corrcoef(X_c[:, 0], Y_c[:, 0])
array([[1. , 0.78763151],
[0.78763151, 1. ])

We can also calculate the correlation between the second pair of covariates and see that the correlation is not so high.

np.corrcoef(X_c[:, 1], Y_c[:, 1])

array([[1. , 0.08638695],
[0.08638695, 1. ])).

To better understand the relationship between the pairs of canonical covariates, let’s make a scatterplot with the first pair of canonical covariates.

sns.set_context(talk, font_scale=1.2)
plt.figure(figsize=(10,8))
sns.scatterplot(x=CCX_1,
y=CCY_1,
data=cc_res)
plt.title (‘comp. 1, corr = %.2f’ %
np.corrcoef(X_c[:, 0], Y_c[:, 0]) [0, 1])

We find a strong correlation between the first pair of canonical covariates.
Dispersion plot of the first pair of canonical covariates Dispersion plot of the first pair of canonical covariates

Interpretation of canonical covariates with the thermal map

In this example we already know that the variable species in the dataset is a hidden or latent variable. Let’s see how the latent variable relates to the first pair of canonical covariates. First we do a Bo exploit between the hidden variable and each of the first pairs of canonical covariates.

plt.figure(figsize=(10,8))
sns.boxplot(x=View,
y=CCX_1,
data=cc_res)
sns.striplot(x=View,
y=CCX_1,
data=cc_res)

Canonical explosion correlation of X and the latent variable Canonical explosion correlation of X and the latent variable

plt.figure(figsize=(10,8))
sns.boxplot(x=View,
y=CCY_1,
data=cc_res)
sns.striplot(x=View,
y=CCY_1,
data=cc_res)

Explaining the canonical correlation between Y and the latent variable Explaining the canonical correlation between Y and the latent variable
By colouring the scattered spots between the first pair of canonical covariates and the species variable and investigating how the canonical covariates captured the main variable behind our datasets.

plt.figure(figsize=(10,8))
sns.scatterplot(x=CCX_1,
y=CCY_1,
hue=Species, data=cc_res)
plt.title(‘First pair of canonical covariates, corr = %.2f’ %
np.corrcoef(X_c[:, 0], Y_c[:, 0]) [0, 1])

Distribution plot of the first pair of canonical covariates coloured by a latent variable Distribution plot of the first pair of canonical covariates coloured by a latent variable

The correlations between the first pair of canonical covariates and the species variable in the dataset show that our canonical correlation analysis has established the common variation between the two datasets. In this example, the common or latent variable behind the first pair of canonical covariates is the species variable.

If we look closely at the relationship between QC and the variable, here the latent variable, we can certainly understand the results of our CCA. Let’s go one step further and make a thermal map of the correlations between the canonical covariates of each dataset and our input dataset, including the corresponding metadata,

Let’s repeat this by creating a data block with the original data and canonical covariates of the first X-dataset. To calculate the correlation, we convert the symbolic variables into categoric variables and convert them into 0/1/2 codes.

ccX_df = pd.DataFrame({CCX_1:X_c[:, 0],
CCX_2:X_c[:, 1],
Type:df.kind.astype(‘category’).cat.codes,
island:df.island.astype(‘category’).cat.codes,
gender:df.sex.astype(‘category’).cat.codes,
bill_length:X_mc.bill_length_mm,
bill_depth:X_mc.bill_depth_mm})

Using the corr() function of Pandas, we can calculate the correlation of all variables in the data frame.

corr_X_df= ccX_df.corr(method= ‘pearson’)
corr_X_df.head()

Let’s make a heat map with a lower triangular correlation matrix. For this we replace the original correlation matrix by the vibration() function of Numpy.

plt.figure(figsize=(10,8))
X_df_lt = corr_X_df.where(np.tril(np.ones(corr_X_df.shape)) .astrype(np.bool)))

Using Seaborn’s heatmap function, we can create a heatmap of the correlation of the lower triangle.

sns.heatmap(X_df_lt,cmap=warm,annot=where,fmt=’.1g’)
plt.tight_layout()
plt.savefig(Heatmap_Canonical_Correlates_from_X_en_data.jpg,
format= ‘jpeg’,
dpi=100)

A thermal map showing the correlations of the canonical covariates from the X-dataset reveals many interesting details. We find that, as expected, there is no correlation between the first and second canonical covariates of the X-dataset. Note that the covariate is strongly correlated with two variables in the X-dataset, positive with count length and negative with depth.

As we have already seen, the first canonical covariate is strongly correlated with the species variable, in this example the latent or latent variable. The first canonical covariate of X is also correlated with another latent variable, the island variable associated with the dataset. We also see that the first canonical covariate is not correlated with the sex variable. However, the second canonical covariate is moderately correlated with the sex variable.

Correlation of canonical correlates from data and an X dataset Thermal map of canonical correlates from data and an X dataset.

Let us construct a similar sub-rectangular correlation thermal map with canonical covariates from the Y-dataset and the Y-dataset itself, including the associated metadata.

# second pair of canonical covariates with dataset
ccY_df = pd.DataFrame({CCY_1:Y_c[:, 0],
CCY_2:Y_c[:, 1],
Type:df.kind.astype(‘category’).cat.codes,
island:df.island.atype(‘category’).cat.codes,
sex:df.sex.atype(‘category’).cat.codes,
flipper_length:Y_mc.flipper_length_mm,
body_mass:Y_mc.body_mass_g})

# Calculate the correlation with pandas corr()
corr_Y_df= ccY_df.corr(method= ‘pearson’)

# Determine the lower triangular correlation matrix
Y_df_lt = corr_Y_df.where(np.tril(np.ones(corr_Y_df.shape)).astype(np.bool)))

# Create a lower triangular correlation heatmap with Seaborn
plt.figure(figsize=(10,8))
sns.heatmap(Y_df_lt,cmap=coolwarm,annot=True,fmt=’.1g’)
plt.tight_layout()
plt.savefig(Heatmap_Canonical_Correlates_from_Y_en_data.jpg,
format=’jpeg’,
dpi=100)

We observe a pattern similar to the pattern we saw in the correlation chart of the canonical covariates in the X-dataset. A positive point, as we have already seen, is that the first canonical covariate of data set Y is strongly correlated with the species variable. And the second canonical covariate in the dataset, Y, is correlated with the sex variable, suggesting that we can capture the effect of two different latent variables using canonical correlation analysis.

Correlation map of the canonical bodies from the Y data and the data set Correlation map of the canonical bodies from the Y data and the data set

Second pair of canonical covariates Second open hidden variable

The two heat maps showing the correlations between the canonical covariates and the datasets show that gender is another variable that affects both datasets and the CCA that can account for it. To see the effect of the sex, we can make a point cloud with the second pair of canonical covariates and color it in with the sex variable.

plt.figure(figsize=(10,8))
sns.scatterplot(x=CCX_2,
y=CCY_2,
hue=sex, data=cc_res)
plt.title(‘Second pair of canonical covariates, corr = %.2f’ %
np.corrcoef(X_c[:, 1], Y_c[:, 1]) [0, 1])

Dispersion plot of the second pair of canonical covariates Dispersion plot of the second pair of canonical covariates

Python CCA Example Summary

To summarize what we have seen, canonical correlation analysis is an excellent tool for understanding trawl fisheries of high-dimensional datasets. As an example of playing with penguin data, the post showed how to do a CCA with scikit-learn on Python. We have also seen how we can interpret and understand pairs of canonical covariates, obtained by examining two datasets at the same time.

In this post we have not discussed the mathematics or the algorithm behind the CCA, it will be interesting to see that in a future post. Moreover, the penguin dataset was ideal to illustrate the CCA, and it would be much more fun to apply the CCA on more complex/realistic high-dimensional datasets. Great ideas for a few more positions.

canonical correlation python example,canonical correlation analysis example,canonical correlation analysis tutorial,canonical correlation analysis in r,canonical correlation analysis cmu,canonical correlation analysis ppt,pyrcca,kernel canonical correlation analysis python,scipy canonical correlation analysis,canonical correlation pdf,canonical correlation analysis book,discriminant analysis canonical correlation

You May Also Like