Multivariate Analysis Techniques for Exploring Data
Most problems we deal with have multiple variables. To analyze these variables before they can be used in training a machine learning framework, we need to analytically explore the data. A fast and easy way to do this is bivariate analysis, wherein we simply compare two variables against each other. This can be in the form of simple two-dimensional plots and t-tests.
However, comparing only two variables at a time does not give deep insights into the nature of variables and how they interact with each other. Consider the Curiosity rover recently launched by NASA. It is using laser-induced breakdown spectroscopy (LIBS) to analyze the chemical composition of the rocks in the Gale Crater region of Mars. Now, this data is highly multivariate in nature with over 6000 variables per sample. Imagine plotting two-dimensional graphs to understand the patterns in the data! This is where the need to understand and implement multivariate analysis techniques comes in.
We now look at some of these techniques in detail.
Pairwise plots are a great way to look at multi-dimensional data, and at the same time maintain the simplicity of a two-dimensional plot. As shown in the figure below, it allows the analysts to view all combinations of the variables, each in a two-dimensional plot. In this way, they can visualize all the relations and interactions among the variables on one single screen.
While there are various ways of visualizing multi-dimensional data, spider plots are one of the easiest ways to decipher the meaning of data. From the figure below, we can see how easily we can compare three mobile phones based on attributes such as their speed, screen, camera, memory and apps.
Often, data sets contain variables that are either related to each other or derived from each other. It is important to understand these relations that exist in the data. In statistical terms, correlation can be defined as the degree to which a pair of variables are linearly related. In some cases, it is easy for the analyst to understand that the variables are related, but in most cases, it isn’t. Thus, performing a correlation analysis is very critical while examining any data. Furthermore, feeding data which has variables correlated to one another is not a good statistical practice, since we are providing multiple weightage to the same type of data. To prevent such issues, correlation analysis is a must.
In many business scenarios, the data belongs to different types of entities; and fitting all of them into a single model might not be the best thing to do. For example, in a bank dataset, the customers might belong to multiple income groups which leads to different spending behaviors. If we use the data having all these customers into a single model, we would be comparing apples to oranges. In that regard, clustering provides analysts a good way to segment their data and therefore avoid this problem. Clustering also allows us to visually understand and therefore compare the different attributes of the segments formed.
K-means clustering is a well-renowned approach used by a lot of data analysts and scientists. This separates the data points into clusters such that the inter-cluster distances are maximized. What this means is that each point in a particular cluster is similar to every other point in that cluster; and, points in a particular cluster are very different from every point in any other cluster. Other popular approaches for clustering include the hierarchical clustering algorithm, the DBSCAN algorithm, Partitioning Around Medoids (PAM) algorithm, etc.
MANOVA (Multivariate Analysis of Variance)
This technique is best suited for use when we have multiple categorical independent variables; and two or more metric dependent variables. While the simple ANOVA (Analysis of Variance) examines the difference between groups by using t-tests for two means and F-test otherwise, MANOVA assesses the relationship between the set of dependent features across a set of groups. For example, this technique is suitable when we want to compare two or more dishes in a restaurant against each other, in terms of the level of spiciness, the time taken to cook and value for money, etc.