Basic Concepts involved in Multivariate Analysis
Data Analysis, Data Science, Machine Learning etc, I am pretty confident that now-a-days everyone has come across these terms and many are actually working in one of these fields.
But what exactly is this craze of data ‘dealing’.
It all started not too long ago when scientists observed data to be growing at an extraordinary pace(exponentially). Data has increased so much over the past few decades that it became a very difficult task for the scientist to ‘handle’ the data by themselves. Hence things like Machine Learning, Data science, Artificial Intelligence were born. Because of these technologies data has been handled properly, and organisations like Google, Netflix, Facebook etc which have to deal with tons of data excel in their job.
The growing data increased the demands for Data Scientist and Data Analyst and hence this might be the reason that you are reading this article 🙂
Now a question you might ask : But what does Data Analysis has to do with all these AI driven fields???
The answer is ‘EVERYTHING’.
Data Analysis forms the basis for Data Science and based on the Data Analysis only one can use the Data Science techniques.
Look it this way, we have a lot of data available to be used, most of it in reality is junk and not efficient to be used directly in a machine learning algorithm.
We must first ‘rectify’ the data and then use it.
In other words Data Analysis helps us to convert data to ‘knowledge’.
DATA ANALYSIS
Definition : Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.
In this article I will be discussing the Multivariate Analysis and the basic concepts constituting it.
MULTIVARIATE ANALYSIS
Definition : Multivariate analysis deals with the statistical analysis of data collected on more than one dependent variable.
Multivariate techniques are popular because they help organizations to turn data into knowledge and thereby improve their decision making.
Most of the Multivariate analysis techniques are extensions of univariate(analysis of single variable) and bivariate analysis(techniques used to analyze two variables). Other multivariate techniques are solely created to deal with the multivariate problems, such as discriminant analysis or factor analysis etc.
To be considered truly multivariate all the variables must be random and interrelated in such a way that their different effects can not meaningfully be interpreted separately.
The Basic concepts of Multivariate Analysis
There are some key concepts that constitutes the Multivariate analysis :
The Variate
The building block of the multivariate analysis is the variate. It is defined as the weighted sum of the variables, where the weights are defined by the multivariate techniques. The variate of n weighted variables(X1 to Xn) can be written as :
Variate = X1*W1 + X2*W2 + X3*W3 + … + Xn*Wn
where X1, X2, … Xn are the observed variables and
W1, W2, W3 … Wn are the weights.
But what actually are these variates used for?
These variates captures the multivariate features of the of the analysis, thus in each technique, the variate acts as the focal point of the analysis.
For example, in multiple regression, the variate is determined in such a manner that the correlation between the dependent variable and the independent variables is maximum.
Measurement Scales
Multivariate analysis involves dealing with data which has multiple variables and so the entries in these variables might have different scales. Hence ‘measurement’ of the data becomes essential. Measurement is important in accurately representing the concept of interest and is instrumental in the selection of the appropriate multivariate technique.
Data can be classified into 2 categories :
non-metric(qualitative)
metric(quantitative)
Determination of each variable as non-metric or metric is essential as it can change the whole analysis. This identification is done by humans as to computers everything is just numbers.
Non-Metric Measurement Scales
Non-metric data describes differences in type or kind by indicating presence or absence of a characteristic or property.
These properties are discrete, such that it signifies only the presence of a particular feature and absence of all other features.
For example : gender. If a person is male then he can not be female.
Performing arithmetic operations on non-metric data is meaningless as it does not give us a comparison between quantities rather it indicates the presence or absence of a feature.
Non-metric measurements can be made by ordinal or nominal scales.
Nominal Scales
These scales are also called as Categorical Scales.
Nominal scale assigns numbers to objects in order to identify them. The number assigned to the objects have no quantitative meaning beyond indicating the presence or absence of the object.
For example : In case of gender we can assign values 0 to male and 1 to female so that now when we see 0 in the gender column we say that the person is a male. But performing any arithmetic operation is meaningless as we can not do 0 +1 = 1 and say that male + female = female.
Nominal Scales do not provide a measure for comparing the data as they only indicates the presence or absence.
Ordinal Scales
In case of Ordinal scales we can order or rank the variables in relation to the amount they signify.
In this, the objects can be compared using the logical expressions : greater than(>), less than(<) and equal to(=) only.
Ordinal Scales only give the comparison between the objects but do not give the exact amount or magnitude in absolute terms.
For example : Consider the figure :
Let us say we have three cars(A, B and C) which are marked on a scale which represents their speed and ranges from slow to fast.
When viewed as ordinal data we can say that C is faster than B and B is faster than A or
speed of C > speed of B and speed of B > speed of A.
The comparison between the speeds of the cars can be done using logical operators but we can not infer the exact or actual magnitude difference between the speeds of the cars, i.e., we can not say that difference between speeds of A and B is more than difference of speeds between B and C.
We can not perform arithmetic operations on the non-metric data which somehow limits the use of it in estimating model coefficients. For this reason many multivariate techniques are devised solely to deal with non-metric data(e.g., correspondence analysis) or to use non-metric data as independent variable. Thus we must identify all non-metric data to ensure that they are used appropriately in the multivariate techniques.
Metric Measurement Scales
Metric data is used when the objects differ in amount or degree on a particular attribute. Metric data represents relative quantities or degree and are appropriate for attributes involving magnitude or amount, such as the speed of the car in km/hr or different prices of houses etc.
The two different metric measurement scales are :
Interval Scale
Ratio Scale
Both these scales provide most precise measurements, permitting any mathematical operation to be performed.
Interval Scale
Interval Scale are used to measure metric data(Duh!). It measures values in terms of magnitude and not just logical operations.
Consider the previous example where we had 3 cars with different speeds.
Using the interval scale we can now find the exact values of the speeds of each car and also determine the exact difference between the speeds of any 2 cars.
The above example clearly shows that the measurements have become precise using the interval scale.
The difference between interval scale and ratio scale is that in ratio scale we have absolute zero point whereas in interval scale we have an arbitrary zero point. The most common interval scales are Celsius and Fahrenheit.
None of the two scales have an absolute zero point, as zero in Celsius or Fahrenheit scale does not represent absolute zero temperature(which is 0 Kelvin and not 0⁰Celsius or 0⁰Fahrenheit), also zero in one is not equal to the zero in another. We, also, can not state that any value on an interval scale is a multiple of some other point on the scale.
For example, we can not say that heat at 100⁰Celsius is twice as hot as heat at 50⁰Celsius.
Ratio Scale
Ratio Scale provides the most precise measurements as they have an absolute zero point. All mathematical operations are permissible on measurements made by ration scales.
Height, age, money, weight etc are some examples of ratio scales because any measured value can be written as a multiple of some other measured value, like 120 inches on height scale can be written as twice of 60 inches. Also these scales posses absolute zero.
Importance of choosing the right measurement scale
The following points depict the importance of choosing the right scale :
- The type of scale helps in differentiating non-metric data from metric data and vice versa. If we incorrectly define non-metric data as metric data then the model may break.
- The measurement scale, many-a-times, also depicts which multivariate technique is more appropriate to be used.
Credit: BecomingHuman By: Shubhankar Rawat