Author Archive

Contingency Table | Cross Classification: Introduction

A bivariate relationship is defined by the joint distribution of the two associated random variables.

Contingency Tables

Let X and Y are two categorical response variables. Let variable X have I levels and variable Y have J levels. The possible combinations of classifications for both variables are I\times J. The response (X, Y) of a subject randomly chosen from some population has a probability distribution, which can be shown in a rectangular table having I rows (for categories of X) and J columns (for categories of Y). The cells of this rectangular table represent the IJ possible outcomes. Their probability (say \pi_{ij}) denotes the probability that (X, Y) falls in the cell in row i and column j. When these cells contain frequency counts of outcomes, the table is called contingency or cross-classification table and it is referred to as an I by J (I \times J) table.

The probability distribution {\pi_{ij}} is the joint distribution of X and Y. The marginal distributions are the rows and columns totals obtained by summing the joint probabilities. For the row variable (X) the marginal probability is denoted by \pi_{i+} and for column variable (Y) it is denoted by \pi_{+j}, where the subscript “+” denotes the sum over the index it replaces; that is, \pi_{i+}=\sum_j \pi_{ij} and \pi_{+j}=\sum_i \pi_{ij} satisfying

\sum_{i} \pi_{i+} =\sum_{j} \pi_{+j} = \sum_i \sum_j \pi_{ij}=1

Note that the marginal distributions are single-variable information, and do not pertain to association linkages between the variables.

In (many) contingency tables, one variable (say, Y) is a response and the other X) is an explanatory variable. When X is fixed rather than random, the notation of a joint distribution for X and Y is no longer meaningful. However, for a fixed level of X, the variable Y has a probability distribution. It is germane to study how this probability distribution of Y changes as the level of X changes.

Matrix Introduction

Matrices are everywhere. If you have used a spreadsheet program such as MS-Excel, Lotus or written a table (such as in Ms-Word) or even have used mathematical or statistical software such a Mathematica, Matlab, Minitab, SAS, SPSS and Eviews etc., you have used a matrix.

Matrices make the presentation of numbers clearer and make calculations easier to program. For example, the matrix is given below about the sale of tires in a particular store given by quarter and make of tires.

Q1 Q2 Q3 Q4
Tirestone 21 20 3 2
Michigan 5 11 15 24
Copper 6 14 7 28

It is called matrix, as information is stored in particular order and different computations can also be performed. For example, if you want to know how many Michigan tires were sold in Quarter 3, you can go along the row ‘Michigan’ and column ‘Q3’ and find that it is 15.

Similarly, total number sales of ‘Michigan’ tiers can also be found by adding all the elements from Q1 to Q4 in Michigan row. It sums to 55. So, a matrix is a rectangular array of elements. The elements of a matrix can be symbolic expression or numbers. Matrix [A] is denoted by;

MatrixRow i of the matrix [A] has n elements and is [ai1, ai2, … a1n] and column of [A] has m elements and is Matrix Column.

The size (order) of any matrix is defined by the number of rows and columns in the matrix. If a matrix [A] has m rows and n columns, the size of the matrix is denoted by (m x n). The matrix [A] can also be denoted by [A]mxn to show that [A] is a matrix that has m rows and n columns in it.

Each entry in the matrix is called the element or entry of the matrix and is denoted by aij, where i represents the row number and j is the column number of the matrix element.

The above-arranged information about sale and type of tires can be denoted by the matrix [A], that is, Matrix of tire SaleThis matrix has 3 rows and 4 columns. So, the order (size) of the matrix is 3 x 4. Note that element a23 indicate the sales of tires in ‘Michigan’ in quarter 3 (Q3).


Number System

In early civilizations, the number of animals (sheep, goat, and camel etc.) or children people have were tracked by using different methods such as people match the number of animals with the number of stones. Similarly, they count the number of children with the number of notches tied on a string or marks on a piece of wood, leather or wall. With the development of human, other uses for numerals were found and this led to the invention of the number system.

Natural Numbers

Natural numbers are used to count the number of subjects or objects. Natural numbers are also called counting numbers. The numbers 1, 2, 3, \cdots are all natural numbers.

Whole Numbers

The numbers 0, 1, 2, \cdots are called whole numbers. It can be observed that whole numbers except 0 are natural numbers.

Number Line

Whole numbers can be represented by points on a line called the number line. For this purpose, a straight line is drawn and a point is chosen on the line and labeled as 0. Starting with 0, mark off equal intervals of any suitable length. Marked points are labeled as 1, 2, \cdots as shown in Figure below. The figure below represents real numbers since it includes the negative number (numbers on the left of 0 in this diagram are called negative numbers).

Real Number LineThe arrow on the extreme (right-hand side in case of while numbers or negative numbers) indicates that the list of numbers continues in the same way indefinitely.

A whole number can be even or odd. An even number is a number which can be divided by 2 without leaving any remainder. The numbers 0, 2, 4, 6, 8, \cdots are all even numbers. An odd number is a number which cannot be divided by 2 without leaving any remainders. The numbers 1, 3, 5, 7, 9, \cdots are all odd numbers.

It is interesting to know that any two numbers can be added in any order and it will not affect the results. For example, 3+5 = 5+3. This is called the commutative law of addition. Similarly, the order of grouping the numbers together does not affect the result. For example, 2+3+5=(2+3)+5 = 2+ (3+5)=(2+5)+3. This is called the associative law of addition. The subtraction and division of numbers are not commutative as 5-7\ne7-5 and 6\div2 \ne 2\div 6 in general.

Like addition and multiplication, whole numbers also follow commutative law and it is called commutative law of multiplication, for example, 2\times 8 = 8 \times 2. Like addition and multiplication, whole numbers also follow associative law of multiplications. For example, 2 \times (3 \times 4) = (2 \times 3) \times 4 =  (2 \times 4)\times 3. Similarly, multiplication is distributive over addition and subtraction, for example, (i) 5\times (6 + 7) = (5 \time 6) + (5 \times 7) or (6+7) \times 5=(6 \times 5)+(7 \times 5). (ii) 3 \times (6-2) = (3 \times 6) - (3 \times 2) or (6-2) \times 3 = (6 \times 3) - (2 \times 3).

Take any two digit number say 57, reverse the digits to obtain 75. Now subtract the smaller number from the bigger number, we have 75-57=18. Now reverse the digits of 18 and add 18 to its reverse (81), that is, 18+81, you will get 99.

The Z-score

The Z-Score

The Z-score also referred to as standardized raw scores is a useful statistic because not only permits to compute the probability (chances or likelihood) of raw score (occurring within normal distribution) but also it helps to compare two raw scores from different normal distributions. The Z-score is a dimensionless measure since it is derived by subtracting the population mean from an individual raw score and then this difference is divided by the population standard deviation. This computational procedure is called standardizing raw score, which is often used in the Z-test of testing of hypothesis.

Any raw score can be converted to a Z-score by

Z Score=\frac{raw score - mean}{\sigma}

Example 1:

If the mean = 100 and standard deviation = 10, what would be the Z-score of the following raw score

Raw Score Z-Score
90 \frac{90-100}{10}=-1
110 \frac{110-100}{10}=1
70 \frac{70-100}{10}=-3
100 \frac{100-100}{10}=0

Note that:

  • If Z-score have a zero value then it means that raw score is equal to the population mean.
  • If Z-score has positive value then it means that raw score is above the population mean.
  • If Z-score has negative value then it means that raw score is below the population mean.

Example 2:

Suppose you got 80 marks in Exam of a class and 70 marks in another exam of that class. You are interested in finding that in which exam you have performed better. Also suppose that the mean and standard deviation of exam1 are 90 and 10 and in exam2 60 and 5 respectively. Converting both exam marks (raw scores) into standard score (Z-Score), we get

Z_1=\frac{80-90}{10} = -1

The Z-score results (Z_1=-1) shows that 80 marks are one standard deviation below the class mean.


The Z-score results (Z_2=2) shows that 70 marks are two standard deviation above the mean.

From Z1 and Z2 means that in second exam student performed well as compared to the first exam. Another way to interpret the Z-score of -1 is that about 34.13% of the students got marks below the class average. Similarly the Z-score of 2 implies that 47.42% of the students got marks above the class average.

Multicollinearity in Linear Regression Models

Multicollinearity in Linear Regression Models

The objective of multiple regression analysis is to approximate the relationship of individual parameters of a dependency, but not of interdependency. It is assumed that the dependent variable y and regressors X‘s are linearly related to each other (Graybill, 1980; Johnston, 1963 and Malinvaud, 1968). Therefore, inferences depicted from any regression model are

(i) identify relative influence of regressors
(ii) prediction and/or estimation and
(iii) selection of appropriate set of regressors for the model.

From all these inferences, one of the purpose of regression model is to ascertain what extent the dependent variable can be predicted by the regressors in the model. However, to draw some suitable inferences, the regressors should be orthogonal, i.e., there should be no linear dependencies among regressors. However, in most of the application of regression analysis, regressors are not orthogonal, which leads to misleading and erroneous inferences, especially, in case when regressors are perfectly or nearly perfectly collinear to each other. The condition of non-orthogonality is also referred to as the problem of multicollinearity or collinear data for example, see Gunst and Mason, 1977;  Mason et al., 1975 and Ragnar, 1934). Multicollinearity is also synonymous with ill-conditioning of X'X matrix.

The presence of interdependence or the lack of independence is signified by high order inter-correlation (R=X'X) within a set of regressors ({Dorsett et al, 1983; Farrar and Glauber1967; Gunst and Mason, 1977; Mason et al., 1975). The perfect multicollinearity situation is a pathological extreme and it can easily be detected and resolved by dropping one of the regressor(s) causing multicollinearity (Belsley et al., 1980). In case of perfect multicollinearity, the regression coefficients remain indeterminate and their standard errors are infinite. Similarly, perfectly collinear regressors destroy the uniqueness of the least square estimators (Belsley et al., 1980 and Belsley, 1991). Many explanatory variables (regressors/ predictors) are highly collinear, making very difficult to infer the separate influence of collinear regressors on the response variable (y), that is, estimation of regression coefficients becomes difficult because coefficient(s) measures the effect of the corresponding regressor while holding all other regressors as constant.  Problem of not perfect multicollinearity is extremely hard to detect (Chatterjee and Hadi, 2006) as it is not specification or modeling error, actually it is a condition of deficit data (Hadi and Chatterjee, 1988). On the other hand, existence of multicollinearity has no impact on the overall regression model and associated statistics such as R^2, F-ratio and p-value. Multicollinearity does not also lessen the predictive or reliability of the regression model as whole, it only affects the individual regressors (Koutsoyiannis, 1977). Note that, multicollinearity refers only to linear relationship among the regressors, it does not rule out nonlinear relationship among them.

To draw suitable inferences from the model, existence of (multi)collinearity should always be tested when examining a data set as an initial step in multiple regression analysis. On the other hand, high collinearity is rare, but some degree of collinearity is always exists.

A distinction between collinearity and multicollinearity should be made. Strictly speaking, multicollinearity is usually refers to the existence of more than one exact linear relationship among regressors, while collinearity refers to the existence of a single linear relationship. However, multicollinearity refers to both of the cases nowadays.

There are many methods for the detection/ testing of multi(collinearity) among regressors. However, these methods can destroy the usefulness of the model, since relevant regressor(s) may be removed by these methods. Note that, if there are two predictors then it is sufficient to detect problem of collinearity using pairwise correlation. However, to check the severity of the collinearity problem, VIF/TOL, eigenvalues or other diagnostic measures can be used.

For further detail see

  • Belsley, D., Kuh, E., and Welsch, R. (1980). Diagnostics: Identifying Influential Data and Sources of Collinearity. John Willey & Sons, New York. chap. 3.
  • Belsley, D. A. (1991). A Guide to Using the Collinearity Diagnostics. Computer Science in Economics and Management, 4(1), 3350.
  • Chatterjee, S. and Hadi, A. S. (2006). Regression Analysis by Example. Wiley and Sons, 4th edition.
  • Dorsett, D., Gunst, R. F., and Gartland, E. C. J. (1983). Multicollinear Effects of Weighted Least Squares Regression. Statistics & Probability Letters, 1(4), 207211.
  • Graybill, F. (1980). An Introduction to Linear Statistical Models. McGraw Hill.
  • Gunst, R. and Mason, R. (1977). Advantages of examining multicollinearities in regression analysis. Biometrics, 33, 249260.
  • Hadi, A. and Chatterjee, S. (1988). Sensitivity Analysis in Linear Regression. John Willey & Sons.
  • Imdadullah, M., Aslam, M. and Altaf, S. (2916) mctest: An R Package for Detection of Collinearity Among Regressors
  • Imdadullah, M., Aslam, M. (2016). mctest: An R Package for Detection of Collinearity Among Regressors
  • Johnston, J. (1963). Econometric Methods. McGraw Hill, New York.
  • Koutsoyiannis, A. (1977). Theory of Econometrics. Macmillan Education Limited.
  • Malinvaud, E. (1968). Statistical Methods of Econometrics. Amsterdam, North Holland. pp. 187192.
  • Mason, R., Gunst, R., and Webster, J. (1975). Regression Analysis and Problems of Multicollinearity. Communications in Statistics, 4(3), 277292.
  • Ragnar, F. (1934). Statistical Consequence Analysis by means of complete regression systems. Universitetets Økonomiske Instituut. Publ. No. 5.

Levels of Measurement

Levels of Measurement (Scale of Measure)

Level of measurement (scale of measure) have been classified into four categories. It is important to understand these level of measurement, since these level of measurement play important part in determining the arithmetic and different possible statistical tests that are carried on the data. The scale of measure is a classification that describes the nature of information within the number assigned to variable.  In simple words, the level of measurement determines how data should be summarized and presented. It also indicate the type of statistical analysis that can be performed. The four level of measurement are described below:

1) Nominal Level of Measurement (Nominal Scale)

In nominal level of measurement, the numbers are used to classify the data (unordered group) into mutually exclusive categories. In other words, for nominal level of measurement, observations of a qualitative variable are measured and  recorded as labels or names.

2) Ordinal Level of Measurement (Ordinal Scale)

In ordinal level of measurement, the numbers are used to classify the data (ordered group) into mutually exclusive categories. However, it does not allow for relative degree of difference between them. In other words, for ordinal  level of measurement, observations of a qualitative variable are either ranked or rated on a relative scale and  recorded as labels or names.

3) Interval Level of Measurement (Interval Scale)

For data recorded at the interval level of measurement, the interval or the distance between values is meaningful. The interval scale is based on a scale with a known unit of measurement.

4) Ratio Level of Measurement (Ratio Scale)

Data recorded at the ratio level of measurement are based on a scale with a know unit of measurement and a meaningful interpretation of zero on the scale. Almost all quantitative variables are recorded on the ratio level of measurement.

Examples of level of measurements

Examples of Nominal Level of Measurement

  • Religion (Muslim, Hindu, Christian, Buddhist)
  • Race (Hispanic, African, Asian)
  • Language (Urdu, English, French, Punjabi, Arabic)
  • Gender (Male, Female)
  • Marital Status (Married, Single, Divorced)
  • Number plates on Cars/ Models of Cars (Toyota, Mehran)
  • Parts of Speech (Noun, Verb, Article, Pronoun)

Examples of Ordinal Level of Measurement

  • Rankings (1st, 2nd, 3rd)
  • Marks Grades (A, B, C, D)
  • Evaluation such as High, Medium, Low
  • Educational level (Elementary School, High School, College, University)
  • Movie Ratings (1 star, 2 stars, 3 stars, 4 stars, 5 stars)
  • Pain Ratings (more, less, no)
  • Cancer Stages (Stage 1, Stage 2, Stage 3)
  • Hypertension Categories (Mild, Moderate, Severe)

Examples of Interval Level of Measurement

  • Temperature with Celsius scale/ Fahrenheit scale
  • Level of happiness rated from 1 to 10
  • Education (in years)
  • Standardized tests of psychological, sociological and educational discipline use interval scales.
  • SAT scores

Examples of Ratio Level of Measurement

  • Height
  • Weight
  • Age
  • Length
  • Volume
  • Number of home computers
  • Salary

For further details visit: Level of measurements

Variance: A Measure of Dispersion

Variance is a measure of dispersion of a distribution of a random variable. The term variance was introduced by R. A. Fisher in 1918. The variance of a set of observations (data set) is defined as the mean of the squares of deviations of all the observations from their mean. When it is computed for entire population, the variance is called the population variance, usually denoted by \sigma^2, while for sample data, it is called sample variance and denoted by S^2 in order to distinguish between population variance and sample variance. Variance is also denoted by Var(X) when we speak about the variance of a random variable. The symbolic definition for population and sample variance is

\sigma^2=\frac{\sum (X_i - \mu)^2}{N}; \quad \text{for population data}

\sigma^2=\frac{\sum (X_i - \overline{X})^2}{n-1}; \quad \text{for sample data}

It should be noted that the variance is in square of units in which the observations are expressed and the variance is a large number compared to observations themselves. The variance because of its some nice mathematical properties, assumes an extremely important role in statistical theory.

Variance can be computed if we have standard deviation as variance is square of standard deviation i.e. \text{Variance} = (\text{Standard Deviation})^2.

Variance can be used to compare dispersion in two or more set of observations. Variance can never be negative since every term in the variance is squared quantity, either positive or zero.
To calculate the standard deviation one have to follow these steps:

  1. First find the mean of the data.
  2. Take difference of each observation from mean of the given data set. The sum of these differences should be zero or near to zero it may be due to rounding of numbers.
  3. Square the values obtained in step 1, which should be greater than or equal to zero, i.e. should be a positive quantity.
  4. Sum all the squared quantities obtained in step 2. We call it sum of squares of differences.
  5. Divide this sum of squares of differences by total number of observation if we have to calculate population standard deviation (\sigma). For sample standard deviation (S) divide the sum of squares of differences by total number of observation minus one i.e. degree of freedom.
    Find the square root of the quantity obtained in step 4. The resultant quantity will be standard deviation for given data set.

The major characteristics of the variances are:
a)    All of the observations are used in the calculations
b)    Variance is not unduly influenced by extreme observations
c)    The variance is not in same units as the observation, the variance is in square of units in which the observations are expressed.

Qualitative and Quantitative Research

A research can be classified into two groups: Qualitative and Quantitative Research

  1. Qualitative Research
    Qualitative research involves collecting data from in-dept interviews, observations, field notes, and open-ended questions in questionnaire etc. The researcher himself is the primary data collection instrument and the data could be collected in form of words, images, and patterns etc.For Qualitative Research, data Analysis involves searching for patterns, themes and holistic features. Results of such research are likely to be context specific and reporting takes the form of a narrative with contextual description and direct quotations from researchers.
  2. Quantitative Research
    Quantitative research involves collecting quantitative data based on precise measurement using some structured, reliable and validated collection instruments (questionnaire) or through archival data sources. The nature of quantitative data is in the form of variables and its data analysis involves establishing statistical relationship. If properly done, results of such research are generalize able  to entire population.Quantitative research could be classified into two groups depending on the data collection methodologies

    1. Experimental Research
      The main purpose of experimental research is to establish a cause and effect relationship. The defining characteristics of experimental research are active manipulation of independent variables and the random assignment of participants to the conditions to be manipulated, everything else should be kept as similar and as constant as possible.To depict the way experiments are conducted, a term used is called design of experiment. There are two main types of experimental design.

      • Between Subjects Design
        In within subject design, the same group of subjects serves in more than one treatment
      • In between group design, two or more groups of subjects, each of which being tested by a different testing factor simultaneously.
    2. Non-Experimental Research
      Non-Experimental Research is commonly used in sociology, political science and management disciplines. This kind of research is often done with the help of a survey. There is no random assignment of participants to a particular group nor do we manipulate the independent variables. As a result one cannot establish a cause and effect relationship through non-experimental research. There are two approaches to analyzing such data

      1. Tests for approaches to analyzing such data such as IQ level of participants from different ethnic background.
      2. Tests for significant association between two factors such as firm sales and advertising expenditure.

Chi-Square Test of Independence

Chi-square test is a non-parametric test. The assumption of normal distribution in the population is not required for this test. The statistical technique chi-square can be used to find the association (dependencies) between sets of two or more categorical variables by comparing how close the observed frequencies are to the expected frequencies. In other words, a chi square (\chi^2) statistic is used to investigate whether the distributions of categorical variables different from one another. Note that the response of categorical variables should be independent from each other. We use the chi-square test for relationship between two nominal scaled variables.

Chi-square test of independence is used as tests of goodness of fit and as tests of independence. In test of goodness of fit, we check whether or not observed frequency distribution is different from the theoretical distribution, while in test of independence we assess, whether paired observations on two variables are independent from each other (from contingency table).

Example: A social scientist sampled 140 people and classified them according to income level and whether or not they played a state lottery in the last month. The sample information is reported below. Is it reasonable to conclude that playing the lottery is related to income level? Use the 0.05 significance level.

Low Middle High Total
Played 46 28 21 95
Did not play 14 12 19 45
Total 60 40 40 140

Step by step procedure of testing of hypothesis about association between these two variable is described, below.

H_0: There is no relationship between income and whether the person played the lottery.
H_1: There is relationship between income and whether the person played the lottery.

Step2: Level of Significance 0.05

Step 3: Test statistics (calculations)

Observed Frequencies (f_o) Expected Frequencies (f_e) \frac{(f_o - f_e)^2}{f_e}
46 95*60/140= 40.71 \frac{(46-40.71)^2}{40.71}
28 95*40/140= 27.14 \frac{(28-27.14)^2}{27.14}
21 95*40/140= 27.14 \frac{(21-27.14)^2}{27.14}
14 45*60/140= 19.29 \frac{(14-19.29)^2}{19.29}
12 45*40/140= 12.86 \frac{(12-12.6)^2}{12.86}
19 45*40/140= 12.86 \frac{(19-12.86)^2}{12.86}
\chi^2=\sum[\frac{(f_0-f_e)^2}{f_e}]= 6.544

Step 4: Critical Region:
Tabular Chi-Square value at 0.05 level of significance and (r-1) \times (c-1)=(2-1)\times(3-1)=2 is 5.991.

Step 5: Decision
As calculated Chi-Square value is greater than tabular Chi-Square value, we reject H_0, which means that there is relationship between income level and playing the lottery.

Note that there are several types of chi-square test (such as Yates, Likelihood ratio, Portmanteau test in time series) available which depends on the way data was collected and also the hypothesis being tested.

Mean: Measure of Central Tendency

Mean: Measure of Central Tendency

The measure of Central Tendency Mean (also know as average or arithmetic mean) is used to describe the data set as a single number (value) which represents the middle (center) of the data, that is average measure (performance, behaviour, etc) of data. This measure of central tendency is also known as measure of central location or measure of center.

Mathematically mean can be defined as the sum of the all values in a given dataset divided by the number of observations in that data set under consideration. The mean is also called arithmetic mean or simply average.

Example: Consider the following data set consists of marks of 15 student in certain examination.

50, 55, 65, 43, 78, 20, 100, 5, 90, 23, 40, 56, 70, 88, 30

The mean of above data values is computed by adding all these values (50 + 55 + 65 + 43 + 78 + 20 + 100 + 5 + 90 + 23 + 40 + 56 + 70 + 88 + 30 = 813) and then dividing by the number of observations added (15) which equals 54.2 marks, that is

\frac{50 + 55 + 65 + 43 + 78 + 20 + 100 + 5 + 90 + 23 + 40 + 56 + 70 + 88 + 30 }{15}=\frac{813}{15}=54.2

The above procedure of calculating the mean can be represented mathematically

\mu= \frac{\sum_{i=1}^n X_i}{N}

The Greek symbol \mu (pronounced “mu”) is the representation of population mean in statistics and N is the number of observations in the population data set.

The above formula is known as population mean as it is computed for whole population. The sample mean can also be computed in same manner as population mean is computed. Only the difference is in representation of the formula, that is,

\overline{X}= \frac{\sum_{i=1}^n X_i}{n} .

The \overline{X} is representation of sample mean and n shows number of observations in the sample.

The mean is used for numeric data only. Statistically the data type for calculating mean should be Quantitative (variables should be measured on either ratio or interval scale), therefore, the numbers in data set can be continuous and/ or discrete in nature.

Note that mean should not be computed for alphabetic or categorical data (data should not belong to nominal or ordinal scale). Mean is influenced by very extreme values in data, i.e. very large or very small values in data changes the mean drastically.

For other measures of central tendencies visit: Measures of Central Tencencies

%d bloggers like this: