Author Archive
A bivariate relationship is defined by the joint distribution of the two associated random variables.
Contingency Tables
Let and are two categorical response variables. Let variable have levels and variable have levels. The possible combinations of classifications for both variables are . The response of a subject randomly chosen from some population has a probability distribution, which can be shown in a rectangular table having rows (for categories of ) and columns (for categories of ). The cells of this rectangular table represent the possible outcomes. Their probability (say ) denotes the probability that () falls in the cell in row and column . When these cells contain frequency counts of outcomes, the table is called contingency or crossclassification table and it is referred to as an by () table.
The probability distribution {} is the joint distribution of and . The marginal distributions are the rows and columns totals obtained by summing the joint probabilities. For the row variable () the marginal probability is denoted by and for column variable () it is denoted by , where the subscript “+” denotes the sum over the index it replaces; that is, and satisfying
Note that the marginal distributions are singlevariable information, and do not pertain to association linkages between the variables.
In (many) contingency tables, one variable (say, ) is a response and the other ) is an explanatory variable. When is fixed rather than random, the notation of a joint distribution for and is no longer meaningful. However, for a fixed level of , the variable has a probability distribution. It is germane to study how this probability distribution of changes as the level of changes.
Like this:
Like Loading...
Matrices are everywhere. If you have used a spreadsheet program such as MSExcel, Lotus or written a table (such as in MsWord) or even have used mathematical or statistical software such a Mathematica, Matlab, Minitab, SAS, SPSS and Eviews etc., you have used a matrix.
Matrices make the presentation of numbers clearer and make calculations easier to program. For example, the matrix is given below about the sale of tires in a particular store given by quarter and make of tires.

Q1 
Q2 
Q3 
Q4 
Tirestone 
21 
20 
3 
2 
Michigan 
5 
11 
15 
24 
Copper 
6 
14 
7 
28 
It is called matrix, as information is stored in particular order and different computations can also be performed. For example, if you want to know how many Michigan tires were sold in Quarter 3, you can go along the row ‘Michigan’ and column ‘Q3’ and find that it is 15.
Similarly, total number sales of ‘Michigan’ tiers can also be found by adding all the elements from Q1 to Q4 in Michigan row. It sums to 55. So, a matrix is a rectangular array of elements. The elements of a matrix can be symbolic expression or numbers. Matrix [A] is denoted by;
Row i of the matrix [A] has n elements and is [a_{i1, }a_{i2}, … a_{1n}] and column of [A] has m elements and is .
The size (order) of any matrix is defined by the number of rows and columns in the matrix. If a matrix [A] has m rows and n columns, the size of the matrix is denoted by (m x n). The matrix [A] can also be denoted by [A]_{mxn }to show that [A] is a matrix that has m rows and n columns in it.
Each entry in the matrix is called the element or entry of the matrix and is denoted by a_{ij}, where i represents the row number and j is the column number of the matrix element.
The abovearranged information about sale and type of tires can be denoted by the matrix [A], that is, This matrix has 3 rows and 4 columns. So, the order (size) of the matrix is 3 x 4. Note that element a_{23} indicate the sales of tires in ‘Michigan’ in quarter 3 (Q3).
Like this:
Like Loading...
In early civilizations, the number of animals (sheep, goat, and camel etc.) or children people have were tracked by using different methods such as people match the number of animals with the number of stones. Similarly, they count the number of children with the number of notches tied on a string or marks on a piece of wood, leather or wall. With the development of human, other uses for numerals were found and this led to the invention of the number system.
Natural Numbers
Natural numbers are used to count the number of subjects or objects. Natural numbers are also called counting numbers. The numbers are all natural numbers.
Whole Numbers
The numbers are called whole numbers. It can be observed that whole numbers except 0 are natural numbers.
Number Line
Whole numbers can be represented by points on a line called the number line. For this purpose, a straight line is drawn and a point is chosen on the line and labeled as 0. Starting with 0, mark off equal intervals of any suitable length. Marked points are labeled as as shown in Figure below. The figure below represents real numbers since it includes the negative number (numbers on the left of 0 in this diagram are called negative numbers).
The arrow on the extreme (righthand side in case of while numbers or negative numbers) indicates that the list of numbers continues in the same way indefinitely.
A whole number can be even or odd. An even number is a number which can be divided by 2 without leaving any remainder. The numbers are all even numbers. An odd number is a number which cannot be divided by 2 without leaving any remainders. The numbers are all odd numbers.
It is interesting to know that any two numbers can be added in any order and it will not affect the results. For example, . This is called the commutative law of addition. Similarly, the order of grouping the numbers together does not affect the result. For example, . This is called the associative law of addition. The subtraction and division of numbers are not commutative as and in general.
Like addition and multiplication, whole numbers also follow commutative law and it is called commutative law of multiplication, for example, . Like addition and multiplication, whole numbers also follow associative law of multiplications. For example, . Similarly, multiplication is distributive over addition and subtraction, for example, (i) or . (ii) or .
Take any two digit number say 57, reverse the digits to obtain 75. Now subtract the smaller number from the bigger number, we have . Now reverse the digits of 18 and add 18 to its reverse (81), that is, 18+81, you will get 99.
Like this:
Like Loading...
The ZScore
The Zscore also referred to as standardized raw scores is a useful statistic because not only permits to compute the probability (chances or likelihood) of raw score (occurring within normal distribution) but also it helps to compare two raw scores from different normal distributions. The Zscore is a dimensionless measure since it is derived by subtracting the population mean from an individual raw score and then this difference is divided by the population standard deviation. This computational procedure is called standardizing raw score, which is often used in the Ztest of testing of hypothesis.
Any raw score can be converted to a Zscore by
Example 1:
If the mean = 100 and standard deviation = 10, what would be the Zscore of the following raw score
Raw Score 
ZScore 
90 

110 

70 

100 

Note that:
 If Zscore have a zero value then it means that raw score is equal to the population mean.
 If Zscore has positive value then it means that raw score is above the population mean.
 If Zscore has negative value then it means that raw score is below the population mean.
Example 2:
Suppose you got 80 marks in Exam of a class and 70 marks in another exam of that class. You are interested in finding that in which exam you have performed better. Also suppose that the mean and standard deviation of exam1 are 90 and 10 and in exam2 60 and 5 respectively. Converting both exam marks (raw scores) into standard score (ZScore), we get
The Zscore results () shows that 80 marks are one standard deviation below the class mean.
The Zscore results () shows that 70 marks are two standard deviation above the mean.
From Z1 and Z2 means that in second exam student performed well as compared to the first exam. Another way to interpret the Zscore of is that about 34.13% of the students got marks below the class average. Similarly the Zscore of 2 implies that 47.42% of the students got marks above the class average.
Like this:
Like Loading...
Multicollinearity in Linear Regression Models
The objective of multiple regression analysis is to approximate the relationship of individual parameters of a dependency, but not of interdependency. It is assumed that the dependent variable and regressors ‘s are linearly related to each other (Graybill, 1980; Johnston, 1963 and Malinvaud, 1968). Therefore, inferences depicted from any regression model are
(i) identify relative influence of regressors
(ii) prediction and/or estimation and
(iii) selection of appropriate set of regressors for the model.
From all these inferences, one of the purpose of regression model is to ascertain what extent the dependent variable can be predicted by the regressors in the model. However, to draw some suitable inferences, the regressors should be orthogonal, i.e., there should be no linear dependencies among regressors. However, in most of the application of regression analysis, regressors are not orthogonal, which leads to misleading and erroneous inferences, especially, in case when regressors are perfectly or nearly perfectly collinear to each other. The condition of nonorthogonality is also referred to as the problem of multicollinearity or collinear data for example, see Gunst and Mason, 1977; Mason et al., 1975 and Ragnar, 1934). Multicollinearity is also synonymous with illconditioning of matrix.
The presence of interdependence or the lack of independence is signified by high order intercorrelation () within a set of regressors ({Dorsett et al, 1983; Farrar and Glauber1967; Gunst and Mason, 1977; Mason et al., 1975). The perfect multicollinearity situation is a pathological extreme and it can easily be detected and resolved by dropping one of the regressor(s) causing multicollinearity (Belsley et al., 1980). In case of perfect multicollinearity, the regression coefficients remain indeterminate and their standard errors are infinite. Similarly, perfectly collinear regressors destroy the uniqueness of the least square estimators (Belsley et al., 1980 and Belsley, 1991). Many explanatory variables (regressors/ predictors) are highly collinear, making very difficult to infer the separate influence of collinear regressors on the response variable (), that is, estimation of regression coefficients becomes difficult because coefficient(s) measures the effect of the corresponding regressor while holding all other regressors as constant. Problem of not perfect multicollinearity is extremely hard to detect (Chatterjee and Hadi, 2006) as it is not specification or modeling error, actually it is a condition of deficit data (Hadi and Chatterjee, 1988). On the other hand, existence of multicollinearity has no impact on the overall regression model and associated statistics such as , ratio and value. Multicollinearity does not also lessen the predictive or reliability of the regression model as whole, it only affects the individual regressors (Koutsoyiannis, 1977). Note that, multicollinearity refers only to linear relationship among the regressors, it does not rule out nonlinear relationship among them.
To draw suitable inferences from the model, existence of (multi)collinearity should always be tested when examining a data set as an initial step in multiple regression analysis. On the other hand, high collinearity is rare, but some degree of collinearity is always exists.
A distinction between collinearity and multicollinearity should be made. Strictly speaking, multicollinearity is usually refers to the existence of more than one exact linear relationship among regressors, while collinearity refers to the existence of a single linear relationship. However, multicollinearity refers to both of the cases nowadays.
There are many methods for the detection/ testing of multi(collinearity) among regressors. However, these methods can destroy the usefulness of the model, since relevant regressor(s) may be removed by these methods. Note that, if there are two predictors then it is sufficient to detect problem of collinearity using pairwise correlation. However, to check the severity of the collinearity problem, VIF/TOL, eigenvalues or other diagnostic measures can be used.
For further detail see
 Belsley, D., Kuh, E., and Welsch, R. (1980). Diagnostics: Identifying Influential Data and Sources of Collinearity. John Willey & Sons, New York. chap. 3.
 Belsley, D. A. (1991). A Guide to Using the Collinearity Diagnostics. Computer Science in Economics and Management, 4(1), 3350.
 Chatterjee, S. and Hadi, A. S. (2006). Regression Analysis by Example. Wiley and Sons, 4th edition.
 Dorsett, D., Gunst, R. F., and Gartland, E. C. J. (1983). Multicollinear Effects of Weighted Least Squares Regression. Statistics & Probability Letters, 1(4), 207211.
 Graybill, F. (1980). An Introduction to Linear Statistical Models. McGraw Hill.
 Gunst, R. and Mason, R. (1977). Advantages of examining multicollinearities in regression analysis. Biometrics, 33, 249260.
 Hadi, A. and Chatterjee, S. (1988). Sensitivity Analysis in Linear Regression. John Willey & Sons.
 Imdadullah, M., Aslam, M. and Altaf, S. (2916) mctest: An R Package for Detection of Collinearity Among Regressors
 Imdadullah, M., Aslam, M. (2016). mctest: An R Package for Detection of Collinearity Among Regressors
 Johnston, J. (1963). Econometric Methods. McGraw Hill, New York.
 Koutsoyiannis, A. (1977). Theory of Econometrics. Macmillan Education Limited.
 Malinvaud, E. (1968). Statistical Methods of Econometrics. Amsterdam, North Holland. pp. 187192.
 Mason, R., Gunst, R., and Webster, J. (1975). Regression Analysis and Problems of Multicollinearity. Communications in Statistics, 4(3), 277292.
 Ragnar, F. (1934). Statistical Consequence Analysis by means of complete regression systems. Universitetets Ãkonomiske Instituut. Publ. No. 5.
Like this:
Like Loading...
Levels of Measurement (Scale of Measure)
Level of measurement (scale of measure) have been classified into four categories. It is important to understand these level of measurement, since these level of measurement play important part in determining the arithmetic and different possible statistical tests that are carried on the data. The scale of measure is a classification that describes the nature of information within the number assigned to variable. In simple words, the level of measurement determines how data should be summarized and presented. It also indicate the type of statistical analysis that can be performed. The four level of measurement are described below:
1) Nominal Level of Measurement (Nominal Scale)
In nominal level of measurement, the numbers are used to classify the data (unordered group) into mutually exclusive categories. In other words, for nominal level of measurement, observations of a qualitative variable are measured and recorded as labels or names.
2) Ordinal Level of Measurement (Ordinal Scale)
In ordinal level of measurement, the numbers are used to classify the data (ordered group) into mutually exclusive categories. However, it does not allow for relative degree of difference between them. In other words, for ordinal level of measurement, observations of a qualitative variable are either ranked or rated on a relative scale and recorded as labels or names.
3) Interval Level of Measurement (Interval Scale)
For data recorded at the interval level of measurement, the interval or the distance between values is meaningful. The interval scale is based on a scale with a known unit of measurement.
4) Ratio Level of Measurement (Ratio Scale)
Data recorded at the ratio level of measurement are based on a scale with a know unit of measurement and a meaningful interpretation of zero on the scale. Almost all quantitative variables are recorded on the ratio level of measurement.
Examples of level of measurements
Examples of Nominal Level of Measurement
 Religion (Muslim, Hindu, Christian, Buddhist)
 Race (Hispanic, African, Asian)
 Language (Urdu, English, French, Punjabi, Arabic)
 Gender (Male, Female)
 Marital Status (Married, Single, Divorced)
 Number plates on Cars/ Models of Cars (Toyota, Mehran)
 Parts of Speech (Noun, Verb, Article, Pronoun)
Examples of Ordinal Level of Measurement
 Rankings (1st, 2nd, 3rd)
 Marks Grades (A, B, C, D)
 Evaluation such as High, Medium, Low
 Educational level (Elementary School, High School, College, University)
 Movie Ratings (1 star, 2 stars, 3 stars, 4 stars, 5 stars)
 Pain Ratings (more, less, no)
 Cancer Stages (Stage 1, Stage 2, Stage 3)
 Hypertension Categories (Mild, Moderate, Severe)
Examples of Interval Level of Measurement
 Temperature with Celsius scale/ Fahrenheit scale
 Level of happiness rated from 1 to 10
 Education (in years)
 Standardized tests of psychological, sociological and educational discipline use interval scales.
 SAT scores
Examples of Ratio Level of Measurement
 Height
 Weight
 Age
 Length
 Volume
 Number of home computers
 Salary
For further details visit: Level of measurements
Like this:
Like Loading...
Variance is a measure of dispersion of a distribution of a random variable. The term variance was introduced by R. A. Fisher in 1918. The variance of a set of observations (data set) is defined as the mean of the squares of deviations of all the observations from their mean. When it is computed for entire population, the variance is called the population variance, usually denoted by , while for sample data, it is called sample variance and denoted by in order to distinguish between population variance and sample variance. Variance is also denoted by when we speak about the variance of a random variable. The symbolic definition for population and sample variance is
It should be noted that the variance is in square of units in which the observations are expressed and the variance is a large number compared to observations themselves. The variance because of its some nice mathematical properties, assumes an extremely important role in statistical theory.
Variance can be computed if we have standard deviation as variance is square of standard deviation i.e. .
Variance can be used to compare dispersion in two or more set of observations. Variance can never be negative since every term in the variance is squared quantity, either positive or zero.
To calculate the standard deviation one have to follow these steps:
 First find the mean of the data.
 Take difference of each observation from mean of the given data set. The sum of these differences should be zero or near to zero it may be due to rounding of numbers.
 Square the values obtained in step 1, which should be greater than or equal to zero, i.e. should be a positive quantity.
 Sum all the squared quantities obtained in step 2. We call it sum of squares of differences.
 Divide this sum of squares of differences by total number of observation if we have to calculate population standard deviation (). For sample standard deviation (S) divide the sum of squares of differences by total number of observation minus one i.e. degree of freedom.
Find the square root of the quantity obtained in step 4. The resultant quantity will be standard deviation for given data set.
The major characteristics of the variances are:
a) All of the observations are used in the calculations
b) Variance is not unduly influenced by extreme observations
c) The variance is not in same units as the observation, the variance is in square of units in which the observations are expressed.
Like this:
Like Loading...
A research can be classified into two groups: Qualitative and Quantitative Research
 Qualitative Research
Qualitative research involves collecting data from indept interviews, observations, field notes, and openended questions in questionnaire etc. The researcher himself is the primary data collection instrument and the data could be collected in form of words, images, and patterns etc.For Qualitative Research, data Analysis involves searching for patterns, themes and holistic features. Results of such research are likely to be context specific and reporting takes the form of a narrative with contextual description and direct quotations from researchers.
 Quantitative Research
Quantitative research involves collecting quantitative data based on precise measurement using some structured, reliable and validated collection instruments (questionnaire) or through archival data sources. The nature of quantitative data is in the form of variables and its data analysis involves establishing statistical relationship. If properly done, results of such research are generalize able to entire population.Quantitative research could be classified into two groups depending on the data collection methodologies
 Experimental Research
The main purpose of experimental research is to establish a cause and effect relationship. The defining characteristics of experimental research are active manipulation of independent variables and the random assignment of participants to the conditions to be manipulated, everything else should be kept as similar and as constant as possible.To depict the way experiments are conducted, a term used is called design of experiment. There are two main types of experimental design.
 Between Subjects Design
In within subject design, the same group of subjects serves in more than one treatment
 In between group design, two or more groups of subjects, each of which being tested by a different testing factor simultaneously.
 NonExperimental Research
NonExperimental Research is commonly used in sociology, political science and management disciplines. This kind of research is often done with the help of a survey. There is no random assignment of participants to a particular group nor do we manipulate the independent variables. As a result one cannot establish a cause and effect relationship through nonexperimental research. There are two approaches to analyzing such data
 Tests for approaches to analyzing such data such as IQ level of participants from different ethnic background.
 Tests for significant association between two factors such as firm sales and advertising expenditure.
Like this:
Like Loading...
Chisquare test is a nonparametric test. The assumption of normal distribution in the population is not required for this test. The statistical technique chisquare can be used to find the association (dependencies) between sets of two or more categorical variables by comparing how close the observed frequencies are to the expected frequencies. In other words, a chi square () statistic is used to investigate whether the distributions of categorical variables different from one another. Note that the response of categorical variables should be independent from each other. We use the chisquare test for relationship between two nominal scaled variables.
Chisquare test of independence is used as tests of goodness of fit and as tests of independence. In test of goodness of fit, we check whether or not observed frequency distribution is different from the theoretical distribution, while in test of independence we assess, whether paired observations on two variables are independent from each other (from contingency table).
Example: A social scientist sampled 140 people and classified them according to income level and whether or not they played a state lottery in the last month. The sample information is reported below. Is it reasonable to conclude that playing the lottery is related to income level? Use the 0.05 significance level.

Income 
Low 
Middle 
High 
Total 
Played 
46 
28 
21 
95 
Did not play 
14 
12 
19 
45 
Total 
60 
40 
40 
140 
Step by step procedure of testing of hypothesis about association between these two variable is described, below.
Step1:
: There is no relationship between income and whether the person played the lottery.
: There is relationship between income and whether the person played the lottery.
Step2: Level of Significance 0.05
Step 3: Test statistics (calculations)
Observed Frequencies () 
Expected Frequencies () 

46 
95*60/140= 40.71 

28 
95*40/140= 27.14 

21 
95*40/140= 27.14 

14 
45*60/140= 19.29 

12 
45*40/140= 12.86 

19 
45*40/140= 12.86 


6.544 
Step 4: Critical Region:
Tabular ChiSquare value at 0.05 level of significance and is 5.991.
Step 5: Decision
As calculated ChiSquare value is greater than tabular ChiSquare value, we reject , which means that there is relationship between income level and playing the lottery.
Note that there are several types of chisquare test (such as Yates, Likelihood ratio, Portmanteau test in time series) available which depends on the way data was collected and also the hypothesis being tested.
Like this:
Like Loading...
Mean: Measure of Central Tendency
The measure of Central Tendency Mean (also know as average or arithmetic mean) is used to describe the data set as a single number (value) which represents the middle (center) of the data, that is average measure (performance, behaviour, etc) of data. This measure of central tendency is also known as measure of central location or measure of center.
Mathematically mean can be defined as the sum of the all values in a given dataset divided by the number of observations in that data set under consideration. The mean is also called arithmetic mean or simply average.
Example: Consider the following data set consists of marks of 15 student in certain examination.
50, 55, 65, 43, 78, 20, 100, 5, 90, 23, 40, 56, 70, 88, 30
The mean of above data values is computed by adding all these values (50 + 55 + 65 + 43 + 78 + 20 + 100 + 5 + 90 + 23 + 40 + 56 + 70 + 88 + 30 = 813) and then dividing by the number of observations added (15) which equals 54.2 marks, that is
The above procedure of calculating the mean can be represented mathematically
The Greek symbol (pronounced “mu”) is the representation of population mean in statistics and is the number of observations in the population data set.
The above formula is known as population mean as it is computed for whole population. The sample mean can also be computed in same manner as population mean is computed. Only the difference is in representation of the formula, that is,
.
The is representation of sample mean and shows number of observations in the sample.
The mean is used for numeric data only. Statistically the data type for calculating mean should be Quantitative (variables should be measured on either ratio or interval scale), therefore, the numbers in data set can be continuous and/ or discrete in nature.
Note that mean should not be computed for alphabetic or categorical data (data should not belong to nominal or ordinal scale). Mean is influenced by very extreme values in data, i.e. very large or very small values in data changes the mean drastically.
For other measures of central tendencies visit: Measures of Central Tencencies
Like this:
Like Loading...