In simple linear regression, we assess the relationship between one dependent (regressand) and one independent (regressor) variable. The goal is to fit a line through a scatterplot of observations in order to find the line that best describes the data (scatterplot).
Suppose you are a marketing research analyst at a music label and your task is to suggest, on the basis of past data, a marketing plan for the next year that will maximize product sales. The data set that is available to you includes information on the sales of music downloads (thousands of units), advertising expenditures (in Euros), the number of radio plays an artist received per week (airplay), the number of previous releases of an artist (starpower), repertoire origin (country; 0 = local, 1 = international), and genre (1 = rock, 2 = pop, 3 = electronic). Let’s load and inspect the data first:
As stated above, regression analysis may be used to relate a quantitative response (“dependent variable”) to one or more predictor variables (“independent variables”). In a simple linear regression, we have one dependent and one independent variable.
Here are a few important questions that we might seek to address based on the data:
We may use linear regression to answer these questions. Let’s start with the first question and investigate the effect of advertising on sales.
A simple linear regression model only has one predictor and can be written as:
\[\begin{equation} Y=\beta_0+\beta_1X+\epsilon \tag{6.5} \end{equation}\]
In our specific context, let’s consider only the influence of advertising on sales for now:
\[\begin{equation} Sales=\beta_0+\beta_1*adspend+\epsilon \tag{6.6} \end{equation}\]
The word “adspend” represents data on advertising expenditures that we have observed and β 1 (the “slope”“) represents the unknown relationship between advertising expenditures and sales. It tells you by how much sales will increase for an additional Euro spent on advertising. β 0 (the”intercept") is the number of sales we would expect if no money is spent on advertising. Together, β 0 and β 1 represent the model coefficients or parameters. The error term (ε) captures everything that we miss by using our model, including, (1) misspecifications (the true relationship might not be linear), (2) omitted variables (other variables might drive sales), and (3) measurement error (our measurement of the variables might be imperfect).
Once we have used our training data to produce estimates for the model coefficients, we can predict future sales on the basis of a particular value of advertising expenditures by computing:
\[\begin{equation} \hat{Sales}=\hat{\beta_0}+\hat{\beta_1}*adspend \tag{6.7} \end{equation}\]
We use the hat symbol, ^ , to denote the estimated value for an unknown parameter or coefficient, or to denote the predicted value of the response (sales). In practice, β 0 and β 1 are unknown and must be estimated from the data to make predictions. In the case of our advertising example, the data set consists of the advertising budget and product sales (n = 200). Our goal is to obtain coefficient estimates such that the linear model fits the available data well. In other words, we fit a line through the scatterplot of observations and try to find the line that best describes the data. The following graph shows the scatterplot for our data, where the black line shows the regression line. The grey vertical lines shows the difference between the predicted values (the regression line) and the observed values. This difference is referred to as the residuals (“e”).
Figure 6.5: Ordinary least squares (OLS)
Estimation of the regression function is based on the idea of the method of least squares (OLS = ordinary least squares). The first step is to calculate the residuals by subtracting the observed values from the predicted values.
\(e_i = Y_i-(\beta_0+\beta_1X_i)\)
This difference is then minimized by minimizing the sum of the squared residuals:
\[\begin{equation} \sum_{i=1}^{N} e_i^2= \sum_{i=1}^{N} [Y_i-(\beta_0+\beta_1X_i)]^2\rightarrow min! \tag{6.8} \end{equation}\]
e i : Residuals (i = 1,2,…,N) Y i : Values of the dependent variable (i = 1,2,…,N) β 0 : Intercept β 1 : Regression coefficient / slope parameters X ni : Values of the nth independent variables and the ith observation N: Number of observations
This is also referred to as the residual sum of squares (RSS) . Now we need to choose the values for β 0 and β 1 that minimize RSS. So how can we derive these values for the regression coefficient? The equation for β 1 is given by:
\[\begin{equation} \hat{\beta_1}=\frac{COV_{XY}}{s_x^2} \tag{6.9} \end{equation}\]
The exact mathematical derivation of this formula is beyond the scope of this script, but the intuition is to calculate the first derivative of the squared residuals with respect to β 1 and set it to zero, thereby finding the β 1 that minimizes the term. Using the above formula, you can easily compute β 1 using the following code:
The interpretation of β 1 is as follows:
For every extra Euros spent on advertising, sales can be expected to increase by 0.096 units. Or, in other words, if we increase our marketing budget by 1,000 Euros, sales can be expected to increase by 96 units.
Using the estimated coefficient for β 1 , it is easy to compute β 0 (the intercept) as follows:
\[\begin{equation} \hat{\beta_0}=\overline{Y}-\hat{\beta_1}\overline{X} \tag{6.10} \end{equation}\]
The R code for this is:
The interpretation of β 0 is as follows:
If we spend no money on advertising, we would expect to sell 134.14 units.
You may also verify this based on a scatterplot of the data. The following plot shows the scatterplot including the regression line, which is estimated using OLS.
Figure 6.6: Scatterplot
You can see that the regression line intersects with the y-axis at 134.14, which corresponds to the expected sales level when advertising expenditure (on the x-axis) is zero (i.e., the intercept β 0 ). The slope coefficient (β 1 ) tells you by how much sales (on the y-axis) would increase if advertising expenditures (on the x-axis) are increased by one unit.
In a next step, we assess if the effect of advertising on sales is statistically significant. This means that we test the null hypothesis H 0 : “There is no relationship between advertising and sales” versus the alternative hypothesis H 1 : “The is some relationship between advertising and sales”. Or, to state this mathematically:
\[H_0:\beta_1=0\] \[H_1:\beta_1\ne0\]
How can we test if the effect is statistically significant? Recall the generalized equation to derive a test statistic:
\[\begin{equation} test\ statistic = \frac{effect}{error} \tag{6.11} \end{equation}\]
The effect is given by the β 1 coefficient in this case. To compute the test statistic, we need to come up with a measure of uncertainty around this estimate (the error). This is because we use information from a sample to estimate the least squares line to make inferences regarding the regression line in the entire population. Since we only have access to one sample, the regression line will be slightly different every time we take a different sample from the population. This is sampling variation and it is perfectly normal! It just means that we need to take into account the uncertainty around the estimate, which is achieved by the standard error. Thus, the test statistic for our hypothesis is given by:
\[\begin{equation} t = \frac{\hat{\beta_1}}{SE(\hat{\beta_1})} \tag{6.12} \end{equation}\]
After calculating the test statistic, we compare its value to the values that we would expect to find if there was no effect based on the t-distribution. In a regression context, the degrees of freedom are given by N - p - 1 where N is the sample size and p is the number of predictors. In our case, we have 200 observations and one predictor. Thus, the degrees of freedom is 200 - 1 - 1 = 198. In the regression output below, R provides the exact probability of observing a t value of this magnitude (or larger) if the null hypothesis was true. This probability is the p-value. A small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the outcome variable due to chance in the absence of any real association between the predictor and the outcome.
To estimate the regression model in R, you can use the lm() function. Within the function, you first specify the dependent variable (“sales”) and independent variable (“adspend”) separated by a ~ (tilde). As mentioned previously, this is known as formula notation in R. The data = regression argument specifies that the variables come from the data frame named “regression”. Strictly speaking, you use the lm() function to create an object called “simple_regression,” which holds the regression output. You can then view the results using the summary() function:
Note that the estimated coefficients for β 0 (134.14) and β 1 (0.096) correspond to the results of our manual computation above. The associated t-values and p-values are given in the output. The t-values are larger than the critical t-values for the 95% confidence level, since the associated p-values are smaller than 0.05. In case of the coefficient for β 1 this means that the probability of an association between the advertising and sales of the observed magnitude (or larger) is smaller than 0.05, if the value of β 1 was, in fact, 0.
The coefficients associated with the respective variables represent point estimates . To get a better feeling for the range of values that the coefficients could take, it is helpful to compute confidence intervals . A 95% confidence interval is defined as a range of values such that with a 95% probability, the range will contain the true unknown value of the parameter. For example, for β 1 , the confidence interval can be computed as.
\[\begin{equation} CI = \hat{\beta_1}\pm(t_{1-\frac{\alpha}{2}}*SE(\beta_1)) \tag{6.13} \end{equation}\]
It is easy to compute confidence intervals in R using the confint() function. You just have to provide the name of you estimated model as an argument:
For our model, the 95% confidence interval for β 0 is [119.28,149], and the 95% confidence interval for β 1 is [0.08,0.12]. Thus, we can conclude that when we do not spend any money on advertising, sales will be somewhere between 119 and 149 units on average. In addition, for each increase in advertising expenditures by one Euro, there will be an average increase in sales of between 0.08 and 0.12.
Once we have rejected the null hypothesis in favor of the alternative hypothesis, the next step is to investigate to what extent the model represents (“fits”) the data. How can we assess the model fit?
Similar to ANOVA, the calculation of model fit statistics relies on estimating the different sum of squares values. SS T is the difference between the observed data and the mean value of Y (aka. total variation). In the absence of any other information, the mean value of Y represents the best guess on where an observation at a given level of advertising will fall:
\[\begin{equation} SS_T= \sum_{i=1}^{N} (Y_i-\overline{Y})^2 \tag{6.14} \end{equation}\]
The following graph shows the total sum of squares:
Figure 6.7: Total sum of squares
Based on our linear model, the best guess about the sales level at a given level of advertising is the predicted value. The model sum of squares (SS M ) has the mathematical representation:
\[\begin{equation} SS_M= \sum_{j=1}^{c} n_j(\overline{Y}_j-\overline{Y})^2 \tag{6.15} \end{equation}\]
The model sum of squares represents the improvement in prediction resulting from using the regression model rather than the mean of the data. The following graph shows the model sum of squares for our example:
Figure 6.8: Ordinary least squares (OLS)
The residual sum of squares (SS R ) is the difference between the observed data and the predicted values along the regression line (i.e., the variation not explained by the model)
\[\begin{equation} SS_R= \sum_{j=1}^{c} \sum_{i=1}^{n} ({Y}_{ij}-\overline{Y}_{j})^2 \tag{6.16} \end{equation}\]
The following graph shows the residual sum of squares for our example:
Figure 6.9: Ordinary least squares (OLS)
The R 2 statistic represents the proportion of variance that is explained by the model and is computed as:
\[\begin{equation} R^2= \frac{SS_M}{SS_T} \tag{6.16} \end{equation}\]
It takes values between 0 (very bad fit) and 1 (very good fit). Note that when the goal of your model is to predict future outcomes, a “too good” model fit can pose severe challenges. The reason is that the model might fit your specific sample so well, that it will only predict well within the sample but not generalize to other samples. This is called overfitting and it shows that there is a trade-off between model fit and out-of-sample predictive ability of the model, if the goal is to predict beyond the sample.
You can get a first impression of the fit of the model by inspecting the scatter plot as can be seen in the plot below. If the observations are highly dispersed around the regression line (left plot), the fit will be lower compared to a data set where the values are less dispersed (right plot).
Figure 6.10: Good vs. bad model fit
The R 2 statistic is reported in the regression output (see above). However, you could also extract the relevant sum of squares statistics from the regression object using the anova() function to compute it manually:
Now we can compute R 2 in the same way that we have computed Eta 2 in the last section:
Due to the way the R 2 statistic is calculated, it will never decrease if a new explanatory variable is introduced into the model. This means that every new independent variable either doesn’t change the R 2 or increases it, even if there is no real relationship between the new variable and the dependent variable. Hence, one could be tempted to just add as many variables as possible to increase the R 2 and thus obtain a “better” model. However, this actually only leads to more noise and therefore a worse model.
To account for this, there exists a test statistic closely related to the R 2 , the adjusted R 2 . It can be calculated as follows:
\[\begin{equation} \overline{R^2} = 1 - (1 - R^2)\frac{n-1}{n - k - 1} \tag{6.17} \end{equation}\]
where n is the total number of observations and k is the total number of explanatory variables. The adjusted R 2 is equal to or less than the regular R 2 and can be negative. It will only increase if the added variable adds more explanatory power than one expect by pure chance. Essentially, it contains a “penalty” for including unnecessary variables and therefore favors more parsimonious models. As such, it is a measure of suitability, good for comparing different models and is very useful in the model selection stage of a project. In R, the standard lm() function automatically also reports the adjusted R 2 .
Another significance test is the F-test. It tests the null hypothesis:
\[H_0:R^2=0\]
This is equivalent to the following null hypothesis:
\[H_0:\beta_1=\beta_2=\beta_3=\beta_k=0\]
The F-test statistic is calculated as follows:
\[\begin{equation} F=\frac{\frac{SS_M}{k}}{\frac{SS_R}{(n-k-1)}}=\frac{MS_M}{MS_R} \tag{6.16} \end{equation}\]
which has a F distribution with k number of predictors and n degrees of freedom. In other words, you divide the systematic (“explained”) variation due to the predictor variables by the unsystematic (“unexplained”) variation.
The result of the F-test is provided in the regression output. However, you might manually compute the F-test using the ANOVA results from the model:
After fitting the model, we can use the estimated coefficients to predict sales for different values of advertising. Suppose you want to predict sales for a new product, and the company plans to spend 800 Euros on advertising. How much will it sell? You can easily compute this either by hand:
\[\hat{sales}=134.134 + 0.09612*800=211\]
… or by extracting the estimated coefficients from the model summary:
The predicted value of the dependent variable is 211 units, i.e., the product will (on average) sell 211 units.
The following video summarizes how to conduct simple linear regression in R
Multiple linear regression is a statistical technique that simultaneously tests the relationships between two or more independent variables and an interval-scaled dependent variable. The general form of the equation is given by:
\[\begin{equation} Y=(\beta_0+\beta_1*X_1+\beta_2*X_2+\beta_n*X_n)+\epsilon \tag{6.5} \end{equation}\]
Again, we aim to find the linear combination of predictors that correlate maximally with the outcome variable. Note that if you change the composition of predictors, the partial regression coefficient of an independent variable will be different from that of the bivariate regression coefficient. This is because the regressors are usually correlated, and any variation in Y that was shared by X1 and X2 was attributed to X1. The interpretation of the partial regression coefficients is the expected change in Y when X is changed by one unit and all other predictors are held constant.
Let’s extend the previous example. Say, in addition to the influence of advertising, you are interested in estimating the influence of airplay on the number of album downloads. The corresponding equation would then be given by:
\[\begin{equation} Sales=\beta_0+\beta_1*adspend+\beta_2*airplay+\epsilon \tag{6.6} \end{equation}\]
The words “adspend” and “airplay” represent data that we have observed on advertising expenditures and number of radio plays, and β 1 and β 2 represent the unknown relationship between sales and advertising expenditures and radio airplay, respectively. The coefficients tells you by how much sales will increase for an additional Euro spent on advertising (when radio airplay is held constant) and by how much sales will increase for an additional radio play (when advertising expenditures are held constant). Thus, we can make predictions about album sales based not only on advertising spending, but also on radio airplay.
With several predictors, the partitioning of sum of squares is the same as in the bivariate model, except that the model is no longer a 2-D straight line. With two predictors, the regression line becomes a 3-D regression plane. In our example:
Figure 6.11: Regression plane
Like in the bivariate case, the plane is fitted to the data with the aim to predict the observed data as good as possible. The deviation of the observations from the plane represent the residuals (the error we make in predicting the observed data from the model). Note that this is conceptually the same as in the bivariate case, except that the computation is more complex (we won’t go into details here). The model is fairly easy to plot using a 3-D scatterplot, because we only have two predictors. While multiple regression models that have more than two predictors are not as easy to visualize, you may apply the same principles when interpreting the model outcome:
Estimating multiple regression models is straightforward using the lm() function. You just need to separate the individual predictors on the right hand side of the equation using the + symbol. For example, the model:
\[\begin{equation} Sales=\beta_0+\beta_1*adspend+\beta_2*airplay+\beta_3*starpower+\epsilon \tag{6.6} \end{equation}\]
could be estimated as follows:
The interpretation of the coefficients is as follows:
The associated t-values and p-values are also given in the output. You can see that the p-values are smaller than 0.05 for all three coefficients. Hence, all effects are “significant”. This means that if the null hypothesis was true (i.e., there was no effect between the variables and sales), the probability of observing associations of the estimated magnitudes (or larger) is very small (e.g., smaller than 0.05).
Again, to get a better feeling for the range of values that the coefficients could take, it is helpful to compute confidence intervals .
What does this tell you? Recall that a 95% confidence interval is defined as a range of values such that with a 95% probability, the range will contain the true unknown value of the parameter. For example, for β 3 , the confidence interval is [6.28,15.89]. Thus, although we have computed a point estimate of 11.09 for the effect of starpower on sales based on our sample, the effect might actually just as well take any other value within this range, considering the sample size and the variability in our data.
The output also tells us that 66.47% of the variation can be explained by our model. You may also visually inspect the fit of the model by plotting the predicted values against the observed values. We can extract the predicted values using the predict() function. So let’s create a new variable yhat , which contains those predicted values.
We can now use this variable to plot the predicted values against the observed values. In the following plot, the model fit would be perfect if all points would fall on the diagonal line. The larger the distance between the points and the line, the worse the model fit.
Figure 6.12: Model fit
Partial plots
In the context of a simple linear regression (i.e., with a single independent variable), a scatter plot of the dependent variable against the independent variable provides a good indication of the nature of the relationship. If there is more than one independent variable, however, things become more complicated. The reason is that although the scatter plot still show the relationship between the two variables, it does not take into account the effect of the other independent variables in the model. Partial regression plot show the effect of adding another variable to a model that already controls for the remaining variables in the model. In other words, it is a scatterplot of the residuals of the outcome variable and each predictor when both variables are regressed separately on the remaining predictors. As an example, consider the effect of advertising expenditures on sales. In this case, the partial plot would show the effect of adding advertising expenditures as an explanatory variable while controlling for the variation that is explained by airplay and starpower in both variables (sales and advertising). Think of it as the purified relationship between advertising and sales that remains after controlling for other factors. The partial plots can easily be created using the avPlots() function from the car package:
Figure 6.13: Partial plots
Using the model
After fitting the model, we can use the estimated coefficients to predict sales for different values of advertising, airplay, and starpower. Suppose you would like to predict sales for a new music album with advertising expenditures of 800, airplay of 30 and starpower of 5. How much will it sell?
\[\hat{sales}=−26.61 + 0.084 * 800 + 3.367*30 + 11.08 ∗ 5= 197.74\]
… or by extracting the estimated coefficients:
The predicted value of the dependent variable is 198 units, i.e., the product will sell 198 units.
Comparing effects
Using the output from the regression model above, it is difficult to compare the effects of the independent variables because they are all measured on different scales (Euros, radio plays, releases). Standardized regression coefficients can be used to judge the relative importance of the predictor variables. Standardization is achieved by multiplying the unstandardized coefficient by the ratio of the standard deviations of the independent and dependent variables:
\[\begin{equation} B_{k}=\beta_{k} * \frac{s_{x_k}}{s_y} \tag{6.18} \end{equation}\]
Hence, the standardized coefficient will tell you by how many standard deviations the outcome will change as a result of a one standard deviation change in the predictor variable. Standardized coefficients can be easily computed using the lm.beta() function from the lm.beta package.
The results show that for adspend and airplay , a change by one standard deviation will result in a 0.51 standard deviation change in sales, whereas for starpower , a one standard deviation change will only lead to a 0.19 standard deviation change in sales. Hence, while the effects of adspend and airplay are comparable in magnitude, the effect of starpower is less strong.
The following video summarizes how to conduct multiple regression in R
Once you have built and estimated your model it is important to run diagnostics to ensure that the results are accurate. In the following section we will discuss common problems.
The following video summarizes how to handle outliers in R
Outliers are data points that differ vastly from the trend. They can introduce bias into a model due to the fact that they alter the parameter estimates. Consider the example below. A linear regression was performed twice on the same data set, except during the second estimation the two green points were changed to be outliers by being moved to the positions indicated in red. The solid red line is the regression line based on the unaltered data set, while the dotted line was estimated using the altered data set. As you can see the second regression would lead to different conclusions than the first. Therefore it is important to identify outliers and further deal with them.
Figure 6.14: Effects of outliers
One quick way to visually detect outliers is by creating a scatterplot (as above) to see whether anything seems off. Another approach is to inspect the studentized residuals. If there are no outliers in your data, about 95% will be between -2 and 2, as per the assumptions of the normal distribution. Values well outside of this range are unlikely to happen by chance and warrant further inspection. As a rule of thumb, observations whose studentized residuals are greater than 3 in absolute values are potential outliers.
The studentized residuals can be obtained in R with the function rstudent() . We can use this function to create a new variable that contains the studentized residuals e music sales regression from before yields the following residuals:
A good way to visually inspect the studentized residuals is to plot them in a scatterplot and roughly check if most of the observations are within the -3, 3 bounds.
Figure 6.15: Plot of the studentized residuals
To identify potentially influential observations in our data set, we can apply a filter to our data:
After a detailed inspection of the potential outliers, you might decide to delete the affected observations from the data set or not. If an outlier has resulted from an error in data collection, then you might simply remove the observation. However, even though data may have extreme values, they might not be influential to determine a regression line. That means, the results wouldn’t be much different if we either include or exclude them from analysis. This means that the decision of whether to exclude an outlier or not is closely related to the question whether this observation is an influential observation, as will be discussed next.
Related to the issue of outliers is that of influential observations, meaning observations that exert undue influence on the parameters. It is possible to determine whether or not the results are driven by an influential observation by calculating how far the predicted values for your data would move if the model was fitted without this particular observation. This calculated total distance is called Cook’s distance . To identify influential observations, we can inspect the respective plots created from the model output. A rule of thumb to determine whether an observation should be classified as influential or not is to look for observation with a Cook’s distance > 1 (although opinions vary on this). The following plot can be used to see the Cook’s distance associated with each data point:
Figure 6.16: Cook’s distance
It is easy to see that none of the Cook’s distance values is close to the critical value of 1. Another useful plot to identify influential observations is plot number 5 from the output:
Figure 6.17: Residuals vs. Leverage
In this plot, we look for cases outside of a dashed line, which represents Cook’s distance . Lines for Cook’s distance thresholds of 0.5 and 1 are included by default. In our example, this line is not even visible, since the Cook’s distance values are far away from the critical values. Generally, you would watch out for outlying values at the upper right corner or at the lower right corner of the plot. Those spots are the places where cases can be influential against a regression line. In our example, there are no influential cases.
To see how influential observations can impact your regression, have a look at this example .
An important underlying assumption for OLS is that of linearity, meaning that the relationship between the dependent and the independent variable can be reasonably approximated in linear terms. One quick way to assess whether a linear relationship can be assumed is to inspect the added variable plots that we already came across earlier:
Figure 6.18: Partial plots
In our example, it appears that linear relationships can be reasonably assumed. Please note, however, that the assumption of linearity implies two things:
These assumptions may not be justifiable in certain contexts and you might have to transform your data (e.g., using log-transformations) in these cases, as we will see below.
The following video summarizes how to identify non-constant error variance in R
Another important assumption of the linear model is that the error terms have a constant variance (i.e., homoscedasticity). The following plot from the model output shows the residuals (the vertical distance from an observed value to the predicted values) versus the fitted values (the predicted value from the regression model). If all the points fell exactly on the dashed grey line, it would mean that we have a perfect prediction. The residual variance (i.e., the spread of the values on the y-axis) should be similar across the scale of the fitted values on the x-axis.
Figure 6.19: Residuals vs. fitted values
In our case, this appears to be the case. You can identify non-constant variances in the errors (i.e., heteroscedasticity) from the presence of a funnel shape in the above plot. When the assumption of constant error variances is not met, this might be due to a misspecification of your model (e.g., the relationship might not be linear). In these cases, it often helps to transform your data (e.g., using log-transformations). The red line also helps you to identify potential misspecification of your model. It is a smoothed curve that passes through the residuals and if it lies close to the gray dashed line (as in our case) it suggest a correct specification. If the line would deviate from the dashed grey line a lot (e.g., a U-shape or inverse U-shape), it would suggest that the linear model specification is not reasonable and you should try different specifications.
If OLS is performed despite heteroscedasticity, the estimates of the coefficient will still be correct on average. However, the estimator is inefficient , meaning that the standard error is wrong, which will impact the significance tests (i.e., the p-values will be wrong). However, there are also robust regression methods, which you can use to estimate your model despite the presence of heteroscedasticity.
Another assumption of OLS is that the error term is normally distributed. This can be a reasonable assumption for many scenarios, but we still need a way to check if it is actually the case. As we can not directly observe the actual error term, we have to work with the next best thing - the residuals.
A quick way to assess whether a given sample is approximately normally distributed is by using Q-Q plots. These plot the theoretical position of the observations (under the assumption that they are normally distributed) against the actual position. The plot below is created by the model output and shows the residuals in a Q-Q plot. As you can see, most of the points roughly follow the theoretical distribution, as given by the straight line. If most of the points are close to the line, the data is approximately normally distributed.
Figure 6.20: Q-Q plot
Another way to check for normal distribution of the data is to employ statistical tests that test the null hypothesis that the data is normally distributed, such as the Shapiro–Wilk test. We can extract the residuals from our model using the resid() function and apply the shapiro.test() function to it:
As you can see, we can not reject the H 0 of the normally distributed residuals, which means that we can assume the residuals to be approximately normally distributed.
When the assumption of normally distributed errors is not met, this might again be due to a misspecification of your model, in which case it might help to transform your data (e.g., using log-transformations).
The assumption of independent errors implies that for any two observations the residual terms should be uncorrelated. This is also known as a lack of autocorrelation . In theory, this could be tested with the Durbin-Watson test, which checks whether adjacent residuals are correlated. However, be aware that the test is sensitive to the order of your data. Hence, it only makes sense if there is a natural order in the data (e.g., time-series data) when the presence of dependent errors indicates autocorrelation. Since there is no natural order in our data, we don’t need to apply this test. .
If you are confronted with data that has a natural order, you can performed the test using the command durbinWatsonTest() , which takes the object that the lm() function generates as an argument. The test statistic varies between 0 and 4, with values close to 2 being desirable. As a rule of thumb values below 1 and above 3 are causes for concern.
Linear dependence of regressors, also known as multicollinearity, is when there is a strong linear relationship between the independent variables. Some correlation will always be present, but severe correlation can make proper estimation impossible. When present, it affects the model in several ways:
A quick way to find obvious multicollinearity is to examine the correlation matrix of the data. Any value > 0.8 - 0.9 should be cause for concern. You can, for example, create a correlation matrix using the rcorr() function from the Hmisc package.
The bivariate correlations can also be show in a plot:
Figure 6.21: Bivariate correlation plots
However, this only spots bivariate multicollinearity. Variance inflation factors can be used to spot more subtle multicollinearity arising from multivariate relationships. It is calculated by regressing X i on all other X and using the resulting R 2 to calculate
\[\begin{equation} \begin{split} \frac{1}{1 - R_i^2} \end{split} \tag{6.19} \end{equation}\]
VIF values of over 4 are certainly cause for concern and values over 2 should be further investigated. If the average VIF is over 1 the regression may be biased. The VIF for all variables can easily be calculated in R with the vif() function.
As you can see the values are well below the cutoff, indicating that we do not have to worry about multicollinearity in our example.
If a variable that influences the outcome is left out of the model (“omitted”), a bias in other variables’ coefficients might be introduced. Specifically, the other coefficients will be biased if the corresponding variables are correlated with the omitted variable. Intuitively, the variables left in the model “pick up” the effect of the omitted variable to the degree that they are related. Let’s illustrate this with an example.
Consider the following data on the number of people visiting concerts of smaller bands.
The data set contains three variables:
If we estimate a model to explain the number of tickets sold as a function of the average rating and the number of followers, the results would look as follows:
Now assume we don’t have data on the number of followers a band has, but we still have information on the average rating and want to explain the number of tickets sold. Fitting a linear model with just the avg_rating variable included yields the following results:
What happens to the coefficient of avg_rating ? Because avg_rating and followers are not independent (e.g. one could argue that bands with a higher average rating probably have more followers) the coefficient will be biased. In our case we massively overestimate the effect that the average rating of a band has on ticket sales. In the original model, the effect was about 20.5. In the new, smaller model, the effect is approximately 3.1 times higher.
We can also work out intuitively what the bias will be. The marginal effect of followers on concert_visitors is captured by avg_rating to the degree that avg_rating is related to followers . There are two coefficients of interest:
The former is just the coefficient of followers in the original regression.
The latter is the coefficient of avg_rating obtained from a regression on followers , since the coefficient shows how avg_rating and followers relate to each other.
Now we can calculate the bias induced by omitting followers
To calculate the biased coefficient, simply add the bias to the coefficient from the original model.
6.4.1 two categories.
Suppose, you wish to investigate the effect of the variable “country” on sales, which is a categorical variable that can only take two levels (i.e., 0 = local artist, 1 = international artist). Categorical variables with two levels are also called binary predictors. It is straightforward to include these variables in your model as “dummy” variables. Dummy variables are factor variables that can only take two values. For our “country” variable, we can create a new predictor variable that takes the form:
\[\begin{equation} x_4 = \begin{cases} 1 & \quad \text{if } i \text{th artist is international}\\ 0 & \quad \text{if } i \text{th artist is local} \end{cases} \tag{6.20} \end{equation}\]
This new variable is then added to our regression equation from before, so that the equation becomes
\[\begin{align} Sales =\beta_0 &+\beta_1*adspend\\ &+\beta_2*airplay\\ &+\beta_3*starpower\\ &+\beta_4*international+\epsilon \end{align}\]
where “international” represents the new dummy variable and \(\beta_4\) is the coefficient associated with this variable. Estimating the model is straightforward - you just need to include the variable as an additional predictor variable. Note that the variable needs to be specified as a factor variable before including it in your model. If you haven’t converted it to a factor variable before, you could also use the wrapper function as.factor() within the equation.
You can see that we now have an additional coefficient in the regression output, which tells us the effect of the binary predictor. The dummy variable can generally be interpreted as the average difference in the dependent variable between the two groups (similar to a t-test). In this case, the coefficient tells you the difference in sales between international and local artists, and whether this difference is significant. Specifically, it means that international artists on average sell 45.67 units more than local artists, and this difference is significant (i.e., p < 0.05).
Predictors with more than two categories, like our “genre”" variable, can also be included in your model. However, in this case one dummy variable cannot represent all possible values, since there are three genres (i.e., 1 = Rock, 2 = Pop, 3 = Electronic). Thus, we need to create additional dummy variables. For example, for our “genre” variable, we create two dummy variables as follows:
\[\begin{equation} x_5 = \begin{cases} 1 & \quad \text{if } i \text{th product is from Pop genre}\\ 0 & \quad \text{if } i \text{th product is from Rock genre} \end{cases} \tag{6.21} \end{equation}\]
\[\begin{equation} x_6 = \begin{cases} 1 & \quad \text{if } i \text{th product is from Electronic genre}\\ 0 & \quad \text{if } i \text{th product is from Rock genre} \end{cases} \tag{6.22} \end{equation}\]
We would then add these variables as additional predictors in the regression equation and obtain the following model
\[\begin{align} Sales =\beta_0 &+\beta_1*adspend\\ &+\beta_2*airplay\\ &+\beta_3*starpower\\ &+\beta_4*international\\ &+\beta_5*Pop\\ &+\beta_6*Electronic+\epsilon \end{align}\]
where “Pop” and “Rock” represent our new dummy variables, and \(\beta_5\) and \(\beta_6\) represent the associated regression coefficients.
The interpretation of the coefficients is as follows: \(\beta_5\) is the difference in average sales between the genres “Rock” and “Pop”, while \(\beta_6\) is the difference in average sales between the genres “Rock” and “Electro”. Note that the level for which no dummy variable is created is also referred to as the baseline . In our case, “Rock” would be the baseline genre. This means that there will always be one fewer dummy variable than the number of levels.
You don’t have to create the dummy variables manually as R will do this automatically when you add the variable to your equation:
How can we interpret the coefficients? It is estimated based on our model that products from the “Pop” genre will on average sell 47.69 units more than products from the “Rock” genre, and that products from the “Electronic” genre will sell on average 27.62 units more than the products from the “Rock” genre. The p-value of both variables is smaller than 0.05, suggesting that there is statistical evidence for a real difference in sales between the genres.
The level of the baseline category is arbitrary. As you have seen, R simply selects the first level as the baseline. If you would like to use a different baseline category, you can use the relevel() function and set the reference category using the ref argument. The following would estimate the same model using the second category as the baseline:
Note that while your choice of the baseline category impacts the coefficients and the significance level, the prediction for each group will be the same regardless of this choice.
The standard linear regression model provides results that are easy to interpret and is useful to address many real-world problems. However, it makes rather restrictive assumptions that might be violated in many cases. Notably, it assumes that the relationships between the response and predictor variable is additive and linear . The additive assumption states that the effect of an independent variable on the dependent variable is independent of the values of the other independent variables included in the model. The linear assumption means that the effect of a one-unit change in the independent variable on the dependent variable is the same, regardless of the values of the value of the independent variable. This is also referred to as constant marginal returns . For example, an increase in ad-spend from 10€ to 11€ yields the same increase in sales as an increase from 100,000€ to 100,001€. This section presents alternative model specifications if the assumptions do not hold.
Regarding the additive assumption, it might be argued that the effect of some variables are not fully independent of the values of other variables. In our example, one could argue that the effect of advertising depends on the type of artist. For example, for local artist advertising might be more effective. We can investigate if this is the case using a grouped scatterplot:
Figure 6.22: Effect of advertising by group
The scatterplot indeed suggests that there is a difference in advertising effectiveness between local and international artists. You can see this from the two different regression lines. We can incorporate this interaction effect by including an interaction term in the regression equation as follows:
\[\begin{align} Sales =\beta_0 &+\beta_1*adspend\\ &+\beta_2*airplay\\ &+\beta_3*starpower\\ &+\beta_4*international\\ &+\beta_5*(adspend*international)\\ &+\epsilon \end{align}\]
You can see that the effect of advertising now depends on the type of artist. Hence, the additive assumption is removed. Note that if you decide to include an interaction effect, you should always include the main effects of the variables that are included in the interaction (even if the associated p-values do not suggest significant effects). It is easy to include an interaction effect in you model by adding an additional variable that has the format ```var1:var2````. In our example, this could be achieved using the following specification:
How can we interpret the coefficient? The adspend main effect tells you the effect of advertising for the reference group that has the factor level zero. In our example, it is the advertising effect for local artist. This means that for local artists, spending an additional 1,000 Euros on advertising will result in approximately 89 additional unit sales. The interaction effect tells you by how much the effect differs for the other group (i.e., international artists) and whether this difference is significant. In our example, it means that the effect for international artists can be computed as: 0.0885 - 0.0347 = 0.0538. This means that for international artists, spending an additional 1,000 Euros on advertising will result in approximately 54 additional unit sales. Since the interaction effect is significant (p < 0.05) we can conclude that advertising is less effective for international artists.
The above example showed the interaction between a categorical variable (i.e., “country”) and a continuous variable (i.e., “adspend”). However, interaction effects can be defined for different combinations of variable types. For example, you might just as well specify an interaction between two continuous variables. In our example, you might suspect that there are synergy effects between advertising expenditures and ratio airplay. It could be that advertising is more effective when an artist receives a large number of radio plays. In this case, we would specify our model as:
\[\begin{align} Sales =\beta_0 &+\beta_1*adspend\\ &+\beta_2*airplay\\ &+\beta_3*starpower\\ &+\beta_4*(adspend*airplay)\\ &+\epsilon \end{align}\]
In this case, we can interpret \(\beta_4\) as the increase in the effectiveness of advertising for a one unit increase in radio airplay (or vice versa). This can be translated to R using:
However, since the p-value of the interaction is larger than 0.05, there is little statistical evidence for an interaction between the two variables.
In our example above, it appeared that linear relationships could be reasonably assumed. In many practical applications, however, this might not be the case. Let’s review the implications of a linear specification again:
In many marketing contexts, these might not be reasonable assumptions. Consider the case of advertising. It is unlikely that the return on advertising will not depend on the level of advertising expenditures. It is rather likely that saturation occurs at some level, meaning that the return from an additional Euro spend on advertising is decreasing with the level of advertising expenditures (i.e., decreasing marginal returns). In other words, at some point the advertising campaign has achieved a certain level of penetration and an additional Euro spend on advertising won’t yield the same return as in the beginning.
Let’s use an example data set, containing the advertising expenditures of a company and the sales (in thousand units).
Now we inspect if a linear specification is appropriate by looking at the scatterplot:
Figure 6.23: Non-linear relationship
It appears that a linear model might not represent the data well. It rather appears that the effect of an additional Euro spend on advertising is decreasing with increasing levels of advertising expenditures. Thus, we have decreasing marginal returns. We could put this to a test and estimate a linear model:
Advertising appears to be positively related to sales with an additional Euro that is spent on advertising resulting in 0.0005 additional sales. The R 2 statistic suggests that approximately 51% of the total variation can be explained by the model
To test if the linear specification is appropriate, let’s inspect some of the plots that are generated by R. We start by inspecting the residuals plot.
Figure 6.24: Residuals vs. Fitted
The plot suggests that the assumption of homoscedasticity is violated (i.e., the spread of values on the y-axis is different for different levels of the fitted values). In addition, the red line deviates from the dashed grey line, suggesting that the relationship might not be linear. Finally, the Q-Q plot of the residuals suggests that the residuals are not normally distributed.
Figure 6.25: Q-Q plot
To sum up, a linear specification might not be the best model for this data set.
In this case, a multiplicative model might be a better representation of the data. The multiplicative model has the following formal representation:
\[\begin{equation} Y =\beta_0 *X_1^{\beta_1}*X_2^{\beta_2}*...*X_J^{\beta_J}*\epsilon \tag{6.23} \end{equation}\]
This functional form can be linearized by taking the logarithm of both sides of the equation:
\[\begin{equation} log(Y) =log(\beta_0) + \beta_1*log(X_1) + \beta_2*log(X_2) + ...+ \beta_J*log(X_J) + log(\epsilon) \tag{6.24} \end{equation}\]
This means that taking logarithms of both sides of the equation makes linear estimation possible. Let’s test how the scatterplot would look like if we use the logarithm of our variables using the log() function instead of the original values.
Figure 6.26: Linearized effect
It appears that now, with the log-transformed variables, a linear specification is a much better representation of the data. Hence, we can log-transform our variables and estimate the following equation:
\[\begin{equation} log(sales) = log(\beta_0) + \beta_1*log(advertising) + log(\epsilon) \tag{6.25} \end{equation}\]
This can be easily implemented in R by transforming the variables using the log() function:
Note that this specification implies decreasing marginal returns (i.e., the returns of advertising are decreasing with the level of advertising), which appear to be more consistent with the data. The specification is also consistent with proportional changes in advertising being associated with proportional changes in sales (i.e., advertising does not become more effective with increasing levels). This has important implications on the interpretation of the coefficients. In our example, you would interpret the coefficient as follows: A 1% increase in advertising leads to a 0.3% increase in sales . Hence, the interpretation is in proportional terms and no longer in units. This means that the coefficients in a log-log model can be directly interpreted as elasticities, which also makes communication easier. We can generally also inspect the R 2 statistic to see that the model fit has increased compared to the linear specification (i.e., R 2 has increased to 0.681 from 0.509). However, please note that the variables are now measured on a different scale, which means that the model fit in theory is not directly comparable. Also, we could use the residuals plot to confirm that the revised specification is more appropriate:
Figure 6.27: Residuals plot
Figure 6.28: Q-Q plot
Finally, we can plot the predicted values against the observed values to see that the results from the log-log model (red) provide a better prediction than the results from the linear model (blue).
Figure 6.29: Comparison if model fit
Another way of modelling non-linearities is including a squared term if there are decreasing or increasing effects. In fact, we can model non-constant slopes as long as the form is a linear combination of exponentials (i.e. squared, cubed, …) of the explanatory variables. Usually we do not expect many inflection points so squared or third power terms suffice. Note that the degree of the polynomial has to be equal to the number of inflection points.
When using squared terms we can model diminishing and eventually negative returns. Think about advertisement spending. If a brand is not well known, spending on ads will increase brand awareness and have a large effect on sales. In a regression model this translates to a steep slope for spending at the origin (i.e. for lower spending). However, as more and more people will already know the brand we expect that an additional Euro spent on advertisement will have less and less of an effect the more the company spends. We say that the returns are diminishing. Eventually, if they keep putting more and more ads out, people get annoyed and some will stop buying from the company. In that case the return might even get negative. To model such a situation we need a linear as well as a squared term in the regression.
lm(...) can take squared (or any power) terms as input by adding I(X^2) as explanatory variable. In the example below we see a clear quadratic relationship with an inflection point at around 70. If we try to model this using the level of the covariates without the quadratic term we do not get a very good fit.
The graph above clearly shows that advertising spending of between 0 and 50 increases sales. However, the marginal increase (i.e. the slope of the data curve) is decreasing. Around 70 there is an inflection point. After that point additional ad-spending actually decreases sales (e.g. people get annoyed). Notice that the prediction line is straight, that is, the marginal increase of sales due to additional spending on advertising is the same for any amount of spending. This shows the danger of basing business decisions on wrongly specified models. But even in the area in which the sign of the prediction is correct, we are quite far off. Lets take a look at the top 5 sales values and the corresponding predictions:
By including a quadratic term we can fit the data very well. This is still a linear model since the outcome variable is still explained by a linear combination of regressors even though one of the regressors is now just a non-linear function of the same variable (i.e. the squared value).
Now the prediction of the model is very close to the actual data and we could base our production decisions on that model.
When interpreting the coefficients of the predictor in this model we have to be careful. Since we included the squared term, the slope is now different at each level of production (this can be seen in the graph above). That is, we do not have a single coefficient to interpret as the slope anymore. This can easily be shown by calculating the derivative of the model with respect to production.
\[ \text{Sales} = \alpha + \beta_1 \text{ Advertising} + \beta_2 \text{ Advertising}^2 + \varepsilon\\ {\delta \text{ Sales} \over \delta \text{ Advertising}} = \beta_1 + 2 \beta_2 \text{ Advertising} \equiv \text{Slope} \]
Intuitively, this means that the change of sales due to an additional Euro spent on advertising depends on the current level of advertising. \(\alpha\) , the intercept can still be interpreted as the expected value of sales given that we do not advertise at all (set advertising to 0 in the model). The sign of the squared term ( \(\beta_2\) ) can be used to determine the curvature of the function. If the sign is positive, the function is convex (curvature is upwards), if it is negative it is concave curvature is downwards). We can interpret \(\beta_1\) and \(\beta_2\) separately in terms of their influence on the slope . By setting advertising to \(0\) we observe that \(\beta_1\) is the slope at the origin. By taking the derivative of the slope with respect to advertising we see that the change of the slope due to additional spending on advertising is two times \(\beta_2\) .
\[ {\delta Slope \over \delta Advertising} = 2\beta_2 \]
At the maximum predicted value the slope is close to \(0\) (theoretically it is equal to \(0\) but this would require decimals and we can only sell whole pieces). Above we only calculated the prediction for the observed data, so let’s first predict the profit for all possible values between \(1\) and \(200\) to get the optimal production level according to our model.
For all other levels of advertising we insert the pieces produced into the formula to obtain the slope at that point. In the following example you can choose the level of advertising.
The following video summarizes how to visualize log-transformed regressions in R
In the last section we saw how to predict continuous outcomes (sales, height, etc.) via linear regression models. Another interesting case is that of binary outcomes, i.e. when the variable we want to model can only take two values (yes or no, group 1 or group 2, dead or alive, etc.). To this end we would like to estimate how our predictor variables change the probability of a value being 0 or 1. In this case we can technically still use a linear model (e.g. OLS). However, its predictions will most likely not be particularly useful. A more useful method is the logistic regression. In particular we are going to have a look at the logit model. In the following dataset we are trying to predict whether a song will be a top-10 hit on a popular music streaming platform. In a first step we are going to use only the danceability index as a predictor. Later we are going to add more independent variables.
Below are two attempts to model the data. The left assumes a linear probability model (calculated with the same methods that we used in the last chapter), while the right model is a logistic regression model . As you can see, the linear probability model produces probabilities that are above 1 and below 0, which are not valid probabilities, while the logistic model stays between 0 and 1. Notice that songs with a higher danceability index (on the right of the x-axis) seem to cluster more at \(1\) and those with a lower more at \(0\) so we expect a positive influence of danceability on the probability of a song to become a top-10 hit.
Figure 6.30: The same binary data explained by two models; A linear probability model (on the left) and a logistic regression model (on the right)
A key insight at this point is that the connection between \(\mathbf{X}\) and \(Y\) is non-linear in the logistic regression model. As we can see in the plot, the probability of success is most strongly affected by danceability around values of \(0.5\) , while higher and lower values have a smaller marginal effect. This obviously also has consequences for the interpretation of the coefficients later on.
As the name suggests, the logistic function is an important component of the logistic regression model. It has the following form:
\[ f(\mathbf{X}) = \frac{1}{1 + e^{-\mathbf{X}}} \] This function transforms all real numbers into the range between 0 and 1. We need this to model probabilities, as probabilities can only be between 0 and 1.
The logistic function on its own is not very useful yet, as we want to be able to determine how predictors influence the probability of a value to be equal to 1. To this end we replace the \(\mathbf{X}\) in the function above with our familiar linear specification, i.e.
\[ \mathbf{X} = \beta_0 + \beta_1 * x_{1,i} + \beta_2 * x_{2,i} + ... +\beta_m * x_{m,i}\\ f(\mathbf{X}) = P(y_i = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 * x_{1,i} + \beta_2 * x_{2,i} + ... +\beta_m * x_{m,i})}} \]
In our case we only have \(\beta_0\) and \(\beta_1\) , the coefficient associated with danceability.
In general we now have a mathematical relationship between our predictor variables \((x_1, ..., x_m)\) and the probability of \(y_i\) being equal to one. The last step is to estimate the parameters of this model \((\beta_0, \beta_1, ..., \beta_m)\) to determine the magnitude of the effects.
We are now going to show how to perform logistic regression in R. Instead of lm() we now use glm(Y~X, family=binomial(link = 'logit')) to use the logit model. We can still use the summary() command to inspect the output of the model.
Noticeably this output does not include an \(R^2\) value to asses model fit. Multiple “Pseudo \(R^2\) s”, similar to the one used in OLS, have been developed. There are packages that return the \(R^2\) given a logit model (see rcompanion or pscl ). The calculation by hand is also fairly simple. We define the function logisticPseudoR2s() that takes a logit model as an input and returns three popular pseudo \(R^2\) values.
The coefficients of the model give the change in the log odds of the dependent variable due to a unit change in the regressor. This makes the exact interpretation of the coefficients difficult, but we can still interpret the signs and the p-values which will tell us if a variable has a significant positive or negative impact on the probability of the dependent variable being \(1\) . In order to get the odds ratios we can simply take the exponent of the coefficients.
Notice that the coefficient is extremely large. That is (partly) due to the fact that the danceability variable is constrained to values between \(0\) and \(1\) and the coefficients are for a unit change. We can make the “unit-change” interpretation more meaningful by multiplying the danceability index by \(100\) . This linear transformation does not affect the model fit or the p-values.
We observe that danceability positively affects the likelihood of becoming at top-10 hit. To get the confidence intervals for the coefficients we can use the same function as with OLS
In order to get a rough idea about the magnitude of the effects we can calculate the partial effects at the mean of the data (that is the effect for the average observation). Alternatively, we can calculate the mean of the effects (that is the average of the individual effects). Both can be done with the logitmfx(...) function from the mfx package. If we set logitmfx(logit_model, data = my_data, atmean = FALSE) we calculate the latter. Setting atmean = TRUE will calculate the former. However, in general we are most interested in the sign and significance of the coefficient.
This now gives the average partial effects in percentage points. An additional point on the danceability scale (from \(1\) to \(100\) ), on average, makes it \(1.57%\) more likely for a song to become at top-10 hit.
To get the effect of an additional point at a specific value, we can calculate the odds ratio by predicting the probability at a value and at the value \(+1\) . For example if we are interested in how much more likely a song with 51 compared to 50 danceability is to become a hit we can simply calculate the following
So the odds are 20% higher at 51 than at 50.
Of course we can also use multiple predictors in logistic regression as shown in the formula above. We might want to add spotify followers (in million) and weeks since the release of the song.
Again, the familiar formula interface can be used with the glm() function. All the model summaries shown above still work with multiple predictors.
The question remains, whether a variable should be added to the model. We will present two methods for model selection for logistic regression. The first is based on the Akaike Information Criterium (AIC). It is reported with the summary output for logit models. The value of the AIC is relative , meaning that it has no interpretation by itself. However, it can be used to compare and select models. The model with the lowest AIC value is the one that should be chosen. Note that the AIC does not indicate how well the model fits the data, but is merely used to compare models.
For example, consider the following model, where we exclude the followers covariate. Seeing as it was able to contribute significantly to the explanatory power of the model, the AIC increases, indicating that the model including followers is better suited to explain the data. We always want the lowest possible AIC.
As a second measure for variable selection, you can use the pseudo \(R^2\) s as shown above. The fit is distinctly worse according to all three values presented here, when excluding the spotify followers.
We can predict the probability given an observation using the predict(my_logit, newdata = ..., type = "response") function. Replace ... with the observed values for which you would like to predict the outcome variable.
The prediction indicates that a song with danceability of \(50\) from an artist with \(10M\) spotify followers has a \(66%\) chance of being in the top-10, 1 week after its release.
Perfect prediction occurs whenever a linear function of \(X\) can perfectly separate the \(1\) s from the \(0\) s in the dependent variable. This is problematic when estimating a logit model as it will result in biased estimators (also check to p-values in the example!). R will return the following message if this occurs:
glm.fit: fitted probabilities numerically 0 or 1 occurred
Given this error, one should not use the output of the glm(...) function for the analysis. There are various ways to deal with this problem, one of which is to use Firth’s bias-reduced penalized-likelihood logistic regression with the logistf(Y~X) function in the logistf package.
In this example data \(Y = 0\) if \(x_1 <0\) and \(Y=1\) if \(x_1>0\) and we thus have perfect prediction. As we can see the output of the regular logit model is not interpretable. The standard errors are huge compared to the coefficients and thus the p-values are \(1\) despite \(x_1\) being a predictor of \(Y\) . Thus, we turn to the penalized-likelihood version. This model correctly indicates that \(x_1\) is in fact a predictor for \(Y\) as the coefficient is significant.
How to Get Started with Marketing and Design Your Career in 5 Steps
Linear Regression for Marketing Analytics is one of the most powerful and basic concepts to get started in Marketing Analytics with. If you are looking to start off with learning Machine Learning which can lend a helping hand to your Marketing education then Linear Regression is the topic to get started with.
You as a marketer know that Machine Learning and Data Science have a significant impact on the decision making process in the field of Marketing. In the field of Marketing, Marketing Analytics helps in giving conclusive reasoning for most of the things, which for years and years has run on the golden gut of marketers.
Learning Regression for Marketing Analytics gives you the ability to predict various marketing variables which may or may not have any visible pattern to them. In this discussion, I will give you an introduction to what is linear regression and how it can transform your marketing and sales analytics.
Therefore, if you are willing to get started with Machine Learning or Marketing Analytics, Linear Regression is the place to begin. I assure you that after this discussion, Linear Regression, as a concept will be crystal clear to you.
You may already have the understanding of the fact that machine learning algorithms can be broadly divided into two categories: Supervised and Unsupervised Learning .
In Supervised Learning, the dataset that you would work with has the observed values from the past of the variable that you are looking to predict.
For example, you could be required to create a model for predicting the Sales of a product basis the Advertising Expenditure and Sales Expenditure for a given month.
You would begin by asking the Sales Manager for the data for the previous months, expecting that he would share with you a well laid out excel with Quarter, Advertising Expenditure, Sales Expenditure and the Sales for that quarter in different columns.
With such a dataset in your possession, the job of the predictive algorithm that you create will be to find the relationship between these variables. The relationship should be generalized enough so that when you enter the advertising and sales expenditures for the coming quarter, it can give the predicted sales for that quarter.
Unsupervised Learn ing is when this observed variable is not made available to you. In that case you will find yourself solving a different kind of marketing problem altogether and not that of prediction of a variable. Unsupervised Learning is not a part of this discussion.
Supervised Learning has two sub-categories of problem: Regression and Classification .
If you are pressed for time, you can go ahead and watch my video first in which I have explained all the concepts that I have shared below.
I shared a common marketing use case above. In that example, you had to predict the sales for the quarter using two different kinds of expenditure variables. Now, we know that the value of sales can be any number - arguably a positive one. The sales could, therefore, range from anywhere between 0 to some really high number.
Such a variable is a continuous variable which can have any value in a really high range. And this, infact, is the simplest way to understand what regression is.
A prediction problem in which the variable to be predicted is a continuous variable is a Regression problem.
Let’s look at an entirely different marketing use case to understand what is not a regression problem.
You have been recruited in the marketing team of a large private bank (a common placement for many business students).
You are given the data of the bank’s list of prospects of the last year with details like Age, Job, Marital Status, No. of Children, Previous Loans, Previous Defaults etc. Along with that you are also provided with the information whether the person took the loan from the bank or not (0 for did not take the loan and 1 for did take the loan).
Your job as a young analytical marketer is to predict whether a prospect which comes in, in the future will take the loan from your bank or not.
Now, please note that in such a prediction problem your task is to just classify the prospects basis your algorithm’s understanding of whether the prospect will buy or not. Which means that the possible values for the outcome are discrete (0 or 1) and not continuous.
A prediction problem in which the variable to be predicted is a discrete variable is a Classification problem.
There are a variety of problems across industries that are prediction problems. My objective of this discussion is to equip you with the intuition and some hands-on coding of Linear Regression so that you can appreciate the use cases irrespective of the industry.
Before I dive straight into what Linear Regression is, let me help you in forming an understanding of the vocabulary used when explaining regression. I will link it to the use cases mentioned above. So just reading through it will give you a complete understanding of what is what.
Target Variable: The variable to be predicted is called the Target Variable. When you had to predict the Sales for the quarter using the Advertising Expenditure and Sales Expenditure, the Sales is the target variable.
Naturally, the target variable can also be referred to as the Dependent Variable as its value is dependent on the other variables in the system. Even in our marketing use case, obviously, the Sales is dependent on how much expenditure you have done on Advertisements and on Sales Promotions.
The target variable is commonly denoted as y.
Feature Variable: All the other variables that are used to predict the target variable are called the Feature variables.
The feature variables can also be called Independent Variables . In the examples, the Advertising and Sales Promotion expenditures are the independent variables. Also, imagine that there is a different machine learning problem altogether of image recognition. In such a problem, each of the pixels is a feature variable.
The feature variable is commonly denoted as x. Multiple feature variables get denoted as x0, x1, x2,...., xn.
Finally, let me clarify to you as to what are the different names with which these two variables can be referred to.
There are many Regression Models out of which the most basic regression model is the Linear Regression . For Nonlinear Regression, there are different models like Generalized Additive Models (GAMS) and tree-based models which can be used for regression.
Since you are starting off with Marketing Analytics, my objective in this discussion is to take you only through Linear Regression for Marketing Analytics and develop your understanding with that as a base.
Note: What I am going to share with you in the remaining part of the article tends to tune out a lot of people who are frightened by anything that even looks like Math. You would see some mathematical equations with unique notations, and some formulas as well.
None of it is a mathematical concept which you would not have studied in school. If you just manage to sit through it, you will realize that it is nothing but plain English written in a jazzed up manner with equations, which by the way are equally important.
For you as an Analytical Marketer, the intuition is important so please just focus on that.
A linear regression model for a use case which has just one independent/feature variable would look like:
When you use more than one feature variable in your model then the linear regression model will look like:
Let me quickly decipher what this equation means.
The symbols that you see, β0, β1, β2, are called Model Parameters . These are constants which determine the predicted value of the target variable. These can also be referred to as Model Coefficients or Parameters .
Specifically, β0 is referred to as the Bias Parameter . You will notice that this is not multiplied with any variable/feature in the model and is a standalone parameter that adjusts the model.
Notice carefully that I referred to these Model Parameters (β) as Constants and the features (x) as Variables. This difference needs to be understood and proper usage of the terms makes a lot of difference in understanding the topic.
As I had mentioned above, y is the target variable, which is the variable we are trying to predict. Now, while y represents the actual value of the variable to be predicted, ŷ represents the predicted value of the variable.
Since, there is always some error in the prediction that is why the predicted value is represented with a different notation from the actual variable.
This is a simple concept straight from your class 10th textbook. If you notice the equation again, you would see that each of the independent variables (x) appears with the power of 1 (degree 1). This means that the variables are not of a higher power (i.e. x 2 , x 3 ..).
Such a model will always be represented by a straight line when plotted on a graph, as I would show in the later part of this discussion.
I briefly touched upon a use case above in which you were to predict the Sales or a quarter basis the Advertisement Expenditure and the Sales Expenditure. Since there are two features in this model, the structure of this model will be like the equation below:
However, for simplicity, let us assume that for the features the Sales Manager could only provide the data for the Advertisement Expenditure and, therefore, there will only be a single feature in our model. In this situation, this is how our model equation is going to look like.
Here is the exact data that you received from the sales manager for you to work on.
This data shows that for a Quarter 1, when the Advertisement Expenditure was 24,000, the sales were 724,000. I’m ignoring the units of currency for the time being. It could be Indian Rupee (INR) or United States Dollars (USD) or anything else.
Now, I went ahead and plotted both of these variables on a scatter plot with the Advertisement Expenditure on the x-axis and the Sales on the y-axis.
In this scatter plot, each dot represents one quarter given in the table. For that particular quarter, we will be able to determine the Advertisement Expenditure and the resulting Sales from the x and y axis, respectively.
You would remember that the objective for this exercise is to be able to predict the sales of the future quarters based on the features that we have. And in this case, we have just one feature variable, i.e. the Ad_Exp .
In order to know where the next dot will lie on the scatter plot you need to find the equation of a straight line which passes through these points hence representing a trend.
Now, through Python I have drawn three lines which pass through these plots. Each of these three lines are represented by three different equations. Just by looking at the three, you can say that the one in the middle seems to be passing through the points just ‘perfectly’. How do we determine whether this line passes through the points perfectly or not?.
Now obviously, you don’t need to make these three lines on your scatter plot every time you do linear regression. It is just for me to explain to you the intuition behind how we choose the best fitting linear line.
The best trendline which passes through the scatter plots is the one which minimizes the difference between the actual value and the predicted value across all the points.
If you magnify at one of the points you will see what exactly is this difference between the actual and the predicted value. A metric that is used to capture the error of the entire model across all the points is called Residual Sum of Squares (RSS) which will be discussed in my next article on errors.
But to explain briefly, each of these distances (of each of the points) from the best fitting line is squared and added. What we finally get is the Residual Sum of Squares.
As we had already understood from our intuition, out of the three lines that I had plotted, the one at the center seems to be the one with least difference across all the points. And if we run the curve fitting on Python, it indeed turns out to be the best fitting line for the scatter plot.
For this part of the discussion, my purpose was to just give you the intuition of Linear Regression for Marketing Analytics.
And with this you should be able to understand what is the objective of a linear regression problem. From what you have seen above, you can simply say that the objective of a linear regression problem is to determine the regression model parameters (β0, β1, β..) that minimize the error of the model.
Notice again, that this is a linear i.e. a straight line and it is not at all necessary that your trendline should be straight.
Non-linear regression is something that I will discuss later in the series once I have helped you develop an understanding for regression.
This is the section where you will learn how to perform the regression in Python. Continuing with the same data that the Sales Manager had shared with you.
Sales is the target variable that needs to be predicted. Now, based on this data, your objective is to create a predictive model (just like the equation above), an equation in which you can plug in the Ad_exp value for a future quarter and predict the Sales for that quarter.
Let us straightaway get down to some hands-on coding to get this prediction done. Please do not feel left out if you do not have experience with Python. You will not require any pre-requisite knowledge. In fact the best way to learn is to get your hands dirty by solving a problem - like the one we are doing.
The first step is to fire up your Jupyter notebook and load all the prerequisite libraries in your Jupyter notebook. Here are the important libraries that we will be needing for this linear regression.
In order to load these, just start with these few lines of codes in your first cell:
The last line of code helps in displaying all the graphs that we will be making within the Jupyter notebook.
Let me now import my data into a DataFrame. A DataFrame is a data type in Python. The simplest way to understand it would be that it stores all your data as a table. And it will be on this table where we will perform all of our Python operations.
Now, I am saving my table (which you saw above) in a variable called ‘data’. Further, after the equal to ‘=’ sign, I have used a command pd.read_csv.
This ensures that the .csv file which I have on my laptop at the file location mentioned in the path, gets loaded onto my Jupyter notebook. Please note that you will need to enter the path of the location where the .csv is stored in your laptop.
By running just the variable name ‘data’, as I have done in the second line of code, you will see the entire table loaded as a DataFrame.
You already know that the Ad_exp is the feature variable or the independent variable. Basis this variable, the target variable i.e. the Sales needs to be predicted.
Therefore, just like a classic mathematical equation, let me store the Ad_exp values in a variable x and the Sales values in a variable y. This notation also makes sense because in a mathematical equation y is the output variable and x is the input variable. Same is the case here.
The last line of code will display a scatter-plot on your Jupyter notebook which will look like this:
Please note, this is the same plot that you saw above in the intuition section.
Let me tell you that till now you have not done any machine learning. This was just some basic level data cleaning/data preparation.
The glamorous Machine Learning part of the code starts here and also ends with this one line of code.
From the Numpy library that you had installed, you will now be using the polyfit() method to find the coefficients of the linear line that fit the curve.
You already know from you school level math that the equation of a linear line is given by:
Here, the m is the slope of the line and c is the y-intercept. This trendline that we are trying to find here is no different. It follows the same equation and with this code we will be able to find the m and c values for it.
This method needs three parameters: the previously defined input and output variables (x, y) — and an integer, too: 1. This latter number defines the degree of the polynomial you want to fit.
You would have understood that if you changed that number from 1 to 2, 3, 4 and so on, it would become a higher degree of regression also referred to as Polynomial Regression.
That is also something that I will be discussing with you in the coming weeks.
But, as soon as you run this code, you see an output which is an array of two digits. These two digits are nothing but the values of m and c from the equation of a straight line.
Therefore, we now know that the best trendline that describes our data is:
y = 633.9931736 + (4.68585196 * x)
If you realize, we are actually done with our prediction problem. With this equation given above, you can just plug in the value of x, which you should remember is Advertising Expenditure, and you will get the value of y, i.e. the Sales that you are likely to make in that quarter.
But, since we are already doing some interesting stuff in python here, why do we have to manually find the value of Sales. Let’s make this better in our last and final step.
Instead of doing the calculations manually in that equation, you can use another method that is made possible with the Numpy library that we had imported. The method is called poly1d()
Please follow the code given below.
We had stored the values of our equation coefficients in ‘model’. I created a variable Predict which now carried all this model data and also had the ability to predict value courtesy the Numpy method poly1d().
Now, when I entered Ad_Expenditure as 51, you saw that the predicted sales for it is shown to be 872.971.
By executing these simple lines of code, you have successfully taken the first step towards learning Marketing Analytics. This is big!
Let me tell you that Linear Regression is a fundamental concept in Marketing Analytics and in Data Science in general. Therefore, you should definitely spend all that time that you need to understand it really well.
Things get interesting from here. I have not yet spoken about how to measure the accuracy of your system. I have also not mentioned how to perform this regression if there were more than one feature or independent variable. That is called Multiple Regression .
Gradually as we proceed in this journey, I will take you through all of these concepts and also through higher order regression i.e. the polynomial regression.
Having covered the most fundamental concept in machine learning, you are now ready to implement it on some of your datasets.
Whatever you learned in this discussion is more than sufficient for you to pick a simple dataset from your work and go ahead to create a linear regression model on it.
If you are not able to find a dataset for practice, stay rest assured. You can download a practice dataset for Linear Regression. This is a toy dataset that I have created for you practice so that you can get the necessary confidence.
Further, if you want to speed up the process of learning Marketing Analytics you can consider taking up this Data Scientist with Python career track on DataCamp. In order to help you get started with the career track, I have crafted a study plan for you so that you can sail through the course with ease.
The form you have selected does not exist.
Let’s learn some more Marketing Analytics in our next discussion.
Darpan thank you for thorough explanation, it’s very useful. I have come across data where some sales values are negative, and advertising expenditure much higher than sales in general (to the effect of 100x). What would be the best approach in dealing with negative values? Thank you.
Home Market Research
Regression analysis is perhaps one of the most widely used statistical methods for investigating or estimating the relationship between a set of independent and dependent variables. In statistical analysis , distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities.
It is also used as a blanket term for various data analysis techniques utilized in a qualitative research method for modeling and analyzing numerous variables. In the regression method, the dependent variable is a predictor or an explanatory element, and the dependent variable is the outcome or a response to a specific query.
LEARN ABOUT: Statistical Analysis Methods
Content Index
Types of regression analysis, regression analysis usage in market research, how regression analysis derives insights from surveys, advantages of using regression analysis in an online survey.
Regression analysis is often used to model or analyze data. Most survey analysts use it to understand the relationship between the variables, which can be further utilized to predict the precise outcome.
For Example – Suppose a soft drink company wants to expand its manufacturing unit to a newer location. Before moving forward, the company wants to analyze its revenue generation model and the various factors that might impact it. Hence, the company conducts an online survey with a specific questionnaire.
After using regression analysis, it becomes easier for the company to analyze the survey results and understand the relationship between different variables like electricity and revenue – here, revenue is the dependent variable.
LEARN ABOUT: Level of Analysis
In addition, understanding the relationship between different independent variables like pricing, number of workers, and logistics with the revenue helps the company estimate the impact of varied factors on sales and profits.
Survey researchers often use this technique to examine and find a correlation between different variables of interest. It provides an opportunity to gauge the influence of different independent variables on a dependent variable.
Overall, regression analysis saves the survey researchers’ additional efforts in arranging several independent variables in tables and testing or calculating their effect on a dependent variable. Different types of analytical research methods are widely used to evaluate new business ideas and make informed decisions.
Create a Free Account
Researchers usually start by learning linear and logistic regression first. Due to the widespread knowledge of these two methods and ease of application, many analysts think there are only two types of models. Each model has its own specialty and ability to perform if specific conditions are met.
This blog explains the commonly used seven types of multiple regression analysis methods that can be used to interpret the enumerated data in various formats.
It is one of the most widely known modeling techniques, as it is amongst the first elite regression analysis methods picked up by people at the time of learning predictive modeling. Here, the dependent variable is continuous, and the independent variable is more often continuous or discreet with a linear regression line.
Please note that multiple linear regression has more than one independent variable than simple linear regression. Thus, linear regression is best to be used only when there is a linear relationship between the independent and a dependent variable.
A business can use linear regression to measure the effectiveness of the marketing campaigns, pricing, and promotions on sales of a product. Suppose a company selling sports equipment wants to understand if the funds they have invested in the marketing and branding of their products have given them substantial returns or not.
Linear regression is the best statistical method to interpret the results. The best thing about linear regression is it also helps in analyzing the obscure impact of each marketing and branding activity, yet controlling the constituent’s potential to regulate the sales.
If the company is running two or more advertising campaigns simultaneously, one on television and two on radio, then linear regression can easily analyze the independent and combined influence of running these advertisements together.
LEARN ABOUT: Data Analytics Projects
Logistic regression is commonly used to determine the probability of event success and event failure. Logistic regression is used whenever the dependent variable is binary, like 0/1, True/False, or Yes/No. Thus, it can be said that logistic regression is used to analyze either the close-ended questions in a survey or the questions demanding numeric responses in a survey.
Please note logistic regression does not need a linear relationship between a dependent and an independent variable, just like linear regression. Logistic regression applies a non-linear log transformation for predicting the odds ratio; therefore, it easily handles various types of relationships between a dependent and an independent variable.
Logistic regression is widely used to analyze categorical data, particularly for binary response data in business data modeling. More often, logistic regression is used when the dependent variable is categorical, like to predict whether the health claim made by a person is real(1) or fraudulent, to understand if the tumor is malignant(1) or not.
Businesses use logistic regression to predict whether the consumers in a particular demographic will purchase their product or will buy from the competitors based on age, income, gender, race, state of residence, previous purchase, etc.
Polynomial regression is commonly used to analyze curvilinear data when an independent variable’s power is more than 1. In this regression analysis method, the best-fit line is never a ‘straight line’ but always a ‘curve line’ fitting into the data points.
Please note that polynomial regression is better to use when two or more variables have exponents and a few do not.
Additionally, it can model non-linearly separable data offering the liberty to choose the exact exponent for each variable, and that too with full control over the modeling features available.
When combined with response surface analysis, polynomial regression is considered one of the sophisticated statistical methods commonly used in multisource feedback research. Polynomial regression is used mostly in finance and insurance-related industries where the relationship between dependent and independent variables is curvilinear.
Suppose a person wants to budget expense planning by determining how long it would take to earn a definitive sum. Polynomial regression, by taking into account his/her income and predicting expenses, can easily determine the precise time he/she needs to work to earn that specific sum amount.
This is a semi-automated process with which a statistical model is built either by adding or removing the dependent variable on the t-statistics of their estimated coefficients.
If used properly, the stepwise regression will provide you with more powerful data at your fingertips than any method. It works well when you are working with a large number of independent variables. It just fine-tunes the unit of analysis model by poking variables randomly.
Stepwise regression analysis is recommended to be used when there are multiple independent variables, wherein the selection of independent variables is done automatically without human intervention.
Please note, in stepwise regression modeling, the variable is added or subtracted from the set of explanatory variables. The set of added or removed variables is chosen depending on the test statistics of the estimated coefficient.
Suppose you have a set of independent variables like age, weight, body surface area, duration of hypertension, basal pulse, and stress index based on which you want to analyze its impact on the blood pressure.
In stepwise regression, the best subset of the independent variable is automatically chosen; it either starts by choosing no variable to proceed further (as it adds one variable at a time) or starts with all variables in the model and proceeds backward (removes one variable at a time).
Thus, using regression analysis, you can calculate the impact of each or a group of variables on blood pressure.
Ridge regression is based on an ordinary least square method which is used to analyze multicollinearity data (data where independent variables are highly correlated). Collinearity can be explained as a near-linear relationship between variables.
Whenever there is multicollinearity, the estimates of least squares will be unbiased, but if the difference between them is larger, then it may be far away from the true value. However, ridge regression eliminates the standard errors by appending some degree of bias to the regression estimates with a motive to provide more reliable estimates.
If you want, you can also learn about Selection Bias through our blog.
Please note, Assumptions derived through the ridge regression are similar to the least squared regression, the only difference being the normality. Although the value of the coefficient is constricted in the ridge regression, it never reaches zero suggesting the inability to select variables.
Suppose you are crazy about two guitarists performing live at an event near you, and you go to watch their performance with a motive to find out who is a better guitarist. But when the performance starts, you notice that both are playing black-and-blue notes at the same time.
Is it possible to find out the best guitarist having the biggest impact on sound among them when they are both playing loud and fast? As both of them are playing different notes, it is substantially difficult to differentiate them, making it the best case of multicollinearity, which tends to increase the standard errors of the coefficients.
Ridge regression addresses multicollinearity in cases like these and includes bias or a shrinkage estimation to derive results.
Lasso (Least Absolute Shrinkage and Selection Operator) is similar to ridge regression; however, it uses an absolute value bias instead of the square bias used in ridge regression.
It was developed way back in 1989 as an alternative to the traditional least-squares estimate with the intention to deduce the majority of problems related to overfitting when the data has a large number of independent variables.
Lasso has the capability to perform both – selecting variables and regularizing them along with a soft threshold. Applying lasso regression makes it easier to derive a subset of predictors from minimizing prediction errors while analyzing a quantitative response.
Please note that regression coefficients reaching zero value after shrinkage are excluded from the lasso model. On the contrary, regression coefficients having more value than zero are strongly associated with the response variables, wherein the explanatory variables can be either quantitative, categorical, or both.
Suppose an automobile company wants to perform a research analysis on average fuel consumption by cars in the US. For samples, they chose 32 models of car and 10 features of automobile design – Number of cylinders, Displacement, Gross horsepower, Rear axle ratio, Weight, ¼ mile time, v/s engine, transmission, number of gears, and number of carburetors.
As you can see a correlation between the response variable mpg (miles per gallon) is extremely correlated to some variables like weight, displacement, number of cylinders, and horsepower. The problem can be analyzed by using the glmnet package in R and lasso regression for feature selection.
It is a mixture of ridge and lasso regression models trained with L1 and L2 norms. The elastic net brings about a grouping effect wherein strongly correlated predictors tend to be in/out of the model together. Using the elastic net regression model is recommended when the number of predictors is far greater than the number of observations.
Please note that the elastic net regression model came into existence as an option to the lasso regression model as lasso’s variable section was too much dependent on data, making it unstable. By using elastic net regression, statisticians became capable of over-bridging the penalties of ridge and lasso regression only to get the best out of both models.
A clinical research team having access to a microarray data set on leukemia (LEU) was interested in constructing a diagnostic rule based on the expression level of presented gene samples for predicting the type of leukemia. The data set they had, consisted of a large number of genes and a few samples.
Apart from that, they were given a specific set of samples to be used as training samples, out of which some were infected with type 1 leukemia (acute lymphoblastic leukemia) and some with type 2 leukemia (acute myeloid leukemia).
Model fitting and tuning parameter selection by tenfold CV were carried out on the training data. Then they compared the performance of those methods by computing their prediction mean-squared error on the test data to get the necessary results.
A market research survey focuses on three major matrices; Customer Satisfaction , Customer Loyalty , and Customer Advocacy . Remember, although these matrices tell us about customer health and intentions, they fail to tell us ways of improving the position. Therefore, an in-depth survey questionnaire intended to ask consumers the reason behind their dissatisfaction is definitely a way to gain practical insights.
However, it has been found that people often struggle to put forth their motivation or demotivation or describe their satisfaction or dissatisfaction. In addition to that, people always give undue importance to some rational factors, such as price, packaging, etc. Overall, it acts as a predictive analytic and forecasting tool in market research.
When used as a forecasting tool, regression analysis can determine an organization’s sales figures by taking into account external market data. A multinational company conducts a market research survey to understand the impact of various factors such as GDP (Gross Domestic Product), CPI (Consumer Price Index), and other similar factors on its revenue generation model.
Obviously, regression analysis in consideration of forecasted marketing indicators was used to predict a tentative revenue that will be generated in future quarters and even in future years. However, the more forward you go in the future, the data will become more unreliable, leaving a wide margin of error .
A water purifier company wanted to understand the factors leading to brand favorability. The survey was the best medium for reaching out to existing and prospective customers. A large-scale consumer survey was planned, and a discreet questionnaire was prepared using the best survey tool .
A number of questions related to the brand, favorability, satisfaction, and probable dissatisfaction were effectively asked in the survey. After getting optimum responses to the survey, regression analysis was used to narrow down the top ten factors responsible for driving brand favorability.
All the ten attributes derived (mentioned in the image below) in one or the other way highlighted their importance in impacting the favorability of that specific water purifier brand.
It is easy to run a regression analysis using Excel or SPSS, but while doing so, the importance of four numbers in interpreting the data must be understood.
In a few cases, the simple coefficient is replaced by a standardized coefficient demonstrating the contribution from each independent variable to move or bring about a change in the dependent variable.
Do you know utilizing regression analysis to understand the outcome of a business survey is like having the power to unveil future opportunities and risks?
For example, after seeing a particular television advertisement slot, we can predict the exact number of businesses using that data to estimate a maximum bid for that slot. The finance and insurance industry as a whole depends a lot on regression analysis of survey data to identify trends and opportunities for more accurate planning and decision-making.
Do you know businesses use regression analysis to optimize their business processes?
For example, before launching a new product line, businesses conduct consumer surveys to better understand the impact of various factors on the product’s production, packaging, distribution, and consumption.
A data-driven foresight helps eliminate the guesswork, hypothesis, and internal politics from decision-making. A deeper understanding of the areas impacting operational efficiencies and revenues leads to better business optimization.
Business surveys today generate a lot of data related to finance, revenue, operation, purchases, etc., and business owners are heavily dependent on various data analysis models to make informed business decisions.
For example, regression analysis helps enterprises to make informed strategic workforce decisions. Conducting and interpreting the outcome of employee surveys like Employee Engagement Surveys, Employee Satisfaction Surveys, Employer Improvement Surveys, Employee Exit Surveys, etc., boosts the understanding of the relationship between employees and the enterprise.
It also helps get a fair idea of certain issues impacting the organization’s working culture, working environment, and productivity. Furthermore, intelligent business-oriented interpretations reduce the huge pile of raw data into actionable information to make a more informed decision.
By knowing how to use regression analysis for interpreting survey results, one can easily provide factual support to management for making informed decisions. ; but do you know that it also helps in keeping out faults in the judgment?
For example, a mall manager thinks if he extends the closing time of the mall, then it will result in more sales. Regression analysis contradicts the belief that predicting increased revenue due to increased sales won’t support the increased operating expenses arising from longer working hours.
Regression analysis is a useful statistical method for modeling and comprehending the relationships between variables. It provides numerous advantages to various data types and interactions. Researchers and analysts may gain useful insights into the factors influencing a dependent variable and use the results to make informed decisions.
With QuestionPro Research, you can improve the efficiency and accuracy of regression analysis by streamlining the data gathering, analysis, and reporting processes. The platform’s user-friendly interface and wide range of features make it a valuable tool for researchers and analysts conducting regression analysis as part of their research projects.
Sign up for the free trial today and let your research dreams fly!
LEARN MORE FREE TRIAL
Aug 30, 2024
Aug 29, 2024
Other categories.
The Process, Data, and Methods Using IBM SPSS Statistics
You can also search for this author in PubMed Google Scholar
Part of the book series: Springer Texts in Business and Economics (STBE)
1.41m Accesses
216 Citations
12 Altmetric
This is a preview of subscription content, log in via an institution to check access.
Subscribe and save.
Tax calculation will be finalised at checkout
Licence this eBook for your library
Institutional subscriptions
This book offers an easily accessible and comprehensive guide to the entire market research process, from asking market research questions to collecting and analyzing data by means of quantitative methods. It is intended for all readers who wish to know more about the market research process, data management, and the most commonly used methods in market research. The book helps readers perform analyses, interpret the results, and make sound statistical decisions using IBM SPSS Statistics. Hypothesis tests, ANOVA, regression analysis, principal component analysis, factor analysis, and cluster analysis, as well as essential descriptive statistics, are covered in detail. Highly engaging and hands-on, the book includes many practical examples, tips, and suggestions that help readers apply and interpret the data analysis methods discussed.
The new edition uses IBM SPSS version 25 and offers the following new features:
Front matter, introduction to market research.
Getting data, descriptive statistics, hypothesis testing and anova, regression analysis, principal component and factor analysis, cluster analysis, communicating the results, back matter, authors and affiliations.
Marko Sarstedt
Marko Sarstedt is chaired professor of Marketing at the Otto-von-Guericke-University Magdeburg (Germany). His main research is in the application and advancement of structural equation modeling methods to further the understanding of consumer behavior and to improve marketing decision-making. His research has been published in journals such as Journal of Marketing Research, Journal of the Academy of Marketing Science, Organizational Research Methods, MIS Quarterly, and International Journal of Research in Marketing. Marko has co-edited several special issues of leading journals and co-authored several widely adopted textbooks, including “A Primer on Partial Least Squares Structural Equation Modeling (PLS-SEM)” (together with Joe F. Hair, G. Tomas M. Hult, and Christian M. Ringle).
Erik Mooi i s senior lecturer at the University of Melbourne (Australia). His main interest is in business-to-business marketing and works on topics such as outsourcing, inter-firmcontracting, innovation, technology licensing, and franchising using advanced econometrics. His research has been published in journals such as Journal of Marketing, the Journal of Marketing Research, the International Journal of Research in Marketing, and the Journal of Business Research. He is also program director at the Centre for Workplace Leadership, a fellow at the EU centre for shared complex challenges, as well as a fellow at the Centre for Business Analytics at Melbourne Business School.
Book Title : A Concise Guide to Market Research
Book Subtitle : The Process, Data, and Methods Using IBM SPSS Statistics
Authors : Marko Sarstedt, Erik Mooi
Series Title : Springer Texts in Business and Economics
DOI : https://doi.org/10.1007/978-3-662-56707-4
Publisher : Springer Berlin, Heidelberg
eBook Packages : Business and Management , Business and Management (R0)
Copyright Information : Springer-Verlag GmbH Germany, part of Springer Nature 2019
Hardcover ISBN : 978-3-662-56706-7 Published: 28 September 2018
Softcover ISBN : 978-3-662-58592-4 Published: 11 January 2019
eBook ISBN : 978-3-662-56707-4 Published: 18 September 2018
Series ISSN : 2192-4333
Series E-ISSN : 2192-4341
Edition Number : 3
Number of Pages : XVII, 396
Number of Illustrations : 69 b/w illustrations, 109 illustrations in colour
Topics : Marketing , Management , Statistics for Business, Management, Economics, Finance, Insurance
Policies and ethics
Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of relationships between a dependent variable and one or more independent variables.
The basic form of regression models includes unknown parameters (β), independent variables (X), and the dependent variable (Y).
Regression model, basically, specifies the relation of dependent variable (Y) to a function combination of independent variables (X) and unknown parameters (β)
Y ≈ f (X, β)
Regression equation can be used to predict the values of ‘y’, if the value of ‘x’ is given, and both ‘y’ and ‘x’ are the two sets of measures of a sample size of ‘n’. The formulae for regression equation would be
Do not be intimidated by visual complexity of correlation and regression formulae above. You don’t have to apply the formula manually, and correlation and regression analyses can be run with the application of popular analytical software such as Microsoft Excel, Microsoft Access, SPSS and others.
Linear regression analysis is based on the following set of assumptions:
1. Assumption of linearity . There is a linear relationship between dependent and independent variables.
2. Assumption of homoscedasticity . Data values for dependent and independent variables have equal variances.
3. Assumption of absence of collinearity or multicollinearity . There is no correlation between two or more independent variables.
4. Assumption of normal distribution . The data for the independent variables and dependent variable are normally distributed
My e-book, The Ultimate Guide to Writing a Dissertation in Business Studies: a step by step assistance offers practical assistance to complete a dissertation with minimum or no stress. The e-book covers all stages of writing a dissertation starting from the selection to the research area to submitting the completed version of the work within the deadline. John Dudovskiy
Appinio Research · 03.07.2023 · 7min read
Back to Market Research Blog
Regression analysis plays a vital role in contemporary market research, offering a powerful tool for making accurate forecasts and addressing intricate interdependencies within challenges and decisions. It enables us to predict user behavior and gain valuable insights for optimising business strategies. This article aims to elucidate the concept of regression analysis, delve into its working principles, and explore its applications in the field of market research.
Regression analysis serves as a statistical method and acts as a translator within the realm of market research, enabling the conversion of ambiguous or complex data into concise and understandable information.
By investigating the relationship between two or more variables, regression analysis sheds light on crucial interactions, such as the correlation between user behavior and screen time in smartphone applications.
Regression analysis serves multiple purposes.
Regression analysis traces its roots back to the late 19th century when it was pioneered by the renowned British statistician, Sir Francis Galton. Galton explored variables within human genetics and introduced the concept of regression.
By examining the relationship between parental height and the height of their offspring, Galton laid the foundation for linear regression analysis. Since then, this methodology has found extensive applications not only in market research but also in diverse fields such as psychology, sociology, medicine, and economics.
Precise market analyses with Appinio
Appinio leverages a variety of market research methods to get you the best results for your market research needs. Do you want to determine the potential of a new product or service before launching it onto the market? Then the TURF analysis can help.
Conjoint analysis, on the other hand, collects consumer feedback during the development phase to optimise an idea.
Contact Appinio now and together we will find the optimal approach to your challenge!
Regression analysis encompasses various regression models, each serving specific purposes depending on the research objectives and data availability.
Employing a combination of these techniques allows for in-depth insights into complex phenomena. Here are the key regression models:
The classic model examines the relationship between a dependent variable and a single independent variable, revealing their association. For instance, it can explore how daily coffee consumption (independent variable) impacts daily energy levels (dependent variable).
Expanding upon simple linear regression, this model incorporates multiple independent variables, such as price, advertising, competition, or sales figures. In the context of energy levels, variables like sleep duration and exercise can be added alongside coffee consumption.
When the relationship between variables deviates from a straight line, non-linear regression comes into play. This is particularly useful for phenomena like exponential growth in app downloads or user numbers, where traditional linear models may not be suitable.
For complex correlations or patterns characterised by ups and downs, quadratic regression is utilised.
It fits data that follows non-linear trends, such as seasonal sales fluctuations. For instance, it can help determine market saturation points, where growth typically plateaus after an initial rapid expansion.
Hierarchical regression allows the researcher to control the order of variables in a model, enabling the assessment of each independent variable's contribution to predicting the dependent variable.
For example, in demographic-based analyses, variables like age, gender or education levels may be weighted differently.
This model examines the probabilities of outcomes with more than two variables, making it valuable for complex questions.
For instance, a music app may predict users' favourite genres based on their previous preferences, listening habits, and other factors like age, gender, or listening time, enabling personalised recommendations.
When multiple dependent variables and their interactions with independent variables need to be explored, multivariate regression analysis is employed.
For instance, in the context of fitness data, it can assess how factors such as diet, sleep, or exercise intensity influence variables like weight and health status.
This model comes into play when a variable has only two possible answers, such as yes or no. Binary logistic regression can be utilised to predict whether a specific product will be purchased by a target group. Factors like age, income, or gender can further segment the buyer groups.
The versatility of regression analysis is reflected in its diverse applications within the field of market research. Here are selected examples of how regression analysis is utilised:
Suppose a company aims to determine the relationship between advertising spending and product sales, requiring a simple linear regression analysis. Here are five possible steps to conduct this analysis:
Regression analysis stands as a powerful and versatile tool in the realm of market research. It offers a range of regression models, varying in complexity depending on the research question or objective at hand. Whether investigating the relationship between advertising spend and sales, analysing usage behavior, or identifying market trends, regression analysis provides data-driven insights that empower informed and sound decision-making.
Interested in running your own regression analysis?
Then register directly on our platform and get in touch with our experts.
Join the loop 💌
Be the first to hear about new updates, product news, and data insights. We'll send it all straight to your inbox.
Get the latest market research news straight to your inbox! 💌
16.08.2023 | 8min read
Mystery Shopping for Beginners: Enhancing Customer Experience Through Objective Evaluations
16.08.2023 | 9min read
The Ansoff Matrix: Exploring Growth Opportunities
The Semantic Differential Scale: The Power of Perceptions in Market Research
From overall customer satisfaction to satisfaction with your product quality and price, regression analysis measures the strength of a relationship between different variables.
Contact Us >
To find out more about measuring customer satisfaction to help your business
While correlation analysis provides a single numeric summary of a relation (“the correlation coefficient”), regression analysis results in a prediction equation, describing the relationship between the variables. If the relationship is strong – expressed by the Rsquare value – it can be used to predict values of one variable given the other variables have known values. For example, how will the overall satisfaction score change if satisfaction with product quality goes up from 6 to 7?
Regression analysis can be used in customer satisfaction and employee satisfaction studies to answer questions such as: “Which product dimensions contribute most to someone’s overall satisfaction or loyalty to the brand?” This is often referred to as Key Drivers Analysis.
It can also be used to simulate the outcome when actions are taken. For example: “What will happen to the satisfaction score when product availability is improved?”
Contact Us >
Cookie | Duration | Description |
---|---|---|
__hssrc | session | This cookie is set by Hubspot whenever it changes the session cookie. The __hssrc cookie set to 1 indicates that the user has restarted the browser, and if the cookie does not exist, it is assumed to be a new session. |
cookielawinfo-checkbox-advertisement | 1 year | Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category . |
cookielawinfo-checkbox-analytics | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics". |
cookielawinfo-checkbox-functional | 11 months | The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional". |
cookielawinfo-checkbox-necessary | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". |
cookielawinfo-checkbox-others | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other. |
cookielawinfo-checkbox-performance | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance". |
CookieLawInfoConsent | 1 year | Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie. |
viewed_cookie_policy | 11 months | The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. |
Cookie | Duration | Description |
---|---|---|
__cf_bm | 30 minutes | This cookie, set by Cloudflare, is used to support Cloudflare Bot Management. |
__hssc | 30 minutes | HubSpot sets this cookie to keep track of sessions and to determine if HubSpot should increment the session number and timestamps in the __hstc cookie. |
bcookie | 2 years | LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID. |
bscookie | 2 years | LinkedIn sets this cookie to store performed actions on the website. |
lang | session | LinkedIn sets this cookie to remember a user's language setting. |
lidc | 1 day | LinkedIn sets the lidc cookie to facilitate data center selection. |
UserMatchHistory | 1 month | LinkedIn sets this cookie for LinkedIn Ads ID syncing. |
Cookie | Duration | Description |
---|---|---|
__hstc | 5 months 27 days | This is the main cookie set by Hubspot, for tracking visitors. It contains the domain, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session). |
_ga | 2 years | The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors. |
_gat_gtag_UA_3031018_3 | 1 minute | Set by Google to distinguish users. |
_gid | 1 day | Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously. |
CONSENT | 2 years | YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data. |
hubspotutk | 5 months 27 days | HubSpot sets this cookie to keep track of the visitors to the website. This cookie is passed to HubSpot on form submission and used when deduplicating contacts. |
undefined | never | Wistia sets this cookie to collect data on visitor interaction with the website's video-content, to make the website's video-content more relevant for the visitor. |
vuid | 2 years | Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website. |
Cookie | Duration | Description |
---|---|---|
VISITOR_INFO1_LIVE | 5 months 27 days | A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface. |
YSC | session | YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages. |
yt-remote-connected-devices | never | YouTube sets this cookie to store the video preferences of the user using embedded YouTube video. |
yt-remote-device-id | never | YouTube sets this cookie to store the video preferences of the user using embedded YouTube video. |
Cookie | Duration | Description |
---|---|---|
AnalyticsSyncHistory | 1 month | No description |
closest_office_location | 1 month | No description |
li_gc | 2 years | No description |
loglevel | never | No description available. |
ssi--lastInteraction | 10 minutes | This cookie is used for storing the date of last secure session the visitor had when visiting the site. |
ssi--sessionId | 1 year | This cookie is used for storing the session ID which helps in reusing the one the visitor had already used. |
user_country | 1 month | No description available. |
| | |
|
IMAGES
VIDEO
COMMENTS
Regression analysis is a statistical method. It's used for analyzing different factors that might influence an objective - such as the success of a product launch, business growth, a new marketing campaign - and determining which factors are important and which ones can be ignored.
Regression analysis is an important component of data-driven decision-making. This statistical technique is widely used in various fields, including economics, finance, marketing, healthcare, and social sciences, to understand the relationships between variables and make informed predictions.
Understanding regression analysis: overview and key uses. Regression analysis is a fundamental statistical method that helps us predict and understand how different factors (aka independent variables) influence a specific outcome (aka dependent variable). Imagine you're trying to predict the value of a house.
Regression analysis is the statistical method used to determine the structure of a relationship between variables. Learn to use it to inform business decisions.
Regression analysis is another tool market research firms used on a daily basis with their clients to help brands understand survey data from customers. The benefit of using a third-party market research firm is that you can leverage their expertise to tell you the "so what" of your customer survey data.
A Refresher on Regression Analysis. Understanding one of the most important types of data analysis. by. Amy Gallo. November 04, 2015. uptonpark/iStock/Getty Images. You probably know by now that ...
Marketing Research: Regression analysis helps marketers understand consumer behavior and make data-driven decisions. It can be used to predict sales based on advertising expenditures, pricing strategies, or demographic variables.
What is regression analysis? Regression analysis serves as a statistical method and acts as a translator within the realm of market research, enabling the conversion of ambiguous or complex data into concise and understandable information.
Regression analysis is one of the most frequently used analysis techniques in market research. It allows market researchers to analyze the relationships between dependent variables and independent variables .
Regression analysis is one of the most frequently used tools in market research. In its simplest form, regression analysis allows market researchers to analyze relationships between one independent and one dependent variable. In marketing applications, the dependent...
Regression analysis is a statistical method used to determine the relationship between a dependent variable (the variable we're trying to predict or understand) and one or more independent variables (the factors that we believe have an effect on the dependent variable). In market research, regression analysis is important because it can help ...
Regression analysis is another tool market research firms used on a daily basis with their clients to help brands understand survey data from customers. The benefit of using a third-party market research firm is that you can leverage their expertise to tell you the "so what" of your customer survey data. At The MSR Group, we use regression ...
In regression analysis, we fit a model to our data and use it to predict the values of the dependent variable from one predictor variable (bivariate regression) or several predictor variables (multiple regression). The following table shows a comparison of correlation and regression analysis:
Linear Regression for Marketing Analytics is one of the most powerful and basic concepts to get started in Marketing Analytics with. If you are looking to start off with learning Machine Learning which can lend a helping hand to your Marketing education then Linear Regression is the topic to get started with.
Overall, regression analysis saves the survey researchers' additional efforts in arranging several independent variables in tables and testing or calculating their effect on a dependent variable. Different types of analytical research methods are widely used to evaluate new business ideas and make informed decisions.
7.1 Introduction. Regression analysis is one of the most fr equently used tools in market resear ch. In its. simplest form, regression analys is allows market researchers to analyze rela tionships ...
This textbook is a quick and comprehensive guide to the entire market research process by means of quantitative methods, using IBM SPSS.
Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of relationships between a ...
What is regression analysis? Regression analysis serves as a statistical method and acts as a translator within the realm of market research, enabling the conversion of ambiguous or complex data into concise and understandable information.
While correlation analysis provides a single numeric summary of a relation ("the correlation coefficient"), regression analysis results in a prediction equation, describing the relationship between the variables. If the relationship is strong - expressed by the Rsquare value - it can be used to predict values of one variable given the ...
Abstract Multiple regression analysis is one of the most widely used statistical procedures for both scholarly and applied marketing research. Yet, correlated predictor variables—and potential collinearity effects—are a common concern in interpretation of regression estimates.
Regression analysis is one of the most frequently used tools in market research. In its simplest form, regression analysis allows market researchers to analyze relationships between one ...
Market Regression Research Analysis with numerical predictors. Once the equation is developed, the resulting coefficients are used to predict the value of sales for a new set of predictor values. This is a quantitative method, and the closer the relationship between the measured variables. The advanced quality of predictions made from this. Some of the common applications include forecasting ...
Here, we focus on market analysis as a thorough business plan component. Continue reading to conduct your market analysis and lay a strong foundation for your business. How to do a market analysis in 6 steps. This section covers six main steps of market analysis, including the purpose of each step and questions to guide your research and ...
In the early stages of residential project investment, accurately estimating the engineering costs of residential projects is crucial for cost control and management of the project. However, the current cost estimation of residential engineering in China is primarily carried out by cost personnel based on their own experience. This process is time-consuming and labour-intensive, and it ...