market research regression analysis

Root out friction in every digital experience, super-charge conversion rates, and optimize digital self-service

Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve

Increase revenue and loyalty with real-time insights and recommendations delivered to teams on the ground

Know how your people feel and empower managers to improve employee engagement, productivity, and retention

Take action in the moments that matter most along the employee journey and drive bottom line growth

Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people

Get faster, richer insights with qual and quant tools that make powerful market research available to everyone

Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts

Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market

Explore the platform powering Experience Management

Free Account
Product Demos
For Digital
For Customer Care
For Human Resources
For Researchers
Financial Services
All Industries

Popular Use Cases

Customer Experience
Employee Experience
Net Promoter Score
Voice of Customer
Customer Success Hub
Product Documentation
Training & Certification
XM Institute
Popular Resources
Customer Stories
Artificial Intelligence
Market Research
Partnerships
Marketplace

The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results, live in Salt Lake City.

English/AU & NZ
Español/Europa
Español/América Latina
Português Brasileiro
REQUEST DEMO
Experience Management
Survey Data Analysis & Reporting
Regression Analysis

Try Qualtrics for free

The complete guide to regression analysis.

19 min read What is regression analysis and why is it useful? While most of us have heard the term, understanding regression analysis in detail may be something you need to brush up on. Here’s what you need to know about this popular method of analysis.

When you rely on data to drive and guide business decisions, as well as predict market trends, just gathering and analyzing what you find isn’t enough — you need to ensure it’s relevant and valuable.

The challenge, however, is that so many variables can influence business data: market conditions, economic disruption, even the weather! As such, it’s essential you know which variables are affecting your data and forecasts, and what data you can discard.

And one of the most effective ways to determine data value and monitor trends (and the relationships between them) is to use regression analysis, a set of statistical methods used for the estimation of relationships between independent and dependent variables.

In this guide, we’ll cover the fundamentals of regression analysis, from what it is and how it works to its benefits and practical applications.

Free eBook: 2024 global market research trends report

What is regression analysis?

Regression analysis is a statistical method. It’s used for analyzing different factors that might influence an objective – such as the success of a product launch, business growth, a new marketing campaign – and determining which factors are important and which ones can be ignored.

Regression analysis can also help leaders understand how different variables impact each other and what the outcomes are. For example, when forecasting financial performance, regression analysis can help leaders determine how changes in the business can influence revenue or expenses in the future.

Running an analysis of this kind, you might find that there’s a high correlation between the number of marketers employed by the company, the leads generated, and the opportunities closed.

This seems to suggest that a high number of marketers and a high number of leads generated influences sales success. But do you need both factors to close those sales? By analyzing the effects of these variables on your outcome, you might learn that when leads increase but the number of marketers employed stays constant, there is no impact on the number of opportunities closed, but if the number of marketers increases, leads and closed opportunities both rise.

Regression analysis can help you tease out these complex relationships so you can determine which areas you need to focus on in order to get your desired results, and avoid wasting time with those that have little or no impact. In this example, that might mean hiring more marketers rather than trying to increase leads generated.

How does regression analysis work?

Regression analysis starts with variables that are categorized into two types: dependent and independent variables. The variables you select depend on the outcomes you’re analyzing.

Understanding variables:

1. dependent variable.

This is the main variable that you want to analyze and predict. For example, operational (O) data such as your quarterly or annual sales, or experience (X) data such as your net promoter score (NPS) or customer satisfaction score (CSAT) .

These variables are also called response variables, outcome variables, or left-hand-side variables (because they appear on the left-hand side of a regression equation).

There are three easy ways to identify them:

Is the variable measured as an outcome of the study?
Does the variable depend on another in the study?
Do you measure the variable only after other variables are altered?

2. Independent variable

Independent variables are the factors that could affect your dependent variables. For example, a price rise in the second quarter could make an impact on your sales figures.

You can identify independent variables with the following list of questions:

Is the variable manipulated, controlled, or used as a subject grouping method by the researcher?
Does this variable come before the other variable in time?
Are you trying to understand whether or how this variable affects another?

Independent variables are often referred to differently in regression depending on the purpose of the analysis. You might hear them called:

Explanatory variables

Explanatory variables are those which explain an event or an outcome in your study. For example, explaining why your sales dropped or increased.

Predictor variables

Predictor variables are used to predict the value of the dependent variable. For example, predicting how much sales will increase when new product features are rolled out .

Experimental variables

These are variables that can be manipulated or changed directly by researchers to assess the impact. For example, assessing how different product pricing ($10 vs $15 vs $20) will impact the likelihood to purchase.

Subject variables (also called fixed effects)

Subject variables can’t be changed directly, but vary across the sample. For example, age, gender, or income of consumers.

Unlike experimental variables, you can’t randomly assign or change subject variables, but you can design your regression analysis to determine the different outcomes of groups of participants with the same characteristics. For example, ‘how do price rises impact sales based on income?’

Carrying out regression analysis

So regression is about the relationships between dependent and independent variables. But how exactly do you do it?

Assuming you have your data collection done already, the first and foremost thing you need to do is plot your results on a graph. Doing this makes interpreting regression analysis results much easier as you can clearly see the correlations between dependent and independent variables.

Let’s say you want to carry out a regression analysis to understand the relationship between the number of ads placed and revenue generated.

On the Y-axis, you place the revenue generated. On the X-axis, the number of digital ads. By plotting the information on the graph, and drawing a line (called the regression line) through the middle of the data, you can see the relationship between the number of digital ads placed and revenue generated.

This regression line is the line that provides the best description of the relationship between your independent variables and your dependent variable. In this example, we’ve used a simple linear regression model.

Statistical analysis software can draw this line for you and precisely calculate the regression line. The software then provides a formula for the slope of the line, adding further context to the relationship between your dependent and independent variables.

Simple linear regression analysis

A simple linear model uses a single straight line to determine the relationship between a single independent variable and a dependent variable.

This regression model is mostly used when you want to determine the relationship between two variables (like price increases and sales) or the value of the dependent variable at certain points of the independent variable (for example the sales levels at a certain price rise).

While linear regression is useful, it does require you to make some assumptions.

For example, it requires you to assume that:

the data was collected using a statistically valid sample collection method that is representative of the target population
The observed relationship between the variables can’t be explained by a ‘hidden’ third variable – in other words, there are no spurious correlations.
the relationship between the independent variable and dependent variable is linear – meaning that the best fit along the data points is a straight line and not a curved one

Multiple regression analysis

As the name suggests, multiple regression analysis is a type of regression that uses multiple variables. It uses multiple independent variables to predict the outcome of a single dependent variable. Of the various kinds of multiple regression, multiple linear regression is one of the best-known.

Multiple linear regression is a close relative of the simple linear regression model in that it looks at the impact of several independent variables on one dependent variable. However, like simple linear regression, multiple regression analysis also requires you to make some basic assumptions.

For example, you will be assuming that:

there is a linear relationship between the dependent and independent variables (it creates a straight line and not a curve through the data points)
the independent variables aren’t highly correlated in their own right

An example of multiple linear regression would be an analysis of how marketing spend, revenue growth, and general market sentiment affect the share price of a company.

With multiple linear regression models you can estimate how these variables will influence the share price, and to what extent.

Multivariate linear regression

Multivariate linear regression involves more than one dependent variable as well as multiple independent variables, making it more complicated than linear or multiple linear regressions. However, this also makes it much more powerful and capable of making predictions about complex real-world situations.

For example, if an organization wants to establish or estimate how the COVID-19 pandemic has affected employees in its different markets, it can use multivariate linear regression, with the different geographical regions as dependent variables and the different facets of the pandemic as independent variables (such as mental health self-rating scores, proportion of employees working at home, lockdown durations and employee sick days).

Through multivariate linear regression, you can look at relationships between variables in a holistic way and quantify the relationships between them. As you can clearly visualize those relationships, you can make adjustments to dependent and independent variables to see which conditions influence them. Overall, multivariate linear regression provides a more realistic picture than looking at a single variable.

However, because multivariate techniques are complex, they involve high-level mathematics that require a statistical program to analyze the data.

Logistic regression

Logistic regression models the probability of a binary outcome based on independent variables.

So, what is a binary outcome? It’s when there are only two possible scenarios, either the event happens (1) or it doesn’t (0). e.g. yes/no outcomes, pass/fail outcomes, and so on. In other words, if the outcome can be described as being in either one of two categories.

Logistic regression makes predictions based on independent variables that are assumed or known to have an influence on the outcome. For example, the probability of a sports team winning their game might be affected by independent variables like weather, day of the week, whether they are playing at home or away and how they fared in previous matches.

What are some common mistakes with regression analysis?

Across the globe, businesses are increasingly relying on quality data and insights to drive decision-making — but to make accurate decisions, it’s important that the data collected and statistical methods used to analyze it are reliable and accurate.

Using the wrong data or the wrong assumptions can result in poor decision-making, lead to missed opportunities to improve efficiency and savings, and — ultimately — damage your business long term.

Assumptions

When running regression analysis, be it a simple linear or multiple regression, it’s really important to check that the assumptions your chosen method requires have been met. If your data points don’t conform to a straight line of best fit, for example, you need to apply additional statistical modifications to accommodate the non-linear data. For example, if you are looking at income data, which scales on a logarithmic distribution, you should take the Natural Log of Income as your variable then adjust the outcome after the model is created.

Correlation vs. causation

It’s a well-worn phrase that bears repeating – correlation does not equal causation. While variables that are linked by causality will always show correlation, the reverse is not always true. Moreover, there is no statistic that can determine causality (although the design of your study overall can).

If you observe a correlation in your results, such as in the first example we gave in this article where there was a correlation between leads and sales, you can’t assume that one thing has influenced the other. Instead, you should use it as a starting point for investigating the relationship between the variables in more depth.

Choosing the wrong variables to analyze

Before you use any kind of statistical method, it’s important to understand the subject you’re researching in detail. Doing so means you’re making informed choices of variables and you’re not overlooking something important that might have a significant bearing on your dependent variable.

Model building The variables you include in your analysis are just as important as the variables you choose to exclude. That’s because the strength of each independent variable is influenced by the other variables in the model. Other techniques, such as Key Drivers Analysis, are able to account for these variable interdependencies.

Benefits of using regression analysis

There are several benefits to using regression analysis to judge how changing variables will affect your business and to ensure you focus on the right things when forecasting.

Here are just a few of those benefits:

Make accurate predictions

Regression analysis is commonly used when forecasting and forward planning for a business. For example, when predicting sales for the year ahead, a number of different variables will come into play to determine the eventual result.

Regression analysis can help you determine which of these variables are likely to have the biggest impact based on previous events and help you make more accurate forecasts and predictions.

Identify inefficiencies

Using a regression equation a business can identify areas for improvement when it comes to efficiency, either in terms of people, processes, or equipment.

For example, regression analysis can help a car manufacturer determine order numbers based on external factors like the economy or environment.

Using the initial regression equation, they can use it to determine how many members of staff and how much equipment they need to meet orders.

Drive better decisions

Improving processes or business outcomes is always on the minds of owners and business leaders, but without actionable data, they’re simply relying on instinct, and this doesn’t always work out.

This is particularly true when it comes to issues of price. For example, to what extent will raising the price (and to what level) affect next quarter’s sales?

There’s no way to know this without data analysis. Regression analysis can help provide insights into the correlation between price rises and sales based on historical data.

How do businesses use regression? A real-life example

Marketing and advertising spending are common topics for regression analysis. Companies use regression when trying to assess the value of ad spend and marketing spend on revenue.

A typical example is using a regression equation to assess the correlation between ad costs and conversions of new customers. In this instance,

our dependent variable (the factor we’re trying to assess the outcomes of) will be our conversions
the independent variable (the factor we’ll change to assess how it changes the outcome) will be the daily ad spend
the regression equation will try to determine whether an increase in ad spend has a direct correlation with the number of conversions we have

The analysis is relatively straightforward — using historical data from an ad account, we can use daily data to judge ad spend vs conversions and how changes to the spend alter the conversions.

By assessing this data over time, we can make predictions not only on whether increasing ad spend will lead to increased conversions but also what level of spending will lead to what increase in conversions. This can help to optimize campaign spend and ensure marketing delivers good ROI.

This is an example of a simple linear model. If you wanted to carry out a more complex regression equation, we could also factor in other independent variables such as seasonality, GDP, and the current reach of our chosen advertising networks.

By increasing the number of independent variables, we can get a better understanding of whether ad spend is resulting in an increase in conversions, whether it’s exerting an influence in combination with another set of variables, or if we’re dealing with a correlation with no causal impact – which might be useful for predictions anyway, but isn’t a lever we can use to increase sales.

Using this predicted value of each independent variable, we can more accurately predict how spend will change the conversion rate of advertising.

Regression analysis tools

Regression analysis is an important tool when it comes to better decision-making and improved business outcomes. To get the best out of it, you need to invest in the right kind of statistical analysis software.

The best option is likely to be one that sits at the intersection of powerful statistical analysis and intuitive ease of use, as this will empower everyone from beginners to expert analysts to uncover meaning from data, identify hidden trends and produce predictive models without statistical training being required.

To help prevent costly errors, choose a tool that automatically runs the right statistical tests and visualizations and then translates the results into simple language that anyone can put into action.

With software that’s both powerful and user-friendly, you can isolate key experience drivers, understand what influences the business, apply the most appropriate regression methods, identify data issues, and much more.

With Qualtrics’ Stats iQ™, you don’t have to worry about the regression equation because our statistical software will run the appropriate equation for you automatically based on the variable type you want to monitor. You can also use several equations, including linear regression and logistic regression, to gain deeper insights into business outcomes and make more accurate, data-driven decisions.

Related resources

Analysis & Reporting

Data Analysis 31 min read

Social media analytics 13 min read, kano analysis 21 min read, margin of error 11 min read, data saturation in qualitative research 8 min read, thematic analysis 11 min read, behavioral analytics 12 min read, request demo.

Ready to learn more about Qualtrics?

Media Center
E-Books & White Papers

The Strategic Value of Regression Analysis in Marketing Research

by Michael Lieberman , on December 14, 2023

designer hand working with digital tablet and laptop and notebook stack and eye glass on wooden desk in office-1

Regression analysis offers significant value in modern business and research contexts. This article explores the strategic importance of regression analysis to shed light on its diverse applications and benefits. Included are several different case studies to help bring the concept to life.

Understanding Regression Analysis in Marketing

Regression analysis in marketing is used to examine how independent variables—such as advertising spend, demographics, pricing, and product features—influence a dependent variable, typically a measure of consumer behavior or business performance. The goal is to create models that capture these relationships accurately, allowing marketers to make informed decisions.

Benefits of Regression Analysis in Marketing

Data-driven decisions : Regression analysis empowers marketers to make data-driven decisions, reducing reliance on intuition and guesswork. This approach leads to more accurate and strategic marketing efforts.
Efficiency and cost savings : By optimizing marketing campaigns and resource allocation, regression analysis can significantly improve efficiency and cost-effectiveness. Companies can achieve better results with the same or fewer resources.
Personalization : Understanding consumer behavior through regression analysis allows for personalized marketing efforts. Tailored messages and offers can lead to higher engagement and conversion rates.
Competitive advantage : Marketers who employ regression analysis are better equipped to adapt to changing market conditions, outperform competitors, and stay ahead of industry trends.
Continuous improvement : Regression analysis is an iterative process. As new data becomes available, models can be updated and refined, ensuring that marketing strategies remain effective over time.

Strategic Applications

Consumer behavior prediction : Regression analysis helps marketers predict consumer behavior. By analyzing historical data and considering various factors, such as past purchases, online behavior, and demographic information, companies can build models to anticipate customer preferences, buying patterns, and churn rates.
Marketing campaign optimization : Businesses invest heavily in marketing campaigns. Regression analysis aids in optimizing these efforts by identifying which marketing channels, messages, or strategies have the greatest impact on key performance indicators (KPIs) like sales, click-through rates, or conversion rates.
Pricing strategy : Pricing is a critical aspect of marketing. Regression analysis can reveal the relationship between pricing strategies and sales volume, helping companies determine the optimal price points for their products or services.
Product Development : In product development, regression analysis can be used to understand the relationship between product features and consumer satisfaction. Companies can then prioritize product enhancements based on customer preferences.

Case Study – Regression Analysis for Ranking Key Attributes

Let’s explore a specific example in a category known as Casual Dining Restaurants (CDR). In a survey, respondents are asked to rate several casual dining restaurants on a variety of attributes. For the purposes of this article, we will keep the number of demonstrated attributes to the top eight. The data for each restaurant is stacked into one regression. We are seeking to rank the attributes based on a regression analysis against an industry standard overall measurement: Net Promoter Score.

Table 1 shows the leading Casual Dining Restaurant chains in the United States to be used to ‘rank’ the key reasons that patrons visit this restaurant category, not specific to one restaurant band.

Table 1 - List of Leading Casual Dining Restaurant Chains in the United States

Table 1 - Regression Analysis Casual Resturant Dining Case Study

In Figure 1 we see a graphic example of key drivers across the CDR category.

Figure 1 - Net Promoter Score Casual Dining Restaurants Regression Analysis

The category-wide drivers of CDR visits are not particularly surprising. Good food. Good value. Cleanliness. Staff energy. There is one attribute, however, that may not seem intuitively as important as restaurant executives might think. Make sure your servers thank departing customers . Diners seek not just delicious cuisine at a reasonable price, but they also desire a sense of appreciation.

Case Study – Regression and Brand Response to Crisis

A major automobile company has a public relations disaster. In order to regain trust in their brand equity, the company commissions a series of regression analyses to gauge how buyers are viewing their brand image. However, what they really want to know is how American auto buyers view trust—the most valuable brand perception of this company’s automotive product.

The disaster is fresh—a nation-wide recall of thousands of cars over safety issues regarding airbags—so our company would like a composite of which values go into “Is this a Company I Trust.” Thus, it surveyed decision makers, stake holders, owners, and prospects. We then stack the data into one dataset and run a strategic regression. Once performed, the regression beta values are summed and then reported as percentages of influence on the dependent variable. What we see are the major components of “Trust.”

Figure 2 - Percentage Influence of "A Company I Trust"

Not surprisingly, family safety is the leading driver of Trust. However, we now have Shapley Values of the major components. These findings would normally be handed over to the public relations team to begin damage control. Within days the company began to run advertisements in major markets to reverse the negative narrative of the recall.

Case Study - Regression Analysis/Maximizing Product Lines

SparkleSquad Studios is a fictional startup hoping to find a niche among tween and teen girls to help reverse the tide of social media addiction. Though funded through venture capital investment, they found that all their 40 potential product areas, they only have capacity to produce eight. In order to determine the top 8 hobby products in demand, they fielded a study.

Table 2 - List of Potential Product Area Development

Table 2 - Regression Analysis for Product Development Example

SparkleSquad Studios then conducted a large study gathering data from thousands of web-based surveys conducted among girls aged 10 to 16 across the United States. The key construct of the study is simple—not more than 5 minutes—and concise to cater to respondents' shorter attention spans. Below are the key questions.

How much money do you typically allocate to hobbies unrelated to social media in a given month?
Please check-off the hobbies that interest you from the list of 40 potential options below.

Question 1 serves as the dependent variable in the regression. Question 2 responses are coded into categorical variables, 1=Checked, 0=Not Checked . These are the independent variables.

Results are shown below in Table 3.

Table 3 - Top 10 Hobby Products for Production Determined Through Regression Analysis

Table 3 - Regression Analysis Case Study Findings

Based on the resulting regression analysis, SparkleSquad will commence production of ten statistically significant products. The data-driven approach ensures these offerings meet the maximized determined market demand.

Regression analysis gives businesses the ability to predict consumer behavior, optimize marketing efforts, and drive results through data-driven decision-making. By leveraging regression analysis, businesses can gain a competitive advantage and increase their efficiency, and effectiveness. In an era where consumer preferences and market conditions are in constant flux, regression analysis remains an essential tool for marketers looking to stay ahead of the curve.

Michael Lieberman is the Founder and President of Multivariate Solutions , a statistical and market research consulting firm that works with major advertising, public relations, and political strategy firms. He can be reached at +1 646 257 3794, or [email protected] .

Download "The 5 Keys to Estimating Market Sizing for Strategic Decision Making"

About This Blog

Our goal is to help you better understand your customer, market, and competition in order to help drive your business growth.

From Our Blog

Subscribe to blog, connect with us.

Understanding regression analysis: overview and key uses

Last updated

22 August 2024

Reviewed by

Miroslav Damyanov

Regression analysis is a fundamental statistical method that helps us predict and understand how different factors (aka independent variables) influence a specific outcome (aka dependent variable).

Imagine you're trying to predict the value of a house. Regression analysis can help you create a formula to estimate the house's value by looking at variables like the home's size and the neighborhood's average income. This method is crucial because it allows us to predict and analyze trends based on data.

While that example is straightforward, the technique can be applied to more complex situations, offering valuable insights into fields such as economics, healthcare, marketing, and more.

3 uses for regression analysis in business

Businesses can use regression analysis to improve nearly every aspect of their operations. When used correctly, it's a powerful tool for learning how adjusting variables can improve outcomes. Here are three applications:

1. Prediction and forecasting

Predicting future scenarios can give businesses significant advantages. No method can guarantee absolute certainty, but regression analysis offers a reliable framework for forecasting future trends based on past data. Companies can apply this method to anticipate future sales for financial planning purposes and predict inventory requirements for more efficient space and cost management. Similarly, an insurance company can employ regression analysis to predict the likelihood of claims for more accurate underwriting.

2. Identifying inefficiencies and opportunities

Regression analysis can help us understand how the relationships between different business processes affect outcomes. Its ability to model complex relationships means that regression analysis can accurately highlight variables that lead to inefficiencies, which intuition alone may not do. Regression analysis allows businesses to improve performance significantly through targeted interventions. For instance, a manufacturing plant experiencing production delays, machine downtime, or labor shortages can use regression analysis to determine the underlying causes of these issues.

3. Making data-driven decisions

Regression analysis can enhance decision-making for any situation that relies on dependent variables. For example, a company can analyze the impact of various price points on sales volume to find the best pricing strategy for its products. Understanding buying behavior factors can help segment customers into buyer personas for improved targeting and messaging.

Types of regression models

There are several types of regression models, each suited to a particular purpose. Picking the right one is vital to getting the correct results.

Simple linear regression analysis is the simplest form of regression analysis. It examines the relationship between exactly one dependent variable and one independent variable, fitting a straight line to the data points on a graph.

Multiple regression analysis examines how two or more independent variables affect a single dependent variable. It extends simple linear regression and requires a more complex algorithm.

Multivariate linear regression is suitable for multiple dependent variables. It allows the analysis of how independent variables influence multiple outcomes.

Logistic regression is relevant when the dependent variable is categorical, such as binary outcomes (e.g., true/false or yes/no). Logistic regression estimates the probability of a category based on the independent variables.

6 mistakes people make with regression analysis

Ignoring key variables is a common mistake when working with regression analysis. Here are a few more pitfalls to try and avoid:

1. Overfitting the model

If a model is too complex, it can become overly powerful and lead to a problem known as overfitting. This mistake is an especially significant problem when the independent variables don't impact the dependent data, though it can happen whenever the model over-adjusts to fit all the variables. In such cases, the model starts memorizing noise rather than meaningful data. When this happens, the model’s results will fit the training data perfectly but fail to generalize to new, unseen data, rendering the model ineffective for prediction or inference.

2. Underfitting the model

A less complex model is unlikely to draw false conclusions mistakenly. However, if the model is too simplistic, it will face the opposite problem: underfitting. In this case, the model will fail to capture the underlying patterns in the data, meaning it won't perform well on either the training or new, unseen data. This lack of complexity prevents the model from making accurate predictions or drawing meaningful inferences.

3. Neglecting model validation

Model validation is how you can be sure that a model isn't overfitting or underfitting. Imagine teaching a child to read. If you always read the same book to the child, they might memorize it and recite it perfectly, making it seem like they’ve learned to read. However, if you give them a new book, they might struggle and be unable to read it.

This scenario is similar to a model that performs well on its training data but fails with new data. Model validation involves testing the model with data it hasn’t seen before. If the model performs well on this new data, it indicates having truly learned to generalize. On the other hand, if the model only performs well on the training data and poorly on new data, it has overfitted to the training data, much like the child who can only recite the memorized book.

4. Multicollinearity

Regression analysis works best when the independent variables are genuinely independent. However, sometimes, two or more variables are highly correlated. This multicollinearity can make it hard for the model to accurately determine each variable's impact.

If a model gives poor results, checking for correlated variables may reveal the issue. You can fix it by removing one or more correlated variables or using a principal component analysis (PCA) technique, which transforms the correlated variables into a set of uncorrelated components.

5. Misinterpreting coefficients

Errors are not always due to the model itself; human error is common. These mistakes often involve misinterpreting the results. For example, someone might misunderstand the units of measure and draw incorrect conclusions. Another frequent issue in scientific analysis is confusing correlation and causation. Regression analysis can only provide insights into correlation, not causation.

6. Poor data quality

The adage “garbage in, garbage out” strongly applies to regression analysis. When low-quality data is input into a model, it analyzes noise rather than meaningful patterns. Poor data quality can manifest as missing values, unrepresentative data, outliers, and measurement errors. Additionally, the model may have excluded essential variables significantly impacting the results. All these issues can distort the relationships between variables and lead to misleading results.

What are the assumptions that must hold for regression models?

To correctly interpret the output of a regression model, the following key assumptions about the underlying data process must hold:

The relationship between variables is linear.

There must be homoscedasticity, meaning the variance of the variables and the error term must remain constant.

All explanatory variables are independent of one another.

All variables are normally distributed.

Real-life examples of regression analysis

Let's turn our attention to examining how a few industries use the regression analysis to improve their outcomes:

Regression analysis has many applications in healthcare, but two of the most common are improving patient outcomes and optimizing resources.

Hospitals need to use resources effectively to ensure the best patient outcomes. Regression models can help forecast patient admissions, equipment and supply usage, and more. These models allow hospitals to plan and maximize their resources.

Predicting stock prices, economic trends, and financial risks benefits the finance industry. Regression analysis can help finance professionals make informed decisions about these topics.

For example, analysts often use regression analysis to assess how changes to GDP, interest rates, and unemployment rates impact stock prices. Armed with this information, they can make more informed portfolio decisions.

The banking industry also uses regression analysis. When a loan underwriter determines whether to grant a loan, regression analysis allows them to calculate the probability that a potential lender will repay the loan.

Imagine how much more effective a company's marketing efforts could be if they could predict customer behavior. Regression analysis allows them to do so with a degree of accuracy. For example, marketers can analyze how price, advertising spend, and product features (combined) influence sales. Once they've identified key sales drivers, they can adjust their strategy to maximize revenue. They may approach this analysis in stages.

For instance, if they determine that ad spend is the biggest driver, they can apply regression analysis to data specific to advertising efforts. Doing so allows them to improve the ROI of ads. The opposite may also be true. If ad spending has little to no impact on sales, something is wrong that regression analysis might help identify.

Regression analysis tools and software

Regression analysis by hand isn't practical. The process requires large numbers and complex calculations. Computers make even the most complex regression analysis possible. Even the most complicated AI algorithms can be considered fancy regression calculations. Many tools exist to help users create these regressions.

Another programming language—while MATLAB is a commercial tool, the open-source project Octave aims to implement much of the functionality. These languages are for complex mathematical operations, including regression analysis. Its tools for computation and visualization have made it very popular in academia, engineering, and industry for calculating regression and displaying the results. MATLAB integrates with other toolboxes so developers can extend its functionality and allow for application-specific solutions.

Python is a more general programming language than the previous examples, but many libraries are available that extend its functionality. For regression analysis, packages like Scikit-Learn and StatsModels provide the computational tools necessary for the job. In contrast, packages like Pandas and Matplotlib can handle large amounts of data and display the results. Python is a simple-to-learn, easy-to-read programming language, which can give it a leg up over the more dedicated math and statistics languages.

SAS (Statistical Analysis System) is a commercial software suite for advanced analytics, multivariate analysis, business intelligence, and data management. It includes a procedure called PROC REG that allows users to efficiently perform regression analysis on their data. The software is well-known for its data-handling capabilities, extensive documentation, and technical support. These factors make it a common choice for large-scale enterprise use and industries requiring rigorous statistical analysis.

Stata is another statistical software package. It provides an integrated data analysis, management, and graphics environment. The tool includes tools for performing a range of regression analysis tasks. This tool's popularity is due to its ease of use, reproducibility, and ability to handle complex datasets intuitively. The extensive documentation helps beginners get started quickly. Stata is widely used in academic research, economics, sociology, and political science.

Most people know Excel , but you might not know that Microsoft's spreadsheet software has an add-in called Analysis ToolPak that can perform basic linear regression and visualize the results. Excel is not an excellent choice for more complex regression or very large datasets. But for those with basic needs who only want to analyze smaller datasets quickly, it's a convenient option already in many tech stacks.

SPSS (Statistical Package for the Social Sciences) is a versatile statistical analysis software widely used in social science, business, and health. It offers tools for various analyses, including regression, making it accessible to users through its user-friendly interface. SPSS enables users to manage and visualize data, perform complex analyses, and generate reports without coding. Its extensive documentation and support make it popular in academia and industry, allowing for efficient handling of large datasets and reliable results.

What is a regression analysis in simple terms?

Regression analysis is a statistical method used to estimate and quantify the relationship between a dependent variable and one or more independent variables. It helps determine the strength and direction of these relationships, allowing predictions about the dependent variable based on the independent variables and providing insights into how each independent variable impacts the dependent variable.

What are the main types of variables used in regression analysis?

Dependent variables : typically continuous (e.g., house price) or binary (e.g., yes/no outcomes).

Independent variables : can be continuous, categorical, binary, or ordinal.

What does a regression analysis tell you?

Regression analysis identifies the relationships between a dependent variable and one or more independent variables. It quantifies the strength and direction of these relationships, allowing you to predict the dependent variable based on the independent variables and understand the impact of each independent variable on the dependent variable.

Should you be using a customer insights hub?

Do you want to discover previous research faster?

Do you share your research findings with others?

Do you analyze research data?

Start for free today, add your research, and get to key insights faster

Editor’s picks

Last updated: 18 April 2023

Last updated: 27 February 2023

Last updated: 22 August 2024

Last updated: 5 February 2023

Last updated: 16 August 2024

Last updated: 9 March 2023

Last updated: 30 April 2024

Last updated: 12 December 2023

Last updated: 11 March 2024

Last updated: 4 July 2024

Last updated: 6 March 2024

Last updated: 5 March 2024

Last updated: 13 May 2024

Latest articles

Related topics, .css-je19u9{-webkit-align-items:flex-end;-webkit-box-align:flex-end;-ms-flex-align:flex-end;align-items:flex-end;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-flex-direction:row;-ms-flex-direction:row;flex-direction:row;-webkit-box-flex-wrap:wrap;-webkit-flex-wrap:wrap;-ms-flex-wrap:wrap;flex-wrap:wrap;-webkit-box-pack:center;-ms-flex-pack:center;-webkit-justify-content:center;justify-content:center;row-gap:0;text-align:center;max-width:671px;}@media (max-width: 1079px){.css-je19u9{max-width:400px;}.css-je19u9>span{white-space:pre;}}@media (max-width: 799px){.css-je19u9{max-width:400px;}.css-je19u9>span{white-space:pre;}} decide what to .css-1kiodld{max-height:56px;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}@media (max-width: 1079px){.css-1kiodld{display:none;}} build next, decide what to build next, log in or sign up.

Get started for free

Business Essentials
Leadership & Management
Credential of Leadership, Impact, and Management in Business (CLIMB)
Entrepreneurship & Innovation
Digital Transformation
Finance & Accounting
Business in Society
For Organizations
Support Portal
Media Coverage
Founding Donors
Leadership Team

Harvard Business School →
HBS Online →
Business Insights →

Business Insights

Harvard Business School Online's Business Insights Blog provides the career insights you need to achieve your goals and gain confidence in your business skills.

Career Development
Communication
Decision-Making
Earning Your MBA
Negotiation
News & Events
Productivity
Staff Spotlight
Student Profiles
Work-Life Balance
AI Essentials for Business
Alternative Investments
Business Analytics
Business Strategy
Business and Climate Change
Creating Brand Value
Design Thinking and Innovation
Digital Marketing Strategy
Disruptive Strategy
Economics for Managers
Entrepreneurship Essentials
Financial Accounting
Global Business
Launching Tech Ventures
Leadership Principles
Leadership, Ethics, and Corporate Accountability
Leading Change and Organizational Renewal
Leading with Finance
Management Essentials
Negotiation Mastery
Organizational Leadership
Power and Influence for Positive Impact
Strategy Execution
Sustainable Business Strategy
Sustainable Investing
Winning with Digital Platforms

What Is Regression Analysis in Business Analytics?

Business professional using calculator for regression analysis

14 Dec 2021

Countless factors impact every facet of business. How can you consider those factors and know their true impact?

Imagine you seek to understand the factors that influence people’s decision to buy your company’s product. They range from customers’ physical locations to satisfaction levels among sales representatives to your competitors' Black Friday sales.

Understanding the relationships between each factor and product sales can enable you to pinpoint areas for improvement, helping you drive more sales.

To learn how each factor influences sales, you need to use a statistical analysis method called regression analysis .

If you aren’t a business or data analyst, you may not run regressions yourself, but knowing how analysis works can provide important insight into which factors impact product sales and, thus, which are worth improving.

Access your free e-book today.

Foundational Concepts for Regression Analysis

Before diving into regression analysis, you need to build foundational knowledge of statistical concepts and relationships.

Independent and Dependent Variables

Start with the basics. What relationship are you aiming to explore? Try formatting your answer like this: “I want to understand the impact of [the independent variable] on [the dependent variable].”

The independent variable is the factor that could impact the dependent variable . For example, “I want to understand the impact of employee satisfaction on product sales.”

In this case, employee satisfaction is the independent variable, and product sales is the dependent variable. Identifying the dependent and independent variables is the first step toward regression analysis.

Correlation vs. Causation

One of the cardinal rules of statistically exploring relationships is to never assume correlation implies causation. In other words, just because two variables move in the same direction doesn’t mean one caused the other to occur.

If two or more variables are correlated , their directional movements are related. If two variables are positively correlated , it means that as one goes up or down, so does the other. Alternatively, if two variables are negatively correlated , one goes up while the other goes down.

A correlation’s strength can be quantified by calculating the correlation coefficient , sometimes represented by r . The correlation coefficient falls between negative one and positive one.

r = -1 indicates a perfect negative correlation.

r = 1 indicates a perfect positive correlation.

r = 0 indicates no correlation.

Causation means that one variable caused the other to occur. Proving a causal relationship between variables requires a true experiment with a control group (which doesn’t receive the independent variable) and an experimental group (which receives the independent variable).

While regression analysis provides insights into relationships between variables, it doesn’t prove causation. It can be tempting to assume that one variable caused the other—especially if you want it to be true—which is why you need to keep this in mind any time you run regressions or analyze relationships between variables.

With the basics under your belt, here’s a deeper explanation of regression analysis so you can leverage it to drive strategic planning and decision-making.

Related: How to Learn Business Analytics without a Business Background

What Is Regression Analysis?

Regression analysis is the statistical method used to determine the structure of a relationship between two variables (single linear regression) or three or more variables (multiple regression).

According to the Harvard Business School Online course Business Analytics , regression is used for two primary purposes:

To study the magnitude and structure of the relationship between variables
To forecast a variable based on its relationship with another variable

Both of these insights can inform strategic business decisions.

“Regression allows us to gain insights into the structure of that relationship and provides measures of how well the data fit that relationship,” says HBS Professor Jan Hammond, who teaches Business Analytics, one of three courses that comprise the Credential of Readiness (CORe) program . “Such insights can prove extremely valuable for analyzing historical trends and developing forecasts.”

One way to think of regression is by visualizing a scatter plot of your data with the independent variable on the X-axis and the dependent variable on the Y-axis. The regression line is the line that best fits the scatter plot data. The regression equation represents the line’s slope and the relationship between the two variables, along with an estimation of error.

Physically creating this scatter plot can be a natural starting point for parsing out the relationships between variables.

Credential of Readiness | Master the fundamentals of business | Learn More

Types of Regression Analysis

There are two types of regression analysis: single variable linear regression and multiple regression.

Single variable linear regression is used to determine the relationship between two variables: the independent and dependent. The equation for a single variable linear regression looks like this:

Single Variable Linear Regression Formula

In the equation:

ŷ is the expected value of Y (the dependent variable) for a given value of X (the independent variable).
x is the independent variable.
α is the Y-intercept, the point at which the regression line intersects with the vertical axis.
β is the slope of the regression line, or the average change in the dependent variable as the independent variable increases by one.
ε is the error term, equal to Y – ŷ, or the difference between the actual value of the dependent variable and its expected value.

Multiple regression , on the other hand, is used to determine the relationship between three or more variables: the dependent variable and at least two independent variables. The multiple regression equation looks complex but is similar to the single variable linear regression equation:

Each component of this equation represents the same thing as in the previous equation, with the addition of the subscript k, which is the total number of independent variables being examined. For each independent variable you include in the regression, multiply the slope of the regression line by the value of the independent variable, and add it to the rest of the equation.

How to Run Regressions

You can use a host of statistical programs—such as Microsoft Excel, SPSS, and STATA—to run both single variable linear and multiple regressions. If you’re interested in hands-on practice with this skill, Business Analytics teaches learners how to create scatter plots and run regressions in Microsoft Excel, as well as make sense of the output and use it to drive business decisions.

Calculating Confidence and Accounting for Error

It’s important to note: This overview of regression analysis is introductory and doesn’t delve into calculations of confidence level, significance, variance, and error. When working in a statistical program, these calculations may be provided or require that you implement a function. When conducting regression analysis, these metrics are important for gauging how significant your results are and how much importance to place on them.

Business Analytics | Become a data-driven leader | Learn More

Why Use Regression Analysis?

Once you’ve generated a regression equation for a set of variables, you effectively have a roadmap for the relationship between your independent and dependent variables. If you input a specific X value into the equation, you can see the expected Y value.

This can be critical for predicting the outcome of potential changes, allowing you to ask, “What would happen if this factor changed by a specific amount?”

Returning to the earlier example, running a regression analysis could allow you to find the equation representing the relationship between employee satisfaction and product sales. You could input a higher level of employee satisfaction and see how sales might change accordingly. This information could lead to improved working conditions for employees, backed by data that shows the tie between high employee satisfaction and sales.

Whether predicting future outcomes, determining areas for improvement, or identifying relationships between seemingly unconnected variables, understanding regression analysis can enable you to craft data-driven strategies and determine the best course of action with all factors in mind.

Do you want to become a data-driven professional? Explore our eight-week Business Analytics course and our three-course Credential of Readiness (CORe) program to deepen your analytical skills and apply them to real-world business problems.

About the Author

Market Research Company Blog

What is Regression Analysis & How Is It Used?

by George Kuhn

Posted at: 3/8/2023 1:30 PM

Hand drawing linear regression line chart

Regression analysis helps organizations make sense of priority areas and what factors have the most impact and influence on their customer relationships.

It allows researchers and brands to read between the lines of the survey data.

This article will help you understand the definition of regression analysis, how it is commonly used, and the benefits of using regression research.

Interested in using regression analysis? Drive Research can help with that too. Reach our market research company by filling out an online contact form or emailing [email protected] .

Regression Analysis: Definition

Regression analysis is a common statistical method that helps organizations understand the relationship between independent variables and dependent variables.

Dependent variable: The main factor you want to measure or understand.
Independent variables: The secondary factors you believe to have an influence on your dependent variable.

More specifically regression analysis tells you what factors are most important, which to disregard, and how each factor affects one another.

In a simple example, say you want to find out how pricing, customer service, and product quality impacts (independent variables) impact customer retention (dependent variable).

A survey using regression analysis research is used to determine if increasing prices will have any impact on repeat customer purchases.

Importance of Regression Analysis

There are several benefits of regression analysis, most of which center around using it to achieve data-driven decision-making .

The advantages of using regression analysis in research include:

1. Great tool for forecasting

While there is no such thing as a magic crystal ball, regression research is a great approach to measuring predictive analytics and forecasting.

For instance, our market research company worked with a manufacturing company to understand the impact that key index scores from the markets had on sales projections.

Regression analysis was used to understand how revenue would be impacted by independent variables such as:

The ups and downs of oil prices
The consumer price index (CPI)
The gross domestic product (GDP)

We used reports and predictive analytic forecasts on these key independent variable statistics to understand how their revenue might be impacted in future quarters.

Though keep in mind, the further in the future you predict, the less reliable the data will be using a wider margin of error .

2. Focus attention on priority areas of improvement

Regression statistical analysis helps businesses and organizations prioritize efforts to improve customer satisfaction metrics such as net promoter score, customer effort score , and customer loyalty.

Using regression analysis in quantitative research provides the opportunity to take corrective actions on the items that will most positively improve overall satisfaction.

When to Use Regression Analysis

A popular way to measure this is with net promoter score (NPS) as it is one of the most commonly used metrics in market research.

The calculation or score is based on a simple 'likelihood to recommend' question.

It is a 0 to 10 scale where “10” indicates very likely to recommend and “0” indicates not at all likely to recommend.

It groups your customers into 3 buckets:

Promoters or those who rate your brand a 9 or 10
Passives or those who rate your brand a 7 or 8
Detractors or those who rate your brand a 0 to 6

The NPS is calculated as the difference between the percentage of promoters and the percentage of detractors (i.e., 75% promoters - 15% detractors = +60 NPS.)

The score is very telling to help your business understand how many raving fans your brand has in comparison to your key competitors and industry benchmarks.

While our online survey company always recommends using an open-ended question after NPS to gather context to help understand the driving forces behind the score, sometimes it does not tell the whole story.

For instance, if you were to ask a customer why they rated your restaurant a “10” on the likelihood to recommend scale, they may say something like “good prices” or “good food” in an open-ended comment.

But is that what is really driving your high NPS rating?

Customers are often not experts at expressing their emotions and feelings in a survey.

This is where regression analysis can help.

Regression Analysis Example in Business

Keeping with the restaurant survey from above, let’s say in the same survey you ask a series of customer satisfaction questions related to respondents’ dining experience at your restaurant.

You believe the price and food are good at your restaurant but you think there might be some underlying drivers really pushing your high NPS.

In this example, likelihood to recommend, or NPS is your dependent variable A .

Your more specific follow-up satisfaction questions are dependent variables B, C, D, E, F, G .

You ask on a scale of 1 to 5 where “5” indicates very satisfied and “1” indicates not at all satisfied, how satisfied are you with the following?

Cleanliness of the restaurant (INDEPENDENT VARIABLE B)
Friendliness of the staff (INDEPENDENT VARIABLE C)
Price of the food (INDEPENDENT VARIABLE D)
Taste of the food (INDEPENDENT VARIABLE E)
Speed of your order (INDEPENDENT VARIABLE F)
Check-out process (INDEPENDENT VARIABLE G)

Through your regression analysis, you find out that INDEPENDENT VARIABLE C (friendliness of the staff) has the most significant effect on NPS.

This means how the customer rates the friendliness of the staff members will have the largest overall impact on how likely they would be to recommend your restaurant.

This is much different than what customers said in the open-ended comment about price and food.

However, as regression analysis proves, staff friendliness is essential.

This is likely driven by subconscious undertones of the customer experience and customers not understanding how they impact their overall experience.

If this facet of your business can be improved so all customers rate your staff a “5” on satisfaction, it is significantly more likely your NPS score will push higher than +60.

Contact Our Market Research Company

Regression analysis is another tool market research firms used on a daily basis with their clients to help brands understand survey data from customers.

The benefit of using a third-party market research firm is that you can leverage their expertise to tell you the “so what” of your customer survey data.

Drive Research is a market research company in Syracuse, NY.

Interested in exploring regression analysis for your customer satisfaction survey ? Need a quote or proposal for the work? Contact us below.

Message us on our website
Email us at [email protected]
Call us at 888-725-DATA
Text us at 315-303-2040

George Kuhn

George is the Owner & President of Drive Research. He has consulted for hundreds of regional, national, and global organizations over the past 15 years. He is a CX certified VoC professional with a focus on innovation and new product management.

Learn more about George, here .

Categories: Market Research Glossary

Need help with your project? Get in touch with Drive Research.

View Our Blog

SUGGESTED TOPICS
The Magazine
Newsletters
Managing Yourself
Managing Teams
Work-life Balance
The Big Idea
Data & Visuals
Reading Lists
Case Selections
HBR Learning
Topic Feeds
Account Settings
Email Preferences

A Refresher on Regression Analysis

Understanding one of the most important types of data analysis.

You probably know by now that whenever possible you should be making data-driven decisions at work . But do you know how to parse through all the data available to you? The good news is that you probably don’t need to do the number crunching yourself (hallelujah!) but you do need to correctly understand and interpret the analysis created by your colleagues. One of the most important types of data analysis is called regression analysis.

Amy Gallo is a contributing editor at Harvard Business Review, cohost of the Women at Work podcast , and the author of two books: Getting Along: How to Work with Anyone (Even Difficult People) and the HBR Guide to Dealing with Conflict . She writes and speaks about workplace dynamics. Watch her TEDx talk on conflict and follow her on LinkedIn . amyegallo

Partner Center

Home » Regression Analysis – Methods, Types and Examples

Regression Analysis – Methods, Types and Examples

Table of Contents

Regression Analysis

Regression analysis is a set of statistical processes for estimating the relationships among variables . It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables (or ‘predictors’).

Regression Analysis Methodology

Here is a general methodology for performing regression analysis:

Define the research question: Clearly state the research question or hypothesis you want to investigate. Identify the dependent variable (also called the response variable or outcome variable) and the independent variables (also called predictor variables or explanatory variables) that you believe are related to the dependent variable.
Collect data: Gather the data for the dependent variable and independent variables. Ensure that the data is relevant, accurate, and representative of the population or phenomenon you are studying.
Explore the data: Perform exploratory data analysis to understand the characteristics of the data, identify any missing values or outliers, and assess the relationships between variables through scatter plots, histograms, or summary statistics.
Choose the regression model: Select an appropriate regression model based on the nature of the variables and the research question. Common regression models include linear regression, multiple regression, logistic regression, polynomial regression, and time series regression, among others.
Assess assumptions: Check the assumptions of the regression model. Some common assumptions include linearity (the relationship between variables is linear), independence of errors, homoscedasticity (constant variance of errors), and normality of errors. Violation of these assumptions may require additional steps or alternative models.
Estimate the model: Use a suitable method to estimate the parameters of the regression model. The most common method is ordinary least squares (OLS), which minimizes the sum of squared differences between the observed and predicted values of the dependent variable.
I nterpret the results: Analyze the estimated coefficients, p-values, confidence intervals, and goodness-of-fit measures (e.g., R-squared) to interpret the results. Determine the significance and direction of the relationships between the independent variables and the dependent variable.
Evaluate model performance: Assess the overall performance of the regression model using appropriate measures, such as R-squared, adjusted R-squared, and root mean squared error (RMSE). These measures indicate how well the model fits the data and how much of the variation in the dependent variable is explained by the independent variables.
Test assumptions and diagnose problems: Check the residuals (the differences between observed and predicted values) for any patterns or deviations from assumptions. Conduct diagnostic tests, such as examining residual plots, testing for multicollinearity among independent variables, and assessing heteroscedasticity or autocorrelation, if applicable.
Make predictions and draw conclusions: Once you have a satisfactory model, use it to make predictions on new or unseen data. Draw conclusions based on the results of the analysis, considering the limitations and potential implications of the findings.

Types of Regression Analysis

Types of Regression Analysis are as follows:

Linear Regression

Linear regression is the most basic and widely used form of regression analysis. It models the linear relationship between a dependent variable and one or more independent variables. The goal is to find the best-fitting line that minimizes the sum of squared differences between observed and predicted values.

Multiple Regression

Multiple regression extends linear regression by incorporating two or more independent variables to predict the dependent variable. It allows for examining the simultaneous effects of multiple predictors on the outcome variable.

Polynomial Regression

Polynomial regression models non-linear relationships between variables by adding polynomial terms (e.g., squared or cubic terms) to the regression equation. It can capture curved or nonlinear patterns in the data.

Logistic Regression

Logistic regression is used when the dependent variable is binary or categorical. It models the probability of the occurrence of a certain event or outcome based on the independent variables. Logistic regression estimates the coefficients using the logistic function, which transforms the linear combination of predictors into a probability.

Ridge Regression and Lasso Regression

Ridge regression and Lasso regression are techniques used for addressing multicollinearity (high correlation between independent variables) and variable selection. Both methods introduce a penalty term to the regression equation to shrink or eliminate less important variables. Ridge regression uses L2 regularization, while Lasso regression uses L1 regularization.

Time Series Regression

Time series regression analyzes the relationship between a dependent variable and independent variables when the data is collected over time. It accounts for autocorrelation and trends in the data and is used in forecasting and studying temporal relationships.

Nonlinear Regression

Nonlinear regression models are used when the relationship between the dependent variable and independent variables is not linear. These models can take various functional forms and require estimation techniques different from those used in linear regression.

Poisson Regression

Poisson regression is employed when the dependent variable represents count data. It models the relationship between the independent variables and the expected count, assuming a Poisson distribution for the dependent variable.

Generalized Linear Models (GLM)

GLMs are a flexible class of regression models that extend the linear regression framework to handle different types of dependent variables, including binary, count, and continuous variables. GLMs incorporate various probability distributions and link functions.

Regression Analysis Formulas

Regression analysis involves estimating the parameters of a regression model to describe the relationship between the dependent variable (Y) and one or more independent variables (X). Here are the basic formulas for linear regression, multiple regression, and logistic regression:

Linear Regression:

Simple Linear Regression Model: Y = β0 + β1X + ε

Multiple Linear Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

In both formulas:

Y represents the dependent variable (response variable).
X represents the independent variable(s) (predictor variable(s)).
β0, β1, β2, …, βn are the regression coefficients or parameters that need to be estimated.
ε represents the error term or residual (the difference between the observed and predicted values).

Multiple Regression:

Multiple regression extends the concept of simple linear regression by including multiple independent variables.

Multiple Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

The formulas are similar to those in linear regression, with the addition of more independent variables.

Logistic Regression:

Logistic regression is used when the dependent variable is binary or categorical. The logistic regression model applies a logistic or sigmoid function to the linear combination of the independent variables.

Logistic Regression Model: p = 1 / (1 + e^-(β0 + β1X1 + β2X2 + … + βnXn))

In the formula:

p represents the probability of the event occurring (e.g., the probability of success or belonging to a certain category).
X1, X2, …, Xn represent the independent variables.
e is the base of the natural logarithm.

The logistic function ensures that the predicted probabilities lie between 0 and 1, allowing for binary classification.

Regression Analysis Examples

Regression Analysis Examples are as follows:

Stock Market Prediction: Regression analysis can be used to predict stock prices based on various factors such as historical prices, trading volume, news sentiment, and economic indicators. Traders and investors can use this analysis to make informed decisions about buying or selling stocks.
Demand Forecasting: In retail and e-commerce, real-time It can help forecast demand for products. By analyzing historical sales data along with real-time data such as website traffic, promotional activities, and market trends, businesses can adjust their inventory levels and production schedules to meet customer demand more effectively.
Energy Load Forecasting: Utility companies often use real-time regression analysis to forecast electricity demand. By analyzing historical energy consumption data, weather conditions, and other relevant factors, they can predict future energy loads. This information helps them optimize power generation and distribution, ensuring a stable and efficient energy supply.
Online Advertising Performance: It can be used to assess the performance of online advertising campaigns. By analyzing real-time data on ad impressions, click-through rates, conversion rates, and other metrics, advertisers can adjust their targeting, messaging, and ad placement strategies to maximize their return on investment.
Predictive Maintenance: Regression analysis can be applied to predict equipment failures or maintenance needs. By continuously monitoring sensor data from machines or vehicles, regression models can identify patterns or anomalies that indicate potential failures. This enables proactive maintenance, reducing downtime and optimizing maintenance schedules.
Financial Risk Assessment: Real-time regression analysis can help financial institutions assess the risk associated with lending or investment decisions. By analyzing real-time data on factors such as borrower financials, market conditions, and macroeconomic indicators, regression models can estimate the likelihood of default or assess the risk-return tradeoff for investment portfolios.

Importance of Regression Analysis

Importance of Regression Analysis is as follows:

Relationship Identification: Regression analysis helps in identifying and quantifying the relationship between a dependent variable and one or more independent variables. It allows us to determine how changes in independent variables impact the dependent variable. This information is crucial for decision-making, planning, and forecasting.
Prediction and Forecasting: Regression analysis enables us to make predictions and forecasts based on the relationships identified. By estimating the values of the dependent variable using known values of independent variables, regression models can provide valuable insights into future outcomes. This is particularly useful in business, economics, finance, and other fields where forecasting is vital for planning and strategy development.
Causality Assessment: While correlation does not imply causation, regression analysis provides a framework for assessing causality by considering the direction and strength of the relationship between variables. It allows researchers to control for other factors and assess the impact of a specific independent variable on the dependent variable. This helps in determining the causal effect and identifying significant factors that influence outcomes.
Model Building and Variable Selection: Regression analysis aids in model building by determining the most appropriate functional form of the relationship between variables. It helps researchers select relevant independent variables and eliminate irrelevant ones, reducing complexity and improving model accuracy. This process is crucial for creating robust and interpretable models.
Hypothesis Testing: Regression analysis provides a statistical framework for hypothesis testing. Researchers can test the significance of individual coefficients, assess the overall model fit, and determine if the relationship between variables is statistically significant. This allows for rigorous analysis and validation of research hypotheses.
Policy Evaluation and Decision-Making: Regression analysis plays a vital role in policy evaluation and decision-making processes. By analyzing historical data, researchers can evaluate the effectiveness of policy interventions and identify the key factors contributing to certain outcomes. This information helps policymakers make informed decisions, allocate resources effectively, and optimize policy implementation.
Risk Assessment and Control: Regression analysis can be used for risk assessment and control purposes. By analyzing historical data, organizations can identify risk factors and develop models that predict the likelihood of certain outcomes, such as defaults, accidents, or failures. This enables proactive risk management, allowing organizations to take preventive measures and mitigate potential risks.

When to Use Regression Analysis

Prediction : Regression analysis is often employed to predict the value of the dependent variable based on the values of independent variables. For example, you might use regression to predict sales based on advertising expenditure, or to predict a student’s academic performance based on variables like study time, attendance, and previous grades.
Relationship analysis: Regression can help determine the strength and direction of the relationship between variables. It can be used to examine whether there is a linear association between variables, identify which independent variables have a significant impact on the dependent variable, and quantify the magnitude of those effects.
Causal inference: Regression analysis can be used to explore cause-and-effect relationships by controlling for other variables. For example, in a medical study, you might use regression to determine the impact of a specific treatment while accounting for other factors like age, gender, and lifestyle.
Forecasting : Regression models can be utilized to forecast future trends or outcomes. By fitting a regression model to historical data, you can make predictions about future values of the dependent variable based on changes in the independent variables.
Model evaluation: Regression analysis can be used to evaluate the performance of a model or test the significance of variables. You can assess how well the model fits the data, determine if additional variables improve the model’s predictive power, or test the statistical significance of coefficients.
Data exploration : Regression analysis can help uncover patterns and insights in the data. By examining the relationships between variables, you can gain a deeper understanding of the data set and identify potential patterns, outliers, or influential observations.

Applications of Regression Analysis

Here are some common applications of regression analysis:

Economic Forecasting: Regression analysis is frequently employed in economics to forecast variables such as GDP growth, inflation rates, or stock market performance. By analyzing historical data and identifying the underlying relationships, economists can make predictions about future economic conditions.
Financial Analysis: Regression analysis plays a crucial role in financial analysis, such as predicting stock prices or evaluating the impact of financial factors on company performance. It helps analysts understand how variables like interest rates, company earnings, or market indices influence financial outcomes.
Marketing Research: Regression analysis helps marketers understand consumer behavior and make data-driven decisions. It can be used to predict sales based on advertising expenditures, pricing strategies, or demographic variables. Regression models provide insights into which marketing efforts are most effective and help optimize marketing campaigns.
Health Sciences: Regression analysis is extensively used in medical research and public health studies. It helps examine the relationship between risk factors and health outcomes, such as the impact of smoking on lung cancer or the relationship between diet and heart disease. Regression analysis also helps in predicting health outcomes based on various factors like age, genetic markers, or lifestyle choices.
Social Sciences: Regression analysis is widely used in social sciences like sociology, psychology, and education research. Researchers can investigate the impact of variables like income, education level, or social factors on various outcomes such as crime rates, academic performance, or job satisfaction.
Operations Research: Regression analysis is applied in operations research to optimize processes and improve efficiency. For example, it can be used to predict demand based on historical sales data, determine the factors influencing production output, or optimize supply chain logistics.
Environmental Studies: Regression analysis helps in understanding and predicting environmental phenomena. It can be used to analyze the impact of factors like temperature, pollution levels, or land use patterns on phenomena such as species diversity, water quality, or climate change.
Sports Analytics: Regression analysis is increasingly used in sports analytics to gain insights into player performance, team strategies, and game outcomes. It helps analyze the relationship between various factors like player statistics, coaching strategies, or environmental conditions and their impact on game outcomes.

Advantages and Disadvantages of Regression Analysis

Advantages of Regression Analysis	Disadvantages of Regression Analysis
Provides a quantitative measure of the relationship between variables	Assumes a linear relationship between variables, which may not always hold true
Helps in predicting and forecasting outcomes based on historical data	Requires a large sample size to produce reliable results
Identifies and measures the significance of independent variables on the dependent variable	Assumes no multicollinearity, meaning that independent variables should not be highly correlated with each other
Provides estimates of the coefficients that represent the strength and direction of the relationship between variables	Assumes the absence of outliers or influential data points
Allows for hypothesis testing to determine the statistical significance of the relationship	Can be sensitive to the inclusion or exclusion of certain variables, leading to different results
Can handle both continuous and categorical variables	Assumes the independence of observations, which may not hold true in some cases
Offers a visual representation of the relationship through the use of scatter plots and regression lines	May not capture complex non-linear relationships between variables without appropriate transformations
Provides insights into the marginal effects of independent variables on the dependent variable	Requires the assumption of homoscedasticity, meaning that the variance of errors is constant across all levels of the independent variables

About the author

Muhammad Hassan

Researcher, Academic Writer, Web developer

Factor Analysis – Steps, Methods and Examples

Discourse Analysis – Methods, Types and Examples

Symmetric Histogram – Examples and Making Guide

Inferential Statistics – Types, Methods and...

Uniform Histogram – Purpose, Examples and Guide

Cluster Analysis – Types, Methods and Examples

Unlock the Power of Regression Analysis for Better Decision-Making

Regression analysis is a powerful statistical method that allows us to examine the relationship between two or more variables of interest. While there are many types of regression analysis, at their core they all examine the influence of one or more independent variables on a dependent variable.

Table of Contents

The Importance of Regression Analysis

Basic principles of regression analysis, benefits of using regression analysis, types of regression analysis, software tools for conducting regression analysis, practical implementation of regression analysis.

The independent variables, also known as predictors or explanatory variables, are the factors we presume to have an impact on our dependent variable – the outcome or response variable we are interested in studying. For instance, in a market research scenario, an independent variable could be ‘advertising spend’, while the dependent variable might be ‘sales revenue’.

By understanding the relationships between different variables, businesses can forecast outcomes and devise optimized strategies. For instance, a well-executed analysis could provide an understanding of how much additional sales revenue a company might expect to make for each additional thousand dollars invested in advertising.

In addition, regression analysis aids in the identification of significant variables that impact a business process, enabling organizations to focus their resources effectively. For instance, if an analysis indicates that ‘customer service rating’ significantly influences ‘customer retention’, a company could invest more in improving customer service to boost customer loyalty and, ultimately, profitability.

The essential elements in this analysis are the independent variables , the dependent variable , and the error terms . The error term represents the difference between the observed and predicted values of the dependent variable. It accounts for the variability in the dependent variable that cannot be explained by the independent variables.

Grasping the Concept of Correlation

Correlation is a fundamental concept in regression analysis. It measures the strength and direction of the linear relationship between two variables. The correlation can be positive (both variables increase or decrease together), negative (one variable increases while the other decreases), or zero (no relationship).

A high positive correlation implies a strong linear relationship where the dependent and independent variables increase or decrease together. On the other hand, a high negative correlation implies a strong linear relationship where one variable increases as the other decreases. A zero correlation suggests that there is no linear relationship between the variables.

Understanding the Assumptions of Regression Analysis

To ensure the validity of a regression analysis, several assumptions must be met:

Linearity : This assumption states that there is a linear relationship between the independent and dependent variables.
Independence : This assumption stipulates that the residuals (errors) are independent of each other. In other words, the error at one point does not affect the error at another point.
Homoscedasticity : This refers to the assumption that the variability of the error terms is constant across all levels of the independent variables.
Normality : This assumption states that the error terms are normally distributed.

Violations of these assumptions can lead to biased or inefficient estimates of the regression coefficients, reducing the reliability of the predictions. Understanding these basic principles of this type of analysis equips us to explore further the different types of regression analyses and their applications in market research .

Regression analysis offers an analytical framework that helps practitioners and researchers decipher complex data sets. In this chapter, we will delve into the various advantages of using this analysis type.

Prediction and Forecasting

One of the most salient advantages of this type of analysis is its capacity for prediction and forecasting. In business, regression can help forecast sales for the next quarter, predict stock prices, or estimate future demand for a product. In environmental science, it can be used to project future temperature changes or pollution levels. The model enables one to make educated guesses about an outcome when specific conditions are met.

Understanding Relationships

Regression analysis is invaluable for probing the relationship between variables. Whether it’s understanding how a drug dosage affects recovery time, or how study time influences exam scores, regression can provide quantitative answers . This aids in hypothesis testing and can guide decision-making processes.

Resource Allocation

In resource-constrained environments, the power to make accurate predictions is golden. Businesses can use regression models to allocate resources more efficiently. For example, a retailer could use regression analysis to understand how labor hours relate to customer service satisfaction , thereby making smarter staffing decisions that improve service while minimizing costs.

Risk Assessment

Financial industries often use regression analysis to assess risk. Lenders can evaluate the factors that are indicative of loan default, enabling them to make better-informed lending decisions. Similarly, insurance companies can use regression models to determine risk levels associated with different policyholders and adjust premiums accordingly.

Quantifying Impact

This analysis allows for the quantification of the effect one variable has on another. This is crucial for policy impact analysis, program evaluation, and any scenario where one needs to isolate the impact of one variable while holding others constant. For instance, policy-makers can measure the effectiveness of a new law or regulation by comparing actual outcomes to those predicted by a regression model that doesn’t include the new policy variable.

Decision-making under Uncertainty

Life and business are fraught with uncertainty, and this analysis provides a way to manage that uncertainty. By using the model’s predictive capabilities, one can make decisions that are statistically likely to result in favorable outcomes. This is particularly useful in fields like supply chain management, where variables such as delivery times and demand can fluctate wildly.

Uncovering Trends and Patterns

As organizations amass large volumes of data, the ability to interpret that data becomes increasingly important. Regression analysis can help identify underlying trends or patterns that may not be immediately visible through simple data inspection. Such insights can be leveraged for competitive advantage or to address existing issues before they escalate.

Cost-effectiveness

Compared to some other forms of data analysis, regression analysis is often less resource-intensive. One can derive meaningful insights without necessarily having to invest in expensive equipment or software. Open-source software like R and Python’s statistical libraries offer robust regression capabilities for free, making it accessible to organizations of all sizes.

There are several types of regression analysis commonly used in market research. While each type has its unique characteristics and use-cases, they all serve the same primary purpose – to identify relationships between variables.

Linear Regression

The simplest form of regression is linear regression. This technique investigates a linear relationship between a dependent variable and one independent variable. For example, a company might use linear regression to understand the relationship between advertising spend (independent variable) and product sales (dependent variable). In this scenario, we’d be trying to model sales as a linear function of advertising spend.

Multiple Regression

When we wish to explore the relationship between a dependent variable and more than one independent variable, we use multiple regression. For instance, we could add ‘average customer income’ as an additional independent variable in our previous example, thereby investigating the impact of both ‘advertising spend’ and ‘average customer income’ on ‘product sales’.

Logistic Regression

Logistic regression is used when the dependent variable is binary – that is, it can take only two values, like ‘yes/no’ or ‘true/false’. For example, if a company wants to predict whether a customer will make a purchase (yes/no) based on variables such as ‘age’, ‘income’, and ‘previous purchase history’, they would use logistic regression.

Polynomial Regression

Polynomial regression is a type of analysis in which the relationship between the independent variable and the dependent variable is modelled as an nth degree polynomial. Polynomial regression can be used, for instance, when the relationship between variables is not linear but is better represented by a curve.

Ridge Regression

Ridge regression is a method used to handle multicollinearity, a problem that can arise when the independent variables are highly correlated with each other. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors, making the estimates more reliable.

Lasso Regression

Like ridge regression, lasso regression is used to handle multicollinearity. However, lasso regression has the additional property of being able to exclude useless variables from equations. This is particularly helpful when dealing with a large number of independent variables.

ElasticNet Regression

ElasticNet regression is a hybrid of ridge and lasso regression. It combines the penalties of ridge and lasso regression to handle both multicollinearity and the automatic variable selection, making it a popular choice when dealing with large datasets.

Understanding these types of analysis and their unique characteristics is critical when deciding which method to apply in your market research study. The chosen method depends on the nature of the data and the specific questions you seek to answer.

While the theory of regression analysis is critical, practical application involves computations that can become complex, especially when dealing with large datasets. Fortunately, there are several software tools available to make this process more manageable and accurate. Several software packages are widely used for regression analysis.

SPSS : This is a powerful, user-friendly software package for statistical analysis, including regression analysis. Its simple graphical user interface makes it a popular choice among beginners and non-programmers.
R is a free software environment for statistical computing and graphics. It is highly flexible and powerful, with numerous packages available for different types of analysis.
Python , a general-purpose programming language, is increasingly used for data analysis thanks to its simplicity and powerful libraries like NumPy, Pandas, and Scikit-learn.
STATA is a complete statistical software package that provides everything you need for data analysis, data management, and graphics.
SAS is a statistical software suite used for advanced analytics, business intelligence, data management, and predictive analytics.

How to Perform Regression Analysis using SPSS

Let’s take a brief look at how to conduct a simple linear regression analysis using SPSS.

Load your dataset into SPSS. The dataset should have one continuous dependent variable and one or more independent variables.
Click ‘Analyze’ -> ‘Regression’ -> ‘Linear…’
In the Linear Regression dialog box that appears, move your dependent variable into the “Dependent” box and your independent variable(s) into the “Independent(s)” box.
Click ‘OK’. SPSS will run the analysis and provide output tables with the regression coefficients, standard error, t-value, and significance level.

The steps for conducting other types of analysis in SPSS are similar. Depending on the type of regression, you might need to make different selections in the ‘Analyze’ -> ‘Regression’ menu. Using software for regression analysis not only simplifies the computation process but also helps ensure accuracy, especially when working with large datasets. However, it is equally important to understand the underlying assumptions and limitations of regression analysis, which we will discuss in the next chapter.

Regression analysis isn’t just an academic pursuit; it has real-world applications that can transform the way businesses operate.

Studying Consumer Behavior

Regression analysis is a powerful tool for studying consumer behavior. By using it, companies can understand which factors influence consumer buying decisions most significantly. For instance, an e-commerce company might conduct a multiple regression analysis using ‘purchase decision’ as the dependent variable, and ‘product price’, ‘product reviews’, ‘delivery time’, and ‘website ease of use’ as independent variables. The analysis could reveal which factors drive purchase decisions and help shape business strategy.

Market Segmentation

Market segmentation is a crucial component of targeted marketing. This type of analysis can help determine which demographic or psychographic factors correlate with the likelihood of purchasing a product or service. For instance, a company could use logistic regression to predict the likelihood of a customer making a purchase based on age, income, location, and lifestyle. The results can help in creating targeted marketing campaigns for different customer segments.

Price Elasticity Studies

Understanding price elasticity – how demand for a product changes with its price – is critical for pricing strategy. This analysis can help businesses analyze this. By considering ‘price’ as an independent variable and ‘quantity sold’ as a dependent variable, a company can understand how changes in price are likely to impact sales volume.

Sales Forecasting

Predicting future sales is crucial for planning production, inventory management, and financial forecasting. Multiple regression analysis, with sales volume as the dependent variable and factors such as advertising spend, seasonal trends, and economic indicators as independent variables, can help businesses forecast future sales with greater accuracy.

In conclusion, regression analysis stands as a powerful, versatile tool in market research. It allows businesses to understand and quantify the relationships between various factors affecting their markets, thereby enabling informed decision-making. With a comprehensive understanding of the basics of regression analysis, its different types, applications, computation methods, and potential pitfalls, one can leverage this tool to extract significant insights from market data.

The evolution of technology has introduced an array of software tools that simplify the complex computations involved in regression analysis. As we move forward in the era of big data, these tools, paired with advanced regression techniques, will prove invaluable in dealing with large datasets.

However, as with any tool, the effective application of regression analysis requires understanding its limitations and assumptions. This knowledge helps ensure the results derived are accurate and meaningful, minimizing potential errors in interpretation.

The intersection of regression analysis with emerging technologies, like machine learning and predictive analytics, further underscores its continued relevance and potential for future growth in market research. As we adapt to the increasingly data-driven business environment, skills in techniques like regression analysis will continue to be in high demand.

What is regression analysis, and why is it important in market research?

Regression analysis is a statistical method used to determine the relationship between a dependent variable (the variable we're trying to predict or understand) and one or more independent variables (the factors that we believe have an effect on the dependent variable). In market research, regression analysis is important because it can help identify the variables that have the most influence on consumer behavior, price elasticity, sales forecasting, etc. It provides a quantifiable way to understand these relationships and make data-driven business decisions.

What are the different types of regression analysis?

There are several types of regression analysis, including linear regression, multiple regression, logistic regression, polynomial regression, ridge regression, lasso regression, and ElasticNet regression. Each type of regression has a unique application, depending on the relationship between the independent and dependent variables and the nature of the data.

What software tools can I use to conduct regression analysis?

Various software tools are available for conducting regression analysis, including SPSS, R, Python, STATA, and SAS. These software tools can handle large datasets and complex computations, making it easier to apply this analysis to real-world data.

What are the assumptions and limitations of regression analysis?

Regression analysis assumes that there is a linear relationship between the variables, that the residuals are independent and homoscedastic, and that they follow a normal distribution. Violation of these assumptions can lead to bias in the results. Also, this type of analysis can identify correlations but cannot prove causation. Outliers and multicollinearity among the independent variables can also impact the results.

What is the future of regression analysis in market research?

With the rise of big data, machine learning, and predictive analytics, the importance of regression analysis in market research is likely to increase. As businesses continue to become more data-driven, this analysis will be a vital tool for making sense of large datasets, predicting future outcomes, and making informed business decisions.

Learn how logistic regression, also known as the Logit model, works and its benefits for precise probability modeling and decision-making in data analysis.

TURF Analysis

Learn how TURF Analysis can optimize your product range and media plans. Unlock strategies to maximize market reach and improve ROI.

Key Driver Analysis

Explore Key Driver Analysis (KDA): the game-changing statistical tool that identifies what really drives customer satisfaction and loyalty.

Discover how the Kano Model guides market research by categorizing customer needs. Optimize product features to boost satisfaction & ROI.

Van Westendorp Price Sensitivity Meter

Comprehensive guide to the Van Westendorp pricing model: ✓ Definition ✓ Implementation ✓ Graph ✓ Interpretation ► Get informed

Discover the t-test, a statistical method to compare group means, and learn how to calculate it to make data-driven decisions.

MaxDiff Scaling

Discover MaxDiff Scaling, a powerful technique to measure relative preferences, with real-world examples and guidance on effective usage.

Implicit Association Test

Uncover hidden biases with the Implicit Association Test. Delve into your subconscious preferences in a revealing psychological experiment.

Gabor-Granger Analysis

Learn to determine the optimal price with our Gabor-Granger analysis guide covering the basics, benefits, drawbacks, and tips.

Conjoint Analysis

Learn about conjoint analysis, a powerful market research technique used to determine how consumers value different product attributes.

Privacy Overview
Strictly Necessary Cookies
Additional Cookies

This website uses cookies to provide you with the best user experience possible. Cookies are small text files that are cached when you visit a website to make the user experience more efficient. We are allowed to store cookies on your device if they are absolutely necessary for the operation of the site. For all other cookies we need your consent.

You can at any time change or withdraw your consent from the Cookie Declaration on our website. Find the link to your settings in our footer.

Find out more in our privacy policy about our use of cookies and how we process personal data.

Necessary cookies help make a website usable by enabling basic functions like page navigation and access to secure areas of the website. The website cannot properly without these cookies.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as additional cookies.

Please enable Strictly Necessary Cookies first so that we can save your preferences!

DMEXCO 2024!

Regression Analysis in Market Research

by Richard Nehrboss SR | Mar 14, 2023 | Customer Experience Management , Financial Services , Research Methodology

What is Regression Analysis & How Is It Used?

Regression analysis helps organizations make sense of priority areas and what factors have the most impact and influence on their customer relationships. It allows researchers and brands to read between the lines of the survey data. This article will help you understand the definition of regression analysis, how it is commonly used, and the benefits of using regression research.

Regression Analysis: Definition

Regression analysis is a common statistical method that helps organizations understand the relationship between independent variables and dependent variables.

Dependent variable: The main factor you want to measure or understand.
Independent variables: The secondary factors you believe to have an influence on your dependent variable.

More specifically regression analysis tells you what factors are most important, which to disregard, and how each factor affects one another.

Importance of Regression Analysis

There are several benefits of regression analysis, most of which center around using it to achieve data-driven decision-making.

The advantages of using regression analysis in research include:

Great tool for forecasting: While there is no such thing as a magic crystal ball, regression research is a great approach to measuring predictive analytics and forecasting.
Focus attention on priority areas of improvement: Regression statistical analysis helps businesses and organizations prioritize efforts to improve customer satisfaction metrics such as net promoter score, customer effort score, and customer loyalty. Using regression analysis in quantitative research provides the opportunity to take corrective actions on the items that will most positively improve overall satisfaction.

When to Use Regression Analysis

A common use of regression analysis is understanding how the likelihood to recommend a product or service (dependent variable) is impacted by changes in wait time, price, and quantity purchased (presumably independent variables). A popular way to measure this is with net promoter score (NPS) as it is one of the most commonly used metrics in market research.

Net promoter score formula

The score is very telling to help your business understand how many raving fans your brand has in comparison to your key competitors and industry benchmarks. While our online survey company always recommends using an open-ended question after NPS to gather context to help understand the driving forces behind the score, sometimes it does not tell the whole story.

Regression Analysis Example in Business

Keeping with the bank survey from above, let’s say in the same survey you ask a series of customer satisfaction questions related to respondents’ experience with the bank. You believe the interest rates and customer service are good at your bank but you think there might be some underlying drivers really pushing your high NPS. In this example, likelihood to recommend, or NPS is your dependent variable A. Your more specific follow-up satisfaction questions are dependent variables B, C, D, E, F, G.

Through your regression analysis, you find out that INDEPENDENT VARIABLE C (friendliness of the staff) has the most significant effect on NPS. This means how the customer rates the friendliness of the staff members will have the largest overall impact on how likely they would be to recommend your bank. This is much different than what customers said in the open-ended comment about interest rates and customer service. However, as regression analysis proves, staff friendliness is essential.

Regression analysis is another tool market research firms used on a daily basis with their clients to help brands understand survey data from customers. The benefit of using a third-party market research firm is that you can leverage their expertise to tell you the “so what” of your customer survey data.

At The MSR Group, we use regression analysis to help our clients understand the relationship between independent variables and dependent variables. We have worked with banks to understand the impact that key index scores from the markets had on sales projections. We also help our clients prioritize efforts to improve customer satisfaction metrics such as net promoter score, customer effort score, and customer loyalty.

If you are interested in using regression analysis to help your business make data-driven decisions, contact The MSR Group by filling out an online contact form or emailing [email protected]. Regression analysis is a powerful tool that can help executives and management make data-driven decisions. It can help them understand the relationship between independent variables and dependent variables, and how each factor affects one another. It can also help them focus their attention on priority areas of improvement, and use predictive analytics and forecasting to understand how their revenue might be impacted in future quarters.

At The MSR Group, we use regression analysis to help our clients understand the relationship between independent variables and dependent variables, and prioritize efforts to improve customer satisfaction metrics. We have worked with banks to understand the impact that key index scores from the markets had on sales projections, and how increasing prices will have any impact on repeat customer purchases. Using regression analysis in quantitative research provides the opportunity to take corrective actions on the items that will most positively improve overall satisfaction.

Subscribe To Our Newsletter

Join our mailing list so you never miss an update from The MSR Group!

Thanks for subscribing!

Customer Experience Management
Employee Experience
Financial Services
Marketing Effectiveness
News Releases
Research Methodology
Social Media
Technology & Innovation

Join our email list to receive MSR Group news and industry updates right to your inbox!

You have Successfully Subscribed!

Marketing Research Design & Analysis 2019

6 regression.

This chapter is primarily based on:

Field, A., Miles J., & Field, Z. (2012): Discovering Statistics Using R. Sage Publications ( chapters 6, 7, 8 ).
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013): An Introduction to Statistical Learning with Applications in R, Springer ( chapter 3 )

You can download the corresponding R-Code here

6.1 Correlation

Before we start with regression analysis, we will review the basic concept of correlation first. Correlation helps us to determine the degree to which the variation in one variable, X, is related to the variation in another variable, Y.

6.1.1 Correlation coefficient

The correlation coefficient summarizes the strength of the linear relationship between two metric (interval or ratio scaled) variables. Let’s consider a simple example. Say you conduct a survey to investigate the relationship between the attitude towards a city and the duration of residency. The “Attitude” variable can take values between 1 (very unfavorable) and 12 (very favorable), and the “duration of residency” is measured in years. Let’s further assume for this example that the attitude measurement represents an interval scale (although it is usually not realistic to assume that the scale points on an itemized rating scale have the same distance). To keep it simple, let’s further assume that you only asked 12 people. We can create a short data set like this:

Let’s look at the data. The following graph shows the individual data points for the “duration of residency”" variable, where the blue horizontal line represents the mean of the variable (9.33) and the vertical lines show the distance of the individual data points from the mean.

Figure 6.1: Scores for duration of residency variable

You can see that there are some respondents that have been living in the city longer than average and some respondents that have been living in the city shorter than average. Let’s do the same for the second variable (“Attitude”):

Figure 6.2: Scores for attitude variable

Again, we can see that some respondents have an above average attitude towards the city (more favorable) and some respondents have a below average attitude towards the city. Let’s plot the data in one graph now to see if there is some co-movement:

Figure 6.3: Scores for attitude and duration of residency variables

We can see that there is indeed some co-movement here. The variables covary because respondents who have an above (below) average attitude towards the city also appear to have been living in the city for an above (below) average amount of time and vice versa. Correlation helps us to quantify this relationship. Before you proceed to compute the correlation coefficient, you should first look at the data. We usually use a scatterplot to visualize the relationship between two metric variables:

Figure 6.4: Scatterplot for duration and attitute variables

How can we compute the correlation coefficient? Remember that the variance measures the average deviation from the mean of a variable:

\[\begin{equation} \begin{split} s_x^2&=\frac{\sum_{i=1}^{N} (X_i-\overline{X})^2}{N-1} \\ &= \frac{\sum_{i=1}^{N} (X_i-\overline{X})*(X_i-\overline{X})}{N-1} \end{split} \tag{6.1} \end{equation}\]

When we consider two variables, we multiply the deviation for one variable by the respective deviation for the second variable:

$(X_i-\overline{X})*(Y_i-\overline{Y})$

This is called the cross-product deviation. Then we sum the cross-product deviations:

$\sum_{i=1}^{N}(X_i-\overline{X})*(Y_i-\overline{Y})$

… and compute the average of the sum of all cross-product deviations to get the covariance :

\[\begin{equation} Cov(x, y) =\frac{\sum_{i=1}^{N}(X_i-\overline{X})*(Y_i-\overline{Y})}{N-1} \tag{6.2} \end{equation}\]

You can easily compute the covariance manually as follows

Or you simply use the built-in cov() function:

A positive covariance indicates that as one variable deviates from the mean, the other variable deviates in the same direction. A negative covariance indicates that as one variable deviates from the mean (e.g., increases), the other variable deviates in the opposite direction (e.g., decreases).

However, the size of the covariance depends on the scale of measurement. Larger scale units will lead to larger covariance. To overcome the problem of dependence on measurement scale, we need to convert covariance to a standard set of units through standardization by dividing the covariance by the standard deviation (i.e., similar to how we compute z-scores).

With two variables, there are two standard deviations. We simply multiply the two standard deviations. We then divide the covariance by the product of the two standard deviations to get the standardized covariance, which is known as a correlation coefficient r:

\[\begin{equation} r=\frac{Cov_{xy}}{s_x*s_y} \tag{6.3} \end{equation}\]

This is known as the product moment correlation (r) and it is straight-forward to compute:

Or you could just use the cor() function:

Properties of r:

ranges from -1 to + 1
+1 indicates perfect linear relationship
-1 indicates perfect negative relationship
0 indicates no linear relationship
± .1 represents small effect
± .3 represents medium effect
± .5 represents large effect

6.1.2 Significance testing

How can we determine if our two variables are significantly related? To test this, we denote the population moment correlation ρ. Then we test the null of no relationship between variables:

\[H_0:\rho=0\] \[H_1:\rho\ne0\]

The test statistic is:

\[\begin{equation} t=\frac{r*\sqrt{N-2}}{\sqrt{1-r^2}} \tag{6.4} \end{equation}\]

It has a t distribution with n - 2 degrees of freedom. Then, we follow the usual procedure of calculating the test statistic and comparing the test statistic to the critical value of the underlying probability distribution. If the calculated test statistic is larger than the critical value, the null hypothesis of no relationship between X and Y is rejected.

Or you can simply use the cor.test() function, which also produces the 95% confidence interval:

To determine the linear relationship between variables, the data only needs to be measured using interval scales. If you want to test the significance of the association, the sampling distribution needs to be normally distributed (we usually assume this when our data are normally distributed or when N is large). If parametric assumptions are violated, you should use non-parametric tests:

Spearman’s correlation coefficient: requires ordinal data and ranks the data before applying Pearson’s equation.
Kendall’s tau: use when N is small or the number of tied ranks is large.

Report the results:

A Pearson product-moment correlation coefficient was computed to assess the relationship between the duration of residence in a city and the attitude toward the city. There was a positive correlation between the two variables, r = 0.936, n = 12, p < 0.05. A scatterplot summarizes the results (Figure XY).

A note on the interpretation of correlation coefficients:

Correlation coefficients give no indication of the direction of causality. In our example, we can conclude that the attitude toward the city is more positive as the years of residence increases. However, we cannot say that the years of residence cause the attitudes to be more positive. There are two main reasons for caution when interpreting correlations:

Third-variable problem: there may be other unobserved factors that affect the results.
Direction of causality: Correlations say nothing about which variable causes the other to change (reverse causality: attitudes may just as well cause the years of residence variable).

6.2 Regression

Correlations measure relationships between variables (i.e., how much two variables covary). Using regression analysis we can predict the outcome of a dependent variable (Y) from one or more independent variables (X). E.g., how many products will we sell if we increase the advertising expenditures by 1000 Euros? In regression analysis, we fit a model to our data and use it to predict the values of the dependent variable from one predictor variable (bivariate regression) or several predictor variables (multiple regression). The following table shows a comparison of correlation and regression analysis:

	Correlation	Regression
Estimated coefficient	Coefficient of correlation (bounded between -1 and +1)	Regression coefficient (not bounded a priori)
Interpretation	Linear association between two variables; Association is bidirectional	(Linear) relation between one or more independent variables and dependent variable; Relation is directional
Role of theory	Theory neither required nor testable	Theory required and testable

6.2.1 Simple linear regression

In simple linear regression, we assess the relationship between one dependent (regressand) and one independent (regressor) variable. The goal is to fit a line through a scatterplot of observations in order to find the line that best describes the data (scatterplot).

Suppose you are a marketing research analyst at a music label and your task is to suggest, on the basis of past data, a marketing plan for the next year that will maximize product sales. The data set that is available to you includes information on the sales of music downloads (thousands of units), advertising expenditures (in Euros), the number of radio plays an artist received per week (airplay), the number of previous releases of an artist (starpower), repertoire origin (country; 0 = local, 1 = international), and genre (1 = rock, 2 = pop, 3 = electronic). Let’s load and inspect the data first:

As stated above, regression analysis may be used to relate a quantitative response (“dependent variable”) to one or more predictor variables (“independent variables”). In a simple linear regression, we have one dependent and one independent variable.

Here are a few important questions that we might seek to address based on the data:

Is there a relationship between advertising budget and sales?
How strong is the relationship between advertising budget and sales?
Which other variables contribute to sales?
How accurately can we estimate the effect of each variable on sales?
How accurately can we predict future sales?
Is the relationship linear?
Is there synergy among the advertising activities?

We may use linear regression to answer these questions. Let’s start with the first question and investigate the effect of advertising on sales.

6.2.1.1 Estimating the coefficients

A simple linear regression model only has one predictor and can be written as:

\[\begin{equation} Y=\beta_0+\beta_1X+\epsilon \tag{6.5} \end{equation}\]

In our specific context, let’s consider only the influence of advertising on sales for now:

\[\begin{equation} Sales=\beta_0+\beta_1*adspend+\epsilon \tag{6.6} \end{equation}\]

The word “adspend” represents data on advertising expenditures that we have observed and β 1 (the “slope”“) represents the unknown relationship between advertising expenditures and sales. It tells you by how much sales will increase for an additional Euro spent on advertising. β 0 (the”intercept") is the number of sales we would expect if no money is spent on advertising. Together, β 0 and β 1 represent the model coefficients or parameters. The error term (ε) captures everything that we miss by using our model, including, (1) misspecifications (the true relationship might not be linear), (2) omitted variables (other variables might drive sales), and (3) measurement error (our measurement of the variables might be imperfect).

Once we have used our training data to produce estimates for the model coefficients, we can predict future sales on the basis of a particular value of advertising expenditures by computing:

\[\begin{equation} \hat{Sales}=\hat{\beta_0}+\hat{\beta_1}*adspend \tag{6.7} \end{equation}\]

We use the hat symbol, ^ , to denote the estimated value for an unknown parameter or coefficient, or to denote the predicted value of the response (sales). In practice, β 0 and β 1 are unknown and must be estimated from the data to make predictions. In the case of our advertising example, the data set consists of the advertising budget and product sales (n = 200). Our goal is to obtain coefficient estimates such that the linear model fits the available data well. In other words, we fit a line through the scatterplot of observations and try to find the line that best describes the data. The following graph shows the scatterplot for our data, where the black line shows the regression line. The grey vertical lines shows the difference between the predicted values (the regression line) and the observed values. This difference is referred to as the residuals (“e”).

Figure 6.5: Ordinary least squares (OLS)

Estimation of the regression function is based on the idea of the method of least squares (OLS = ordinary least squares). The first step is to calculate the residuals by subtracting the observed values from the predicted values.

$e_i = Y_i-(\beta_0+\beta_1X_i)$

This difference is then minimized by minimizing the sum of the squared residuals:

\[\begin{equation} \sum_{i=1}^{N} e_i^2= \sum_{i=1}^{N} [Y_i-(\beta_0+\beta_1X_i)]^2\rightarrow min! \tag{6.8} \end{equation}\]

e i : Residuals (i = 1,2,…,N) Y i : Values of the dependent variable (i = 1,2,…,N) β 0 : Intercept β 1 : Regression coefficient / slope parameters X ni : Values of the nth independent variables and the ith observation N: Number of observations

This is also referred to as the residual sum of squares (RSS) . Now we need to choose the values for β 0 and β 1 that minimize RSS. So how can we derive these values for the regression coefficient? The equation for β 1 is given by:

\[\begin{equation} \hat{\beta_1}=\frac{COV_{XY}}{s_x^2} \tag{6.9} \end{equation}\]

The exact mathematical derivation of this formula is beyond the scope of this script, but the intuition is to calculate the first derivative of the squared residuals with respect to β 1 and set it to zero, thereby finding the β 1 that minimizes the term. Using the above formula, you can easily compute β 1 using the following code:

The interpretation of β 1 is as follows:

For every extra Euros spent on advertising, sales can be expected to increase by 0.096 units. Or, in other words, if we increase our marketing budget by 1,000 Euros, sales can be expected to increase by 96 units.

Using the estimated coefficient for β 1 , it is easy to compute β 0 (the intercept) as follows:

\[\begin{equation} \hat{\beta_0}=\overline{Y}-\hat{\beta_1}\overline{X} \tag{6.10} \end{equation}\]

The R code for this is:

The interpretation of β 0 is as follows:

If we spend no money on advertising, we would expect to sell 134.14 units.

You may also verify this based on a scatterplot of the data. The following plot shows the scatterplot including the regression line, which is estimated using OLS.

Figure 6.6: Scatterplot

You can see that the regression line intersects with the y-axis at 134.14, which corresponds to the expected sales level when advertising expenditure (on the x-axis) is zero (i.e., the intercept β 0 ). The slope coefficient (β 1 ) tells you by how much sales (on the y-axis) would increase if advertising expenditures (on the x-axis) are increased by one unit.

6.2.1.2 Significance testing

In a next step, we assess if the effect of advertising on sales is statistically significant. This means that we test the null hypothesis H 0 : “There is no relationship between advertising and sales” versus the alternative hypothesis H 1 : “The is some relationship between advertising and sales”. Or, to state this mathematically:

\[H_0:\beta_1=0\] \[H_1:\beta_1\ne0\]

How can we test if the effect is statistically significant? Recall the generalized equation to derive a test statistic:

\[\begin{equation} test\ statistic = \frac{effect}{error} \tag{6.11} \end{equation}\]

The effect is given by the β 1 coefficient in this case. To compute the test statistic, we need to come up with a measure of uncertainty around this estimate (the error). This is because we use information from a sample to estimate the least squares line to make inferences regarding the regression line in the entire population. Since we only have access to one sample, the regression line will be slightly different every time we take a different sample from the population. This is sampling variation and it is perfectly normal! It just means that we need to take into account the uncertainty around the estimate, which is achieved by the standard error. Thus, the test statistic for our hypothesis is given by:

\[\begin{equation} t = \frac{\hat{\beta_1}}{SE(\hat{\beta_1})} \tag{6.12} \end{equation}\]

After calculating the test statistic, we compare its value to the values that we would expect to find if there was no effect based on the t-distribution. In a regression context, the degrees of freedom are given by N - p - 1 where N is the sample size and p is the number of predictors. In our case, we have 200 observations and one predictor. Thus, the degrees of freedom is 200 - 1 - 1 = 198. In the regression output below, R provides the exact probability of observing a t value of this magnitude (or larger) if the null hypothesis was true. This probability is the p-value. A small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the outcome variable due to chance in the absence of any real association between the predictor and the outcome.

To estimate the regression model in R, you can use the lm() function. Within the function, you first specify the dependent variable (“sales”) and independent variable (“adspend”) separated by a ~ (tilde). As mentioned previously, this is known as formula notation in R. The data = regression argument specifies that the variables come from the data frame named “regression”. Strictly speaking, you use the lm() function to create an object called “simple_regression,” which holds the regression output. You can then view the results using the summary() function:

Note that the estimated coefficients for β 0 (134.14) and β 1 (0.096) correspond to the results of our manual computation above. The associated t-values and p-values are given in the output. The t-values are larger than the critical t-values for the 95% confidence level, since the associated p-values are smaller than 0.05. In case of the coefficient for β 1 this means that the probability of an association between the advertising and sales of the observed magnitude (or larger) is smaller than 0.05, if the value of β 1 was, in fact, 0.

The coefficients associated with the respective variables represent point estimates . To get a better feeling for the range of values that the coefficients could take, it is helpful to compute confidence intervals . A 95% confidence interval is defined as a range of values such that with a 95% probability, the range will contain the true unknown value of the parameter. For example, for β 1 , the confidence interval can be computed as.

\[\begin{equation} CI = \hat{\beta_1}\pm(t_{1-\frac{\alpha}{2}}*SE(\beta_1)) \tag{6.13} \end{equation}\]

It is easy to compute confidence intervals in R using the confint() function. You just have to provide the name of you estimated model as an argument:

For our model, the 95% confidence interval for β 0 is [119.28,149], and the 95% confidence interval for β 1 is [0.08,0.12]. Thus, we can conclude that when we do not spend any money on advertising, sales will be somewhere between 119 and 149 units on average. In addition, for each increase in advertising expenditures by one Euro, there will be an average increase in sales of between 0.08 and 0.12.

6.2.1.3 Assessing model fit

Once we have rejected the null hypothesis in favor of the alternative hypothesis, the next step is to investigate to what extent the model represents (“fits”) the data. How can we assess the model fit?

First, we calculate the fit of the most basic model (i.e., the mean)
Then, we calculate the fit of the best model (i.e., the regression model)
A good model should fit the data significantly better than the basic model
R 2 : Represents the percentage of the variation in the outcome that can be explained by the model
The F-ratio measures how much the model has improved the prediction of the outcome compared to the level of inaccuracy in the model

Similar to ANOVA, the calculation of model fit statistics relies on estimating the different sum of squares values. SS T is the difference between the observed data and the mean value of Y (aka. total variation). In the absence of any other information, the mean value of Y represents the best guess on where an observation at a given level of advertising will fall:

\[\begin{equation} SS_T= \sum_{i=1}^{N} (Y_i-\overline{Y})^2 \tag{6.14} \end{equation}\]

The following graph shows the total sum of squares:

Figure 6.7: Total sum of squares

Based on our linear model, the best guess about the sales level at a given level of advertising is the predicted value. The model sum of squares (SS M ) has the mathematical representation:

\[\begin{equation} SS_M= \sum_{j=1}^{c} n_j(\overline{Y}_j-\overline{Y})^2 \tag{6.15} \end{equation}\]

The model sum of squares represents the improvement in prediction resulting from using the regression model rather than the mean of the data. The following graph shows the model sum of squares for our example:

Figure 6.8: Ordinary least squares (OLS)

The residual sum of squares (SS R ) is the difference between the observed data and the predicted values along the regression line (i.e., the variation not explained by the model)

\[\begin{equation} SS_R= \sum_{j=1}^{c} \sum_{i=1}^{n} ({Y}_{ij}-\overline{Y}_{j})^2 \tag{6.16} \end{equation}\]

The following graph shows the residual sum of squares for our example:

Figure 6.9: Ordinary least squares (OLS)

The R 2 statistic represents the proportion of variance that is explained by the model and is computed as:

\[\begin{equation} R^2= \frac{SS_M}{SS_T} \tag{6.16} \end{equation}\]

It takes values between 0 (very bad fit) and 1 (very good fit). Note that when the goal of your model is to predict future outcomes, a “too good” model fit can pose severe challenges. The reason is that the model might fit your specific sample so well, that it will only predict well within the sample but not generalize to other samples. This is called overfitting and it shows that there is a trade-off between model fit and out-of-sample predictive ability of the model, if the goal is to predict beyond the sample.

You can get a first impression of the fit of the model by inspecting the scatter plot as can be seen in the plot below. If the observations are highly dispersed around the regression line (left plot), the fit will be lower compared to a data set where the values are less dispersed (right plot).

Figure 6.10: Good vs. bad model fit

The R 2 statistic is reported in the regression output (see above). However, you could also extract the relevant sum of squares statistics from the regression object using the anova() function to compute it manually:

Now we can compute R 2 in the same way that we have computed Eta 2 in the last section:

Adjusted R-squared

Due to the way the R 2 statistic is calculated, it will never decrease if a new explanatory variable is introduced into the model. This means that every new independent variable either doesn’t change the R 2 or increases it, even if there is no real relationship between the new variable and the dependent variable. Hence, one could be tempted to just add as many variables as possible to increase the R 2 and thus obtain a “better” model. However, this actually only leads to more noise and therefore a worse model.

To account for this, there exists a test statistic closely related to the R 2 , the adjusted R 2 . It can be calculated as follows:

\[\begin{equation} \overline{R^2} = 1 - (1 - R^2)\frac{n-1}{n - k - 1} \tag{6.17} \end{equation}\]

where n is the total number of observations and k is the total number of explanatory variables. The adjusted R 2 is equal to or less than the regular R 2 and can be negative. It will only increase if the added variable adds more explanatory power than one expect by pure chance. Essentially, it contains a “penalty” for including unnecessary variables and therefore favors more parsimonious models. As such, it is a measure of suitability, good for comparing different models and is very useful in the model selection stage of a project. In R, the standard lm() function automatically also reports the adjusted R 2 .

Another significance test is the F-test. It tests the null hypothesis:

\[H_0:R^2=0\]

This is equivalent to the following null hypothesis:

\[H_0:\beta_1=\beta_2=\beta_3=\beta_k=0\]

The F-test statistic is calculated as follows:

\[\begin{equation} F=\frac{\frac{SS_M}{k}}{\frac{SS_R}{(n-k-1)}}=\frac{MS_M}{MS_R} \tag{6.16} \end{equation}\]

which has a F distribution with k number of predictors and n degrees of freedom. In other words, you divide the systematic (“explained”) variation due to the predictor variables by the unsystematic (“unexplained”) variation.

The result of the F-test is provided in the regression output. However, you might manually compute the F-test using the ANOVA results from the model:

6.2.1.4 Using the model

After fitting the model, we can use the estimated coefficients to predict sales for different values of advertising. Suppose you want to predict sales for a new product, and the company plans to spend 800 Euros on advertising. How much will it sell? You can easily compute this either by hand:

\[\hat{sales}=134.134 + 0.09612*800=211\]

… or by extracting the estimated coefficients from the model summary:

The predicted value of the dependent variable is 211 units, i.e., the product will (on average) sell 211 units.

The following video summarizes how to conduct simple linear regression in R

6.2.2 Multiple linear regression

Multiple linear regression is a statistical technique that simultaneously tests the relationships between two or more independent variables and an interval-scaled dependent variable. The general form of the equation is given by:

\[\begin{equation} Y=(\beta_0+\beta_1*X_1+\beta_2*X_2+\beta_n*X_n)+\epsilon \tag{6.5} \end{equation}\]

Again, we aim to find the linear combination of predictors that correlate maximally with the outcome variable. Note that if you change the composition of predictors, the partial regression coefficient of an independent variable will be different from that of the bivariate regression coefficient. This is because the regressors are usually correlated, and any variation in Y that was shared by X1 and X2 was attributed to X1. The interpretation of the partial regression coefficients is the expected change in Y when X is changed by one unit and all other predictors are held constant.

Let’s extend the previous example. Say, in addition to the influence of advertising, you are interested in estimating the influence of airplay on the number of album downloads. The corresponding equation would then be given by:

\[\begin{equation} Sales=\beta_0+\beta_1*adspend+\beta_2*airplay+\epsilon \tag{6.6} \end{equation}\]

The words “adspend” and “airplay” represent data that we have observed on advertising expenditures and number of radio plays, and β 1 and β 2 represent the unknown relationship between sales and advertising expenditures and radio airplay, respectively. The coefficients tells you by how much sales will increase for an additional Euro spent on advertising (when radio airplay is held constant) and by how much sales will increase for an additional radio play (when advertising expenditures are held constant). Thus, we can make predictions about album sales based not only on advertising spending, but also on radio airplay.

With several predictors, the partitioning of sum of squares is the same as in the bivariate model, except that the model is no longer a 2-D straight line. With two predictors, the regression line becomes a 3-D regression plane. In our example:

Figure 6.11: Regression plane

Like in the bivariate case, the plane is fitted to the data with the aim to predict the observed data as good as possible. The deviation of the observations from the plane represent the residuals (the error we make in predicting the observed data from the model). Note that this is conceptually the same as in the bivariate case, except that the computation is more complex (we won’t go into details here). The model is fairly easy to plot using a 3-D scatterplot, because we only have two predictors. While multiple regression models that have more than two predictors are not as easy to visualize, you may apply the same principles when interpreting the model outcome:

Total sum of squares (SS T ) is still the difference between the observed data and the mean value of Y (total variation)
Residual sum of squares (SS R ) is still the difference between the observed data and the values predicted by the model (unexplained variation)
Model sum of squares (SS M ) is still the difference between the values predicted by the model and the mean value of Y (explained variation)
R measures the multiple correlation between the predictors and the outcome
R 2 is the amount of variation in the outcome variable explained by the model

Estimating multiple regression models is straightforward using the lm() function. You just need to separate the individual predictors on the right hand side of the equation using the + symbol. For example, the model:

\[\begin{equation} Sales=\beta_0+\beta_1*adspend+\beta_2*airplay+\beta_3*starpower+\epsilon \tag{6.6} \end{equation}\]

could be estimated as follows:

The interpretation of the coefficients is as follows:

adspend (β 1 ): when advertising expenditures increase by 1 Euro, sales will increase by 0.08 units
airplay (β 2 ): when radio airplay increases by 1 play per week, sales will increase by 3.37 units
starpower (β 3 ): when the number of previous albums increases by 1, sales will increase by 11.09 units

The associated t-values and p-values are also given in the output. You can see that the p-values are smaller than 0.05 for all three coefficients. Hence, all effects are “significant”. This means that if the null hypothesis was true (i.e., there was no effect between the variables and sales), the probability of observing associations of the estimated magnitudes (or larger) is very small (e.g., smaller than 0.05).

Again, to get a better feeling for the range of values that the coefficients could take, it is helpful to compute confidence intervals .

What does this tell you? Recall that a 95% confidence interval is defined as a range of values such that with a 95% probability, the range will contain the true unknown value of the parameter. For example, for β 3 , the confidence interval is [6.28,15.89]. Thus, although we have computed a point estimate of 11.09 for the effect of starpower on sales based on our sample, the effect might actually just as well take any other value within this range, considering the sample size and the variability in our data.

The output also tells us that 66.47% of the variation can be explained by our model. You may also visually inspect the fit of the model by plotting the predicted values against the observed values. We can extract the predicted values using the predict() function. So let’s create a new variable yhat , which contains those predicted values.

We can now use this variable to plot the predicted values against the observed values. In the following plot, the model fit would be perfect if all points would fall on the diagonal line. The larger the distance between the points and the line, the worse the model fit.

Figure 6.12: Model fit

Partial plots

In the context of a simple linear regression (i.e., with a single independent variable), a scatter plot of the dependent variable against the independent variable provides a good indication of the nature of the relationship. If there is more than one independent variable, however, things become more complicated. The reason is that although the scatter plot still show the relationship between the two variables, it does not take into account the effect of the other independent variables in the model. Partial regression plot show the effect of adding another variable to a model that already controls for the remaining variables in the model. In other words, it is a scatterplot of the residuals of the outcome variable and each predictor when both variables are regressed separately on the remaining predictors. As an example, consider the effect of advertising expenditures on sales. In this case, the partial plot would show the effect of adding advertising expenditures as an explanatory variable while controlling for the variation that is explained by airplay and starpower in both variables (sales and advertising). Think of it as the purified relationship between advertising and sales that remains after controlling for other factors. The partial plots can easily be created using the avPlots() function from the car package:

Figure 6.13: Partial plots

Using the model

After fitting the model, we can use the estimated coefficients to predict sales for different values of advertising, airplay, and starpower. Suppose you would like to predict sales for a new music album with advertising expenditures of 800, airplay of 30 and starpower of 5. How much will it sell?

\[\hat{sales}=−26.61 + 0.084 * 800 + 3.367*30 + 11.08 ∗ 5= 197.74\]

… or by extracting the estimated coefficients:

The predicted value of the dependent variable is 198 units, i.e., the product will sell 198 units.

Comparing effects

Using the output from the regression model above, it is difficult to compare the effects of the independent variables because they are all measured on different scales (Euros, radio plays, releases). Standardized regression coefficients can be used to judge the relative importance of the predictor variables. Standardization is achieved by multiplying the unstandardized coefficient by the ratio of the standard deviations of the independent and dependent variables:

\[\begin{equation} B_{k}=\beta_{k} * \frac{s_{x_k}}{s_y} \tag{6.18} \end{equation}\]

Hence, the standardized coefficient will tell you by how many standard deviations the outcome will change as a result of a one standard deviation change in the predictor variable. Standardized coefficients can be easily computed using the lm.beta() function from the lm.beta package.

The results show that for adspend and airplay , a change by one standard deviation will result in a 0.51 standard deviation change in sales, whereas for starpower , a one standard deviation change will only lead to a 0.19 standard deviation change in sales. Hence, while the effects of adspend and airplay are comparable in magnitude, the effect of starpower is less strong.

The following video summarizes how to conduct multiple regression in R

6.3 Potential problems

Once you have built and estimated your model it is important to run diagnostics to ensure that the results are accurate. In the following section we will discuss common problems.

6.3.1 Outliers

The following video summarizes how to handle outliers in R

Outliers are data points that differ vastly from the trend. They can introduce bias into a model due to the fact that they alter the parameter estimates. Consider the example below. A linear regression was performed twice on the same data set, except during the second estimation the two green points were changed to be outliers by being moved to the positions indicated in red. The solid red line is the regression line based on the unaltered data set, while the dotted line was estimated using the altered data set. As you can see the second regression would lead to different conclusions than the first. Therefore it is important to identify outliers and further deal with them.

Figure 6.14: Effects of outliers

One quick way to visually detect outliers is by creating a scatterplot (as above) to see whether anything seems off. Another approach is to inspect the studentized residuals. If there are no outliers in your data, about 95% will be between -2 and 2, as per the assumptions of the normal distribution. Values well outside of this range are unlikely to happen by chance and warrant further inspection. As a rule of thumb, observations whose studentized residuals are greater than 3 in absolute values are potential outliers.

The studentized residuals can be obtained in R with the function rstudent() . We can use this function to create a new variable that contains the studentized residuals e music sales regression from before yields the following residuals:

A good way to visually inspect the studentized residuals is to plot them in a scatterplot and roughly check if most of the observations are within the -3, 3 bounds.

Figure 6.15: Plot of the studentized residuals

To identify potentially influential observations in our data set, we can apply a filter to our data:

After a detailed inspection of the potential outliers, you might decide to delete the affected observations from the data set or not. If an outlier has resulted from an error in data collection, then you might simply remove the observation. However, even though data may have extreme values, they might not be influential to determine a regression line. That means, the results wouldn’t be much different if we either include or exclude them from analysis. This means that the decision of whether to exclude an outlier or not is closely related to the question whether this observation is an influential observation, as will be discussed next.

6.3.2 Influential observations

Related to the issue of outliers is that of influential observations, meaning observations that exert undue influence on the parameters. It is possible to determine whether or not the results are driven by an influential observation by calculating how far the predicted values for your data would move if the model was fitted without this particular observation. This calculated total distance is called Cook’s distance . To identify influential observations, we can inspect the respective plots created from the model output. A rule of thumb to determine whether an observation should be classified as influential or not is to look for observation with a Cook’s distance > 1 (although opinions vary on this). The following plot can be used to see the Cook’s distance associated with each data point:

Figure 6.16: Cook’s distance

It is easy to see that none of the Cook’s distance values is close to the critical value of 1. Another useful plot to identify influential observations is plot number 5 from the output:

Figure 6.17: Residuals vs. Leverage

In this plot, we look for cases outside of a dashed line, which represents Cook’s distance . Lines for Cook’s distance thresholds of 0.5 and 1 are included by default. In our example, this line is not even visible, since the Cook’s distance values are far away from the critical values. Generally, you would watch out for outlying values at the upper right corner or at the lower right corner of the plot. Those spots are the places where cases can be influential against a regression line. In our example, there are no influential cases.

To see how influential observations can impact your regression, have a look at this example .

6.3.3 Non-linearity

An important underlying assumption for OLS is that of linearity, meaning that the relationship between the dependent and the independent variable can be reasonably approximated in linear terms. One quick way to assess whether a linear relationship can be assumed is to inspect the added variable plots that we already came across earlier:

Figure 6.18: Partial plots

In our example, it appears that linear relationships can be reasonably assumed. Please note, however, that the assumption of linearity implies two things:

Constant marginal returns (e.g., an increase in ad-spend from 10€ to 11€ yields the same increase in sales as an increase from 100,000€ to 100,001€)
Elasticities increase with X (e.g., advertising becomes relatively more effective; i.e., a relatively smaller change in advertising expenditure will yield the same return)

These assumptions may not be justifiable in certain contexts and you might have to transform your data (e.g., using log-transformations) in these cases, as we will see below.

6.3.4 Non-constant error variance

The following video summarizes how to identify non-constant error variance in R

Another important assumption of the linear model is that the error terms have a constant variance (i.e., homoscedasticity). The following plot from the model output shows the residuals (the vertical distance from an observed value to the predicted values) versus the fitted values (the predicted value from the regression model). If all the points fell exactly on the dashed grey line, it would mean that we have a perfect prediction. The residual variance (i.e., the spread of the values on the y-axis) should be similar across the scale of the fitted values on the x-axis.

Figure 6.19: Residuals vs. fitted values

In our case, this appears to be the case. You can identify non-constant variances in the errors (i.e., heteroscedasticity) from the presence of a funnel shape in the above plot. When the assumption of constant error variances is not met, this might be due to a misspecification of your model (e.g., the relationship might not be linear). In these cases, it often helps to transform your data (e.g., using log-transformations). The red line also helps you to identify potential misspecification of your model. It is a smoothed curve that passes through the residuals and if it lies close to the gray dashed line (as in our case) it suggest a correct specification. If the line would deviate from the dashed grey line a lot (e.g., a U-shape or inverse U-shape), it would suggest that the linear model specification is not reasonable and you should try different specifications.

If OLS is performed despite heteroscedasticity, the estimates of the coefficient will still be correct on average. However, the estimator is inefficient , meaning that the standard error is wrong, which will impact the significance tests (i.e., the p-values will be wrong). However, there are also robust regression methods, which you can use to estimate your model despite the presence of heteroscedasticity.

6.3.5 Non-normally distributed errors

Another assumption of OLS is that the error term is normally distributed. This can be a reasonable assumption for many scenarios, but we still need a way to check if it is actually the case. As we can not directly observe the actual error term, we have to work with the next best thing - the residuals.

A quick way to assess whether a given sample is approximately normally distributed is by using Q-Q plots. These plot the theoretical position of the observations (under the assumption that they are normally distributed) against the actual position. The plot below is created by the model output and shows the residuals in a Q-Q plot. As you can see, most of the points roughly follow the theoretical distribution, as given by the straight line. If most of the points are close to the line, the data is approximately normally distributed.

Figure 6.20: Q-Q plot

Another way to check for normal distribution of the data is to employ statistical tests that test the null hypothesis that the data is normally distributed, such as the Shapiro–Wilk test. We can extract the residuals from our model using the resid() function and apply the shapiro.test() function to it:

As you can see, we can not reject the H 0 of the normally distributed residuals, which means that we can assume the residuals to be approximately normally distributed.

When the assumption of normally distributed errors is not met, this might again be due to a misspecification of your model, in which case it might help to transform your data (e.g., using log-transformations).

6.3.6 Correlation of errors

The assumption of independent errors implies that for any two observations the residual terms should be uncorrelated. This is also known as a lack of autocorrelation . In theory, this could be tested with the Durbin-Watson test, which checks whether adjacent residuals are correlated. However, be aware that the test is sensitive to the order of your data. Hence, it only makes sense if there is a natural order in the data (e.g., time-series data) when the presence of dependent errors indicates autocorrelation. Since there is no natural order in our data, we don’t need to apply this test. .

If you are confronted with data that has a natural order, you can performed the test using the command durbinWatsonTest() , which takes the object that the lm() function generates as an argument. The test statistic varies between 0 and 4, with values close to 2 being desirable. As a rule of thumb values below 1 and above 3 are causes for concern.

6.3.7 Collinearity

Linear dependence of regressors, also known as multicollinearity, is when there is a strong linear relationship between the independent variables. Some correlation will always be present, but severe correlation can make proper estimation impossible. When present, it affects the model in several ways:

Limits the size of R 2 : when two variables are highly correlated, the amount of unique explained variance is low; therefore the incremental change in R 2 by including an additional predictor is larger if the predictor is uncorrelated with the other predictors.
Increases the standard errors of the coefficients, making them less trustworthy.
Uncertainty about the importance of predictors: if two predictors explain similar variance in the outcome, we cannot know which of these variables is important.

A quick way to find obvious multicollinearity is to examine the correlation matrix of the data. Any value > 0.8 - 0.9 should be cause for concern. You can, for example, create a correlation matrix using the rcorr() function from the Hmisc package.

The bivariate correlations can also be show in a plot:

Figure 6.21: Bivariate correlation plots

However, this only spots bivariate multicollinearity. Variance inflation factors can be used to spot more subtle multicollinearity arising from multivariate relationships. It is calculated by regressing X i on all other X and using the resulting R 2 to calculate

\[\begin{equation} \begin{split} \frac{1}{1 - R_i^2} \end{split} \tag{6.19} \end{equation}\]

VIF values of over 4 are certainly cause for concern and values over 2 should be further investigated. If the average VIF is over 1 the regression may be biased. The VIF for all variables can easily be calculated in R with the vif() function.

As you can see the values are well below the cutoff, indicating that we do not have to worry about multicollinearity in our example.

6.3.8 Omitted Variables

If a variable that influences the outcome is left out of the model (“omitted”), a bias in other variables’ coefficients might be introduced. Specifically, the other coefficients will be biased if the corresponding variables are correlated with the omitted variable. Intuitively, the variables left in the model “pick up” the effect of the omitted variable to the degree that they are related. Let’s illustrate this with an example.

Consider the following data on the number of people visiting concerts of smaller bands.

The data set contains three variables:

avg_rating : The average rating a band has, resulting from a ten-point scale.
followers : The number of followers the band has at the time of the concert.
concert_visitors : The number of tickets sold for the concert.

If we estimate a model to explain the number of tickets sold as a function of the average rating and the number of followers, the results would look as follows:

Now assume we don’t have data on the number of followers a band has, but we still have information on the average rating and want to explain the number of tickets sold. Fitting a linear model with just the avg_rating variable included yields the following results:

What happens to the coefficient of avg_rating ? Because avg_rating and followers are not independent (e.g. one could argue that bands with a higher average rating probably have more followers) the coefficient will be biased. In our case we massively overestimate the effect that the average rating of a band has on ticket sales. In the original model, the effect was about 20.5. In the new, smaller model, the effect is approximately 3.1 times higher.

We can also work out intuitively what the bias will be. The marginal effect of followers on concert_visitors is captured by avg_rating to the degree that avg_rating is related to followers . There are two coefficients of interest:

What is the marginal effect of followers on concert_visitors ?
How much of that effect is captured by avg_rating ?

The former is just the coefficient of followers in the original regression.

The latter is the coefficient of avg_rating obtained from a regression on followers , since the coefficient shows how avg_rating and followers relate to each other.

Now we can calculate the bias induced by omitting followers

To calculate the biased coefficient, simply add the bias to the coefficient from the original model.

6.4 Categorical predictors

6.4.1 two categories.

Suppose, you wish to investigate the effect of the variable “country” on sales, which is a categorical variable that can only take two levels (i.e., 0 = local artist, 1 = international artist). Categorical variables with two levels are also called binary predictors. It is straightforward to include these variables in your model as “dummy” variables. Dummy variables are factor variables that can only take two values. For our “country” variable, we can create a new predictor variable that takes the form:

\[\begin{equation} x_4 = \begin{cases} 1 & \quad \text{if } i \text{th artist is international}\\ 0 & \quad \text{if } i \text{th artist is local} \end{cases} \tag{6.20} \end{equation}\]

This new variable is then added to our regression equation from before, so that the equation becomes

\[\begin{align} Sales =\beta_0 &+\beta_1*adspend\\ &+\beta_2*airplay\\ &+\beta_3*starpower\\ &+\beta_4*international+\epsilon \end{align}\]

where “international” represents the new dummy variable and $\beta_4$ is the coefficient associated with this variable. Estimating the model is straightforward - you just need to include the variable as an additional predictor variable. Note that the variable needs to be specified as a factor variable before including it in your model. If you haven’t converted it to a factor variable before, you could also use the wrapper function as.factor() within the equation.

You can see that we now have an additional coefficient in the regression output, which tells us the effect of the binary predictor. The dummy variable can generally be interpreted as the average difference in the dependent variable between the two groups (similar to a t-test). In this case, the coefficient tells you the difference in sales between international and local artists, and whether this difference is significant. Specifically, it means that international artists on average sell 45.67 units more than local artists, and this difference is significant (i.e., p < 0.05).

6.4.2 More than two categories

Predictors with more than two categories, like our “genre”" variable, can also be included in your model. However, in this case one dummy variable cannot represent all possible values, since there are three genres (i.e., 1 = Rock, 2 = Pop, 3 = Electronic). Thus, we need to create additional dummy variables. For example, for our “genre” variable, we create two dummy variables as follows:

\[\begin{equation} x_5 = \begin{cases} 1 & \quad \text{if } i \text{th product is from Pop genre}\\ 0 & \quad \text{if } i \text{th product is from Rock genre} \end{cases} \tag{6.21} \end{equation}\]

\[\begin{equation} x_6 = \begin{cases} 1 & \quad \text{if } i \text{th product is from Electronic genre}\\ 0 & \quad \text{if } i \text{th product is from Rock genre} \end{cases} \tag{6.22} \end{equation}\]

We would then add these variables as additional predictors in the regression equation and obtain the following model

\[\begin{align} Sales =\beta_0 &+\beta_1*adspend\\ &+\beta_2*airplay\\ &+\beta_3*starpower\\ &+\beta_4*international\\ &+\beta_5*Pop\\ &+\beta_6*Electronic+\epsilon \end{align}\]

where “Pop” and “Rock” represent our new dummy variables, and $\beta_5$ and $\beta_6$ represent the associated regression coefficients.

The interpretation of the coefficients is as follows: $\beta_5$ is the difference in average sales between the genres “Rock” and “Pop”, while $\beta_6$ is the difference in average sales between the genres “Rock” and “Electro”. Note that the level for which no dummy variable is created is also referred to as the baseline . In our case, “Rock” would be the baseline genre. This means that there will always be one fewer dummy variable than the number of levels.

You don’t have to create the dummy variables manually as R will do this automatically when you add the variable to your equation:

How can we interpret the coefficients? It is estimated based on our model that products from the “Pop” genre will on average sell 47.69 units more than products from the “Rock” genre, and that products from the “Electronic” genre will sell on average 27.62 units more than the products from the “Rock” genre. The p-value of both variables is smaller than 0.05, suggesting that there is statistical evidence for a real difference in sales between the genres.

The level of the baseline category is arbitrary. As you have seen, R simply selects the first level as the baseline. If you would like to use a different baseline category, you can use the relevel() function and set the reference category using the ref argument. The following would estimate the same model using the second category as the baseline:

Note that while your choice of the baseline category impacts the coefficients and the significance level, the prediction for each group will be the same regardless of this choice.

6.5 Extensions of the linear model

The standard linear regression model provides results that are easy to interpret and is useful to address many real-world problems. However, it makes rather restrictive assumptions that might be violated in many cases. Notably, it assumes that the relationships between the response and predictor variable is additive and linear . The additive assumption states that the effect of an independent variable on the dependent variable is independent of the values of the other independent variables included in the model. The linear assumption means that the effect of a one-unit change in the independent variable on the dependent variable is the same, regardless of the values of the value of the independent variable. This is also referred to as constant marginal returns . For example, an increase in ad-spend from 10€ to 11€ yields the same increase in sales as an increase from 100,000€ to 100,001€. This section presents alternative model specifications if the assumptions do not hold.

6.5.1 Interaction effects

Regarding the additive assumption, it might be argued that the effect of some variables are not fully independent of the values of other variables. In our example, one could argue that the effect of advertising depends on the type of artist. For example, for local artist advertising might be more effective. We can investigate if this is the case using a grouped scatterplot:

Figure 6.22: Effect of advertising by group

The scatterplot indeed suggests that there is a difference in advertising effectiveness between local and international artists. You can see this from the two different regression lines. We can incorporate this interaction effect by including an interaction term in the regression equation as follows:

\[\begin{align} Sales =\beta_0 &+\beta_1*adspend\\ &+\beta_2*airplay\\ &+\beta_3*starpower\\ &+\beta_4*international\\ &+\beta_5*(adspend*international)\\ &+\epsilon \end{align}\]

You can see that the effect of advertising now depends on the type of artist. Hence, the additive assumption is removed. Note that if you decide to include an interaction effect, you should always include the main effects of the variables that are included in the interaction (even if the associated p-values do not suggest significant effects). It is easy to include an interaction effect in you model by adding an additional variable that has the format ```var1:var2````. In our example, this could be achieved using the following specification:

How can we interpret the coefficient? The adspend main effect tells you the effect of advertising for the reference group that has the factor level zero. In our example, it is the advertising effect for local artist. This means that for local artists, spending an additional 1,000 Euros on advertising will result in approximately 89 additional unit sales. The interaction effect tells you by how much the effect differs for the other group (i.e., international artists) and whether this difference is significant. In our example, it means that the effect for international artists can be computed as: 0.0885 - 0.0347 = 0.0538. This means that for international artists, spending an additional 1,000 Euros on advertising will result in approximately 54 additional unit sales. Since the interaction effect is significant (p < 0.05) we can conclude that advertising is less effective for international artists.

The above example showed the interaction between a categorical variable (i.e., “country”) and a continuous variable (i.e., “adspend”). However, interaction effects can be defined for different combinations of variable types. For example, you might just as well specify an interaction between two continuous variables. In our example, you might suspect that there are synergy effects between advertising expenditures and ratio airplay. It could be that advertising is more effective when an artist receives a large number of radio plays. In this case, we would specify our model as:

\[\begin{align} Sales =\beta_0 &+\beta_1*adspend\\ &+\beta_2*airplay\\ &+\beta_3*starpower\\ &+\beta_4*(adspend*airplay)\\ &+\epsilon \end{align}\]

In this case, we can interpret $\beta_4$ as the increase in the effectiveness of advertising for a one unit increase in radio airplay (or vice versa). This can be translated to R using:

However, since the p-value of the interaction is larger than 0.05, there is little statistical evidence for an interaction between the two variables.

6.5.2 Non-linear relationships

In our example above, it appeared that linear relationships could be reasonably assumed. In many practical applications, however, this might not be the case. Let’s review the implications of a linear specification again:

In many marketing contexts, these might not be reasonable assumptions. Consider the case of advertising. It is unlikely that the return on advertising will not depend on the level of advertising expenditures. It is rather likely that saturation occurs at some level, meaning that the return from an additional Euro spend on advertising is decreasing with the level of advertising expenditures (i.e., decreasing marginal returns). In other words, at some point the advertising campaign has achieved a certain level of penetration and an additional Euro spend on advertising won’t yield the same return as in the beginning.

Let’s use an example data set, containing the advertising expenditures of a company and the sales (in thousand units).

Now we inspect if a linear specification is appropriate by looking at the scatterplot:

Figure 6.23: Non-linear relationship

It appears that a linear model might not represent the data well. It rather appears that the effect of an additional Euro spend on advertising is decreasing with increasing levels of advertising expenditures. Thus, we have decreasing marginal returns. We could put this to a test and estimate a linear model:

Advertising appears to be positively related to sales with an additional Euro that is spent on advertising resulting in 0.0005 additional sales. The R 2 statistic suggests that approximately 51% of the total variation can be explained by the model

To test if the linear specification is appropriate, let’s inspect some of the plots that are generated by R. We start by inspecting the residuals plot.

Figure 6.24: Residuals vs. Fitted

The plot suggests that the assumption of homoscedasticity is violated (i.e., the spread of values on the y-axis is different for different levels of the fitted values). In addition, the red line deviates from the dashed grey line, suggesting that the relationship might not be linear. Finally, the Q-Q plot of the residuals suggests that the residuals are not normally distributed.

Figure 6.25: Q-Q plot

To sum up, a linear specification might not be the best model for this data set.

In this case, a multiplicative model might be a better representation of the data. The multiplicative model has the following formal representation:

\[\begin{equation} Y =\beta_0 *X_1^{\beta_1}*X_2^{\beta_2}*...*X_J^{\beta_J}*\epsilon \tag{6.23} \end{equation}\]

This functional form can be linearized by taking the logarithm of both sides of the equation:

\[\begin{equation} log(Y) =log(\beta_0) + \beta_1*log(X_1) + \beta_2*log(X_2) + ...+ \beta_J*log(X_J) + log(\epsilon) \tag{6.24} \end{equation}\]

This means that taking logarithms of both sides of the equation makes linear estimation possible. Let’s test how the scatterplot would look like if we use the logarithm of our variables using the log() function instead of the original values.

Figure 6.26: Linearized effect

It appears that now, with the log-transformed variables, a linear specification is a much better representation of the data. Hence, we can log-transform our variables and estimate the following equation:

\[\begin{equation} log(sales) = log(\beta_0) + \beta_1*log(advertising) + log(\epsilon) \tag{6.25} \end{equation}\]

This can be easily implemented in R by transforming the variables using the log() function:

Note that this specification implies decreasing marginal returns (i.e., the returns of advertising are decreasing with the level of advertising), which appear to be more consistent with the data. The specification is also consistent with proportional changes in advertising being associated with proportional changes in sales (i.e., advertising does not become more effective with increasing levels). This has important implications on the interpretation of the coefficients. In our example, you would interpret the coefficient as follows: A 1% increase in advertising leads to a 0.3% increase in sales . Hence, the interpretation is in proportional terms and no longer in units. This means that the coefficients in a log-log model can be directly interpreted as elasticities, which also makes communication easier. We can generally also inspect the R 2 statistic to see that the model fit has increased compared to the linear specification (i.e., R 2 has increased to 0.681 from 0.509). However, please note that the variables are now measured on a different scale, which means that the model fit in theory is not directly comparable. Also, we could use the residuals plot to confirm that the revised specification is more appropriate:

Figure 6.27: Residuals plot

Figure 6.28: Q-Q plot

Finally, we can plot the predicted values against the observed values to see that the results from the log-log model (red) provide a better prediction than the results from the linear model (blue).

Figure 6.29: Comparison if model fit

Another way of modelling non-linearities is including a squared term if there are decreasing or increasing effects. In fact, we can model non-constant slopes as long as the form is a linear combination of exponentials (i.e. squared, cubed, …) of the explanatory variables. Usually we do not expect many inflection points so squared or third power terms suffice. Note that the degree of the polynomial has to be equal to the number of inflection points.

When using squared terms we can model diminishing and eventually negative returns. Think about advertisement spending. If a brand is not well known, spending on ads will increase brand awareness and have a large effect on sales. In a regression model this translates to a steep slope for spending at the origin (i.e. for lower spending). However, as more and more people will already know the brand we expect that an additional Euro spent on advertisement will have less and less of an effect the more the company spends. We say that the returns are diminishing. Eventually, if they keep putting more and more ads out, people get annoyed and some will stop buying from the company. In that case the return might even get negative. To model such a situation we need a linear as well as a squared term in the regression.

lm(...) can take squared (or any power) terms as input by adding I(X^2) as explanatory variable. In the example below we see a clear quadratic relationship with an inflection point at around 70. If we try to model this using the level of the covariates without the quadratic term we do not get a very good fit.

The graph above clearly shows that advertising spending of between 0 and 50 increases sales. However, the marginal increase (i.e. the slope of the data curve) is decreasing. Around 70 there is an inflection point. After that point additional ad-spending actually decreases sales (e.g. people get annoyed). Notice that the prediction line is straight, that is, the marginal increase of sales due to additional spending on advertising is the same for any amount of spending. This shows the danger of basing business decisions on wrongly specified models. But even in the area in which the sign of the prediction is correct, we are quite far off. Lets take a look at the top 5 sales values and the corresponding predictions:

By including a quadratic term we can fit the data very well. This is still a linear model since the outcome variable is still explained by a linear combination of regressors even though one of the regressors is now just a non-linear function of the same variable (i.e. the squared value).

Now the prediction of the model is very close to the actual data and we could base our production decisions on that model.

When interpreting the coefficients of the predictor in this model we have to be careful. Since we included the squared term, the slope is now different at each level of production (this can be seen in the graph above). That is, we do not have a single coefficient to interpret as the slope anymore. This can easily be shown by calculating the derivative of the model with respect to production.

\[ \text{Sales} = \alpha + \beta_1 \text{ Advertising} + \beta_2 \text{ Advertising}^2 + \varepsilon\\ {\delta \text{ Sales} \over \delta \text{ Advertising}} = \beta_1 + 2 \beta_2 \text{ Advertising} \equiv \text{Slope} \]

Intuitively, this means that the change of sales due to an additional Euro spent on advertising depends on the current level of advertising. $\alpha$ , the intercept can still be interpreted as the expected value of sales given that we do not advertise at all (set advertising to 0 in the model). The sign of the squared term ( $\beta_2$ ) can be used to determine the curvature of the function. If the sign is positive, the function is convex (curvature is upwards), if it is negative it is concave curvature is downwards). We can interpret $\beta_1$ and $\beta_2$ separately in terms of their influence on the slope . By setting advertising to $0$ we observe that $\beta_1$ is the slope at the origin. By taking the derivative of the slope with respect to advertising we see that the change of the slope due to additional spending on advertising is two times $\beta_2$ .

\[ {\delta Slope \over \delta Advertising} = 2\beta_2 \]

At the maximum predicted value the slope is close to $0$ (theoretically it is equal to $0$ but this would require decimals and we can only sell whole pieces). Above we only calculated the prediction for the observed data, so let’s first predict the profit for all possible values between $1$ and $200$ to get the optimal production level according to our model.

For all other levels of advertising we insert the pieces produced into the formula to obtain the slope at that point. In the following example you can choose the level of advertising.

6.6 Logistic regression

The following video summarizes how to visualize log-transformed regressions in R

6.6.1 Motivation and intuition

In the last section we saw how to predict continuous outcomes (sales, height, etc.) via linear regression models. Another interesting case is that of binary outcomes, i.e. when the variable we want to model can only take two values (yes or no, group 1 or group 2, dead or alive, etc.). To this end we would like to estimate how our predictor variables change the probability of a value being 0 or 1. In this case we can technically still use a linear model (e.g. OLS). However, its predictions will most likely not be particularly useful. A more useful method is the logistic regression. In particular we are going to have a look at the logit model. In the following dataset we are trying to predict whether a song will be a top-10 hit on a popular music streaming platform. In a first step we are going to use only the danceability index as a predictor. Later we are going to add more independent variables.

Below are two attempts to model the data. The left assumes a linear probability model (calculated with the same methods that we used in the last chapter), while the right model is a logistic regression model . As you can see, the linear probability model produces probabilities that are above 1 and below 0, which are not valid probabilities, while the logistic model stays between 0 and 1. Notice that songs with a higher danceability index (on the right of the x-axis) seem to cluster more at $1$ and those with a lower more at $0$ so we expect a positive influence of danceability on the probability of a song to become a top-10 hit.

Figure 6.30: The same binary data explained by two models; A linear probability model (on the left) and a logistic regression model (on the right)

A key insight at this point is that the connection between $\mathbf{X}$ and $Y$ is non-linear in the logistic regression model. As we can see in the plot, the probability of success is most strongly affected by danceability around values of $0.5$ , while higher and lower values have a smaller marginal effect. This obviously also has consequences for the interpretation of the coefficients later on.

6.6.2 Technical details of the model

As the name suggests, the logistic function is an important component of the logistic regression model. It has the following form:

\[ f(\mathbf{X}) = \frac{1}{1 + e^{-\mathbf{X}}} \] This function transforms all real numbers into the range between 0 and 1. We need this to model probabilities, as probabilities can only be between 0 and 1.

The logistic function on its own is not very useful yet, as we want to be able to determine how predictors influence the probability of a value to be equal to 1. To this end we replace the $\mathbf{X}$ in the function above with our familiar linear specification, i.e.

\[ \mathbf{X} = \beta_0 + \beta_1 * x_{1,i} + \beta_2 * x_{2,i} + ... +\beta_m * x_{m,i}\\ f(\mathbf{X}) = P(y_i = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 * x_{1,i} + \beta_2 * x_{2,i} + ... +\beta_m * x_{m,i})}} \]

In our case we only have $\beta_0$ and $\beta_1$ , the coefficient associated with danceability.

In general we now have a mathematical relationship between our predictor variables $(x_1, ..., x_m)$ and the probability of $y_i$ being equal to one. The last step is to estimate the parameters of this model $(\beta_0, \beta_1, ..., \beta_m)$ to determine the magnitude of the effects.

6.6.3 Estimation in R

We are now going to show how to perform logistic regression in R. Instead of lm() we now use glm(Y~X, family=binomial(link = 'logit')) to use the logit model. We can still use the summary() command to inspect the output of the model.

Noticeably this output does not include an $R^2$ value to asses model fit. Multiple “Pseudo $R^2$ s”, similar to the one used in OLS, have been developed. There are packages that return the $R^2$ given a logit model (see rcompanion or pscl ). The calculation by hand is also fairly simple. We define the function logisticPseudoR2s() that takes a logit model as an input and returns three popular pseudo $R^2$ values.

The coefficients of the model give the change in the log odds of the dependent variable due to a unit change in the regressor. This makes the exact interpretation of the coefficients difficult, but we can still interpret the signs and the p-values which will tell us if a variable has a significant positive or negative impact on the probability of the dependent variable being $1$ . In order to get the odds ratios we can simply take the exponent of the coefficients.

Notice that the coefficient is extremely large. That is (partly) due to the fact that the danceability variable is constrained to values between $0$ and $1$ and the coefficients are for a unit change. We can make the “unit-change” interpretation more meaningful by multiplying the danceability index by $100$ . This linear transformation does not affect the model fit or the p-values.

We observe that danceability positively affects the likelihood of becoming at top-10 hit. To get the confidence intervals for the coefficients we can use the same function as with OLS

In order to get a rough idea about the magnitude of the effects we can calculate the partial effects at the mean of the data (that is the effect for the average observation). Alternatively, we can calculate the mean of the effects (that is the average of the individual effects). Both can be done with the logitmfx(...) function from the mfx package. If we set logitmfx(logit_model, data = my_data, atmean = FALSE) we calculate the latter. Setting atmean = TRUE will calculate the former. However, in general we are most interested in the sign and significance of the coefficient.

This now gives the average partial effects in percentage points. An additional point on the danceability scale (from $1$ to $100$ ), on average, makes it $1.57%$ more likely for a song to become at top-10 hit.

To get the effect of an additional point at a specific value, we can calculate the odds ratio by predicting the probability at a value and at the value $+1$ . For example if we are interested in how much more likely a song with 51 compared to 50 danceability is to become a hit we can simply calculate the following

So the odds are 20% higher at 51 than at 50.

6.6.3.1 Logistic model with multiple predictors

Of course we can also use multiple predictors in logistic regression as shown in the formula above. We might want to add spotify followers (in million) and weeks since the release of the song.

Again, the familiar formula interface can be used with the glm() function. All the model summaries shown above still work with multiple predictors.

6.6.3.2 Model selection

The question remains, whether a variable should be added to the model. We will present two methods for model selection for logistic regression. The first is based on the Akaike Information Criterium (AIC). It is reported with the summary output for logit models. The value of the AIC is relative , meaning that it has no interpretation by itself. However, it can be used to compare and select models. The model with the lowest AIC value is the one that should be chosen. Note that the AIC does not indicate how well the model fits the data, but is merely used to compare models.

For example, consider the following model, where we exclude the followers covariate. Seeing as it was able to contribute significantly to the explanatory power of the model, the AIC increases, indicating that the model including followers is better suited to explain the data. We always want the lowest possible AIC.

As a second measure for variable selection, you can use the pseudo $R^2$ s as shown above. The fit is distinctly worse according to all three values presented here, when excluding the spotify followers.

6.6.3.3 Predictions

We can predict the probability given an observation using the predict(my_logit, newdata = ..., type = "response") function. Replace ... with the observed values for which you would like to predict the outcome variable.

The prediction indicates that a song with danceability of $50$ from an artist with $10M$ spotify followers has a $66%$ chance of being in the top-10, 1 week after its release.

6.6.3.4 Perfect Prediction Logit

Perfect prediction occurs whenever a linear function of $X$ can perfectly separate the $1$ s from the $0$ s in the dependent variable. This is problematic when estimating a logit model as it will result in biased estimators (also check to p-values in the example!). R will return the following message if this occurs:

glm.fit: fitted probabilities numerically 0 or 1 occurred

Given this error, one should not use the output of the glm(...) function for the analysis. There are various ways to deal with this problem, one of which is to use Firth’s bias-reduced penalized-likelihood logistic regression with the logistf(Y~X) function in the logistf package.

6.6.3.4.1 Example

In this example data $Y = 0$ if $x_1 <0$ and $Y=1$ if $x_1>0$ and we thus have perfect prediction. As we can see the output of the regular logit model is not interpretable. The standard errors are huge compared to the coefficients and thus the p-values are $1$ despite $x_1$ being a predictor of $Y$ . Thus, we turn to the penalized-likelihood version. This model correctly indicates that $x_1$ is in fact a predictor for $Y$ as the coefficient is significant.

The Only Course You'll Need To Understand Marketing Like Never Before

How to Get Started with Marketing and Design Your Career in 5 Steps

Linear Regression for Marketing Analytics [Hands-on] If you are thinking about learning Predictive Analytics to improve your marketing efficiency then Linear Regression is the concept to start with. In this discussion, I present to you a step-by-step guide for performing Linear Regression for Marketing Analytics - which is the first topic you should start off with in your Marketing Analytics journey.

Linear Regression for Marketing Analytics is one of the most powerful and basic concepts to get started in Marketing Analytics with. If you are looking to start off with learning Machine Learning which can lend a helping hand to your Marketing education then Linear Regression is the topic to get started with.

You as a marketer know that Machine Learning and Data Science have a significant impact on the decision making process in the field of Marketing. In the field of Marketing, Marketing Analytics helps in giving conclusive reasoning for most of the things, which for years and years has run on the golden gut of marketers.

Learning Regression for Marketing Analytics gives you the ability to predict various marketing variables which may or may not have any visible pattern to them. In this discussion, I will give you an introduction to what is linear regression and how it can transform your marketing and sales analytics.

Therefore, if you are willing to get started with Machine Learning or Marketing Analytics, Linear Regression is the place to begin. I assure you that after this discussion, Linear Regression, as a concept will be crystal clear to you.

Machine Learning for Marketing Analytics

You may already have the understanding of the fact that machine learning algorithms can be broadly divided into two categories: Supervised and Unsupervised Learning .

In Supervised Learning, the dataset that you would work with has the observed values from the past of the variable that you are looking to predict.

For example, you could be required to create a model for predicting the Sales of a product basis the Advertising Expenditure and Sales Expenditure for a given month.

You would begin by asking the Sales Manager for the data for the previous months, expecting that he would share with you a well laid out excel with Quarter, Advertising Expenditure, Sales Expenditure and the Sales for that quarter in different columns.

With such a dataset in your possession, the job of the predictive algorithm that you create will be to find the relationship between these variables. The relationship should be generalized enough so that when you enter the advertising and sales expenditures for the coming quarter, it can give the predicted sales for that quarter.

Unsupervised Learn ing is when this observed variable is not made available to you. In that case you will find yourself solving a different kind of marketing problem altogether and not that of prediction of a variable. Unsupervised Learning is not a part of this discussion.

Supervised Learning has two sub-categories of problem: Regression and Classification .

If you are pressed for time, you can go ahead and watch my video first in which I have explained all the concepts that I have shared below.

Supervised Learning for Marketing Analytics

I shared a common marketing use case above. In that example, you had to predict the sales for the quarter using two different kinds of expenditure variables. Now, we know that the value of sales can be any number - arguably a positive one. The sales could, therefore, range from anywhere between 0 to some really high number.

Such a variable is a continuous variable which can have any value in a really high range. And this, infact, is the simplest way to understand what regression is.

A prediction problem in which the variable to be predicted is a continuous variable is a Regression problem.

Let’s look at an entirely different marketing use case to understand what is not a regression problem.

You have been recruited in the marketing team of a large private bank (a common placement for many business students).

You are given the data of the bank’s list of prospects of the last year with details like Age, Job, Marital Status, No. of Children, Previous Loans, Previous Defaults etc. Along with that you are also provided with the information whether the person took the loan from the bank or not (0 for did not take the loan and 1 for did take the loan).

Your job as a young analytical marketer is to predict whether a prospect which comes in, in the future will take the loan from your bank or not.

Now, please note that in such a prediction problem your task is to just classify the prospects basis your algorithm’s understanding of whether the prospect will buy or not. Which means that the possible values for the outcome are discrete (0 or 1) and not continuous.

A prediction problem in which the variable to be predicted is a discrete variable is a Classification problem.

There are a variety of problems across industries that are prediction problems. My objective of this discussion is to equip you with the intuition and some hands-on coding of Linear Regression so that you can appreciate the use cases irrespective of the industry.

Vocabulary of Regression

Before I dive straight into what Linear Regression is, let me help you in forming an understanding of the vocabulary used when explaining regression. I will link it to the use cases mentioned above. So just reading through it will give you a complete understanding of what is what.

Target Variable: The variable to be predicted is called the Target Variable. When you had to predict the Sales for the quarter using the Advertising Expenditure and Sales Expenditure, the Sales is the target variable.

Naturally, the target variable can also be referred to as the Dependent Variable as its value is dependent on the other variables in the system. Even in our marketing use case, obviously, the Sales is dependent on how much expenditure you have done on Advertisements and on Sales Promotions.

The target variable is commonly denoted as y.

Feature Variable: All the other variables that are used to predict the target variable are called the Feature variables.

The feature variables can also be called Independent Variables . In the examples, the Advertising and Sales Promotion expenditures are the independent variables. Also, imagine that there is a different machine learning problem altogether of image recognition. In such a problem, each of the pixels is a feature variable.

The feature variable is commonly denoted as x. Multiple feature variables get denoted as x0, x1, x2,...., xn.

Finally, let me clarify to you as to what are the different names with which these two variables can be referred to.

Independent Variable (x) and Dependent Variable (y)
Feature Variable (x) and Target Variable (y)
Predictor Variable (x) and Predicted Variable (y)
Input Variable (x) and Output Variable (y)

Introduction to Linear Regression

There are many Regression Models out of which the most basic regression model is the Linear Regression . For Nonlinear Regression, there are different models like Generalized Additive Models (GAMS) and tree-based models which can be used for regression.

Since you are starting off with Marketing Analytics, my objective in this discussion is to take you only through Linear Regression for Marketing Analytics and develop your understanding with that as a base.

Note: What I am going to share with you in the remaining part of the article tends to tune out a lot of people who are frightened by anything that even looks like Math. You would see some mathematical equations with unique notations, and some formulas as well.

None of it is a mathematical concept which you would not have studied in school. If you just manage to sit through it, you will realize that it is nothing but plain English written in a jazzed up manner with equations, which by the way are equally important.

For you as an Analytical Marketer, the intuition is important so please just focus on that.

A linear regression model for a use case which has just one independent/feature variable would look like:

When you use more than one feature variable in your model then the linear regression model will look like:

Let me quickly decipher what this equation means.

What are the β Parameters?

The symbols that you see, β0, β1, β2, are called Model Parameters . These are constants which determine the predicted value of the target variable. These can also be referred to as Model Coefficients or Parameters .

Specifically, β0 is referred to as the Bias Parameter . You will notice that this is not multiplied with any variable/feature in the model and is a standalone parameter that adjusts the model.

Notice carefully that I referred to these Model Parameters (β) as Constants and the features (x) as Variables. This difference needs to be understood and proper usage of the terms makes a lot of difference in understanding the topic.

Why is the predicted variable ŷ and not y?

As I had mentioned above, y is the target variable, which is the variable we are trying to predict. Now, while y represents the actual value of the variable to be predicted, ŷ represents the predicted value of the variable.

Since, there is always some error in the prediction that is why the predicted value is represented with a different notation from the actual variable.

Why is this called a ‘Linear’ regression model?

This is a simple concept straight from your class 10th textbook. If you notice the equation again, you would see that each of the independent variables (x) appears with the power of 1 (degree 1). This means that the variables are not of a higher power (i.e. x 2 , x 3 ..).

Such a model will always be represented by a straight line when plotted on a graph, as I would show in the later part of this discussion.

Visual Representation of Linear Regression

I briefly touched upon a use case above in which you were to predict the Sales or a quarter basis the Advertisement Expenditure and the Sales Expenditure. Since there are two features in this model, the structure of this model will be like the equation below:

However, for simplicity, let us assume that for the features the Sales Manager could only provide the data for the Advertisement Expenditure and, therefore, there will only be a single feature in our model. In this situation, this is how our model equation is going to look like.

Here is the exact data that you received from the sales manager for you to work on.

This data shows that for a Quarter 1, when the Advertisement Expenditure was 24,000, the sales were 724,000. I’m ignoring the units of currency for the time being. It could be Indian Rupee (INR) or United States Dollars (USD) or anything else.

Now, I went ahead and plotted both of these variables on a scatter plot with the Advertisement Expenditure on the x-axis and the Sales on the y-axis.

In this scatter plot, each dot represents one quarter given in the table. For that particular quarter, we will be able to determine the Advertisement Expenditure and the resulting Sales from the x and y axis, respectively.

How would you predict using Linear Regression?

You would remember that the objective for this exercise is to be able to predict the sales of the future quarters based on the features that we have. And in this case, we have just one feature variable, i.e. the Ad_Exp .

In order to know where the next dot will lie on the scatter plot you need to find the equation of a straight line which passes through these points hence representing a trend.

Now, through Python I have drawn three lines which pass through these plots. Each of these three lines are represented by three different equations. Just by looking at the three, you can say that the one in the middle seems to be passing through the points just ‘perfectly’. How do we determine whether this line passes through the points perfectly or not?.

Which is the best trendline in Linear Regression?

Now obviously, you don’t need to make these three lines on your scatter plot every time you do linear regression. It is just for me to explain to you the intuition behind how we choose the best fitting linear line.

The best trendline which passes through the scatter plots is the one which minimizes the difference between the actual value and the predicted value across all the points.

If you magnify at one of the points you will see what exactly is this difference between the actual and the predicted value. A metric that is used to capture the error of the entire model across all the points is called Residual Sum of Squares (RSS) which will be discussed in my next article on errors.

But to explain briefly, each of these distances (of each of the points) from the best fitting line is squared and added. What we finally get is the Residual Sum of Squares.

As we had already understood from our intuition, out of the three lines that I had plotted, the one at the center seems to be the one with least difference across all the points. And if we run the curve fitting on Python, it indeed turns out to be the best fitting line for the scatter plot.

For this part of the discussion, my purpose was to just give you the intuition of Linear Regression for Marketing Analytics.

And with this you should be able to understand what is the objective of a linear regression problem. From what you have seen above, you can simply say that the objective of a linear regression problem is to determine the regression model parameters (β0, β1, β..) that minimize the error of the model.

Notice again, that this is a linear i.e. a straight line and it is not at all necessary that your trendline should be straight.

Non-linear regression is something that I will discuss later in the series once I have helped you develop an understanding for regression.

Hands-on Coding: Linear Regression Model with Marketing Data

This is the section where you will learn how to perform the regression in Python. Continuing with the same data that the Sales Manager had shared with you.

Sales is the target variable that needs to be predicted. Now, based on this data, your objective is to create a predictive model (just like the equation above), an equation in which you can plug in the Ad_exp value for a future quarter and predict the Sales for that quarter.

Let us straightaway get down to some hands-on coding to get this prediction done. Please do not feel left out if you do not have experience with Python. You will not require any pre-requisite knowledge. In fact the best way to learn is to get your hands dirty by solving a problem - like the one we are doing.

Step 1: Importing Python Libraries

The first step is to fire up your Jupyter notebook and load all the prerequisite libraries in your Jupyter notebook. Here are the important libraries that we will be needing for this linear regression.

numpy (to perform certain mathematical operations for regression)
pandas (the data that you load will be stored in a pandas DataFrames)
matplotlib.pyplot (you will use matplotlib to plot the data)

In order to load these, just start with these few lines of codes in your first cell:

The last line of code helps in displaying all the graphs that we will be making within the Jupyter notebook.

Step 2: Loading the data in a DataFrame

Let me now import my data into a DataFrame. A DataFrame is a data type in Python. The simplest way to understand it would be that it stores all your data as a table. And it will be on this table where we will perform all of our Python operations.

Now, I am saving my table (which you saw above) in a variable called ‘data’. Further, after the equal to ‘=’ sign, I have used a command pd.read_csv.

This ensures that the .csv file which I have on my laptop at the file location mentioned in the path, gets loaded onto my Jupyter notebook. Please note that you will need to enter the path of the location where the .csv is stored in your laptop.

By running just the variable name ‘data’, as I have done in the second line of code, you will see the entire table loaded as a DataFrame.

Step 3: Separating the Feature and the Target Variable

You already know that the Ad_exp is the feature variable or the independent variable. Basis this variable, the target variable i.e. the Sales needs to be predicted.

Therefore, just like a classic mathematical equation, let me store the Ad_exp values in a variable x and the Sales values in a variable y. This notation also makes sense because in a mathematical equation y is the output variable and x is the input variable. Same is the case here.

The last line of code will display a scatter-plot on your Jupyter notebook which will look like this:

Please note, this is the same plot that you saw above in the intuition section.

Step 4: Machine Learning! Line fitting

Let me tell you that till now you have not done any machine learning. This was just some basic level data cleaning/data preparation.

The glamorous Machine Learning part of the code starts here and also ends with this one line of code.

From the Numpy library that you had installed, you will now be using the polyfit() method to find the coefficients of the linear line that fit the curve.

You already know from you school level math that the equation of a linear line is given by:

Here, the m is the slope of the line and c is the y-intercept. This trendline that we are trying to find here is no different. It follows the same equation and with this code we will be able to find the m and c values for it.

This method needs three parameters: the previously defined input and output variables (x, y) — and an integer, too: 1. This latter number defines the degree of the polynomial you want to fit.

You would have understood that if you changed that number from 1 to 2, 3, 4 and so on, it would become a higher degree of regression also referred to as Polynomial Regression.

That is also something that I will be discussing with you in the coming weeks.

But, as soon as you run this code, you see an output which is an array of two digits. These two digits are nothing but the values of m and c from the equation of a straight line.

Therefore, we now know that the best trendline that describes our data is:

y = 633.9931736 + (4.68585196 * x)

If you realize, we are actually done with our prediction problem. With this equation given above, you can just plug in the value of x, which you should remember is Advertising Expenditure, and you will get the value of y, i.e. the Sales that you are likely to make in that quarter.

But, since we are already doing some interesting stuff in python here, why do we have to manually find the value of Sales. Let’s make this better in our last and final step.

Step 5: Making the Predictions

Instead of doing the calculations manually in that equation, you can use another method that is made possible with the Numpy library that we had imported. The method is called poly1d()

Please follow the code given below.

We had stored the values of our equation coefficients in ‘model’. I created a variable Predict which now carried all this model data and also had the ability to predict value courtesy the Numpy method poly1d().

Now, when I entered Ad_Expenditure as 51, you saw that the predicted sales for it is shown to be 872.971.

Congratulations on your first step towards Marketing Analytics!

By executing these simple lines of code, you have successfully taken the first step towards learning Marketing Analytics. This is big!

Let me tell you that Linear Regression is a fundamental concept in Marketing Analytics and in Data Science in general. Therefore, you should definitely spend all that time that you need to understand it really well.

Things get interesting from here. I have not yet spoken about how to measure the accuracy of your system. I have also not mentioned how to perform this regression if there were more than one feature or independent variable. That is called Multiple Regression .

Gradually as we proceed in this journey, I will take you through all of these concepts and also through higher order regression i.e. the polynomial regression.

Having covered the most fundamental concept in machine learning, you are now ready to implement it on some of your datasets.

Whatever you learned in this discussion is more than sufficient for you to pick a simple dataset from your work and go ahead to create a linear regression model on it.

If you are not able to find a dataset for practice, stay rest assured. You can download a practice dataset for Linear Regression. This is a toy dataset that I have created for you practice so that you can get the necessary confidence.

Further, if you want to speed up the process of learning Marketing Analytics you can consider taking up this Data Scientist with Python career track on DataCamp. In order to help you get started with the career track, I have crafted a study plan for you so that you can sail through the course with ease.

The form you have selected does not exist.

Let’s learn some more Marketing Analytics in our next discussion.

Beginner’s Guide: How to do Data Analysis

How SAP HANA helped the Kolkata Knight Riders to Clinch IPL 7!

Are DataCamp certificates worth it? You are sure about learning Data Science but are not really sure where to learn it from? As an MBA student, Data Science serves as a really special skill set that you can have. It makes it a combination that the industry is seeking desperately. If you are evaluating DataCamp as an option to learn from, you surely would have this question in your mind, "Are DataCamp certificates worth it?". Here's the answer.

Basic Statistics for Data Science

Linear Regression is inaccurate and misleading! Here’s why The first thing we learn in predictive modeling is linear regression. Linear Regression comes across as a potent tool to predict but is it a reliable model with real world data. Turns out that it is not. In this post I will take you through the Sales data set to demonstrate this fallacy.

What is Marketing Analytics? [Detailed Introduction]

About the Author: Darpan Saxena

Darpan thank you for thorough explanation, it’s very useful. I have come across data where some sales values are negative, and advertising expenditure much higher than sales in general (to the effect of 100x). What would be the best approach in dealing with negative values? Thank you.

Skip to main content
Skip to primary sidebar
Skip to footer
QuestionPro

Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case AskWhy Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
Resources Blog eBooks Survey Templates Case Studies Training Help center

Home Market Research

Regression Analysis: Definition, Types, Usage & Advantages

Regression analysis is perhaps one of the most widely used statistical methods for investigating or estimating the relationship between a set of independent and dependent variables. In statistical analysis , distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities.

It is also used as a blanket term for various data analysis techniques utilized in a qualitative research method for modeling and analyzing numerous variables. In the regression method, the dependent variable is a predictor or an explanatory element, and the dependent variable is the outcome or a response to a specific query.

LEARN ABOUT: Statistical Analysis Methods

Content Index

Definition of Regression Analysis

Types of regression analysis, regression analysis usage in market research, how regression analysis derives insights from surveys, advantages of using regression analysis in an online survey.

Regression analysis is often used to model or analyze data. Most survey analysts use it to understand the relationship between the variables, which can be further utilized to predict the precise outcome.

For Example – Suppose a soft drink company wants to expand its manufacturing unit to a newer location. Before moving forward, the company wants to analyze its revenue generation model and the various factors that might impact it. Hence, the company conducts an online survey with a specific questionnaire.

After using regression analysis, it becomes easier for the company to analyze the survey results and understand the relationship between different variables like electricity and revenue – here, revenue is the dependent variable.

LEARN ABOUT: Level of Analysis

In addition, understanding the relationship between different independent variables like pricing, number of workers, and logistics with the revenue helps the company estimate the impact of varied factors on sales and profits.

Survey researchers often use this technique to examine and find a correlation between different variables of interest. It provides an opportunity to gauge the influence of different independent variables on a dependent variable.

Overall, regression analysis saves the survey researchers’ additional efforts in arranging several independent variables in tables and testing or calculating their effect on a dependent variable. Different types of analytical research methods are widely used to evaluate new business ideas and make informed decisions.

Create a Free Account

Researchers usually start by learning linear and logistic regression first. Due to the widespread knowledge of these two methods and ease of application, many analysts think there are only two types of models. Each model has its own specialty and ability to perform if specific conditions are met.

This blog explains the commonly used seven types of multiple regression analysis methods that can be used to interpret the enumerated data in various formats.

01. Linear Regression Analysis

It is one of the most widely known modeling techniques, as it is amongst the first elite regression analysis methods picked up by people at the time of learning predictive modeling. Here, the dependent variable is continuous, and the independent variable is more often continuous or discreet with a linear regression line.

Please note that multiple linear regression has more than one independent variable than simple linear regression. Thus, linear regression is best to be used only when there is a linear relationship between the independent and a dependent variable.

A business can use linear regression to measure the effectiveness of the marketing campaigns, pricing, and promotions on sales of a product. Suppose a company selling sports equipment wants to understand if the funds they have invested in the marketing and branding of their products have given them substantial returns or not.

Linear regression is the best statistical method to interpret the results. The best thing about linear regression is it also helps in analyzing the obscure impact of each marketing and branding activity, yet controlling the constituent’s potential to regulate the sales.

If the company is running two or more advertising campaigns simultaneously, one on television and two on radio, then linear regression can easily analyze the independent and combined influence of running these advertisements together.

LEARN ABOUT: Data Analytics Projects

02. Logistic Regression Analysis

Logistic regression is commonly used to determine the probability of event success and event failure. Logistic regression is used whenever the dependent variable is binary, like 0/1, True/False, or Yes/No. Thus, it can be said that logistic regression is used to analyze either the close-ended questions in a survey or the questions demanding numeric responses in a survey.

Please note logistic regression does not need a linear relationship between a dependent and an independent variable, just like linear regression. Logistic regression applies a non-linear log transformation for predicting the odds ratio; therefore, it easily handles various types of relationships between a dependent and an independent variable.

Logistic regression is widely used to analyze categorical data, particularly for binary response data in business data modeling. More often, logistic regression is used when the dependent variable is categorical, like to predict whether the health claim made by a person is real(1) or fraudulent, to understand if the tumor is malignant(1) or not.

Businesses use logistic regression to predict whether the consumers in a particular demographic will purchase their product or will buy from the competitors based on age, income, gender, race, state of residence, previous purchase, etc.

03. Polynomial Regression Analysis

Polynomial regression is commonly used to analyze curvilinear data when an independent variable’s power is more than 1. In this regression analysis method, the best-fit line is never a ‘straight line’ but always a ‘curve line’ fitting into the data points.

Please note that polynomial regression is better to use when two or more variables have exponents and a few do not.

Additionally, it can model non-linearly separable data offering the liberty to choose the exact exponent for each variable, and that too with full control over the modeling features available.

When combined with response surface analysis, polynomial regression is considered one of the sophisticated statistical methods commonly used in multisource feedback research. Polynomial regression is used mostly in finance and insurance-related industries where the relationship between dependent and independent variables is curvilinear.

Suppose a person wants to budget expense planning by determining how long it would take to earn a definitive sum. Polynomial regression, by taking into account his/her income and predicting expenses, can easily determine the precise time he/she needs to work to earn that specific sum amount.

04. Stepwise Regression Analysis

This is a semi-automated process with which a statistical model is built either by adding or removing the dependent variable on the t-statistics of their estimated coefficients.

If used properly, the stepwise regression will provide you with more powerful data at your fingertips than any method. It works well when you are working with a large number of independent variables. It just fine-tunes the unit of analysis model by poking variables randomly.

Stepwise regression analysis is recommended to be used when there are multiple independent variables, wherein the selection of independent variables is done automatically without human intervention.

Please note, in stepwise regression modeling, the variable is added or subtracted from the set of explanatory variables. The set of added or removed variables is chosen depending on the test statistics of the estimated coefficient.

Suppose you have a set of independent variables like age, weight, body surface area, duration of hypertension, basal pulse, and stress index based on which you want to analyze its impact on the blood pressure.

In stepwise regression, the best subset of the independent variable is automatically chosen; it either starts by choosing no variable to proceed further (as it adds one variable at a time) or starts with all variables in the model and proceeds backward (removes one variable at a time).

Thus, using regression analysis, you can calculate the impact of each or a group of variables on blood pressure.

05. Ridge Regression Analysis

Ridge regression is based on an ordinary least square method which is used to analyze multicollinearity data (data where independent variables are highly correlated). Collinearity can be explained as a near-linear relationship between variables.

Whenever there is multicollinearity, the estimates of least squares will be unbiased, but if the difference between them is larger, then it may be far away from the true value. However, ridge regression eliminates the standard errors by appending some degree of bias to the regression estimates with a motive to provide more reliable estimates.

If you want, you can also learn about Selection Bias through our blog.

Please note, Assumptions derived through the ridge regression are similar to the least squared regression, the only difference being the normality. Although the value of the coefficient is constricted in the ridge regression, it never reaches zero suggesting the inability to select variables.

Suppose you are crazy about two guitarists performing live at an event near you, and you go to watch their performance with a motive to find out who is a better guitarist. But when the performance starts, you notice that both are playing black-and-blue notes at the same time.

Is it possible to find out the best guitarist having the biggest impact on sound among them when they are both playing loud and fast? As both of them are playing different notes, it is substantially difficult to differentiate them, making it the best case of multicollinearity, which tends to increase the standard errors of the coefficients.

Ridge regression addresses multicollinearity in cases like these and includes bias or a shrinkage estimation to derive results.

06. Lasso Regression Analysis

Lasso (Least Absolute Shrinkage and Selection Operator) is similar to ridge regression; however, it uses an absolute value bias instead of the square bias used in ridge regression.

It was developed way back in 1989 as an alternative to the traditional least-squares estimate with the intention to deduce the majority of problems related to overfitting when the data has a large number of independent variables.

Lasso has the capability to perform both – selecting variables and regularizing them along with a soft threshold. Applying lasso regression makes it easier to derive a subset of predictors from minimizing prediction errors while analyzing a quantitative response.

Please note that regression coefficients reaching zero value after shrinkage are excluded from the lasso model. On the contrary, regression coefficients having more value than zero are strongly associated with the response variables, wherein the explanatory variables can be either quantitative, categorical, or both.

Suppose an automobile company wants to perform a research analysis on average fuel consumption by cars in the US. For samples, they chose 32 models of car and 10 features of automobile design – Number of cylinders, Displacement, Gross horsepower, Rear axle ratio, Weight, ¼ mile time, v/s engine, transmission, number of gears, and number of carburetors.

As you can see a correlation between the response variable mpg (miles per gallon) is extremely correlated to some variables like weight, displacement, number of cylinders, and horsepower. The problem can be analyzed by using the glmnet package in R and lasso regression for feature selection.

07. Elastic Net Regression Analysis

It is a mixture of ridge and lasso regression models trained with L1 and L2 norms. The elastic net brings about a grouping effect wherein strongly correlated predictors tend to be in/out of the model together. Using the elastic net regression model is recommended when the number of predictors is far greater than the number of observations.

Please note that the elastic net regression model came into existence as an option to the lasso regression model as lasso’s variable section was too much dependent on data, making it unstable. By using elastic net regression, statisticians became capable of over-bridging the penalties of ridge and lasso regression only to get the best out of both models.

A clinical research team having access to a microarray data set on leukemia (LEU) was interested in constructing a diagnostic rule based on the expression level of presented gene samples for predicting the type of leukemia. The data set they had, consisted of a large number of genes and a few samples.

Apart from that, they were given a specific set of samples to be used as training samples, out of which some were infected with type 1 leukemia (acute lymphoblastic leukemia) and some with type 2 leukemia (acute myeloid leukemia).

Model fitting and tuning parameter selection by tenfold CV were carried out on the training data. Then they compared the performance of those methods by computing their prediction mean-squared error on the test data to get the necessary results.

A market research survey focuses on three major matrices; Customer Satisfaction , Customer Loyalty , and Customer Advocacy . Remember, although these matrices tell us about customer health and intentions, they fail to tell us ways of improving the position. Therefore, an in-depth survey questionnaire intended to ask consumers the reason behind their dissatisfaction is definitely a way to gain practical insights.

However, it has been found that people often struggle to put forth their motivation or demotivation or describe their satisfaction or dissatisfaction. In addition to that, people always give undue importance to some rational factors, such as price, packaging, etc. Overall, it acts as a predictive analytic and forecasting tool in market research.

When used as a forecasting tool, regression analysis can determine an organization’s sales figures by taking into account external market data. A multinational company conducts a market research survey to understand the impact of various factors such as GDP (Gross Domestic Product), CPI (Consumer Price Index), and other similar factors on its revenue generation model.

Obviously, regression analysis in consideration of forecasted marketing indicators was used to predict a tentative revenue that will be generated in future quarters and even in future years. However, the more forward you go in the future, the data will become more unreliable, leaving a wide margin of error .

Case study of using regression analysis

A water purifier company wanted to understand the factors leading to brand favorability. The survey was the best medium for reaching out to existing and prospective customers. A large-scale consumer survey was planned, and a discreet questionnaire was prepared using the best survey tool .

A number of questions related to the brand, favorability, satisfaction, and probable dissatisfaction were effectively asked in the survey. After getting optimum responses to the survey, regression analysis was used to narrow down the top ten factors responsible for driving brand favorability.

All the ten attributes derived (mentioned in the image below) in one or the other way highlighted their importance in impacting the favorability of that specific water purifier brand.

It is easy to run a regression analysis using Excel or SPSS, but while doing so, the importance of four numbers in interpreting the data must be understood.

The first two numbers out of the four numbers directly relate to the regression model itself.

F-Value: It helps in measuring the statistical significance of the survey model. Remember, an F-Value significantly less than 0.05 is considered to be more meaningful. Less than 0.05 F-Value ensures survey analysis output is not by chance.
R-Squared: This is the value wherein the independent variables try to explain the amount of movement by dependent variables. Considering the R-Squared value is 0.7, a tested independent variable can explain 70% of the dependent variable’s movement. It means the survey analysis output we will be getting is highly predictive in nature and can be considered accurate.

The other two numbers relate to each of the independent variables while interpreting regression analysis.

P-Value: Like F-Value, even the P-Value is statistically significant. Moreover, here it indicates how relevant and statistically significant the independent variable’s effect is. Once again, we are looking for a value of less than 0.05.
Interpretation: The fourth number relates to the coefficient achieved after measuring the impact of variables. For instance, we test multiple independent variables to get a coefficient. It tells us, ‘by what value the dependent variable is expected to increase when independent variables (which we are considering) increase by one when all other independent variables are stagnant at the same value.

In a few cases, the simple coefficient is replaced by a standardized coefficient demonstrating the contribution from each independent variable to move or bring about a change in the dependent variable.

01. Get access to predictive analytics

Do you know utilizing regression analysis to understand the outcome of a business survey is like having the power to unveil future opportunities and risks?

For example, after seeing a particular television advertisement slot, we can predict the exact number of businesses using that data to estimate a maximum bid for that slot. The finance and insurance industry as a whole depends a lot on regression analysis of survey data to identify trends and opportunities for more accurate planning and decision-making.

02. Enhance operational efficiency

Do you know businesses use regression analysis to optimize their business processes?

For example, before launching a new product line, businesses conduct consumer surveys to better understand the impact of various factors on the product’s production, packaging, distribution, and consumption.

A data-driven foresight helps eliminate the guesswork, hypothesis, and internal politics from decision-making. A deeper understanding of the areas impacting operational efficiencies and revenues leads to better business optimization.

03. Quantitative support for decision-making

Business surveys today generate a lot of data related to finance, revenue, operation, purchases, etc., and business owners are heavily dependent on various data analysis models to make informed business decisions.

For example, regression analysis helps enterprises to make informed strategic workforce decisions. Conducting and interpreting the outcome of employee surveys like Employee Engagement Surveys, Employee Satisfaction Surveys, Employer Improvement Surveys, Employee Exit Surveys, etc., boosts the understanding of the relationship between employees and the enterprise.

It also helps get a fair idea of certain issues impacting the organization’s working culture, working environment, and productivity. Furthermore, intelligent business-oriented interpretations reduce the huge pile of raw data into actionable information to make a more informed decision.

04. Prevent mistakes from happening due to intuitions

By knowing how to use regression analysis for interpreting survey results, one can easily provide factual support to management for making informed decisions. ; but do you know that it also helps in keeping out faults in the judgment?

For example, a mall manager thinks if he extends the closing time of the mall, then it will result in more sales. Regression analysis contradicts the belief that predicting increased revenue due to increased sales won’t support the increased operating expenses arising from longer working hours.

Regression analysis is a useful statistical method for modeling and comprehending the relationships between variables. It provides numerous advantages to various data types and interactions. Researchers and analysts may gain useful insights into the factors influencing a dependent variable and use the results to make informed decisions.

With QuestionPro Research, you can improve the efficiency and accuracy of regression analysis by streamlining the data gathering, analysis, and reporting processes. The platform’s user-friendly interface and wide range of features make it a valuable tool for researchers and analysts conducting regression analysis as part of their research projects.

LEARN MORE FREE TRIAL

MORE LIKE THIS

Why You Should Attend XDAY 2024

Aug 30, 2024

Alchemer vs Qualtrics: Find out which one you should choose

Target Population: What It Is + Strategies for Targeting

Aug 29, 2024

Microsoft Customer Voice vs QuestionPro: Choosing the Best

Other categories.

Academic Research
Artificial Intelligence
Assessments
Brand Awareness
Case Studies
Communities
Consumer Insights
Customer effort score
Customer Engagement
Customer Experience
Customer Loyalty
Customer Research
Customer Satisfaction
Employee Benefits
Employee Engagement
Employee Retention
Friday Five
General Data Protection Regulation
Insights Hub
Life@QuestionPro
Market Research
Mobile diaries
Mobile Surveys
New Features
Online Communities
Question Types
Questionnaire
QuestionPro Products
Release Notes
Research Tools and Apps
Revenue at Risk
Survey Templates
Training Tips
Tuesday CX Thoughts (TCXT)
Uncategorized
What’s Coming Up
Workforce Intelligence

A Concise Guide to Market Research

The Process, Data, and Methods Using IBM SPSS Statistics

© 2019
Latest edition
Marko Sarstedt ORCID: https://orcid.org/0000-0002-5424-4268 0 ,
Erik Mooi ORCID: https://orcid.org/0000-0003-3613-1179 1

Faculty of Economics and Management, Otto-von-Guericke- University Magdeburg, Magdeburg, Germany

You can also search for this author in PubMed Google Scholar

Department of Management and Marketing, The University of Melbourne, Parkville, VIC, Australia

Compact, hands-on, and step-by-step introduction to quantitative market research techniques
Discusses the theory of important quantitative techniques and links directly to their use in SPSS
Includes a wide range of educational elements, such as learning objectives, keywords, self-assessment tests, and case studies
Links to additional material and videos via the Springer Multimedia App

Part of the book series: Springer Texts in Business and Economics (STBE)

1.41m Accesses

216 Citations

12 Altmetric

This is a preview of subscription content, log in via an institution to check access.

Access this book

Subscribe and save.

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime
Available as EPUB and PDF
Read on any device
Instant download
Own it forever
Compact, lightweight edition
Dispatched in 3 to 5 business days
Free shipping worldwide - see info
Durable hardcover edition

Tax calculation will be finalised at checkout

Other ways to access

Licence this eBook for your library

Institutional subscriptions

About this book

This book offers an easily accessible and comprehensive guide to the entire market research process, from asking market research questions to collecting and analyzing data by means of quantitative methods. It is intended for all readers who wish to know more about the market research process, data management, and the most commonly used methods in market research. The book helps readers perform analyses, interpret the results, and make sound statistical decisions using IBM SPSS Statistics. Hypothesis tests, ANOVA, regression analysis, principal component analysis, factor analysis, and cluster analysis, as well as essential descriptive statistics, are covered in detail. Highly engaging and hands-on, the book includes many practical examples, tips, and suggestions that help readers apply and interpret the data analysis methods discussed.

The new edition uses IBM SPSS version 25 and offers the following new features:

A single case and dataset used throughout thebook to facilitate learning
New material on survey design and all data analysis methods to reflect the latest advances concerning each topic
Improved use of educational elements, such as learning objectives, keywords, self-assessment tests, case studies, and much more
A glossary that includes definitions of all the keywords and other descriptions of selected topics
Links to additional material and videos via the Springer Multimedia App

Digital Marketing Research – How to Effectively Utilize Online Research Methods

Small Business Marketing Research: Methodological Alternatives to the Often Used Demand and Opinion Survey

Market Research
Research Methods

Table of contents (10 chapters)

Front matter, introduction to market research.

Marko Sarstedt, Erik Mooi

The Market Research Process

Getting data, descriptive statistics, hypothesis testing and anova, regression analysis, principal component and factor analysis, cluster analysis, communicating the results, back matter, authors and affiliations.

Marko Sarstedt

About the authors

Marko Sarstedt is chaired professor of Marketing at the Otto-von-Guericke-University Magdeburg (Germany). His main research is in the application and advancement of structural equation modeling methods to further the understanding of consumer behavior and to improve marketing decision-making. His research has been published in journals such as Journal of Marketing Research, Journal of the Academy of Marketing Science, Organizational Research Methods, MIS Quarterly, and International Journal of Research in Marketing. Marko has co-edited several special issues of leading journals and co-authored several widely adopted textbooks, including “A Primer on Partial Least Squares Structural Equation Modeling (PLS-SEM)” (together with Joe F. Hair, G. Tomas M. Hult, and Christian M. Ringle).

Erik Mooi i s senior lecturer at the University of Melbourne (Australia). His main interest is in business-to-business marketing and works on topics such as outsourcing, inter-firmcontracting, innovation, technology licensing, and franchising using advanced econometrics. His research has been published in journals such as Journal of Marketing, the Journal of Marketing Research, the International Journal of Research in Marketing, and the Journal of Business Research. He is also program director at the Centre for Workplace Leadership, a fellow at the EU centre for shared complex challenges, as well as a fellow at the Centre for Business Analytics at Melbourne Business School.

Bibliographic Information

Book Title : A Concise Guide to Market Research

Book Subtitle : The Process, Data, and Methods Using IBM SPSS Statistics

Authors : Marko Sarstedt, Erik Mooi

Series Title : Springer Texts in Business and Economics

DOI : https://doi.org/10.1007/978-3-662-56707-4

Publisher : Springer Berlin, Heidelberg

eBook Packages : Business and Management , Business and Management (R0)

Hardcover ISBN : 978-3-662-56706-7 Published: 28 September 2018

Softcover ISBN : 978-3-662-58592-4 Published: 11 January 2019

eBook ISBN : 978-3-662-56707-4 Published: 18 September 2018

Series ISSN : 2192-4333

Series E-ISSN : 2192-4341

Edition Number : 3

Number of Pages : XVII, 396

Number of Illustrations : 69 b/w illustrations, 109 illustrations in colour

Topics : Marketing , Management , Statistics for Business, Management, Economics, Finance, Insurance

Publish with us

Policies and ethics

Find a journal
Track your research

Regression Analysis

Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of relationships between a dependent variable and one or more independent variables.

The basic form of regression models includes unknown parameters (β), independent variables (X), and the dependent variable (Y).

Regression model, basically, specifies the relation of dependent variable (Y) to a function combination of independent variables (X) and unknown parameters (β)

Y ≈ f (X, β)

Regression equation can be used to predict the values of ‘y’, if the value of ‘x’ is given, and both ‘y’ and ‘x’ are the two sets of measures of a sample size of ‘n’. The formulae for regression equation would be

Do not be intimidated by visual complexity of correlation and regression formulae above. You don’t have to apply the formula manually, and correlation and regression analyses can be run with the application of popular analytical software such as Microsoft Excel, Microsoft Access, SPSS and others.

Linear regression analysis is based on the following set of assumptions:

1. Assumption of linearity . There is a linear relationship between dependent and independent variables.

2. Assumption of homoscedasticity . Data values for dependent and independent variables have equal variances.

3. Assumption of absence of collinearity or multicollinearity . There is no correlation between two or more independent variables.

4. Assumption of normal distribution . The data for the independent variables and dependent variable are normally distributed

My e-book, The Ultimate Guide to Writing a Dissertation in Business Studies: a step by step assistance offers practical assistance to complete a dissertation with minimum or no stress. The e-book covers all stages of writing a dissertation starting from the selection to the research area to submitting the completed version of the work within the deadline. John Dudovskiy

Regression analysis: Precise Forecasts and Predictions

Appinio Research · 03.07.2023 · 7min read

Regression analysis: precise market analyses with Appinio

Back to Market Research Blog

Regression analysis plays a vital role in contemporary market research, offering a powerful tool for making accurate forecasts and addressing intricate interdependencies within challenges and decisions. It enables us to predict user behavior and gain valuable insights for optimising business strategies. This article aims to elucidate the concept of regression analysis, delve into its working principles, and explore its applications in the field of market research.

What is regression analysis?

Regression analysis serves as a statistical method and acts as a translator within the realm of market research, enabling the conversion of ambiguous or complex data into concise and understandable information.

By investigating the relationship between two or more variables, regression analysis sheds light on crucial interactions, such as the correlation between user behavior and screen time in smartphone applications.

What does regression analysis do?

Regression analysis serves multiple purposes.

It identifies correlations between two or more variables, allowing us to understand and visualize their interrelationship.
It has the capability to forecast potential changes when variables are altered.
It can capture values at specific time points, enabling us to examine the impact of fluctuating parameters on the overall outcomes.

Origins of Regression Analysis

Regression analysis traces its roots back to the late 19th century when it was pioneered by the renowned British statistician, Sir Francis Galton. Galton explored variables within human genetics and introduced the concept of regression.

By examining the relationship between parental height and the height of their offspring, Galton laid the foundation for linear regression analysis. Since then, this methodology has found extensive applications not only in market research but also in diverse fields such as psychology, sociology, medicine, and economics.

Precise market analyses with Appinio

Appinio leverages a variety of market research methods to get you the best results for your market research needs. Do you want to determine the potential of a new product or service before launching it onto the market? Then the TURF analysis can help.

Conjoint analysis, on the other hand, collects consumer feedback during the development phase to optimise an idea.

Contact Appinio now and together we will find the optimal approach to your challenge!

What types of regression analysis are there?

Regression analysis encompasses various regression models, each serving specific purposes depending on the research objectives and data availability.

Employing a combination of these techniques allows for in-depth insights into complex phenomena. Here are the key regression models:

Simple linear regression

The classic model examines the relationship between a dependent variable and a single independent variable, revealing their association. For instance, it can explore how daily coffee consumption (independent variable) impacts daily energy levels (dependent variable).

Multiple linear regression

Expanding upon simple linear regression, this model incorporates multiple independent variables, such as price, advertising, competition, or sales figures. In the context of energy levels, variables like sleep duration and exercise can be added alongside coffee consumption.

Non-linear regression

When the relationship between variables deviates from a straight line, non-linear regression comes into play. This is particularly useful for phenomena like exponential growth in app downloads or user numbers, where traditional linear models may not be suitable.

Quadratic regression

For complex correlations or patterns characterised by ups and downs, quadratic regression is utilised.

It fits data that follows non-linear trends, such as seasonal sales fluctuations. For instance, it can help determine market saturation points, where growth typically plateaus after an initial rapid expansion.

Hierarchical regression

Hierarchical regression allows the researcher to control the order of variables in a model, enabling the assessment of each independent variable's contribution to predicting the dependent variable.

For example, in demographic-based analyses, variables like age, gender or education levels may be weighted differently.

Multinomial logistic regression

This model examines the probabilities of outcomes with more than two variables, making it valuable for complex questions.

For instance, a music app may predict users' favourite genres based on their previous preferences, listening habits, and other factors like age, gender, or listening time, enabling personalised recommendations.

Multivariate regression analysis

When multiple dependent variables and their interactions with independent variables need to be explored, multivariate regression analysis is employed.

For instance, in the context of fitness data, it can assess how factors such as diet, sleep, or exercise intensity influence variables like weight and health status.

Binary logistic regression

This model comes into play when a variable has only two possible answers, such as yes or no. Binary logistic regression can be utilised to predict whether a specific product will be purchased by a target group. Factors like age, income, or gender can further segment the buyer groups.

How is regression analysis used in market research?

The versatility of regression analysis is reflected in its diverse applications within the field of market research. Here are selected examples of how regression analysis is utilised:

Predicting market trends Regression analysis enables the exploration of future market trends. For instance, a real estate company can forecast future home prices by considering factors such as property location, size, and age of the property. Similarly, a food company may employ regression analysis to identify the ice cream flavour with the highest sales potential.
Customer satisfaction Companies can employ regression analysis to investigate the factors influencing customer satisfaction. By conducting customer surveys and analysing the data through regression analysis, a customer service company can identify the aspects of their service that have the greatest impact on customer satisfaction.
Usage behavior Regression analysis provides insights into the factors influencing the usage of smartphone apps. It allows for differentiation based on variables such as age, gender, or education level, shedding light on the drivers of app usage.
Advertising impact Regression analysis measures the effectiveness of advertising campaigns. By analysing advertising expenditure in relation to product sales, it enables the classification of advertising effectiveness and informs decision-making regarding advertising strategies.
Measuring market maturity Regression analysis helps evaluate the reception of a product or service among the target audience. It identifies positive and negative evaluations, as well as determining which features should be emphasised. Through regression analyses, insights can be gained into products and services even before their market launch.

How does a simple linear regression analysis work?

How does a linear regression analysis work?

Suppose a company aims to determine the relationship between advertising spending and product sales, requiring a simple linear regression analysis. Here are five possible steps to conduct this analysis:

Data collection To commence the analysis, data on advertising spending and product sales needs to be collected.
Chart generation The data is plotted on a scatter plot where one axis represents advertising spending and the other represents product sales.
Determine the regression line A straight line is drawn to intersect as many data points as possible. This regression line illustrates the average relationship between the two variables.
Predicting developments The regression line serves as the foundation for making future predictions. By manipulating one variable, you can examine its influence on the other variable.
Interpretation of the results Valuable insights can be derived from the results. For instance, the analysis may reveal that an additional £10,000 in advertising spending could lead to an average increase in sales of 500 units.

Regression analysis: All-rounder in market research

Regression analysis stands as a powerful and versatile tool in the realm of market research. It offers a range of regression models, varying in complexity depending on the research question or objective at hand. Whether investigating the relationship between advertising spend and sales, analysing usage behavior, or identifying market trends, regression analysis provides data-driven insights that empower informed and sound decision-making.

Interested in running your own regression analysis?

Then register directly on our platform and get in touch with our experts.

Join the loop 💌

Be the first to hear about new updates, product news, and data insights. We'll send it all straight to your inbox.

Get the latest market research news straight to your inbox! 💌

Wait, there's more

Mystery shopping is a market research method that allows companies to check their own customer service and customer experience

16.08.2023 | 8min read

Mystery Shopping for Beginners: Enhancing Customer Experience Through Objective Evaluations

The Ansoff Matrix lets you explore market penetration, development, diversification, and product strategies

16.08.2023 | 9min read

The Ansoff Matrix: Exploring Growth Opportunities

The Semantic Differential Scale: The Power of Perceptions in Market Research

Regression Analysis

Home > What We Do > Research Methods > Pricing and Value Research Techniques > Regression Analysis

From overall customer satisfaction to satisfaction with your product quality and price, regression analysis measures the strength of a relationship between different variables.

To find out more about measuring customer satisfaction to help your business

How regression analysis works

While correlation analysis provides a single numeric summary of a relation (“the correlation coefficient”), regression analysis results in a prediction equation, describing the relationship between the variables. If the relationship is strong – expressed by the Rsquare value – it can be used to predict values of one variable given the other variables have known values. For example, how will the overall satisfaction score change if satisfaction with product quality goes up from 6 to 7?

Measuring customer satisfaction

Regression analysis can be used in customer satisfaction and employee satisfaction studies to answer questions such as: “Which product dimensions contribute most to someone’s overall satisfaction or loyalty to the brand?” This is often referred to as Key Drivers Analysis.

It can also be used to simulate the outcome when actions are taken. For example: “What will happen to the satisfaction score when product availability is improved?”

Privacy Overview

Cookie	Duration	Description
__hssrc	session	This cookie is set by Hubspot whenever it changes the session cookie. The __hssrc cookie set to 1 indicates that the user has restarted the browser, and if the cookie does not exist, it is assumed to be a new session.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
__hssc	30 minutes	HubSpot sets this cookie to keep track of sessions and to determine if HubSpot should increment the session number and timestamps in the __hstc cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
__hstc	5 months 27 days	This is the main cookie set by Hubspot, for tracking visitors. It contains the domain, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_3031018_3	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
hubspotutk	5 months 27 days	HubSpot sets this cookie to keep track of the visitors to the website. This cookie is passed to HubSpot on form submission and used when deduplicating contacts.
undefined	never	Wistia sets this cookie to collect data on visitor interaction with the website's video-content, to make the website's video-content more relevant for the visitor.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	No description
closest_office_location	1 month	No description
li_gc	2 years	No description
loglevel	never	No description available.
ssi--lastInteraction	10 minutes	This cookie is used for storing the date of last secure session the visitor had when visiting the site.
ssi--sessionId	1 year	This cookie is used for storing the session ID which helps in reusing the one the visitor had already used.
user_country	1 month	No description available.

| |

Client Testimonials

We had implemented eserviz 6 months ago and Peace of Mind is what i can quote.

Figures,stats,feedback,current info and delivery status is what a business man needs and available in a giffy. A client comes back to you to repeat an order if he is happy with your post-sales service. With this software no business can afford to err on it.

Great job by you and your team.

Regression Analysis

Primarily, a set of variables that affect sales figures have to be identified. For example, it could be variables such as GNP growth rate or new vehicle purchases or variables like promotion expense and sales figures. Once these figures are made known, data is required for at least 20-25 observations. Then, a linear equation is built using methods like the least squares algorithm. Categorical predictor variables like the levels of education of consumers can also be performed.

Market Regression Research Analysis with numerical predictors.

Once the equation is developed, the resulting coefficients are used to predict the value of sales for a new set of predictor values. This is a quantitative method, and the closer the relationship between the measured variables. The advanced quality of predictions made from this. Some of the common applications include forecasting sales of biscuits, to the sales of cement and heavy machinery. The only requisite condition is the availability of suitable numerical data on predicted variables.

Our approach to Market Regression Research Analysis

Some variables must be filtered out to reduce collinearity, if a lot of variables exist.
Factor analysis is performed to combine the correlated variables.
Then an equation for the best prediction is determined from the set of independent variables. Two criteria are: the statistical significance of the equation, and the amount of variance

The Prediction model thus built must be tested to find out the accuracy in actual conditions or how it executes with data left out during the model-building process. The prediction model is modified by adding or reducing variables if needed.

Informatics Outsourcing has worked with customers in North America and Europe to carry out advanced statistical analysis on market research data and instrumental data.

Market Regression Analysis professionals :

We have well trained software professionals and statisticians to carry out complicated tasks in Market Regression Analysis . We have licensed copies of SAS, MATLAB, and SPSS to perform all statistical analysis.

Our Regression Analysis Services:

To know more about our Market Regression Analysis , CONTACT US with your requirements.

-->


Name
Address
Email*
Phone No
Country
Company
Description

Information

Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

Active Journals
Find a Journal
Proceedings Series
For Authors
For Reviewers
For Editors
For Librarians
For Publishers
For Societies
For Conference Organizers
Open Access Policy
Institutional Open Access Program
Special Issues Guidelines
Editorial Process
Research and Publication Ethics
Article Processing Charges
Testimonials
Preprints.org
SciProfiles
Encyclopedia

Article Menu

Subscribe SciFeed
Recommended Articles
Google Scholar
on Google Scholar
Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Cost estimation and prediction for residential projects based on grey relational analysis–lasso regression–backpropagation neural network.

1. Introduction

2. methodology, 2.1. data collection method.

Data Request: Send a request to the server of the specified website to obtain its corresponding web content.
Webpage Analysis: Use regular expressions and other rules to selectively filter the needed information from the extensive content on the web server.
Data Storage: Save the initially captured key information into files in formats such as EXCEL to prepare for subsequent data preprocessing.

Click here to enlarge figure

2.2. Grey Relational Analysis

2.3. lasso regression, 2.4. gra-lasso-bpnn, 2.5. performance evaluation, 3. research applications, 3.1. data acquisition and preprocessing, 3.2. selection of input variables for residential project costs, 3.2.1. input variable importance ranking, 3.2.2. correlation analysis, 3.2.3. determination of input variables, 3.3. gra-lasso-bp prediction model establishment, 4. results and discussion, 4.1. conclusion and analysis of input variable selection, 4.2. performance analysis of the residential project cost estimation hybrid model, 5. conclusions.

Among the 17 input variables, the ones with the most significant impact on the unit cost of residential projects in Shanghai, after GRA and LASSO regularization, are the seismic fortification intensity, commercial concrete grade, type of doors and windows, and total building area. A total of 12 input variables were ultimately selected.
The evaluation metrics of the proposed GRA-LASSO-BPNN hybrid prediction model are significantly lower than those of the BPNN and LASSO regression models, indicating that the GRA-LASSO-BPNN hybrid prediction model proposed in this study has superior predictive performance in estimating residential project costs.
The GRA-LASSO-BPNN model outperforms the BPNN model alone, demonstrating that input variable selection can enhance model prediction accuracy. Additionally, when comparing the BPNN and LASSO models, as well as the GRA-LASSO-BPNN and LASSO models, it is evident that the errors of the hybrid models are lower than those of LASSO, suggesting that BPNN can improve prediction accuracy on high-dimensional small sample datasets.
Collect more datasets to further reduce prediction model errors.
Introduce additional relevant feature parameters that impact the cost of underground structures, and then use GRA-LASSO for feature selection.
Introduce optimization algorithms to improve the GRA-LASSO-BPNN model. With the progressive informatization of construction, deep learning is expected to become increasingly integral to the cost management in construction.

Author Contributions

Data availability statement, conflicts of interest.

Dandan, T.H.; Sweis, G.; Sukkare, L.S.; Sweis, R.J. Factors affecting the accuracy of cost estimate during various design stages. J. Eng. Des. Technol. 2019 , 18 , 787–819. [ Google Scholar ] [ CrossRef ]
Wang, B.; Dai, J. Discussion on the prediction of engineering cost based on improved BP neural network algorithm. J. Intell. Fuzzy Syst. 2019 , 37 , 6091–6098. [ Google Scholar ] [ CrossRef ]
Stoy, C.; Kalusche, W. The determination of occupancy costs during early project phases. Constr. Manag. Econ. 2006 , 24 , 933–944. [ Google Scholar ] [ CrossRef ]
Bimenyimana, S.; Asemota, G.N.O.; Ihirwe, P.J.; Mesa, K.C. Performance estimation of Ntaruka hydropower plant and its comparison with the prediction results obtained by SPSS. Energy Environ. 2018 , 29 , 1004–1021. [ Google Scholar ] [ CrossRef ]
Wang, Y.R.; Yu, C.Y.; Chan, H.H. Predicting construction cost and schedule success using artificial neural networks ensemble and support vector machines classification models. Int. J. Proj. Manag. 2012 , 30 , 470–478. [ Google Scholar ] [ CrossRef ]
Jin, R.Z.; Cho, K.M.; Hyun, C.T.; Son, M.J. MRA-based revised CBR model for cost prediction in the early stage of construction projects. Expert Syst. Appl. 2012 , 39 , 5214–5222. [ Google Scholar ] [ CrossRef ]
Son, H.; Kim, C.; Kim, C. Hybrid principal component analysis and support vector machine model for predicting the cost performance of commercial building projects using pre-project planning variables. Autom. Constr. 2012 , 27 , 60–66. [ Google Scholar ] [ CrossRef ]
Ongpeng, J.; Roxas, C. An artificial neural network approach to structural cost estimation of building projects in the Philippines. In Proceedings of the DLSU Research Congress, Manila, Philippines, 6–8 March 2014. [ Google Scholar ]
Bala, K.; Bustani, S.A.; Waziri, B.S. A computer-based cost prediction model for institutional building projects in Nigeria: An artificial neural network approach. J. Eng. Des. Technol. 2014 , 12 , 519–530. [ Google Scholar ] [ CrossRef ]
Deepa, G.; Niranjana, A.J.; Balu, A.S. A hybrid machine learning approach for early cost estimation of pile foundations. J. Eng. Des. Technol. 2023; ahead-of-print . [ Google Scholar ] [ CrossRef ]
Ye, D. An Algorithm for Construction Project Cost Forecast Based on Particle Swarm Optimization-Guided BP Neural Network. Sci. Program. 2021 , 2021 , 8. [ Google Scholar ] [ CrossRef ]
Du, Z.; Li, B. Construction project cost estimation based on improved BP neural network. In Proceedings of the 2017 International Conference on Smart Grid and Electrical Automation (ICSGEA), Changsha, China, 27–28 May 2017. [ Google Scholar ]
Feng, G.L.; Li, L. Application of genetic algorithm and neural network in construction cost estimate. Adv. Mat. Res. 2013 , 756–759 , 3194–3198. [ Google Scholar ] [ CrossRef ]
Sandoval-Moreno, G.; Galea, M.; Arellano-Valle, R. Inference in multivariate regression models with measurement errors. J. Stat. Comput. Simul. 2023 , 93 , 1997–2025. [ Google Scholar ] [ CrossRef ]
De Mello, R.; Manapragada, C.; Bifet, A. Measuring the Shattering coefficient of Decision Tree models. Expert Syst. Appl. 2019 , 137 , 443–452. [ Google Scholar ] [ CrossRef ]
Sabzekar, M.; Hasheminejad, S.M.H. Robust regression using support vector regressions. Chaos Solitons Fractals 2021 , 144 , 110738. [ Google Scholar ] [ CrossRef ]
Wang, S.Z.; Ji, B.X.; Zhao, J.S.; Liu, W.; Xu, T. Predicting ship fuel consumption based on LASSO regression. Transport. Res. Part D—Transp. Environ. 2018 , 65 , 817–824. [ Google Scholar ] [ CrossRef ]
Li, Z.T.; Jiang, L.L.; Zhao, R.; Huang, J.; Yang, W.; Wen, Z.; Zhang, B.; Du, G. MiRNA-based model for predicting the TMB level in colon adenocarcinoma based on a LASSO logistic regression method. Medicine 2021 , 100 , e26068. [ Google Scholar ] [ CrossRef ]
Lee, J.H.; Shi, Z.T.; Gao, Z. On LASSO for predictive regression. J. Econom. 2022 , 229 , 322–349. [ Google Scholar ] [ CrossRef ]
Meharie, M.G.; Gariy, Z.; Mutuku, R.; Mengesha, W.J. An effective approach to input variable selection for preliminary cost estimation of construction projects. Adv. Civ. Eng. 2019 , 2019 , 1–14. [ Google Scholar ] [ CrossRef ]
Alshemosi, A.M.B.; Alsaad, H.S.H. Cost estimation process for construction residential projects by using multifactor linear regression technique. Int. J. Sci. Res. 2017 , 6 , 151–156. [ Google Scholar ]
Greenacre, M.; Groenen, P.J.F.; Hastie, T.; D’Enza, A.I.; Markos, A.; Tuzhilina, E. Principal component analysis. Nat. Rev. Methods Primers 2022 , 2 , 100. [ Google Scholar ] [ CrossRef ]
Sayed, M.; Abdel-Hamid, M.; El-Dash, K. Improving cost estimation in construction projects. Int. J. Constr. Manag. 2020 , 23 , 135–143. [ Google Scholar ] [ CrossRef ]
Youssefi, I.; Celik, T. Optimized approach toward identification of influential cost overrun causes in construction industry. ASCE-ASME J. Risk Uncertain. Eng. Syst. Part A Civ. Eng. 2023 , 9 , 04023003. [ Google Scholar ] [ CrossRef ]
Mao, S.; Tseng, C.H.; Shang, J.; Wu, Y.; Zeng, X.J. In Proceedings of Construction cost index prediction: A visibility graph network method. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021. [ Google Scholar ]
Wang, C.X.; Qiao, J.L. Construction Project Cost Prediction Method Based on Improved BiLSTM. Appl. Sci. 2024 , 14 , 978. [ Google Scholar ] [ CrossRef ]
Gai, R.; Guo, Z. A Water Quality Assessment Method Based on an Improved Grey Relational Analysis and Particle Swarm Optimization Multi-classification Support Vector Machine. Front. Plant Sci. 2023 , 14 , 1099668. [ Google Scholar ] [ CrossRef ] [ PubMed ]
Tong, B.; Guo, J.J.; Fan, S. Predicting Budgetary Estimate of Highway Construction Projects in China Based on GRA-LASSO. J. Manage. Eng. 2021 , 37 , 04021012. [ Google Scholar ] [ CrossRef ]
Yu, W.K.; Wu, H.R.; Peng, C. Short-Term Price Forecast of Vegetables Based on Combination Model of Lasso Regression Method and BP Neural Network. Smart Agric. 2020 , 2 , 108–117. [ Google Scholar ]
Muehlethaler, C.; Albert, R. Collecting data on textiles from the internet using web crawling and web scraping tools. Forensic Sci. Int. 2021 , 322 , 110753. [ Google Scholar ] [ CrossRef ]
Shi, T.; Jiang, W.; Luo, P. A Method of Clustering Ensemble Based on Grey Relation Analysis. Wirel. Pers. Commun. 2018 , 103 , 871–885. [ Google Scholar ] [ CrossRef ]
Singh, T.; Patnaik, A.; Chauhan, R. Optimization of tribological properties of cement kiln dust-filled brake pad using grey relation analysis. Mater. Des. 2016 , 89 , 1335–1342. [ Google Scholar ] [ CrossRef ]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996 , 58 , 267–288. [ Google Scholar ] [ CrossRef ]
Fang, S.; Zhao, T.; Zhang, Y. Prediction of construction projects’ costs based on fusion method. Eng. Comput. 2017 , 34 , 2396–2408. [ Google Scholar ]
Rafiei, M.H.; Adeli, H. Novel machine-learning model for estimating construction costs considering economic variables and indexes. J. Constr. Eng. Manag. 2018 , 144 , 04018106. [ Google Scholar ] [ CrossRef ]
Ahn, J.; Park, M.; Lee, H.S.; Ahn, S.J.; Ji, S.H.; Song, K.; Son, B.S. Covariance effect analysis of similarity measurement methods for early construction cost estimation using case-based reasoning. Autom. Constr. 2017 , 81 , 254–266. [ Google Scholar ] [ CrossRef ]
Jiang, S.J. Green supplier selection for sustainable development of the automotive industry using grey decision-making. Sustain. Dev. 2018 , 26 , 890–903. [ Google Scholar ] [ CrossRef ]
Zeng, B.; Guo, J.; Zhang, F.; Zhu, W.; Xiao, Z.; Huang, S.; Fan, P. Prediction model for dissolved gas concentration in transformer oil based on modified grey wolf optimizer and LSSVM with grey relational analysis and empirical mode decomposition. Energies 2020 , 13 , 422. [ Google Scholar ] [ CrossRef ]
Delcea, C.; Cotfas, L.A. Public opinion assessment through grey relational analysis approach. In Advancements of Grey Systems Theory in Economics and Social Sciences ; Springer Nature: Singapore, 2023; pp. 179–199. [ Google Scholar ]
Fan, H.B. Research on Estimation Model of Construction Cost Based on Optimal Neural Network. Master’s Dissertation, Shenyang Jianzhu University, Shenyang, China, 2019. [ Google Scholar ]
Wang, D.H. Multivariate Statistical Analysis and SPSS Applications ; East China University of Science and Technology Press: Shanghai, China, 2010. [ Google Scholar ]

Number	Input Variables
X1	Project location
X2	Number of underground floors
X3	Structural type
X4	Total building area
X5	Above-ground building area
X6	Underground building area
X7	Presence of basement
X8	Number of floors
X9	Number of above-ground floors
X10	Ground floor height
X11	Standard floor height
X12	Eaves’ height
X13	Seismic fortification intensity
X14	Type of doors
X15	Type of windows
X16	Percentage of grade III steel
X17	Commercial concrete grade

Number	Input Variables	Correlation Degrees
X13	Seismic fortification intensity	0.9050
X17	Commercial concrete grade	0.9019
X15	Type of windows	0.9012
X14	Type of doors	0.8967
X11	Standard floor height	0.8913
X10	Ground floor height	0.8848
X16	Percentage of grade III steel	0.8795
X4	Total building area	0.8098
X5	Above-ground building area	0.8065
X12	Eaves’ height	0.8064
X3	Structural type	0.7984
X8	Number of floors	0.7947
X9	Number of above-ground floors	0.7937
X1	Project location	0.7257
X6	Underground building area	0.7119
X2	Number of underground floors	0.6916
X7	Presence of basement	0.6865

MODEL	MAE	MSE	RMSE
GRA-LASSO-BPNN	197.02	55,057.04	234.64
BPNN	246.77	92,251.84	303.73
LASSO	278.33	237,556.01	487.40

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Chen, L.; Wang, D. Cost Estimation and Prediction for Residential Projects Based on Grey Relational Analysis–Lasso Regression–Backpropagation Neural Network. Information 2024 , 15 , 502. https://doi.org/10.3390/info15080502

Chen L, Wang D. Cost Estimation and Prediction for Residential Projects Based on Grey Relational Analysis–Lasso Regression–Backpropagation Neural Network. Information . 2024; 15(8):502. https://doi.org/10.3390/info15080502

Chen, Lijun, and Dejiang Wang. 2024. "Cost Estimation and Prediction for Residential Projects Based on Grey Relational Analysis–Lasso Regression–Backpropagation Neural Network" Information 15, no. 8: 502. https://doi.org/10.3390/info15080502

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

Subscribe to receive issue release notifications and newsletters from MDPI journals

IMAGES

What is regression analysis?
Regression analysis: What it means and how to interpret the outcome
Regression Analysis Types Uses And Tips
Regression Analysis
A Refresher on Regression Analysis
Regression Analysis. Regression analysis models Explained…

VIDEO

Understanding Regression Analysis in Business Data Analytics(December 2022 )
How to Do Simple Randomization in MS-Excel
Meta-analysis in Stata || Funnel Plot || Egger’s Test
Clear variables and output form Stata database
Interpreting multiple regression analysis results
The Power of ChatGPT in Regression Analysis

COMMENTS

Regression Analysis: The Complete Guide
Regression analysis is a statistical method. It's used for analyzing different factors that might influence an objective - such as the success of a product launch, business growth, a new marketing campaign - and determining which factors are important and which ones can be ignored.
The Strategic Value of Regression Analysis in Marketing Research
Regression analysis is an important component of data-driven decision-making. This statistical technique is widely used in various fields, including economics, finance, marketing, healthcare, and social sciences, to understand the relationships between variables and make informed predictions.
Understanding Regression Analysis: Overview and Key Use
Understanding regression analysis: overview and key uses. Regression analysis is a fundamental statistical method that helps us predict and understand how different factors (aka independent variables) influence a specific outcome (aka dependent variable). Imagine you're trying to predict the value of a house.
What Is Regression Analysis in Business Analytics?
Regression analysis is the statistical method used to determine the structure of a relationship between variables. Learn to use it to inform business decisions.
What is Regression Analysis & How Is It Used?
Regression analysis is another tool market research firms used on a daily basis with their clients to help brands understand survey data from customers. The benefit of using a third-party market research firm is that you can leverage their expertise to tell you the "so what" of your customer survey data.
A Refresher on Regression Analysis
A Refresher on Regression Analysis. Understanding one of the most important types of data analysis. by. Amy Gallo. November 04, 2015. uptonpark/iStock/Getty Images. You probably know by now that ...
Regression Analysis
Marketing Research: Regression analysis helps marketers understand consumer behavior and make data-driven decisions. It can be used to predict sales based on advertising expenditures, pricing strategies, or demographic variables.
Regression analysis: Precise Forecasts and Predictions
What is regression analysis? Regression analysis serves as a statistical method and acts as a translator within the realm of market research, enabling the conversion of ambiguous or complex data into concise and understandable information.
Regression Analysis
Regression analysis is one of the most frequently used analysis techniques in market research. It allows market researchers to analyze the relationships between dependent variables and independent variables .
Regression Analysis
Regression analysis is one of the most frequently used tools in market research. In its simplest form, regression analysis allows market researchers to analyze relationships between one independent and one dependent variable. In marketing applications, the dependent...
Regression Analysis for Better Decision-Making
Regression analysis is a statistical method used to determine the relationship between a dependent variable (the variable we're trying to predict or understand) and one or more independent variables (the factors that we believe have an effect on the dependent variable). In market research, regression analysis is important because it can help ...
Regression Analysis in Market Research
Regression analysis is another tool market research firms used on a daily basis with their clients to help brands understand survey data from customers. The benefit of using a third-party market research firm is that you can leverage their expertise to tell you the "so what" of your customer survey data. At The MSR Group, we use regression ...
6 Regression
In regression analysis, we fit a model to our data and use it to predict the values of the dependent variable from one predictor variable (bivariate regression) or several predictor variables (multiple regression). The following table shows a comparison of correlation and regression analysis:
Linear Regression for Marketing Analytics [Hands-on]
Linear Regression for Marketing Analytics is one of the most powerful and basic concepts to get started in Marketing Analytics with. If you are looking to start off with learning Machine Learning which can lend a helping hand to your Marketing education then Linear Regression is the topic to get started with.
Regression Analysis: Definition, Types, Usage & Advantages
Overall, regression analysis saves the survey researchers' additional efforts in arranging several independent variables in tables and testing or calculating their effect on a dependent variable. Different types of analytical research methods are widely used to evaluate new business ideas and make informed decisions.
(PDF) Regression Analysis
7.1 Introduction. Regression analysis is one of the most fr equently used tools in market resear ch. In its. simplest form, regression analys is allows market researchers to analyze rela tionships ...
A Concise Guide to Market Research
This textbook is a quick and comprehensive guide to the entire market research process by means of quantitative methods, using IBM SPSS.
Regression Analysis
Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of relationships between a ...
Regression analysis: Precise Forecasts and Predictions
What is regression analysis? Regression analysis serves as a statistical method and acts as a translator within the realm of market research, enabling the conversion of ambiguous or complex data into concise and understandable information.
Regression Analysis in Market Research
While correlation analysis provides a single numeric summary of a relation ("the correlation coefficient"), regression analysis results in a prediction equation, describing the relationship between the variables. If the relationship is strong - expressed by the Rsquare value - it can be used to predict values of one variable given the ...
Collinearity, Power, and Interpretation of Multiple Regression Analysis
Abstract Multiple regression analysis is one of the most widely used statistical procedures for both scholarly and applied marketing research. Yet, correlated predictor variables—and potential collinearity effects—are a common concern in interpretation of regression estimates.
(PDF) A concise guide to market research: The process, data, and
Regression analysis is one of the most frequently used tools in market research. In its simplest form, regression analysis allows market researchers to analyze relationships between one ...
Market Research Regression
Market Regression Research Analysis with numerical predictors. Once the equation is developed, the resulting coefficients are used to predict the value of sales for a new set of predictor values. This is a quantitative method, and the closer the relationship between the measured variables. The advanced quality of predictions made from this. Some of the common applications include forecasting ...
Market Analysis: What It Is and How to Conduct One
Here, we focus on market analysis as a thorough business plan component. Continue reading to conduct your market analysis and lay a strong foundation for your business. How to do a market analysis in 6 steps. This section covers six main steps of market analysis, including the purpose of each step and questions to guide your research and ...
Information
In the early stages of residential project investment, accurately estimating the engineering costs of residential projects is crucial for cost control and management of the project. However, the current cost estimation of residential engineering in China is primarily carried out by cost personnel based on their own experience. This process is time-consuming and labour-intensive, and it ...

Try Qualtrics for free

What is regression analysis?

How does regression analysis work?

Understanding variables:

2. Independent variable

Simple linear regression analysis

Multiple regression analysis

Multivariate linear regression

Logistic regression

Make accurate predictions

Identify inefficiencies

Drive better decisions

How do businesses use regression? A real-life example

Regression analysis tools

Related resources

Data Analysis 31 min read

The Strategic Value of Regression Analysis in Marketing Research

Understanding Regression Analysis in Marketing

Benefits of Regression Analysis in Marketing

Strategic Applications

Case Study – Regression Analysis for Ranking Key Attributes

Table 1 - List of Leading Casual Dining Restaurant Chains in the United States

Case Study – Regression and Brand Response to Crisis

Figure 2 - Percentage Influence of "A Company I Trust"

Case Study - Regression Analysis/Maximizing Product Lines

Table 2 - List of Potential Product Area Development

Table 3 - Top 10 Hobby Products for Production Determined Through Regression Analysis

About This Blog

Popular Posts

Recent Posts

From Our Blog

Understanding regression analysis: overview and key uses

1. Prediction and forecasting

2. Identifying inefficiencies and opportunities

3. Making data-driven decisions

1. Overfitting the model

2. Underfitting the model

3. Neglecting model validation

4. Multicollinearity

5. Misinterpreting coefficients

6. Poor data quality

What is a regression analysis in simple terms?

What are the main types of variables used in regression analysis?

What does a regression analysis tell you?

Should you be using a customer insights hub?

Editor’s picks

Latest articles

Business Insights

What Is Regression Analysis in Business Analytics?

Foundational Concepts for Regression Analysis

Independent and Dependent Variables

Correlation vs. Causation

What Is Regression Analysis?

Types of Regression Analysis

How to Run Regressions

Calculating Confidence and Accounting for Error

Why Use Regression Analysis?

About the Author

What is Regression Analysis & How Is It Used?

A Refresher on Regression Analysis

Partner Center

Regression Analysis – Methods, Types and Examples

Regression Analysis

Regression Analysis Methodology

Types of Regression Analysis

Linear Regression

Multiple Regression

Polynomial Regression

Logistic Regression

Ridge Regression and Lasso Regression

Time Series Regression

Nonlinear Regression

Poisson Regression

Generalized Linear Models (GLM)

Regression Analysis Formulas

Regression Analysis Examples

Importance of Regression Analysis

When to Use Regression Analysis

Applications of Regression Analysis

Advantages and Disadvantages of Regression Analysis