What Is R2 Linear Regression?

••• Klaus Vedfelt/DigitalVision/GettyImages

What Is the Tukey HSD Test?

Updated February 24, 2020

By Kevin Beck

Reviewed by: Lana Bandoim, B.S.

In sports, hard work in training and practice sessions is often rewarded with high placings in competitions and games (in a proportional way). In other words, the old-school refrain of "No pain, no gain!" rings with a lot of truth, although a more optimistic framing of the same idea is, "The harder you objectively work, the greater your level of objective success."

You could test this idea by choosing 100 distance runners at random (perhaps using an online survey to collect participants) and having them race each other over a distance of 5 kilometers (3.1 miles). You could ask them to report how many miles per week they ran on average in the preceding three months before this test.

If you then plotted a graph of 5K speed vs. average miles per week, you would expect to see a positive correlation between training and performance. But would this be a "perfect" correlation? In other words, can you think of reasons to expect data points that would deviate from the predicted relationship between training volume and 5K speed?

Welcome to the world of linear regression analysis, a marvelous and usually quite interesting tool to help scrutinize and quantify relationships between apparently related variables. In addition to the example above, you can imagine countless others (e.g., rainfall vs. vegetation level; income vs. access to medical care in the U.S.) of personal and civic interest.

Read on for more than you ever expected to know about matters related to the now-famous "R-squared formula" in statistics.

About Linear Equations

A linear equation is so named because it produces a straight line when graphed using x and y coordinates. It can be expressed in the form:

y = a + bx

In this scheme, a and b are constants, x is called the independent variable, and y is known as the dependent variable. Another way to state this relationship is "the variation of y with x."

What this translates to in the real world is that x is usually a variable you can control or pick in an experiment or analysis (such as the number of miles run), and y is a variable that seems to have some kind of dependency on x (such as running speed).

Example: Graph the equation y = 5x − 7.

In linear equations, a is known as the y-intercept. You can see from the graph that this is the value of y where the graph crosses the y-axis. If it does not, then the graph is a vertical line, and the equation assumes the form x = a constant. Such a graph does not establish anything at all about y as a function of x and cannot be put in the form y = a + bx.

The constant b is called the slope of the line, familiarly known as "rise over run" in introductory mathematics courses. It can be positive (represented by an upward-sloping line in relation to the x- and y-axes), negative (a downward-sloping line) or 0 (a horizontal line).

What Is Correlation Between Variables?

Above, you were invited to consider the impact of a variable behavior (physical training) on an outcome (a 5K time) proposed to hinge to some unknown but considerable extent on that variable behavior.

By choosing a sizable number of subjects for your analysis (N = 100), you aim to seek determine whether a meaningful and reproducible relationship exists; if you only looked at three or four runners and one or two happened to have a cold on test day, the results would be less helpful.

If you charged $10 for an app that you developed and somehow had no start-up or maintenance costs, your profit would just be the number of units you sold times ten: y = 10x. There would thus be a "perfect," or invariant, correlation between the number of units sold and profit. If you plotted the graph, a single line would obviously join all the points.

But what about correlations that are clearly in play but are not "perfect"? In science, this is in fact the case most of the time, and linear regression analysis is the tool scientists use to determine the extent or power of any relationships determined between variables in the world.

What Is Confounding in Statistics?

Imagine sampling 1,000 people from the U.S. population who report consuming more than three cups of coffee per day and comparing the collective rate of lung cancer in this group to the lung-cancer rate of 1,000 randomly chosen Americans who report drinking no coffee at all. Would you be surprised to find that the coffee-drinking group wound up experiencing significantly more lung cancers than the abstainers?

If you're already thinking that either the study design was flawed, or there is something insidious and previously unknown about coffee, you're on the right track. It would perhaps not be surprising to find that the rate of cigarette smoking is far higher among heavy coffee drinkers than in people who drink moderate amounts or none at all.

In this case, cigarette smoking is known as a confounding variable. Because it has measurable effects on the outcome of interest without being related to the independent variable, it throws noise into the study. Statisticians and researchers have to be able to control for such confounding variables when designing studies and analyzing the data these produce.

About Regression Analysis

Say you carry out your training-versus-5K time analysis, and much to your delight, you see that there is in fact a relationship between work and results: Those who report more rigorous preparation tend to have faster times. But the graph is not a line by any means; instead, it is a sort of cloud that looks like a line could be run through it and capture the mathematical "essence" of the cloud of points, called a scatter plot.

In order to perform what is called a linear regression analysis, which is the process used to determine a best line of fit in a scatter plot, you must be able to make two assumptions. One is that the relationship is in fact linear rather than, say, curvilinear, as when y varies with some exponential power of x.

The other is that the relationship between y and x is such that y is continuous, that is, not a discrete variable such as 1, 2 or 3 classes in a semester.

In a graph of 5K speed vs. training volume for your 100 subjects, there is no true line representing the graph. That means that there is also no real slope or y-intercept. There is, however, a line that best fits all of the plotted points and minimizes the total difference between the line and all of the individual data points. This line produces an estimate of the y-intercept and slope and the equation describing it is of the form noted above:

ŷ = a + bx

ŷ is called "y hat," and the graph is called a line of best fit or, for reasons soon to become clear, a least-squares line.

As you may have determined, you aren't expected to solve these equations by hand. Not only will your calculator perform this function for you, but you can also use any number of online tools to do the job for you (see the Resources for an example).

What Is the Correlation Coefficient r?

In the above equation, the constants a and b are estimates derived from the mean values of x and y in the sample (such as average training volume and average 5K time), written as x̅ and y̅. The derivation is too extensive for this discussion, but for completeness' sake,

a = y̅ − bx̅

b = ∑[(x − x̅)(y − y̅ )]/ ∑(x − x̅ )²

The constant b is derived from the magnitude of the deviations. Intuitively, you may already perceive that smaller values of all of the quantities in parentheses in this equation are associated with a better "fit" between the data and the line created to determine a linear relationship between x and y within those points.

The expression for the constant b above can be written:

b = r(S_y/S_x),

Where S_y and S_x are the standard deviations of the x and y values in the set. At last, you have arrived at a key quantity in regression analysis: The correlation coefficient r, which can vary between −1.0 and 1.0.

r is the bottom item on the output screen of the LinRegTTest on TI-83, TI-83+ and TI-84+ calculators.

What Is the Coefficient of Determination?

The correlation coefficient r on its own is very useful. A value close to 1.0 indicates a near-perfect positive correlation, as in the example of your app sales. A value close to −1.0 indicates a strong negative correlation, in which moving the independent variable (say, hours spent partying) one way results in moving another (say, GPA) in the opposite direction.

A second important quantity in linear regression analysis is the coefficient of determination. In discussions of linear regression, the coefficient of determination is always the square of the correlation coefficient r, so it is simply (r)² = r². Note that this value cannot be negative.

The coefficient of determination is not merely a numerical transformation from the correlation coefficient; it also has great explanatory value in many cases. It is usually expressed as a percentage rather than a decimal number, for this is the language statisticians prefer to use when conveying information to other scientists and especially the public.

Why Use the r2 Value?

First, it is useful to know what r² actually represents. It is best defined as the percentage of variation in the dependent or predicted variable (y) that can be explained by variation in the independent or explanatory variable (x) using the best-fit line generated by the regression analysis.

If the value of r² in your running study turned out to be 0.64, you could state that 64 percent of the variation in 5K times was explained by differences in training volume. (Quick quiz: What values of r could result in a coefficient of determination of 0.64?)

By the same token, the value 1 – r², expressed as a percentage, represents the percent of variation in y that is not explained by variation in x. This may appear to be a trivially true result, but in some cases, you may be more explicitly interested in differences rather than similarities.

In your running analysis, if you did not divide your subjects into categories based on factors such as age, sex and general health, you could expect to have a number of confounding variables in your analysis, thus driving down the value of r² and exposing the limits of the investigative power of your analysis.

Linear Regression Calculator

In the Resources, you'll find an example of a tool that allows you to input as many x and y values as you wish from a data set and perform a linear regression, generating r and r2 in the process. Playing around with increasingly larger data sets and tinkering with the variation by "feel" is a great way to familiarize yourself with linear regression and its graphical implications.

References

Resources

About the Author

Sciencing_Icons_Cells Cells

Sciencing_Icons_Molecular Molecular

Sciencing_Icons_Microorganisms Microorganisms

Sciencing_Icons_Genetics Genetics

Sciencing_Icons_Human Body Human Body

Sciencing_Icons_Ecology Ecology

Sciencing_Icons_Atomic &amp; Molecular Structure Atomic & Molecular Structure

Sciencing_Icons_Bonds Bonds

Sciencing_Icons_Reactions Reactions

Sciencing_Icons_Stoichiometry Stoichiometry

Sciencing_Icons_Solutions Solutions

Sciencing_Icons_Acids &amp; Bases Acids & Bases

Sciencing_Icons_Thermodynamics Thermodynamics

Sciencing_Icons_Organic Chemistry Organic Chemistry

Sciencing_Icons_Fundamentals-Physics Fundamentals

Mechanics

Sciencing_Icons_Electronics Electronics

Sciencing_Icons_Waves Waves

Sciencing_Icons_Energy Energy

Sciencing_Icons_Fluid Fluid

Sciencing_Icons_Astronomy Astronomy

Sciencing_Icons_Fundamentals-Geology Fundamentals

Sciencing_Icons_Minerals &amp; Rocks Minerals & Rocks

Sciencing_Icons_Earth Scructure Earth Structure

Sciencing_Icons_Fossils Fossils

Sciencing_Icons_Natural Disasters Natural Disasters

Sciencing_Icons_Ecosystems Ecosystems

Sciencing_Icons_Environment Environment

Sciencing_Icons_Insects Insects

Sciencing_Icons_Plants &amp; Mushrooms Plants & Mushrooms

Sciencing_Icons_Animals Animals

Sciencing_Icons_Addition &amp; Subtraction Addition & Subtraction

Sciencing_Icons_Multiplication &amp; Division Multiplication & Division

Sciencing_Icons_Decimals Decimals

Sciencing_Icons_Fractions Fractions

Sciencing_Icons_Conversions Conversions

Sciencing_Icons_Working with Units Working With Units

Sciencing_Icons_Equations &amp; Expressions Equations & Expressions

Sciencing_Icons_Ratios &amp; Proportions Ratios & Proportions

Sciencing_Icons_Inequalities Inequalities

Sciencing_Icons_Exponents &amp; Logarithms Exponents & Logarithms

Sciencing_Icons_Factorization Factorization

Sciencing_Icons_Functions Functions

Sciencing_Icons_Linear Equations Linear Equations

Sciencing_Icons_Graphs Graphs

Sciencing_Icons_Quadratics Quadratics

Sciencing_Icons_Polynomials Polynomials

Sciencing_Icons_Fundamentals-Geometry Fundamentals

Sciencing_Icons_Cartesian Cartesian

Sciencing_Icons_Circles Circles

Sciencing_Icons_Solids Solids

Sciencing_Icons_Trigonometry Trigonometry

Sciencing_Icons_Mean-Median-Mode Mean/Median/Mode

Sciencing_Icons_Independent-Dependent Variables Independent/Dependent Variables

Sciencing_Icons_Deviation Deviation

Sciencing_Icons_Correlation Correlation

Sciencing_Icons_Sampling Sampling

Sciencing_Icons_Distributions Distributions

Sciencing_Icons_Probability Probability

Sciencing_Icons_Differentiation-Integration Differentiation/Integration

Sciencing_Icons_Application Application

What Is R2 Linear Regression?

What Is the Tukey HSD Test?

About Linear Equations

What Is Correlation Between Variables?

What Is Confounding in Statistics?

About Regression Analysis

What Is the Correlation Coefficient r?

What Is the Coefficient of Determination?

Why Use the r2 Value?

Linear Regression Calculator

Related Articles

What Is the Tukey HSD Test?

The Difference Between Linear & Nonlinear Equations

Similarities of Univariate & Multivariate Statistical...

What Is a Non Linear Relationship?

How to Use the Pearson Correlation Coefficient

How to Calculate Binomial Probability

The Advantages of a Large Sample Size

10 Ways Simultaneous Equations Can Be Used in Everyday...

Cells

Molecular

Microorganisms

Genetics

Human Body

Ecology

Atomic & Molecular Structure

Bonds

Reactions

Stoichiometry

Solutions

Acids & Bases

Thermodynamics

Organic Chemistry

Fundamentals

Electronics

Waves

Energy

Fluid

Astronomy

Fundamentals

Minerals & Rocks

Earth Structure

Fossils

Natural Disasters

Ecosystems

Environment

Insects

Plants & Mushrooms

Animals

Addition & Subtraction

Multiplication & Division

Decimals

Fractions

Conversions

Working With Units

Equations & Expressions

Ratios & Proportions

Inequalities

Exponents & Logarithms

Factorization

Functions

Linear Equations

Graphs

Quadratics

Polynomials

Fundamentals

Cartesian

Circles

Solids

Trigonometry

Mean/Median/Mode

Independent/Dependent Variables

Deviation

Correlation

Sampling

Distributions

Probability

Differentiation/Integration

Application