
You could use the line to predict the final exam score for a student who earned a grade of 73 on the third exam. If the scatter plot indicates that there is a linear relationship between the variables, then it is reasonable to use a best fit line to make predictions for y given x within the domain of x-values in the sample data, but not necessarily for x-values outside that domain. Remember, it is always important to plot a scatter diagram first.

This best fit line is called the least-squares regression line. Any other line you might choose would have a higher SSE than the best fit line. The criteria for the best fit line is that the sum of the squared errors (SSE) is minimized, that is, made as small as possible. The idea behind finding the best-fit line is based on the assumption that the data are scattered about a straight line. The process of fitting the best-fit line is called linear regression. r is the correlationĬoefficient, which is discussed in the next section. The slope b can be written as b = r ( s y s x ) b = r ( s y s x ) where s y = the standard deviation of the y values and s x = the standard deviation of the x values. The best fit line always passes through the point ( x ¯, y ¯ ) ( x ¯, y ¯ ). The sample means of the x values and the y values are x ¯ x ¯ and y ¯ y ¯, respectively. Where a = y ¯ − b x ¯ a = y ¯ − b x ¯ and b = Σ ( x − x ¯ ) ( y − y ¯ ) Σ ( x − x ¯ ) 2 b = Σ ( x − x ¯ ) ( y − y ¯ ) Σ ( x − x ¯ ) 2. Minimum, you have determined the points that are on the line of best fit. Using calculus, you can determine the values of a and b that make the SSE a minimum.

This is called the Sum of Squared Errors (SSE). , 11.įor the example about the third exam scores and the final exam scores for the 11 statistics students, there are 11 data points. Here the point lies above the line and the residual is positive.įor each data point, you can calculate the residuals or errors, y i - ŷ i = ε i for i = 1, 2, 3. In the diagram in Figure 12.10, y 0 – ŷ 0 = ε 0 is the residual for the point shown. If the observed data point lies below the line, the residual is negative, and the line overestimates that actual data value for y. If the observed data point lies above the line, the residual is positive, and the line underestimates the actual data value for y. In other words, it measures the vertical distance between the actual data point and the predicted point on the line. The absolute value of a residual measures the vertical distance between the actual value of y and the estimated value of y.

It is not an error in the sense of a mistake. The term y 0 – ŷ 0 = ε 0 is called the "error" or residual.
