# ASTM E3080-16

Designation: E3080 − 16 An American National StandardStandard Practice forRegression Analysis1This standard is issued under the fixed designation E3080; the number immediately following the designation indicates the year oforiginal adoption or, in the case of revision, the year of last revision. A number in parentheses indicates the year of last reapproval. Asuperscript epsilon (´) indicates an editorial change since the last revision or reapproval.1. Scope1.1 This practice covers regression analysis methodologyfor estimating, evaluating, and using the simple linear regres-sion model to define the relationship between two numericalvariables.1.2 The system of units for this practice is not specified.Dimensional quantities in the practice are presented only asillustrations of calculation methods. The examples are notbinding on products or test methods treated.1.3 This standard does not purport to address all of thesafety concerns, if any, associated with its use. It is theresponsibility of the user of this standard to establish appro-priate safety and health practices and determine the applica-bility of regulatory limitations prior to use.2. Referenced Documents2.1 ASTM Standards:2E456 Terminology Relating to Quality and StatisticsE2282 Guide for Defining the Test Result of a Test MethodE2586 Practice for Calculating and Using Basic Statistics3. Terminology3.1 Definitions—Unless otherwise noted, terms relating toquality and statistics are as defined in Terminology E456.3.1.1 characteristic, n—a property of items in a sample orpopulation which, when measured, counted, or otherwiseobserved, helps to distinguish among the items. E22823.1.2 coeffıcient of determination, r2,n—square of thecorrelation coefficient.3.1.3 confidence interval, n—an interval estimate [L, U]with the statistics L and U as limits for the parameter θ andwith confidence level 1 – α, where Pr(L ≤θ≤U) ≥ 1–α.E25863.1.3.1 Discussion—The confidence level, 1 – α, reflects theproportion of cases that the confidence interval [L, U] wouldcontain or cover the true parameter value in a series of repeatedrandom samples under identical conditions. Once L and U aregiven values, the resulting confidence interval either does ordoes not contain it. In this sense “confidence” applies not to theparticular interval but only to the long run proportion of caseswhen repeating the procedure many times.3.1.4 confidence level, n—the value, 1 – α, of the probabilityassociated with a confidence interval, often expressed as apercentage. E25863.1.4.1 Discussion—α is generally a small number. Confi-dence level is often 95 % or 99 %.3.1.5 correlation coeffıcient, n—for a population, ρ, a di-mensionless measure of association between two variables Xand Y, equal to the covariance divided by the product of σXtimes σY.3.1.6 correlation coeffıcient, n—for a sample, r, the estimateof the parameter ρ from the data.3.1.7 covariance, n—of a population, cov(X, Y), for twovariables, X and Y, the expected value of (X – µX)(Y – µY).3.1.8 covariance, n—of a sample; the estimate of the pa-rameter cov(X,Y) from the data.3.1.9 dependent variable, n—a variable to be predictedusing an equation.3.1.10 degrees of freedom, n—the number of independentdata points minus the number of parameters that have to beestimated before calculating the variance. E25863.1.11 deviation, d, n—the difference of an observed valuefrom its mean.3.1.12 estimate, n—sample statistic used to approximate apopulation parameter. E25863.1.13 independent variable, n—a variable used to predictanother using an equation.3.1.14 mean, n—of a population, µ, average or expectedvalue of a characteristic in a population – of a sample, X¯, sumof the observed values in the sample divided by the samplesize. E25863.1.15 parameter, n—see population parameter. E25863.1.16 population, n—the totality of items or units ofmaterial under consideration. E25861This practice is under the jurisdiction of ASTM Committee E11 on Quality andStatistics and is the direct responsibility of Subcommittee E11.10 on Sampling /Statistics.Current edition approved Nov. 1, 2016. Published November 2016. DOI:10.1520/E3080-16.2For referenced ASTM standards, visit the ASTM website, www.astm.org, orcontact ASTM Customer Service at service@astm.org. For Annual Book of ASTMStandards volume information, refer to the standard’s Document Summary page onthe ASTM website.Copyright © ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959. United States13.1.17 population parameter, n—summary measure of thevalues of some characteristic of a population. E25863.1.18 prediction interval, n—an interval for a future valueor set of values, constructed from a current set of data, in a waythat has a specified probability for the inclusion of the futurevalue. E25863.1.19 regression, n—the process of estimating parameter(s)of an equation using a set of data.3.1.20 residual, n—observed value minus fitted value, whena model is used.3.1.21 statistic, n—see sample statistic. E25863.1.22 quantile, n—value such that a fraction f of the sampleor population is less than or equal to that value. E25863.1.23 sample, n—a group of observations or test results,taken from a larger collection of observations or test results,which serves to provide information that may be used as a basisfor making a decision concerning the larger collection. E25863.1.24 sample size, n, n—number of observed values in thesample. E25863.1.25 sample statistic, n—summary measure of the ob-served values of a sample. E25863.1.26 standard error—standard deviation of the populationof values of a sample statistic in repeated sampling, or anestimate of it. E25863.1.26.1 Discussion—If the standard error of a statistic isestimated, it will itself be a statistic with some variance thatdepends on the sample size.3.1.27 standard deviation—of a population, σ, the squareroot of the average or expected value of the squared deviationof a variable from its mean; —of a sample, s, the square rootof the sum of the squared deviations of the observed values inthe sample from their mean divided by the sample sizeminus 1. E25863.1.28 variance, σ2,s2,n—square of the standard deviationof the population or sample. E25863.1.28.1 Discussion—For a finite population, σ2is calcu-lated as the sum of squared deviations of values from the mean,divided by n. For a continuous population, σ2is calculated byintegrating (x –µ)2with respect to the density function. For asample, s2is calculated as the sum of the squared deviations ofobserved values from their average divided by one less than thesample size.4. Significance and Use4.1 Regression analysis is a statistical procedure that studiesthe relations between two or more numerical variables andutilizes existing data to determine a model equation forprediction of one variable from another. In this standard, asimple linear regression model, that is, a straight line relation-ship between two variables, is considered (1, 2).35. Straight Line Regression and Correlation5.1 Two Variables—The data set includes two variables, Xand Y, measured over a collection of sampling units, experi-mental units or other type of observational units. Each variableoccurs the same number of times and the two variables arepaired one to one. Data of this type constitute a set of n orderedpairs of the form (xi, yi), where the index variable (i) runs from1 through n.5.1.1 Y is always to be treated as a random variable. X maybe either a random variable sampled from a population with anerror that is negligible compared to the error of Y, or valueschosen as in the design of an experiment where the valuesrepresent levels that are fixed and without error. We refer to Xas the independent variable and Y as the dependent variable.5.1.2 The practitioner typically wants to see if a relationshipexists between X and Y. In theory, many different types ofrelationships can occur between X and Y. The most common isa simple linear relationship of the form Y = α + β X + ε, whereα and β are model coefficients and ε is a random error termrepresenting variation in the observed value of Y at given X,and is assumed to have a mean of 0 and some unknownstandard deviation σ. A statistical analysis that seeks todetermine a linear relationship between a dependent variable,Y, and a single independent variable, X, is called simple linearregression. In this type of analysis it is assumed that the errorstructure is normally distributed with mean 0 and someunknown variance σ2throughout the range of X and Y. Further,the errors are uncorrelated with each other. This will beassumed throughout the remainder of this section.45.1.3 The regression problem is to determine estimates ofthe coefficients α and β that “best” fit the data and allowestimation of σ. An additional measure of association, thecorrelation coefficient, ρ, can also be estimated from this typeof data which indicates the strength of the linear relationshipbetween X and Y. The sample correlation coefficient, r,istheestimate of ρ. The square of the correlation coefficient, r2,iscalled the coefficient of determination and has additionalmeaning for the linear relationship between X and Y.5.1.4 When a suitable model is found, it may be used toestimate the mean response at a given value of X or to predictthe range of future Y values from a given X.5.2 Method of Least Squares—The methodology consideredin this standard and used to estimate the model parameters αand β is called the method of least squares. The form of the bestfitting line will be denoted as Y = a + bX, where a and b are theestimates of α and β respectively. The ith observed values of Xand Y are denoted as xiand yi. The estimate of Y at X = xiiswritten yˆi5a1bxi. The “hat” notation over the yivariabledenotes that this is the estimated mean or predicted value of Yfor a given x.5.2.1 The least squares best fitting line is one that minimizesthe sum of the squared deviations from the line to the observed3The boldface numbers in parentheses refer to a list of references at the end ofthis standard.4The normal distribution of the error structure is not required to fit the linearmodel to the data but is required for performing standard model analysis such asresidual analysis, confidence and prediction intervals and statistical inference on themodel parameters.E3080 − 162yivalues. Note that these are vertical distances. Analytically,this sum of squared deviations is of the form:S~a, b!5 Σi51n~yi2 yˆi!25 Σi51n~yi2 a 2 bxi!2(1)5.2.2 The sum of squares, S, is written as a function of a andb. Minimizing this function involves taking partial derivativesof S with respect to a and b. This will result in two linearequations that are then solved simultaneously for a and b. Theresulting solutions are functions of the (xi, yi) paired data.5.2.3 Several algebraically equivalent formulas for the leastsquares solutions are found in the literature. The followingdescribes one convenient form of the solution. First definesums of squares SXXand SYYand the sum of cross products SXYas follows:SXX5 ~n 2 1!sx25 Σi51n~x12 x¯!2(2)SYY5 ~n 2 1!sy25 Σi51n~y12 y¯!2(3)SXY5 Σi51n~x12 x¯!~y12 y¯! 5 Σi51n~x12 x¯!y1(4)Note that in Eq 2 and Eq 3, sxand syare the ordinary samplestandard deviations of the X and Y data respectively. The lastexpression in Eq 4 follows from the middle expression becauseΣi51n~x12 x¯!y¯50.From the least squares solution, the slope estimate iscalculated as:b 5Σi21n~xi2 x¯!yiΣi21n~xi2 x¯!25SXYSXX(5)Once b is determined, the intercept term is calculated from:a 5 y¯ 2 bx¯ (6)5.3 Example—An example for this kind of data and theassociated basic calculations is shown in Table 1. This data istaken from Duncan (3), and shows the relationship between themeasurement of shear strength, Y, and weld diameter, X, for 10random specimens. Values for the estimated slope and interceptare b = 6.898 and a = –569.468. Fig. 2 shows the scatter plotand associated least squares linear fit.In Eq 5, the slope estimate b is seen as a weighted averageof the yiwhere the weights, wi, are defined as:wi5~xi2 x¯!SXX(7)Values of xifurthest from the average will have the greatestimpact on the associated weight applied to observation yiandon the numerical determination of the slope b.5.4 Correlation Coeffıcient—The population correlationcoefficient, or Pearson Product Moment CorrelationCoefficient, ρ, is a dimensionless parameter intended to mea-sure the strength of a linear relationship between two variables.The estimated sample correlation coefficient, r, for a set ofpaired data (xi, yi) is calculated as:r 5Σi21n~xi2 x¯!~yi2 y¯!~n 2 1!sxsy5Σi21n~xi2 x¯!yi~n 2 1!sxsy(8)In Eq 8, the quantityΣi21n~x 2 x¯!~y 2 y¯!~n 2 1!is referred to as thesample co-variance. Here again, the mean of y disappears fromthe right side of Eq 8, because Σi21n~x 2 x¯!y¯50.5.4.1 An alternative formula for r uses the standard devia-tion of the paired differences (di= yi– xi). Note that it does notmatter in what order we calculate these differences. Either di=yi– xior di= xi– yiwill give the same result:TABLE 1 Weld Diameter (x) and Shear Strength (y)ixiyidi=xi–yixi–x¯(xi–x¯)yi1 190 680 –490.0 –33.9 –23,052.02 200 800 –600.0 –23.9 –19,120.03 209 780 –571.0 –14.9 –11,622.04 215 885 –670.0 –8.9 –7,876.55 215 975 –760.0 –8.9 –8,677.56 215 1025 –810.0 –8.9 –9,122.57 230 1100 –870.0 6.1 6,710.08 250 1030 –780.0 26.1 26,883.09 265 1175 –910.0 41.1 48,292.510 250 1300 –1050.0 26.1 33,930.0average 223.9 975.0stdev (S) 24.196 191.645 170.987S2585.433 36,727.778 29,236.544parameter estimatesb 6.898a –569.468SXX5,268.900SYY330,550.000SXY36,345.000E3080 − 163r 5sx21sy22 sd22sxsy(9)The correlation coefficient for the data in Table 1 using Eq 8and Eq 9 are:r 536,345~10 2 1!~24.196!~191.645!5 0.871r 524.19621191.64522 170.89722~24.196!~191.645!5 0.8715.4.2 The value of the correlation coefficient is alwaysbetween –1 and +1. If r is negative (y decreases as x increases)then a line fit to the data will have a negative slope; similarly,positive values of r (y increased as x increases) are associatedwith a positive slope. Values of r near 0 indicate no linearrelationship so that a line fit to the data will have a slope near0. In cases where the (x, y) data have an r =–1orr = +1, therelationship between x and y is perfectly linear.An r value nearto +1 or –1 indicate that a line may provide an adequate fit tothe data but does not “prove” that the relationship is linearsince other models may provide a better fit (for example, aquadratic model).As values of r become closer to the extremes(–1 and +1) a line provides a stronger explanation of therelationship. Fig. 2 shows examples of what correlated datalook like for several values of r.5.4.3 An alternative formula for the estimated slope b as afunction of the correlation coefficient, r, and standard devia-tions of the variables X and Y is:b 5rsysx(10)5.5 Residuals—For any specified xiin the data set, theresidual at xiis the difference ei5yi2yˆi5yi2~a 1 bxi!, thedifference between