Which of the following is a high leverage point with respect to the regression?

Linear Regression Models

Milan Meloun, Jiří Militký, in Statistical Data Analysis, 2011

Problem 6.28 Identification of influential points from elements of the projection matrix

The outlying point in sample C and the leverage point in sample D from Problem 6.8 may be used to test the identification of influential points by elements Hii and H*ii of projection matrix H.

Data: from Problem 6.8

Solution: The calculated diagonal elements Hii and H*ii of the projection matrix H are listed in Table 6.9

Table 6.9. Elements of the projection matrix Hii and the extended projection matrix H*ii for samples C and D

SamplexiyiHiiH*ii
C 13 12.75 0.236 1
D 19 12.5 1 1

The diagonal elements of the extended projection matrix indicate a strong influential point in both samples. The leverage point in sample D is indicated even by the diagonal element Hii of the original projection matrix.

Conclusion: The diagonal elements of an extended projection matrix are useful for detecting outlier and leverage points in data. The leverage point was not detected by any type of residuals (Problem 6.27).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780857091093500066

Volume 3

J. Ferré, in Comprehensive Chemometrics, 2009

3.02.3.3.7 Detection of multiple outliers

If a data set contains a single y-outlier or a high leverage point, the standard diagnostics in Sections 3.02.3.3.5, 3.02.3.3.6, and 3.02.3.4 work quite well. However, if the data set contains several outliers, masking and swamping may occur, and these methods can fail. The masking effect consists of failing to detect one or more outliers (they are incorrectly identified as normal samples) because other high leverage points are hiding them (i.e., they are masking their bad influence). The swamping effect consists of mistakenly declaring a ‘good’ (nonoutlying) point as an outlier because of the presence of other outliers that pull the regression equation toward them and make the ‘good’ points lie far from the fitted equation.

A single masked outlier can be detected by deletion diagnostics, in which one observation at a time is deleted, followed by the calculation of new residuals and parameter estimates. This can make other outliers emerge, which were not visible at first. Alternatively, the deletion process can be done with several observations at a time to detect multiple outliers. Some of the diagnostics designed for finding single outliers have been extended to multiple-case diagnostics to measure the joint effect of deleting more than one case. Cook and Weisberg2 (p135) and Rousseeuw and Leroy4 (p 234) show how Cook’s distance (Section 3.02.3.4.3) and other diagnostics can be generalized to detect multiple outliers. An inconvenience is that the number of combinations to be considered to consider all pairs, triples, etc., increases exponentially. Other proposals have appeared for identifying multiple outliers at a substantially reduced cost. Hadi62 proposed and later improved63 an iterative Mahalanobis distance type of method for the detection of multiple outliers in multivariate data. Hadi and Simonoff64 introduced two new test procedures for the detection of multiple outliers that attempt to separate the data into a set of ‘clean’ data points and a set of points that contain the potential outliers. These methods were reported to be superior in the detection of multiple outliers in linear models when compared with other methods, including methods based on robust fits (e.g., least median of squares (LMS) residuals, see below). Asimilar idea was behind the work by Walczak.65,66 Barrett and Gray67 also presented subset diagnostics for identifying multiple outliers. Penny and Jolliffe68 compared six techniques for detecting multivariate x-outliers, and found that some methods do better than others depending on whether or not the data set is multivariate normal, the dimension of the data set, the type of outlier, the proportion of outliers in a data set, and the degree of contamination, that is, ‘outlyingness’. Hence, it is best to run a battery of methods to detect outliers.

Another alternative is to use outlier diagnostics calculated from the output of robust regression. Robust multivariate methods fit the models to the majority of the data points, and minimize or remove the influence of outlying data points in the fit. As the ‘good’ data primarily determine the result, the outliers (especially multiple outliers) can be better identified as those observations with large residuals from the robust fit. After the outliers have been removed, the statistically well-established classical methods can be applied again. There are many robust strategies,4 such as the LMS69 regression, which minimizes the median of the squares of the residuals. In this method, up to half of the observations can disagree without masking a model that fits the rest of the data. Liang and Kvalheim,70 Moller et al.,71 and Daszykowski et al.72 have reviewed commonly used robust multivariate regression and exploratory methods used in Chemometrics and commented on their outlier detection capabilities. Diverse works have shown the use of outlier detection in robust methods,59,60,73,74 especially in robust PCR. A comparison of the classical diagnostics and robust diagnostics for the detection of multiple outlier can be found in Walczak and Massart.75

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780444527011000764

Building a Global Model: The Toolkit

Barry B. Hughes, in International Futures, 2019

2.4.3.4 Looking for and Understanding Leverage Points

The stylized system representations in Figs. 2.1–2.3 also provide insights with respect to potential leverage points for policy choices. Many such leverage points affect variables that operate within complex feedback systems. Interventions at points that sit within positive feedback loops, such as capital stock in Fig. 2.2, may have very strong impacts because they accelerate those reinforcing loops. In contrast, interventions at points that sit within negative feedback loops, like energy consumption in Fig. 2.3, may demonstrate disappointing impacts. For example, reductions in energy consumption (e.g., via auto emission standards) lower energy prices, which feeds back to encourage more energy consumption (e.g., through the purchase of larger cars), a well-known phenomenon called the rebound effect. In fact, Figs. 2.1–2.3 suggest the significant number of negative feedback loops in market-centric systems, explaining why models (and the real world) tend to “fight” our attempts to change basic forecast patterns.

In discussion of distal and proximate variables, we already touched on another leverage point insight. Fig. 2.1 included two proximate variables, namely contraception use and government spending on health. Demand for both contraception and public health spending tend to grow with GDP per capita and education levels (distal developmental variables). If government policy and other forces in a society have already led to higher actual contraception or health spending values than expected at a particular development level, it is unlikely that real world intended upward interventions will have as much leverage as in a country where values are lower than normal at the current development level. In the real world, attempted push on what we might call “saturated leverage points” will likely run into other constraints that fight back against our efforts.

The IFs project often uses cross-sectional analysis of potential leverage points with development variables to benchmark whether a country is ripe for intervention in a given area. Going back to Fig. 2.4, education in the Middle East was far below what we would expect in countries at their level of GDP per capita. In this case, even the distal drivers are imbalanced. Because in some circumstances education can also be considered a proximate driver (especially when enrollment rates are well below typical levels at a country's level of GDP per capita), it suggests an important leverage point for action by governments and other actors—or possibly a grievance point for young people who feel deprived of a good education and employment prospects.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128042717000026

Robust Regression

Rand Wilcox, in Introduction to Robust Estimation and Hypothesis Testing (Fourth Edition), 2017

10.15.2 R Functions reglev and rmblo

The R function

is provided for detecting regression outliers and leverage points using the method described in the previous section. If the ith vector of observations is a regression outlier, the function stores the value of i in the R variable reglev$regout. If xi is an outlier based on the method in Section 6.4.3, it is declared a leverage point and the function stores the value of i in reglev$levpoints. The plot created by this function can be suppressed by setting plotit=F. The R function

removes leverage points and returns the remaining data.

Example

If the reading data, described in Section 10.8.1, are stored in the R variables x and y, the command reglev(x,y) returns

$levpoints:
[1] 8

$regout:
[1] 12 44 46 48 59 80

This says that x8 is flagged as a leverage point (it is an outlier among the x values), and the points (y12,x12), (y44,x44), (y46,x46 ), (y48,x48), (y 59,x59), and (y80,x 80) are regression outliers. Note that even though x8 is an outlier, the point (y8,x8) is not a regression outlier. For this reason, x8 is called a good leverage point. (Recall that extreme x values can lower the standard error of an estimator.) If ( y8,x8) had been a regression outlier, x8 would be called a bad leverage point. Regression outliers, for which x is not a leverage point, are called vertical outliers. In the illustration all of the regression outliers are vertical outliers as well.

Example

The reading data used in the last example are considered again, only the predictor is now taken to be the data in column 3 of the file read.dat, which is another measure of phonological awareness called sound blending. The plot created by reglev is shown in Figure 10.6. Points below the horizontal line that intersects the y-axis at −2.24 are declared regression outliers, as are points above the horizontal line that intersects the y-axis at 2.24. There are three points that lie to the right of the vertical line that intersects the x-axis at χ0.975,p2=2.24; theses points are flagged as leverage points. These three points are not flagged as regression outliers, so they are deemed to be good leverage points.

Which of the following is a high leverage point with respect to the regression?

Figure 10.6. The plot created by the function reglev based on the reading data. Points to the right of the vertical line located at 2.24 on the x-axis are declared leverage points. Points outside the two horizontal lines are declared regression outliers.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978012804733000010X

Robust Regression

Rand R. Wilcox, in Introduction to Robust Estimation and Hypothesis Testing (Fifth Edition), 2022

Abstract

Chapter 10 summarizes a wide range of robust regression estimators. Their relative merits are discussed. Generally, these estimators deal effectively with regression outliers and leverage points. Some can offer a substantial advantage, in terms of efficiency, when there is heteroscedasticity. Included are robust versions of logistic regression and recently derived methods for dealing with multivariate regression, two of which take into account the association among the outcome variables, in contrast to most estimators that have been proposed. Robust versions of ridge estimators have been added as well robust lasso and elastic net techniques. R functions for applying these estimators are described.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128200988000166

Statistical Methods for Physical Science

George A.F. Seber, Christopher J. Wild, in Methods in Experimental Physics, 1994

9.6.6 Plots to Detect Influential Observations

An influential observation is one that either individually or collectively with several other observations has a bigger impact on the estimators than most of the other observations. There are two kinds of influential pointsoutliers and high leverage points. We have already mentioned outliers, and they can brought to our attention by unusual points in a residual plot. It is usually best to carry out the least squares fit with and without an outlier. Frequently the omission of an outlier has little effect so there may be no need to agonize over whether to include it or not. An outlier, therefore, need not be influential. On the other hand, an influential observation need not be an outlier in the sense of having a large residual, as we see in Fig. 3 , based on Draper and Smith ([12] p. 169). Here we have four observations, three at x=a and one at x=b. The residual for the middle observation, point (1), at x=a is 0. However, it turns out that the residual at x=b is also 0, irrespective of the corresponding y-value; i.e., points (2) and (3) both have a 0 residual. Clearly the observation at x=b is extremely influential, and the slopes of the fitted lines can be completely different. What makes this point influential is that its x-value (x=b) is very different from the x-values (x=a) for the remaining points. It is clear, therefore, that graphical methods based on residual alone will may fail to detect these points. This leads us to the concept of leverage.

Which of the following is a high leverage point with respect to the regression?

Fig. 3. The point (I) at x = a has a 0 residual. A single point at x = b is very influential but has a 0 residual irrespective of its value of y . Here points (2) and (3) represent two such values of y.

so that pii, the coefficient of yi can be thought of as the amount of "leverage" that yi has on y^i. Since pii=xiT(XTX)-1xi, where xi is the ith row of X; i.e., the ith data point in "X-space," we define a high leverage point to be an x i with a large pii. Points far removed from the main body of points in the X-space will have high leverage. Since it can be shown that Σi=1npii=p, we see that p/n is the average value of the pii . Values greater than 3p/n or even 2p/n are therefore deemed to have high leverage. As with outliers, a high leverage point need not be influential, but it has the potential to be. In Fig. 4 we have three situations: (a) there is one outlier but it is not influential, (b) there are no outliers but there is one high leverage point that is not influential, and (c) there is one outlier with high leverage so that it has high influence.

Which of the following is a high leverage point with respect to the regression?

Fig. 4. The effect of extreme points on least squares fitting: (a) there is one outlier but i t is not influential; (b) there are no outliers but there is one high leverage point; (c) there is one outlier with high leverage so that it has high influence.

There is a wide and somewhat confusing range of measures for detecting influential points, and a good summary of what is available is given by Chatterjee and Hadi [25] and the ensuing discussion. Some measures highlight problems with y (outliers), others highlight problems with the x-variables (high leverage), while some focus on both. As statisticians all have different experiences with data, the advice given on which measure to use is confusing. As Cook [27] emphasizes. the choice of the method depends on the answer to the question, " Influence on what?" One should be most concemed with the influence on the part of the analysis that is most closely related to the practical problem under investigation. The choice of measure should therefore be determined by the priorities in the subject matter rather than the purely statistical priorities. However, for the practioner in the physical sciences, what often determines the choice is what is available from the statistical packages being used. We shall therefore begin by looking at some of the measures produced by the regression program PROC REG using the key word INFLUENCE. The first is known as Cook's distance Di, which measures the change that occurs to β^ when the ith data point is omitted (giving β^(i)). It represents a general i zed distance between β^ and β^(i), namely,

Influence on β^ as a whole:

(9.21) Di=(β^−β^ (i))TXTX(β^ −β^(i))/ps2=ti2ρiip(1−pii)

In SAS pii, is denoted by hi. Points of high influence are those for which Di exceeds the upper α quantile point of the Fp,n-p distribution. Equation (9.21) indicates that Di depends on two quantities: (i) ti2, which measures the degree to which the ith point is an outlier, and (ii) pii/(1-p jj), which measures leverage.

Another measure is

Influence on predicted values: DFFITSi=| xiτ(β^−β^(i) )|s(i)pii =|rj*|pii1−p ii

which is similar to Di except that s(i) is used instead of s. Points for which DFFITS i >2p/n are investigated further. While Cook's Dj measures the influence of a single observation on β^, DFFlTSi tends to measure its influence on both β^ and s2 simultaneously. However, the latter does have some deficiencies in this double role, and the comments by Cook [27] about the relative merits of both measures are helpful.

The preceding measures assume that all the values of βr are of equal importance so that they measure the effect of the observation on all the elements of β^. However, an observation can be an outlier or influential only in one dimension or a few dimensions. The following measure

Influence on individual coefficients: DFBETASij=β^j−β^j j(i)se(β^j)

(with s(i) replacing s in se (β^j) measures the partial influence of the ith observation on the jth coefficient and it is to be compared to 2/n.

Before leaving these measures we note that Cook [27] recommends the routine use of the so-called likelihood displacement. This method can also be applied to nonlinear models as in Cook [28], though the theory is difficult. The development of techniques for nonlinear models is still in its infancy. However, if the intrinsic curvatures are negligible then the residuals ri can be used in much the same way as in linear models. Otherwise, so-called projected residuals need to be used (Seber and Wild [3], p. 179). For a further discussion on regression diagnostics for linear models, the reader is referred to Belsey, Kuh, and Welsch, [29], Cook and Weisberg [24, 301, Weisberg [14], and Atkinson [22]. Influence methods are discussed by Escobar and Meeker [31] with applications to censored data. A helpful book on graphical methods in general, as well as regression diagnostics, is given by Chambers et al. [32].

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/S0076695X08602598

Residual and influence diagnostics

Xian Liu, in Methods and Applications of Longitudinal Data Analysis, 2016

6.2.2 Leverage

In linear regression models, leverage is used to assess outliers with respect to the independent variables by identifying the observations that are distant from the average predictor values. While potentially impactful on the parameter estimates and the model fit, a higher leverage point does not necessarily indicate strong influence on the regression coefficient estimates because a far distance for a subject’s predictor values from those of others can be situated in the same regression line as other observations (Fox, 1991). Therefore, checking a substantial influence must combine high leverage with discrepancy of the case from the rest of the data.

The basic measurement of leverage is the so-called hat-value, denoted by hi. In general linear models, the hat-value is specified as a weight variable in the expression of the fitted value of the predicted response yˆj, given by

(6.21)yˆj~=∑i =1Nhij~yi,

where hij~ is the weight of subject i in predicting the outcome Y at data point j~ (j~=1,2,...,N ), and yˆj~ is specified as a weighted average of N observed values. Therefore, the weight variable hij~ displays the influence of yi on yˆj~, with a higher score indicating a greater impact on the fitted value. Let

(6.22) hi=hii=∑j~=1Nhij~2.

According to Equation (6.22), the hat-value hi, with property 0≤hi≤1, is the leverage score of yi on all fitted values.

In general linear models including a number of independent variables, leverage measures distance from the means of the independent variables and can be expressed as a matrix quantity given the covariance structure of the X matrix. Correspondingly, the hat-value hi is the ith diagonal of a hat matrix H. The hat matrix H is given by

(6.23)H=XX′X−1X′.

The diagonal of H provides a standardized measure of the distance for the ith subject from the center of the X-space, with a large value indicating that the subject is potentially influential. If all cases have equal influence, each subject will have a leverage score of M/N, where M is the number of independent variables (including the intercept) and N is the number of observations. In the literature of influence diagnostics, the leverage values exceeding 2M/N for large samples or 3M/N for samples of N ≤ 30 are regarded roughly as the influential cases.

Given the H matrix, the predicted values of y in general linear models can be written as

(6.24)yˆ=Hy.

Therefore, the H matrix determines the variance and covariance of the fitted values and residuals, given by

(6.25a)varyˆ=σ2H,

(6.25b)varr=σ21−H.

In longitudinal data analysis, the specification of the H matrix becomes more complex due to the inclusion of the covariance matrix V(R, G). Let Θ be an available estimate of R and G (also specified in Chapter 4). The leverage score for subject i can be expressed as the ith diagonal of the following hat matrix:

(6.26)H=XX′VΘˆ−1X−X′VΘˆ−1.

The ith diagonal of the above matrix is the leverage score for subject i displaying the degree of the case’s difference from others in one or more independent variables.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978012801342700006X

More Regression Methods

Rand R. Wilcox, in Introduction to Robust Estimation and Hypothesis Testing (Fifth Edition), 2022

11.5.16 R Functions smgridAB, smgridLC, smgrid, smtest, and smbinAB

The R function

smgridAB(x, y, IV = c(1, 2), Qsplit1 = 0.5, Qsplit2 = 0.5, tr = 0.2, PB = FALSE, est = tmean, nboot = 1000, pr = TRUE, xout = FALSE, outfun = outpro, SEED = TRUE, ...)

splits the data into groups based on quantiles specified by the arguments Qsplit1 and Qsplit2, and then compares the resulting groups based on trimmed means. By default, the splits are based on the medians of two of the independent variables. The argument IV indicates which of the two independent variables will be used. The first two columns of the argument x, a matrix or a data frame, are used by default. For each row of the first factor (the splits based on the first independent variable), all pairwise comparisons are made among the levels of the second factor (the splits based on the second independent variable). In similar manner, for each level of the second factor (the splits based on the second independent variable), all pairwise comparisons among the levels of the first factor are performed. Setting PB=TRUE, a percentile bootstrap method is used, which makes it possible to use a robust measure of location other than a trimmed mean via the argument est. Measures of effect size are returned as well.

The R function

smgridLC(x, y, IV = c(1, 2), Qsplit1 = 0.5, Qsplit2 = 0.5, PB = FALSE, est = tmean, tr = 0.2, nboot = 1000, pr = TRUE, con = NULL, xout = FALSE, outfun = outpro, SEED = TRUE, ...)

can be used to test hypotheses about linear contrasts. Linear contrast coefficients can be specified via the argument con. See, for example, Section 7.4.4. By default, con=NULL, meaning that all relevant interactions are tested. If it is desired to split the data based on a single independent variable, this can be done with the R function

smtest( x, y, IV = 1, Qsplit = 0.5, nboot = 1000, est = tmean, tr = 0.2, PB = FALSE, xout = FALSE, outfun = outpro, SEED = TRUE, ...).

To perform all pairwise comparisons, use the R function

smgrid(x, y, IV = c(1, 2), Qsplit1 = 0.5, Qsplit2 = 0.5, tr = 0.2, PB = FALSE, est = tmean, nboot = 1000, pr = TRUE, xout = FALSE, outfun = outpro, SEED = TRUE, ...).

If the dependent variable is binary, use the function

smbinAB(x, y, IV = c(1, 2), Qsplit1 = 0.5, Qsplit2 = 0.5, tr = 0.2, method = ‘KMS’, xout = FALSE, outfun = outpro, ...),

which is like the function smgridAB, only the KMS method for comparing two binomial distributions, described in Section 5.2, is used by default. To use method SK, also described in Section 5.2, set the argument method=‘SK’. The R function

smbin.inter(x, y, IV = c(1, 2), Qsplit1 = 0.5, Qsplit2 = 0.5, alpha = 0.05, con = NULL, xout = FALSE, outfun = outpro, SEED = TRUE, ...)

also deals with a binary dependent variable. By default, all interactions are tested, but other linear contrasts can be tested via the argument con.

Example

This example is based on the Well Elderly data mentioned in Sections 11.1.6 and 11.2.2. The goal here is to understand the association between a measure of meaningful activities (MAPA) and two covariates: a measure of life satisfaction (LSIZ) and the CAR, which was described in Section 11.1.6. All analyses were done with leverage points removed. The sample size is n=246. Testing the hypothesis that the linear model is correct, the p-value is 0.45. Least squares regression, with leverage points removed, yields a significant association for LSIZ (p-value < 0.001) but not for the CAR (p-value = 0.68). The HC4 method was used for dealing with heteroscedasticity. Switching to the Theil–Sen estimator in conjunction with a percentile bootstrap method, again, LSIZ is significant at the 0.05 level (p-value < 0.001) and CAR is not (p-value = 0.51). The basic lasso estimator given by Eq. (10.21), as well as the Huber-type lasso, also indicates that CAR is not relevant. That is, both of these methods estimate the slope for CAR to be zero.

However, the smooth in Fig. 11.13 suggests that the usual linear regression model might poorly reflect the true nature of the association. (The plot was rotated by setting the argument theta=120, which provides a better view of the association.) In particular, it suggests that the nature of the association depends in part on whether cortisol increases after awakening and whether LSIZ is relatively high or low. Here is a portion of the output from smgridAB, based on default settings and leverage points removed, with LSIZ the first independent variable and CAR the second:

Which of the following is a high leverage point with respect to the regression?

Figure 11.13. A smooth illustrating that in some situations, simply using grids can detect associations that are missed by a basic linear model.

$A[[1]]

 Group Group psihat ci.lower ci.upper p.value Est.1 Est.2

[1,] 1 2 2.38529 0.4991086 4.271472 0.01393357 33.02632 30.64103

$A[[2]]

 Group Group psihat ci.lower ci.upper p.value Est.1 Est.2

[1,] 1 2 -1.435165 -3.579639 0.7093088 0.1863744 34.30769 35.74286

Here, A[[1]] refers to the first level of the first factor, which corresponds to LSIZ scores less than the median. The reported results are for the two levels of the second factor, which correspond to low and high CAR values. The median CAR is −0.032. The results indicate that for low LSIZ scores, typical MAPA scores differ significantly for CAR less than the median versus CAR greater than the median. Roughly, for low LSIZ scores, MAPA scores tend to be higher when the CAR is negative (cortisol increases upon awakening). When LSIZ is high, which corresponds to A[[2]], now no significant difference is found. That is, contrary to the results based on linear models, the CAR has a significant association with MAPA when taking into account whether LSIZ is relatively high or low.

Another portion of the output looks like this:

$B[[1]]

 Group Group psihat ci.lower ci.upper p.value Est.1 Est.2

[1,] 1 2 -1.281377 -3.220148 0.6573951 0.1917055 33.02632 34.30769

$B[[2]]

 Group Group psihat ci.lower ci.upper p.value Est.1 Est.2

[1,] 1 2 -5.101832 -7.199571 -3.004092 6.971174e-06 30.64103 35.74286

This indicates that for high CAR values, B[[2]], it is reasonable to decide that typical MAPA scores are higher when LSIZ scores are relatively high, but no decision is made for low CAR values based on the results labeled B[[1]].

It is noted that smgridLC indicates that there is an interaction. The p-value is 0.009. As indicated in Section 11.7, a common strategy for modeling an interaction is to simply include a product term involving the two independent variables. But often this approach is not flexible enough to adequately detect any interaction that might exist. For the situation at hand, this approach does not yield a significant result. Using least squares regression via the R function olsci, the p-value is 0.57. Using the Theil–Sen estimator via the R function regci, the p-value is 0.91.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128200988000178

More Regression Methods

Rand Wilcox, in Introduction to Robust Estimation and Hypothesis Testing (Fourth Edition), 2017

11.10.5 R Function larsR

As previously noted, another method for identifying the best predictors is based on what is called the lasso (Tibshirani, 1996). And a related approach is least angle regression. Both the lasso and least angle regression can be applied with the R function

larsR(x,y,type=‘lasso’,xout=F,outfun=outpro).

By default, the lasso method is used. To use least angle regression, set the argument type=‘lar’. To eliminate leverage points via the function indicated by the argument outfun, set the argument xout=T. The function returns estimates of which independent variables are best, in descending order. Unlike the R functions regpre and regpreCV, larsR does not provide information about which subsets of variables are best. That is, it does not indicate, for example, whether predictors 1 and 2, taken together, are better in some sense than using predictor 1 only. Rather, it estimates which independent variable is best. Suppose this is independent variable 3. It then estimates which of the remaining independent variables is best when used in conjunction with independent variable 3. And it continues in this fashion using the remaining independent variables. (For results on bootstrapping the lasso estimator, see Chatterjee & Lahiri, 2013.) Kwon, Lee, and Kim (2015) note that the lasso can select too many noisy variables and they report results on how this issue might be addressed. Another issue is that the method is based on the least squares estimator raising concerns about robustness. A simple way of dealing with leverage points is to simply remove them. For alternative approaches, see McCann and Welsch (2007), and Khan, van Aelst, and Zamar (2007).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128047330000111

Volume 3

P. Filzmoser, ... P.J. Van Espen, in Comprehensive Chemometrics, 2009

3.24.5.4.2 Prediction of the concentration of metal oxides in archaeological glass vessels

The data set under consideration consists of 180 EPXMA (electron probe X-ray microanalysis) spectra of archaeological glass vessels. The experimental conditions under which the spectra were measured are beyond the scope of the current example, but have been described in detail in Janssens et al.40 It is noteworthy that the data contain samples that appertain to four different classes of archaeological glass: sodic glass, potassic glass, calcic glass, and potasso-calcic glass. The vast majority of samples belongs to the sodic group as only 10, 10, and 15 samples belong to the other three classes, respectively. Of the raw data set, the last 25 samples were identified to be outliers as they had been measured with a different detector efficiency. Hence, they are outlying in the space of the spectra, not the concentrations. In the statistical sense, these spectra are thus leverage points. It is important to note that all outliers are samples of sodic glass. The detection of these outliers was done by a tedious procedure: Each spectrum was evaluated separately. After removal of the outliers, classical PLS calibration was performed and was shown to yield good agreement between predicted and true concentrations for a test set.41

In the current section, we will show how the use of robust statistics could have sped up this procedure, making the manual spectrum evaluation step superfluous, without causing a sizable loss in prediction capacity. With these purposes, the PRM regression estimator was applied, as it is both efficient and can cope with leverage outliers in the data.

In order to perform calibration, the data were split up into two sets. A set for calibration was constructed as in Lemberge et al.,41 where it had been decided that the training set should contain 20 sodic samples and 5 samples of each of the other types of glass. In addition to these 35 spectra, 6 spectra were added, which belonged to the group of sodic samples measured at a different detector efficiency (bad leverage points). The remaining samples with correct detector efficiency were left for validation purposes.

In the original analysis, univariate PLS calibration was performed for all of the main constituents of the glass. These are sodium oxide, silicium dioxide, potassium oxide, calcium oxide, manganese oxide, and iron(iii) oxide. Here we will limit the description by showing the results for the prediction of sodium oxide and iron(iii) oxide. The reason for this is the following: The lighter the element42 (In EPXMA one observes the characteristic peaks elementwise; it is thus equivalent to write ‘model for sodium’ and ‘model for sodium oxide’.) for which to calibrate, the bigger the influence of the leverage points. Hence, showing the results for sodium and iron covers the trends that can be observed from this set of models, as sodium is the lightest and iron the heaviest element to be modeled. Why the model for sodium should be affected more by the outliers than the model for iron can be explained by physics: A decrease in the detector efficiency function is caused by a contamination layer on the detector’s surface. The number of X-ray photons that reach the detector is inversely proportional to the thickness of the contamination layer. However, highly energetic photons will not be absorbed by the contamination layer. The characteristic energies for Na Kα and Fe Kα photons are 1.02 and 6.4 keV, respectively. Hence, one may expect the peaks corresponding to iron photons to be affected far less by the lower detector efficiency than the sodium peak.

We come to describing the results for this data set. A PLS and a PRM model were constructed for the calibration of the sodium and iron concentrations. In the former model eight latent variables were used, whereas in the latter seven were used, as in Lemberge et al.41 The concentrations of Na2O and Fe2O3 were estimated for the validation set. The root mean squared errors of prediction (RMSEPs) are given in Table 1. We also state the respective RMSEPs obtained by Lemberge et al.41 posterior to removal of the leverage points from all of the data from the training set. The latter are reported in Table 1 under the heading ‘cleaned’ data, in contrast to the ‘original’ data.

Table 1. Root mean squared errors of prediction for the EPXMA data set using the PLS and PRM estimator, once using the original training sample and once using a clean version of the training sample

Na2OFe2O3
OriginalCleanedOriginalCleaned
PLS 2.66 1.26 0.14 0.12
PRM 1.50 0.10

It is observed (see Table 1) that indeed for sodium oxide, the classical PLS model is vastly affected by the leverage points, as the RMSEP is almost double compared to the cleaned data set. Using a robust method (PRM), the effect of the leverage points can to a large extent be countered, although in comparison with the cleaned classical model there is still a nonnegligible increase in RMSEP.

For iron(iii) oxide, the results also show the expected trends: The ‘outliers’ are not as outlying as in the model for sodium oxide (due to the higher energy of the iron Kα characteristic photons) and thus PLS performs only slightly inferior on the data set containing outliers than on the cleaned data set, whereas PRM surprisingly performs best of all.

From these results, we can conclude that at least a lot of time could have been gained by usage of a robust method. Depending on the requested accuracy, PRM could have been used directly for calibration or for detection of the outliers, which in both cases would have eliminated the tedious spectrum evaluation step in which the outliers were detected manually.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780444527011001137

What is high leverage points in regression?

Simply put, high leverage points in linear regression are those with extremely unusual independent variable values in either direction from the mean (large or small). Such points are noteworthy because they have the potential to exert considerable “pull”, or leverage, on the model's best-fit line.

What is a leverage point in regression?

Def 0.1 (Leverage and influence points). • A leverage point is an observation that has an unusual predictor value (very different from the bulk of the observations). • An influence point is an observation whose removal from the data set would cause a large change in the estimated reggression model coefficients.

What is a high leverage?

When one refers to a company, property, or investment as "highly leveraged," it means that item has more debt than equity. The concept of leverage is used by both investors and companies. Investors use leverage to significantly increase the returns that can be provided on an investment.

What is an example of a leverage point?

An example of a low leverage point would be pushing on the side of a ship to change its course. This would require a large amount of force to have the intended effect. But if the high leverage point of pushing on the rudder is used instead, it takes only a small amount of force to achieve the same effect.