-
Notifications
You must be signed in to change notification settings - Fork 16
SHRIMP: Sub WtdLinCorr
[The following is a generalised function, used and reused by SQUID 2.50 on at least a couple of different occasions during a 'typical' data-reduction. As such, the inputs and outputs of the functions have generalised names. The identities of the specific inputs and outputs are defined when the function is called.]
The function calculates EITHER the error-weighted linear regression of Dodson-interpolated nuclide ratios (or Task equations) vs time, with the regression-error calculated at analysis midtime OR the error-weighted mean of Dodson-interpolated nuclide ratios (or Task equations) and its error, independent of time.
Note that in BOTH cases, the error-weighting relates solely to the y-errors (x-errors associated with timestamps, if applicable, are assumed zero throughout), and that in BOTH cases, the calculations use index-adjacent ratio error-correlations as described by Ludwig (2009; http://dx.doi.org/10.1016/j.chemgeo.2009.07.004).
Once the value, its error, and its MSWD and probability of fit have been determined, they are tested against a threshold for statistical coherence (probability of fit > 0.1). If this test is failed, each of the constituent data-points are deleted (sequentially, one at a time) and the effect of each rejection on the MSWD derived from the remaining points is assessed. If a deletion decreases the MSWD of the remaining points by a specified factor (which depends on the number of data-points in the input, and the type of calculation specified) AND statistical coherence is attained, then the (single) rejection is permitted. If the number of data-points in the input is relatively large (>8 for weighted average, >9 for linear regression), then a search for a second rejection is permitted.
WtdLinCorr(Avg1LinRegr2, N, Y, SigRho, MSWD, Probfit, WLrej, Intercept, SigmaIntercept, Bad, Slope, SigmaSlope, CovSlopeInter, X)
Avg1LinRegr2: This is a 'integer Boolean' input specifying the type of calculation to be performed (note the integer value assigned is used in some of the following equations). As the name suggests, permitted values are 1 (for calculation of a time-independent error-weighted average and its error) or 2 (for calculation of an error-weighted linear regression vs time, with the regression-error calculated at analysis midtime).
N: Integer defining the number of Dodson-interpolated data-points submitted to the calculation, as an input.
Y: A vector array of length N, containing the y-values of the Dodson-interpolated data-points submitted to the calculation, as an input.
SigRho: A N x N array defining the sigma-rho matrix associated with the vector Y, as an input. The diagonal elements are the sigma values associated with each Y value, and the (symmetric) immediate off-diagonal elements are the correlation coefficients between time-adjacent Y values. All other elements are zero; more details below.
WLrej: Integer (= 0 at start of each 'ratio of interest' and each analysis) incremented each time a valid rejection is identified.
Intercept: The calculated 'mean' y-value calculated (EITHER error-weighted linear regression to analysis midtime (NOT the y-intercept at x = 0!) OR as error-weighted average independent of time), as an output.
SigmaIntercept: The calculated 1-sigma absolute error of the Intercept, as an output.
Bad: Boolean for error-trapping (=FALSE at start).
MSWD: The calculated Mean Square of Weighted Deviates associated with the Intercept, as an output.
ProbFit: The calculated probability of fit associated with the MSWD and (N - Avg1LinRegr2) degrees of freedom, as an output.
[These apply ONLY when Avg1LinRegr2 = 2]
X: A vector array of length N, containing the x-values (e.g. time-stamps) of the data-points, as input to the calculation.
Slope: The calculated slope of the linear regression, as an output.
SigmaSlope: The calculated 1-sigma absolute error of the Slope, as an output.
CovSlopeInter: The calculated covariance of the Slope and the y-intercept (i.e. at x = 0) of the linear regression, as an output.
Value of type Boolean
Bad, LinReg
Values of type Integer
i, j, MaxRej, MinIndex, Nw, Pass, WLrej
Values of type Double
f, MaxProb, MinProb, MinMSWD, MswdRat, MswdRatToler
Vector of length N (addressed 1 to N) comprising values of type Boolean)
Rej
Vectors of length N (addressed 1 to N) comprising values of type Double
x1, x2, y1, y2, SigmaY
Vectors of length N + 1 (addressed 0 to N) comprising values of type Double
InterW, SigmaInterW, MswdW, Prob, ProbW, SlopeW, SigmaSlopeW, CovSlopeInterW
Arrays of size N x N (addressed [1 to N, 1 to N]) comprising values of type Double
SigRho1, SigRho2
In theory, WtdLinCorr should never be invoked unless N > 1. But in addition to this, Ludwig specified that linear regression to analysis midtime calculation is not permitted if N < 4, regardless of what is specified in the user-preferences (i.e. if the user has requested linear regression to analysis midtime, but N = 2 or 3, then the calculation is performed as error-weighted average instead). These scenarios are best handled by logic preceding the call of WtdLinCorr (and that is indeed how the code continues immediately beyond the current extent of my documented Step 3), but the WtdLinCorr routine itself contains a check, as a (theoretically obsolete) backup.
If (Avg1LinRegr2 = 1 and N < 2) Or (AvgLinRegr2 = 2 and N < 4) Then
--[error-message: Not enough data-points for requested calculation]
Exit Sub
End If
The first step is to define the LinReg Boolean, and to select the threshold value of MswdRatToler (which is the factor by which MSWD must improve following the deletion of a point, for that deletion to be considered a valid improvement to the data). For LinReg:
Bad = False
If Avg1LinRegr = 2
LinReg = True
Else
LinReg = False
End If
(I know that code-blocks like this are significantly more long-winded than necessary, but I am keen to keep my documentation of the code as explicit and human-readable as possible, particularly with respect to Booleans. Ludwig made extensive use of abbreviated, implicit syntax for Booleans, and the result is that sections of the VBA code are very difficult to understand, particularly in combination with logic tests.)
The candidate values for MswdRatToler are hard-wired (see MswdRatList below), with the selected value dependent on N and (for N values at the lower end of the range) Avg1LinRegr2, as follows:
If N > 7
MswdRatToler = 0.3
Else
MswdRatList = [0, 0.1, 0.15, 0.2, 0.2, 0.25]
MswdRatToler = MswdRatList[N - Avg1LinRegr2]
End If
Next, establish the maximum number of rejections permitted from the set of Dodson-interpolated values, as a function of both the number of values and the type of calulcation desired. The VBA code verbatim is:
MaxRej = 1 + (N - Avg1LinRegr2) \ 8
In Visual Basic, it seems that this final term is defined as a division that yields an integer result, with any remainder discarded (https://msdn.microsoft.com/en-us/library/0e16fywh.aspx), i.e.
MaxRej = 1 + rounddown( (N - Avg1LinRegr2) / 8 )
Also define the minimum probability of fit (above which there is no need to test the effect of deleting each data-point in sequence), and associated parameters:
MinProb = 0.1
WLrej = 0
Pass = 0
MinIndex = -1
Fill the y1, y2, SigmaY, SigRho1, SigRho2, and (if required) x1 and x2 matrices from the input Y, SigRho, and X:
For i = 1 To N
y1[i] = Y[i]
y2[i] = Y[i]
SigmaY[i] = SigRho[i, i] --diagonal elements of SigRho
If LinReg = True
x1[i] = X[i]
x2[i] = X[i]
End If
For j = 1 To N
SigRho1[i, j] = SigRho[i, j]
SigRho2[i, j] = SigRho[i, j]
Next j
Next i
This is followed by a curious little manipulation in which all elements of SigmaY less than the median are replaced with the median, for the purpose of defining the diagonal elements of SigRho1 and SigRho2:
f = max( median(SigmaY), 1E-10 )
For i = 1 To N
SigRho1[i, i] = max( SigRho1[i, i], f)
SigRho2(i, i) = SigRho1[i, i]
Next i
This is followed by quite a long "do... loop", in which the linear regressions (or weighted means) are calculated firstly with all points included, tested against minimum criteria for statistical coherence (MinProb > 0.1), and if that test is failed, the calculation is repeated with each single data-point rejected sequentially. If the deletion of a single data-point improves the statistical coherence of the calculation significantly, one rejection is permitted (and a second is possible if N >= 8 or 9, depending on calculation type):
Do
For i = 0 To N
If i = 0
Nw = N --first pass calculation includes all points
Else
DeletePoint LinReg, N, y1, y2, SigRho1, SigRho2, i, x1, x2
--Subroutine 'DeletePoint' documented separately
Nw = N - 1
End If
If Nw = 1 And LinReg = False
--The VBA code verbatim does not include "LinReg =": I have inferred it,
--but it makes sense, and I see no other obvious Boolean candidates.
ProbW[i] = 1
MswdW[i] = 0
SigmaInterW[i] = 1
InterW[i] = 1
ElseIf LinReg = True --perform the linear regression calculation
WeightedLinearCorr x2, y2, Nw, SigRho2, SlopeW[i], InterW[i], MswdW[i],...
ProbW[i], SigmaSlopeW[i], SigmaInterW[i], CovSlopeInterW[i], Bad
--Subroutine 'WeightedLinearCorr' documented separately
Else --perform the error-weighted average calculation
WtdAvCorr y2, SigRho2, Nw, ...
InterW[i], SigmaInterW[i], MswdW[i], ProbW[i], True, Bad
--Subroutine 'WtdAvCorr' documented separately
End If
If i = 0
--If ProbW[0] > 0.1 --this line is Ludwig's original; replaced by next line
If ProbW[0] > MinProb --no-rejection calculation is statistically coherent
MinIndex = 0
MinMSWD = MswdW[0]
Exit For --no need to trial sequential rejections
End If
--MaxProb = Prob[0] --this line is Ludwig's original; replaced by next line
MaxProb = ProbW[0]
End If
Next i
If MinIndex = 0 --no-rejection calculation is statistically coherent
Exit Do
End If
MinIndex = 0 --no-rejection calculation 'best' until proven otherwise
MinMSWD = MswdW[0]
For i = 1 To N
MswdRat = MswdW[i] / max( 1E-32, MswdW[0] )
--Three criteria must be met for a rejection to be permitted:
If MswdRat < MswdRatToler And MswdW[i] < MinMSWD And ProbW[i] > MinProb
Rej(i) = True
WLrej = 1 + WLrej
MinIndex = i
MaxProb = ProbW[i]
MinMSWD = MswdW[i]
End If
Next i
Pass = 1 + Pass
If Pass > 0 And (MinIndex = 0 Or Pass = MaxRej Or MaxProb > 0.1)
--MinIndex = 0 means EITHER the no-rejection calculation is coherent OR
--no single rejection yields enough improvement to be justified.
--Pass = MaxRej means the rejection 'ceiling' has been reached.
--MaxProb > 0.1 means statistical coherence has been achieved, EITHER
--via no-rejection calculation OR within permitted number of rejections.
Exit Do
End If
--If a rejection IS permitted BUT rejection 'ceiling' not yet reached
--AND statistical coherence not yet achieved, try another rejection:
DeletePoint LinReg, N, y1, y2, SigRho1, SigRho2, MinIndex, x1, x2
N = N - 1
--Redefine as vectors of length N (addressed 1 to N): x1, y1
--Redefine as vectors of length N + 1 (addressed 0 to N): Prob, Xbar
--Redefine as an array of size N x N (addressed [1 to N, 1 to N] ): SigRho1
For i = 1 To N
y1[i] = y2[i]
If LinReg = True
x1[i] = x2[i]
End If
For j = 1 To N - 1
SigRho1[i, j] = SigRho2[i, j]
Next j
Next i
Loop
Assign values obtained from EITHER the WeightedLinearCorr OR the WtdAvCorr subroutine, as follows:
Intercept = InterW(MinIndex)
SigmaIntercept = SigmaInterW(MinIndex)
MSWD = MswdW(MinIndex)
Probfit = ProbW(MinIndex)
If LinReg = True And MinIndex > 0 --need to check WHY only when MinIndex > 0
Slope = SlopeW(MinIndex)
SigmaSlope = SigmaSlopeW(MinIndex)
CovSlopeInter = CovSlopeInterW(MinIndex)
End If
If Probfit < 0.05
SigmaIntercept = SigmaIntercept * sqrt(MSWD)
If LinReg = True
SigmaSlope = SigmaSlope * sqrt(MSWD)
End If
End If
End Sub