The data
compression method:
The largest peak in a Eu157 specrum is shown on a linear scale above its background along with a compressed set of unequally spaced points representing this region
which is a nice reasonable compression to 16 points from 160. Note that one of the lines is the fit, the other the data and that on this scale the standard deviation in the data points is barely visible. With more complexity and a log scale the result is
Looking at these pictures it can be seen that the best set of points representing the spectrum concentrates points in regions where the second derivative is large.
More rigorously the error in a trapezoidal rule integration of a function from a to b
with f’’ evaluated numerically using the function values one position lower and one position higher is used to assign an integral error estimate to the L-1 intervals present with L xi values. The L+1 the xi is then placed in the middle of the interval with the largest integral error estimate, the appropriate integral error estimates re-evaluated and the process repeated until all M xi values have been assigned. Noting that the residuals are found from the original data rather than the compressed data, we can be rather conservative in our estimate of the f’’. That is we can replace abs(f’’) by max(10-10,abs(f’’)-3ef’’), where ef’’ is the standard deviation in the estimate of f’’. This enables us to use this same scheme for background data compression where the rofust fitting makes e become very large in the vicinity of peaks.
In this work the function f(i) with 0<i<N+1 is the i’th data point and the object of the data compression is to find a set of M<N xi’s, fi’s and ei’s representing the full function f(i). Note that eI is the standard deviation in the function estimate f(xi), ef’’ is the error estimate in f’’ calculated from eI, while ea,b is the error in the trap rule estimate of the integral of f(x) from a to b which would be present even if eI were zero. The f(xi)’s returned are weighted fits of the form
evaluated at xi and fitted to all data between .5(xi-1+xi) and .5(xi+xi+1). The standard deviation in fi is that in c0 which is given by the 0,0 element of the inverse matrix involved in the curve fit. A bit of experimentation was done to find the best number of coefficients to include in the fit. The final numbers were 1 for 3 or fewer points, 2 for 4 and 5 points, 3 for 6 to 50 points, 4 for 51 to 100 points, and 5 for over 100 points. This appeared to be a reasonable compromise between the smaller weights caused by the internal fitting and the need for enough accuracy to follow the peak shapes.
The first L points are the xi’s at the beginning of the interval, the end of the interval, at the location of each peak, at the peak + and - .5 times its full width at half maximum and at the largest residual and also at the largest residual + and - .5 times the full width at half maximum of each peak type.
Testing
the size of M
The data set used to test the size of M needed is ZTCASE1.SP. As described in the ROBFIT book, this test case consist of 1000 randomly generated data points with a declining exponential background and a single peak at x=850, with w=46.9 and strength=10000. As fitted with all the points and a cutoff of 20 this is
With the background subtracted out, the peak can be compressed to 16 points as
with each point weighted more strongly than before to represent the compression
A series of tests were carried out using various number of points to represent the compressed peak.
M chi2 x ex w ew Strength estrength
8 |
1067 |
849.04 |
0.49 |
45.35 |
1.29 |
9987 |
231 |
16 |
1058 |
849.58 |
0.46 |
45.35 |
1.27 |
9745 |
212 |
32 |
1031 |
850.06 |
0.54 |
47.21 |
0.85 |
10228 |
225 |
64 |
1032 |
849.70 |
0.30 |
47.12 |
0.52 |
10159 |
153 |
128 |
1030 |
849.83 |
0.24 |
47.21 |
0.45 |
10133 |
117 |
256 all |
1029 |
850.05 |
0.23 |
47.01 |
0.44 |
10097 |
113 |
actual values 850 46.9 10000 100
Note that the statistical lower limit to the error in the peak strength is the square root of its strength.
In the present settings of ROBFIT, M is set to 64 times the number of peaks for the preliminary peak settings, then at least one pass is made during the refit part with the maximum possible number of points, presently 1700. This is because it is always desirable to minimize the error in the quantities calculated. The following test is a fit to S1.SP which is a NaI spectrum of Cs137 + Co60. The first standard was Gauss, then the Cs peak was fitted to the S1.WDA file generated from FSPDIS. There were 30 packground constants and the entire fit shown took about 2 hours.
Spurious peaks
With only a few points, however, it is always possible for a peak to slip between the cracks as the following illustrates
The spurious peak will be removed on refitting in the next pass when a point will be introduced at its peak. It occurs in a region where there is a gentle hump much wider than the peak and the width biasing helped to force it into this region. In the fitting, the peak is intruduce in QMIN where the data are the compressed boxes, but the chi is calculated by PFIT so that it seems horrible. The chi can jump from 107 to 1022 which seems horrible, but is still recoverable when the extra point is introduced. A more typical data compression is shown below
where 60 points are doing the work of 1760 points.
Background error estimation
An inelegant but reliable way
of estimating the error is to look at the fit to the (ie-ib+1)/ncon points in
the vicinity of the x reported, and to
take the error as that given by a one constant fit to f-fA in this
region.
Note that in the case of a “perfect” fit, (fi-fA,i)2 averages to e2=1/w, thus
for a 1 constant fit to one data point with error 1/e
Uneven Background data compression.
The background compression before
The Chi2 is below 4096, because the error is the maximum of 1 or sqrt(data) and that means that the data points above channel 2000 contribulte very little to the c2. The trick in compressing this data is to start with half the points evenly spaced and the same robust weighting (dividing high points by A/(1+alpha res2) and multiplying low points by B(1+alpha res2)) described in the book and used for the previous data compression method. The first constants are determined using a total of 64 data points, the rest with 8 times the number of constants. The 160 points used to determine 20 constants are shown below
Note that because of the robust weighting the big peaks do not stand out above the background, in fact the apparent peaks above are not the big peaks but are rather the Compton peaks falling below the big peaks. The old evenly spaced data points would have been 6.4 channels apart and a large part of the reason for such a close spacing was the difficulty of getting enough data points to fit the initial rise at the beginning of the spectrum. The blowup of the first 100 channels below show that in addition to cutting the required data points by a factor of 4 we have put 21 rather than 15 in this region and have placed 10 in the first 15 channels to completely solve this problem.
A blow up in the region of the first peak at channel 930
shows its complete omission from the background data while points are placed relatively closely on both sides of it. An interesting feature (documented bug) of the minimization comes from the fact that as the background fits improve points move slightly closer to the peak causeing larger misses so that while the c2 decreases as the constants are optimized for each data point choice, it frequently increases with each new data set point choice so that over all little progress appears to be made. Recall, however, that the goal is to fit the data, not to minimize c2.
The weights including the robust part are shown below
where the lower curve is before data compression. Note that the presence of the peaks as suppressed weights is obvious in this plot.
A fit was made, requiring about 4 hours of run time using 60 constants.
At the 42 constant level, the fit dived beyond the second peak in the usual manner for this spectrum. The option was thus changed from FITB to FBKG so that as with the old code so that 46 constants started with a flat background beyond the second peak. The fit then dived beyond the third peak so the FBKG option was continued to give 54 constants and a region beyond the third peak that started flat. The option was then changed back to FITB and we went to 60 constants which is the highest number for which both the multiplier and the knot constants are varied. The 0 to 500 channel region
shows that as hoped the fit which started at channel 45 does match the startup part of the spectrum and as it should is almost completely oblivious to the small peak. Note that this is BKGFIT only, no peaks have been added to the data. At this level of fitting the bump between channels 200 and 250 would eveentually be made a peak. The next set of revision to the fit will attempt to allow BKGFIT to follow this by allowing many more constants. Note that this bump needs about 8 constants to be fit.
The fit properly totally ignores the escape peak at channel 525, needs about 4 more constants to fit the BGO influenced Compton peak at channel 725 and probably 4 more to get a perfect fit to the photopeak at 930.
A classic background with a small peak at channel 1100 and the escape peak at channel 1380 clearly separated out. (The high energy escape peak is much smaller than the low energy one because of the increased probability of detection of the higher energy cascade associated with it)
In this region the
A reasonable
fit to the
In summary, it appears that a “fair” background fit has been achieved with this 4 hour fit but that on the order of 20-30 more constants are needed to make this a “good” fit. The time at present goes roughly as the 4th power fo the number of constants, thus going from 60 to 90 requires 5 times as long or about 20 hours within reach, but not fast.
Iteratively
varying the constants.
The coding for passing an integer array to decide whether a variable is to be held constant already exists in NLFIT dating back to the version developed for Alex Green’s ICAS group. This allows the code to vary only a few of the variable returned by POLY with the same codeing for any set of variables.
In general the region between a pair of knots is a cubic polynomial (4 adjustable parameters) with the function, the first and second derivatives at the left side determined by the polynomial to the left of the first knot and the function and first and second derivatives on the right side determined by the polynomial on the right. (4 constraining conditions on the left plus the vaue at the the solid knot marking the largest residual, plus a zero first derivative at the sold knot so that it is extremal, plus a zero third derivative to give some semblance of symmetry for a total of 7 conditions, equal to the constants at the six open circles which are the old knots plus the constant at the closed circle which is the new knot located at the largest residual.).
In the region shown
Poly=Polyold+S7i=1 ci(ci+1-x)3+
where ci+1 is in turn equal to x1 through x7. At x1 to avoid changing Poly for x<x1, the conditions S(x1),S’(x1),S’’(x1), and S’’’(x1)=0 must be satisfied. At x4 the conditions are S(x4)=res, S’(x4)=0, and S’’’(x4)=0. This implies that 7 knots or 14 constants is the minimum number to be fitted at once.
The proposal is to fit those around each
introduced knot and also the entire set 14 constants at a time. This will make each fit 64 times as fast as
the current 56 constant fit while because of the local one dimensional nature
of the fit the entire accuracy should be maintained. The method is to begin with the first 4 and
last 10 constants, then the last 14, then the last 14 minus the last 4, etc
requiring 14 sets to go through all 56, reducing the apparent speed up to a
factor of 14, then maybe 2-3 iterations to converge reducing the over-all speed
up to a factor of 2. This will become greater because for most sets the minum
will require only 3-5 ROBMIN steps instead of the current 150. Overall expectation is a factor of 5-10 which
makes the above 20 hour estimate drop to a doable 2-4 hours and will allow us
to use up to 90 coefficients and to include all non-photopeak features in the
background.
The equations for the constants in terms of knots at x1,...,x5 are
0+0 +0 +0 +c5(x5-x4)3+c6(x6-x4)3+c7(x7-x4)3=Alr
0+0 +0 +0 +c5(x5-x4)2+c6(x6-x4)2+c7(x7-x4)2=0
0+0 +0 +0 +c5+ c6 +c7 =0
0+c2(x2-x1)3+c3(x3-x1)3+c4(x4-x1)3+c5(x5-x1)3+c6(x6-x1)3+c7(x7-x1)3= 0
0+c2(x2-x1)2+c3(x3-x1)2+c4(x4-x1)2+c5(x5-x1)2+c6(x6-x1)2+c7(x7-x1)2= 0
0+c2(x2-x1) +c3(x3-x1) +c4(x4-x1) +c5(x5-x1) +c6(x6-x1) +c7(x7-x1) =0
c1+c2 +c3 +c4 +c5 +c6 +c7 =0
which is a rather easy set of equations to solve for the c’s using GaussJ.
The largest residual must be looked at carefully. If the background were a simple spline, the above would simply require adding the polynomial found above to the original one. The background, however, is frequently B(x)=exp(spline(x)), then we have at the largest residual
exp(S(x)+P(x))-exp(S(x))=res
exp(P(x))-1=res/exp(S(x)) and for small P(x)
P(x)=res/exp(S(x))=res/data(x)
while for all x, exp(P(x))=1+res/exp(S(x)), P(x)=log(1+res/exp(S(x)))
The result with the knot locations 1901.7, 1904.53, 1914.9, 1915 new, 1916.26, 1958.49, 1961.84 is shown above. A series of test were made using the exponential of this is the starting guess for the background in this location. It was found that this caused the constants to move sufficiently far from the minimum chi-square constants, that unwanted changes occurred making it very time consuming to undo thes changes in the set of constants. It was also found that treating the constants 14 at a time is sufficient to bring this kind of form in by simple minimization of the chi-square without introducing the complications caused by directly intruducing the residuals. The coding to put the bump in is saved as BKGBUMP.FOR.
The result of fitting with a cutoff of 3 and 90 background constants is shown below.
1
-34.6747568573
.850897469804E-01
-.292212452983E-04
.316505842923E-08
-.159050508918
64.3178669192
.157487088598
64.3386768227
.104040637145E-04
152.021768383
-.130887458740E-04
181.222502149
.310003678208E-04
237.638085538
-.254937758461E-04
247.539680436
.658534851199E-06
331.432447181
-.146960142872E-06
528.919109122
.346412503538E-06
601.684672803
.942382257901E-05
680.466510498
.361480337194E-03
710.639227104
-.752824822864E-03
710.542056090
.135616560370E-02
719.692292898
-.101863256345E-02
724.014383374
.445294331757E-04
747.424086309
-.123671470402E-03
897.709272486
.203495216746E-03
905.558535094
-.248671902428E-04
921.953457452
-.694717441184E-04
923.490295466
.141116645838E-04
956.685885175
.496052801281E-06
1546.04001754
-.134522629247E-04
1652.91433572
-.843301519471E-04
1655.30294659
.934062672898E-04
1666.21505546
.171025883990E-03
1667.79946595
-.681498243200E-05
1667.99981208
.455935259702E-04
1667.63756721
-.870857877052E-05
1672.69091688
-.433490860243E-04
1674.24101883
-.500296706837E-04
1674.40315130
-.754873019281E-04
1674.52365316
-.453913085719E-04
1674.71189088
.121362938418E-04
1684.36734832
.503262110664E-05
1721.79584931
.102928680927E-04
1826.28218765
-.302106157150E-04
1844.94687672
.280595495424E-04
1882.22457411
.110512310263E-02
1921.43483868
-.174644916499E-02
1925.39239882
.252069978321E-03
1930.17190842
.380972319251E-03
1933.20343047
.492311925420E-08
2686.02335487
-.843725461779E-13
84117.1221349
FIT TO \ROBFIT\TCASES\ZTCASE7.SP 7.5 hrs
IP= 2
18 PEAKS 4052 CHAN, CHIS= 2503. NITB= 90 90
CUTOFF= 3.00
1.0732 1.0000 1.4375 1.0000
55.023 .165 2.938 .367 478.451 54.48 1
71.326 .060 1.735 .156 764.420 56.33 1
73.789 .036 1.664 .073 1224.24 54.12 1
84.294 .111 1.983 .283 339.463 38.45 1
87.393 1.356 .845 4.386 103.967 551.4 1
90.128 .190 1.500 .352 117.753 26.99 1
205.064 .472 6.809 1.065 388.520 54.63 1
242.464 .433 2.737 .800 155.028 44.49 2
526.349 .094 3.022 .229 712.548 46.10 1
530.370 .378 2.000 .958 77.1283 31.28 1
913.617 .230 1.438 .951 59.9639 19.39 1
929.419 .001 1.049 .002 96469.3 350.0 2
1100.739 .267 1.951 .632 41.7042 11.73 1
1373.290 .125 2.986 .270 242.183 20.60 1
1514.755 .274 2.230 .638 58.1154 14.58 1
1667.731 .415 4.398 .749 312.492 53.23 2
1905.358 .005 2.640 .009 57437.0 240.0 1
2841.038 .082 2.994 .163 211.134 13.23 1
\ZTCASE7 2503.07633616
2
\ROBFIT\TCASES\TCASE7.ST
\ROBFIT\TCASES\TCASE7LE.ST
\
0
\ROBFIT\TCASES\ZTCASE7.SP
NONE
\ROBFIT\TCASES\TCASE7.CN
\ROBFIT\TCASES\TCASE7.PK
45 4096
6.17405063291 .151898734177E-02
6.17405063291 .151898734177E-02