The data compression method:

     The largest peak in a Eu157 specrum is shown on a linear scale above its background along with a compressed set of unequally spaced points representing this region

which is a nice reasonable compression to 16 points from 160.  Note that one of the lines is the fit, the other the data and that on this scale the standard deviation in the data points is barely visible.  With more complexity and  a log scale the result is

 

Looking at these pictures it can be seen that the best set of points representing the spectrum concentrates points in regions where the second derivative is large. 

     More rigorously the error in a trapezoidal rule integration of a function from a to b

with f’’  evaluated numerically using the function values one position lower and one position higher is used to assign an integral error estimate to the L-1 intervals present with L xi values.  The L+1 the xi is then placed in the middle of the interval with the largest integral error estimate, the appropriate integral error estimates re-evaluated and the process repeated until all M xi values have been assigned.  Noting that the residuals are found from the original data rather than the compressed data, we can be rather conservative in our estimate of the f’’.  That is we can replace abs(f’’) by max(10-10,abs(f’’)-3ef’’), where ef’’ is the standard deviation in the estimate of f’’.   This enables us to use this same scheme for background data compression where the rofust fitting makes e become very large in the vicinity of peaks.

       In this work the function f(i) with 0<i<N+1 is the i’th data point and the object of the data compression is to find a set of M<N xi’s, fi’s and ei’s representing the full function f(i).  Note that eI is the standard deviation in the function estimate f(xi), ef’’ is the error estimate in f’’ calculated from eI, while ea,b is the error in the trap rule estimate of the integral of f(x) from a to b which would be present even if eI were zero.  The f(xi)’s returned are weighted fits of the form

evaluated at xi  and fitted to all data between .5(xi-1+xi) and .5(xi+xi+1).  The standard deviation in fi  is that in c0 which is given by the 0,0 element of the inverse matrix involved in the curve fit.  A bit of experimentation was done to find the best number of coefficients to include in the fit.  The final numbers were 1 for 3 or fewer points, 2 for 4 and 5 points, 3 for 6 to 50 points, 4 for 51 to 100 points, and 5 for over 100 points.  This appeared to be a reasonable compromise between the smaller weights caused by the internal fitting and the need for enough accuracy to follow the peak shapes.

      The first L points are the xi’s at the beginning of the interval, the end of the interval, at the location of each peak, at the peak + and -  .5 times its full width at half maximum and at the largest residual and also at the largest residual + and - .5 times the full width at half maximum of each peak type. 

Testing the size of M

    The data set used to test the size of M needed is ZTCASE1.SP.  As described in the ROBFIT book, this test case consist of 1000 randomly generated data points with a declining exponential background and a single peak at x=850, with w=46.9 and strength=10000.  As fitted with all the points and a cutoff of 20 this is

  With the background subtracted out, the peak can be compressed to 16 points as

with each point weighted more strongly than before to represent the compression

A series of tests were carried out using various number of points to represent the compressed peak.

 

M           chi2             x           ex              w             ew          Strength   estrength

8

1067

849.04

0.49

45.35

1.29

 9987

231

16

1058

849.58

0.46

45.35

1.27

 9745

212

32

1031

850.06

0.54

47.21

0.85

10228

225

64

1032

849.70

0.30

47.12

0.52

10159

153

128

1030

849.83

0.24

47.21

0.45

10133

117

256 all

1029

850.05

0.23

47.01

0.44

10097

113

actual values           850                         46.9                         10000     100

Note that the statistical lower limit to the error in the peak strength is the square root of its strength.

    In the present settings of ROBFIT, M is set to 64 times the number of peaks for the preliminary peak settings, then at least one pass is made during the refit part with the maximum possible number of points, presently 1700.  This is because it is always desirable to minimize the error in the quantities calculated.  The following test is a fit to S1.SP which is a NaI spectrum of Cs137 + Co60.  The first standard was Gauss, then the Cs peak was fitted to the S1.WDA file generated from FSPDIS.  There were 30 packground constants and the entire fit shown took about 2 hours.

Spurious peaks

With only a  few points, however, it is always possible for a peak to slip between the cracks as the following illustrates

The spurious peak will be removed on refitting in the next pass when a point will be introduced at its peak.  It occurs in a region where there is a gentle hump much wider than the peak and the width biasing helped to force it into this region.  In the fitting, the peak is intruduce in QMIN where the data are the compressed boxes, but the chi is calculated by PFIT so that it seems horrible.  The chi can jump from 107 to 1022  which seems horrible, but is still recoverable when the extra point is introduced.  A more typical data compression is shown below

where 60 points are doing the work of 1760 points.

Background error estimation

An inelegant but reliable way of estimating the error is to look at the fit to the (ie-ib+1)/ncon points in the vicinity of the x reported,  and to take the error as that given by a one constant fit to f-fA in this region.

Note that in the case of a “perfect” fit, (fi-fA,i)2 averages to e2=1/w, thus

for a 1 constant fit to one data point with error 1/e

Uneven Background data compression.

    The background compression before 7-12-94 was a weighted fit across N uniform sized regions where N is the  min of 1024 or 32 times the number of constants.  In general the test case for background fitting is ZTCASE7, a Y88 spectrum in which the Compton features are easily seen.  The fit shown was made before data compression

The Chi2 is below 4096, because the error is the maximum of 1 or sqrt(data) and that means that the data points above channel 2000 contribulte very little to the c2.   The trick in compressing this data is to start with half the points evenly spaced and the same robust weighting (dividing high points by A/(1+alpha res2) and multiplying low points by B(1+alpha res2)) described in the book and used for the previous data compression method.   The first constants are determined using a total of 64 data points, the rest with 8 times the number of constants.  The 160 points used to determine 20 constants are shown below

Note that because of the robust weighting the big peaks do not stand out above the background, in fact the apparent peaks above are not the big peaks but are rather the Compton peaks falling below the big peaks.  The old evenly spaced data points would have been 6.4 channels apart and a large part of the reason for such a close spacing was the difficulty of getting enough data points to fit the initial rise at the beginning of the spectrum.  The blowup of the first 100 channels below show that in addition to cutting the required data points by a factor of 4 we have put 21 rather than 15 in this region and have placed  10 in the first 15 channels to completely solve this problem.

A blow up in the region of the first peak at channel 930

shows its complete omission from the background data while points are placed relatively closely on both sides of it.  An interesting feature (documented bug) of the minimization comes from the fact that as the background fits improve points move slightly closer to the peak causeing larger misses so that while the c2 decreases as the constants are optimized for each data point choice, it frequently increases with each new data set point choice so that over all little progress appears to be made.  Recall, however, that the goal is to fit the data, not to minimize c2.

   The weights including the robust part are shown below

where the lower curve is before data compression.  Note that the presence of the peaks as  suppressed weights is obvious in this plot.

    A fit was made, requiring about 4 hours of run time using 60 constants.

At the 42 constant level, the fit dived beyond the second peak in the usual manner for this spectrum.  The option was thus changed from FITB to FBKG so that as with the old code so that 46 constants started with a flat background beyond the second peak.  The fit then dived beyond the third peak so the FBKG option was continued to give 54 constants and a region beyond the third peak that started flat.  The option was then changed back to FITB and we went to 60 constants which is the highest number for which both the multiplier and the knot constants are varied.  The 0 to 500 channel region

shows that as hoped the fit which started at channel 45 does match the startup part of the spectrum and as it should is almost completely oblivious to the small peak.  Note that this is BKGFIT only, no peaks have been added to the data.  At this level of fitting the bump between channels 200 and 250 would eveentually be made a peak.  The next set of revision to the fit will attempt to allow BKGFIT to follow this by allowing many more constants.  Note that this bump needs about 8 constants to be fit.

 

The fit properly totally ignores the escape peak at channel 525, needs about 4 more constants to fit the BGO influenced Compton peak at channel 725 and probably 4 more to get a perfect fit to the photopeak at 930.

A classic background with a small peak at channel 1100 and the escape peak at channel 1380 clearly separated out.  (The high energy escape peak is much smaller than the low energy one because of the increased probability of detection of the higher energy cascade associated with it)

In this region the Compton peak could use 8 more coefficients.

A reasonable fit to the Compton part of the peak at 2830 and maybe a bit of an attempt to represent the 1 count peak at 3050 along with the small Compton given by the occasional counts elsewhere.

    In summary, it appears that a “fair” background fit has been achieved with this 4 hour fit but  that on the order of 20-30 more constants are needed to make this a  “good” fit.  The time at present goes roughly as the 4th power fo the number of constants, thus going from 60 to 90 requires 5 times as long or about 20 hours within reach, but not fast.

Iteratively varying the constants.

   The coding for passing an integer array to decide whether a variable is to be held constant already exists in NLFIT dating back to the version developed for Alex Green’s ICAS group.  This allows the code to vary only a few of the variable returned by POLY with the same codeing for any set of variables.

   In general the region between a pair of knots is a cubic polynomial (4 adjustable parameters) with the function, the first and second derivatives at the left side determined by the polynomial to the left of the first knot and the  function and first and second derivatives on the right side determined by the polynomial on the right. (4 constraining conditions on the left plus the vaue at the the solid knot marking the largest residual, plus a zero first derivative at the sold knot so that it is extremal, plus a zero third derivative to give some semblance of symmetry for a total of 7 conditions, equal to the constants at the six open circles which are the old knots plus the constant at the closed circle which is the new knot located at the largest residual.). 

In the region shown

   Poly=Polyold+S7i=1 ci(ci+1-x)3+

where ci+1 is in turn equal to x1 through x7.  At x1 to avoid changing Poly for x<x1, the conditions S(x1),S’(x1),S’’(x1), and S’’’(x1)=0 must be satisfied.  At x4 the conditions are S(x4)=res, S’(x4)=0, and S’’’(x4)=0.  This implies that 7 knots or 14 constants is the minimum number to be fitted at once. 

   The proposal is to fit those around each introduced knot and also the entire set 14 constants at a time.  This will make each fit 64 times as fast as the current 56 constant fit while because of the local one dimensional nature of the fit the entire accuracy should be maintained.  The method is to begin with the first 4 and last 10 constants, then the last 14, then the last 14 minus the last 4, etc requiring 14 sets to go through all 56, reducing the apparent speed up to a factor of 14, then maybe 2-3 iterations to converge reducing the over-all speed up to a factor of 2. This will become greater because for most sets the minum will require only 3-5 ROBMIN steps instead of the current 150.  Overall expectation is a factor of 5-10 which makes the above 20 hour estimate drop to a doable 2-4 hours and will allow us to use up to 90 coefficients and to include all non-photopeak features in the background.

   The equations for the constants in terms of knots at x1,...,x5  are

0+0              +0          +0             +c5(x5-x4)3+c6(x6-x4)3+c7(x7-x4)3=Alr               

0+0             +0           +0             +c5(x5-x4)2+c6(x6-x4)2+c7(x7-x4)2=0

0+0             +0           +0             +c5+          c6            +c7            =0

0+c2(x2-x1)3+c3(x3-x1)3+c4(x4-x1)3+c5(x5-x1)3+c6(x6-x1)3+c7(x7-x1)3= 0

0+c2(x2-x1)2+c3(x3-x1)2+c4(x4-x1)2+c5(x5-x1)2+c6(x6-x1)2+c7(x7-x1)2= 0

0+c2(x2-x1) +c3(x3-x1) +c4(x4-x1) +c5(x5-x1)  +c6(x6-x1) +c7(x7-x1)  =0

c1+c2          +c3            +c4          +c5            +c6            +c7           =0         

which is a rather easy set of equations to solve for the c’s using GaussJ.

The largest residual must be looked at carefully.  If the background were a simple spline, the above would simply require adding the polynomial found above to the original one.  The background, however, is frequently B(x)=exp(spline(x)),  then we have at the largest residual

exp(S(x)+P(x))-exp(S(x))=res

exp(P(x))-1=res/exp(S(x)) and for small P(x)

P(x)=res/exp(S(x))=res/data(x)

while for all x, exp(P(x))=1+res/exp(S(x)), P(x)=log(1+res/exp(S(x)))

The result with the knot locations 1901.7, 1904.53, 1914.9, 1915 new, 1916.26, 1958.49, 1961.84 is shown above.  A series of test were made using the exponential of this is the starting guess for the background in this location.  It was found that this caused the constants to move sufficiently far from the minimum chi-square constants, that unwanted changes occurred making it very time consuming to undo thes changes in the set of constants.  It was also found that treating the constants 14 at a time is sufficient to bring this kind of form in by simple minimization of the chi-square without introducing the complications caused by directly intruducing the residuals.  The coding to put the bump in is saved as BKGBUMP.FOR.

    The result of fitting with a cutoff of 3 and 90 background constants is shown below.

           1

  -34.6747568573   

   .850897469804E-01

  -.292212452983E-04

   .316505842923E-08

  -.159050508918   

   64.3178669192   

   .157487088598   

   64.3386768227   

   .104040637145E-04

   152.021768383   

  -.130887458740E-04

   181.222502149   

   .310003678208E-04

   237.638085538   

  -.254937758461E-04

   247.539680436   

   .658534851199E-06

   331.432447181   

  -.146960142872E-06

   528.919109122   

   .346412503538E-06

   601.684672803   

   .942382257901E-05

   680.466510498   

   .361480337194E-03

   710.639227104   

  -.752824822864E-03

   710.542056090   

   .135616560370E-02

   719.692292898   

  -.101863256345E-02

   724.014383374   

   .445294331757E-04

   747.424086309   

  -.123671470402E-03

   897.709272486   

   .203495216746E-03

   905.558535094   

  -.248671902428E-04

   921.953457452   

  -.694717441184E-04

   923.490295466   

   .141116645838E-04

   956.685885175   

   .496052801281E-06

   1546.04001754   

  -.134522629247E-04

   1652.91433572   

  -.843301519471E-04

   1655.30294659   

   .934062672898E-04

   1666.21505546   

   .171025883990E-03

   1667.79946595   

  -.681498243200E-05

   1667.99981208   

   .455935259702E-04

   1667.63756721   

  -.870857877052E-05

   1672.69091688   

  -.433490860243E-04

   1674.24101883   

  -.500296706837E-04

   1674.40315130   

  -.754873019281E-04

   1674.52365316   

  -.453913085719E-04

   1674.71189088   

   .121362938418E-04

   1684.36734832   

   .503262110664E-05

   1721.79584931   

   .102928680927E-04

   1826.28218765   

  -.302106157150E-04

   1844.94687672   

   .280595495424E-04

   1882.22457411   

   .110512310263E-02

   1921.43483868   

  -.174644916499E-02

   1925.39239882   

   .252069978321E-03

   1930.17190842   

   .380972319251E-03

   1933.20343047   

   .492311925420E-08

   2686.02335487   

  -.843725461779E-13

   84117.1221349   

 FIT TO \ROBFIT\TCASES\ZTCASE7.SP     7.5 hrs

 IP=    2

   18 PEAKS 4052 CHAN, CHIS=   2503.     NITB=   90   90

 CUTOFF=   3.00

  1.0732  1.0000  1.4375  1.0000

    55.023    .165      2.938    .367   478.451       54.48      1

    71.326    .060      1.735    .156   764.420       56.33      1

    73.789    .036      1.664    .073   1224.24       54.12      1

    84.294    .111      1.983    .283   339.463       38.45      1

    87.393   1.356       .845   4.386   103.967       551.4      1

    90.128    .190      1.500    .352   117.753       26.99      1

   205.064    .472      6.809   1.065   388.520       54.63      1

   242.464    .433      2.737    .800   155.028       44.49      2

   526.349    .094      3.022    .229   712.548       46.10      1

   530.370    .378      2.000    .958   77.1283       31.28      1

   913.617    .230      1.438    .951   59.9639       19.39      1

   929.419    .001      1.049    .002   96469.3       350.0      2

  1100.739    .267      1.951    .632   41.7042       11.73      1

  1373.290    .125      2.986    .270   242.183       20.60      1

  1514.755    .274      2.230    .638   58.1154       14.58      1

  1667.731    .415      4.398    .749   312.492       53.23      2

  1905.358    .005      2.640    .009   57437.0       240.0      1

  2841.038    .082      2.994    .163   211.134       13.23      1

\ZTCASE7   2503.07633616   

    2

\ROBFIT\TCASES\TCASE7.ST                                       

\ROBFIT\TCASES\TCASE7LE.ST                                     

\

   0

\ROBFIT\TCASES\ZTCASE7.SP                                      

NONE                                                            

\ROBFIT\TCASES\TCASE7.CN                                       

\ROBFIT\TCASES\TCASE7.PK                                       

   45 4096

   6.17405063291       .151898734177E-02

   6.17405063291       .151898734177E-02