Notes on Likelihood
Likelihood is a probability, so it ranges between zero and unity. It is easiest to think about it in terms of a simple coin tossing experiment. Suppose we have a coin for which the probability of getting a head in a single toss is p_h. In N tosses of the coin we may actually get N_h heads. The likelihood of this outcome is given by the probability for this outcome to result from N tosses:
L=N!(p_h)^N_h(1-p_h)^(N-N_h)/N_h!/(N-N_h)!
(Note: Even for twenty tosses of an unbiased coin, the likelihood of equal heads and tails is less than 20%, although this is the most probable result.)
With an unbiased coin, we would not normally expect to get 1000 heads and 1 tail from 1001 tosses, although it is possible. We might be tempted to ask whether the coin really was biased. Is there a biased value for p_h which "best" fits the experimental data, for which the given outcome would be considered much more likely? We can extremize L wrt p_h in order to find a "best fit" value. This gives p_h=N_h/N.
It is important to realize that even for this value of p_h, the outcome obtained is still not the most probable, in general. This involves asking: given a fixed value of p_h, for what value N_h will the outcome be most probable? Assuming the Stirling approximation, so that we can differentiate the factorials, the answer is given by N_h=N*p_h +(0.5-p_h), which may not be the nearest integer to N*p_h. But because we are near the maximum, small discrepancies should have a second order effect. Below we will ignore them.
ALL of our applications are of the form in which we vary the parameters of the population (the intrinsic properties of an noiseless coin, as characterized by the entire population of all possible outcomes) based on the given sample (experimental data) we have to hand. But instead of using the binomial distribution we use the generalized multinomial distribution:
L=N!/(\Pi_i N_i!)*(\Pi_i (p_i)^N_i), st: \Sigma_i N_i=N.
Because we are not looking directly at the discretized data, we tend to want to use continuous distributions, broken into a finite number of finite width bins.
The idea of an intrinsic likelihood for the data doesn't make much sense. We know a particular experimental outcome with certainty, so its probability is unity. We are also fairly certain that the results will be different next time. What we would really like to know is something about the population from which all such experimental outcomes might be drawn. By varying the parameters of a fit we are varying the population against which we are comparing a particular sample. For finite N and fixed p_i, there is a finite number of values for L, and one (or several) is maximum. For a given fit, the likelihood ratio we would be interested in would compare our value with that maximum.
I believe there are frequencies at which our data is well represented by a Gaussian. At some frequency, let N_i be the number of samples in a bin centered at x_i. Then a simple (statistically biased) approximation to the distribution would give \mu and \sigma by:
\mu=(\Sigma_i N_i*x_i)/N,
\sigma^2=(\sigma_i N_i*(x_i-\mu)^2)/N.
Now we can ask, for (these) fixed \mu and \sigma, what set of n_i's lead to the most probable outcome of an experiment of this kind? We would expect something like n_i=N*p_i, since that is what our fit actually gives, where now the p_i's are calculated from these fixed \mu and \sigma. It is clear that if our experimental data N_i actually corresponded to these n_i, then we would end up with the same value of \mu and \sigma. It is also clear that the ratio of the likelihood for such an experimental outcome to that for our presumed maximum (at n_i=N*p_i) must be unity. If our presumption about the likelihood maximum is correct, it is clear that every other possible experimental outcome must give a likelihood ratio less than unity, while those close to unity should correspond to fits which can be classified as "good". Our data must show this.
With reference to your query, note that since N_i is not n_i in general, it follows that the most likely distribution with fit parameters estimated from the data in a given sample is not the one which is actually observed in that sample. Does your program fail even of "perfect" data?