[37] | 1 | function [density,test] = hhist(data, resolution) |
---|
| 2 | % HHIST Locally-adaptive unbiased density estimate. |
---|
| 3 | % |
---|
| 4 | % hhist(data) plots a density estimate of p(data). |
---|
| 5 | % hhist(data,resolution) increases the resolution of the plotted curve. |
---|
| 6 | % density = hhist(data,resolution) returns the density estimate instead of |
---|
| 7 | % plotting it. length(density) == resolution. |
---|
| 8 | % [density,test] = hhist(data,resolution) also returns the locations at which |
---|
| 9 | % the density was evaluated. |
---|
| 10 | % |
---|
| 11 | % The algorithm is linear interpolation of the empirical cumulative |
---|
| 12 | % distribution. Reference: |
---|
| 13 | % Bruce M. Hill, "Posterior Distribution of Percentiles: Bayes' |
---|
| 14 | % Theorem for Sampling from a Population", Journal of the American |
---|
| 15 | % Statistical Association, Vol. 63, No. 322. (Jun., 1968), |
---|
| 16 | % pp. 677-691. |
---|
| 17 | % http://links.jstor.org/sici?sici=0162-1459%28196806%2963%3A322%3C677%3APDOPBT%3E2.0.CO%3B2-O |
---|
| 18 | % |
---|
| 19 | % See test_hhist.m for a demonstration. |
---|
| 20 | |
---|
| 21 | % Written by Tom Minka |
---|
| 22 | |
---|
| 23 | n = length(data); |
---|
| 24 | data = data(:); |
---|
| 25 | data = sort(data); |
---|
| 26 | % find the number of occurrences of each point. |
---|
| 27 | % keep only the first two occurrences, and remember the total count. |
---|
| 28 | d = [0; data(1:end-1)==data(2:end)]; |
---|
| 29 | dd = diff([d; 0]); |
---|
| 30 | dd2 = [0; dd(1:end-1)]; |
---|
| 31 | firstdup = (dd > 0); |
---|
| 32 | seconddup = (dd2 > 0); |
---|
| 33 | lastdup = (dd < 0); |
---|
| 34 | nodup = (dd == 0 & d == 0); |
---|
| 35 | % set the second occurrence to be the next larger double-precision number. |
---|
| 36 | data(seconddup) = data(seconddup) + eps(data(seconddup)); |
---|
| 37 | dstart = find(firstdup | seconddup | nodup); |
---|
| 38 | dend = find(firstdup | lastdup | nodup); |
---|
| 39 | data = data(dstart); |
---|
| 40 | % count(i) is the number of occurrences of data(i) |
---|
| 41 | count = dend-dstart+1; |
---|
| 42 | |
---|
| 43 | if nargin < 2 |
---|
| 44 | resolution = 1000; |
---|
| 45 | end |
---|
| 46 | delta = 0.05*range(data); |
---|
| 47 | if delta == 0 |
---|
| 48 | delta = 1; |
---|
| 49 | end |
---|
| 50 | mind = min(data)-delta; |
---|
| 51 | maxd = max(data)+delta; |
---|
| 52 | if length(resolution) == 1 |
---|
| 53 | test = linspace(mind,maxd,resolution); |
---|
| 54 | else |
---|
| 55 | test = resolution; |
---|
| 56 | mind = min(mind, min(test)); |
---|
| 57 | maxd = max(maxd, max(test)); |
---|
| 58 | end |
---|
| 59 | |
---|
| 60 | % A tricky aspect of this algorithm is that it is not enough to simply |
---|
| 61 | % evaluate the density at the test locations. If you do that, you may miss |
---|
| 62 | % sharp spikes. Instead, you must compute the average density around each |
---|
| 63 | % test location. |
---|
| 64 | |
---|
| 65 | if 0 |
---|
| 66 | % Algorithm to evaluate the density at test locations only: |
---|
| 67 | % 1. Compute the slope of the cdf between consecutive input points. |
---|
| 68 | % For the last point, the slope is zero. |
---|
| 69 | % 2. For each test location, find the nearest input point which is smaller. |
---|
| 70 | % This is cleverly done via the interp1 function. |
---|
| 71 | % 3. Use the previously computed slope as the density at that test location. |
---|
| 72 | |
---|
| 73 | % g(i) = slope of the cdf between data(i) and data(i+1) |
---|
| 74 | g = [0; 1./diff(data); 0; 0]; |
---|
| 75 | % extend data so that interp1 doesn't fail |
---|
| 76 | % note mind < min(data), maxd > max(data) |
---|
| 77 | datax = [mind; data; maxd]; |
---|
| 78 | index = [0; cumsum(count); 0]; |
---|
| 79 | test_index = floor(interp1(datax,index,test)); |
---|
| 80 | density = g(test_index)/(length(data)+1); |
---|
| 81 | end |
---|
| 82 | |
---|
| 83 | % Algorithm to compute the average density around each test location: |
---|
| 84 | % 1. Find the borders of the nearest-neighbor cell for each test location. |
---|
| 85 | % 2. Evaluate the cdf at the border locations. |
---|
| 86 | % 3. Use the cdf slope between borders as the density for the test location. |
---|
| 87 | test = test(:); |
---|
| 88 | border = [mind; (test(1:end-1) + test(2:end))/2; maxd]; |
---|
| 89 | % extend data so that interp1 doesn't fail |
---|
| 90 | % note mind < min(data), maxd > max(data) |
---|
| 91 | datax = [mind; data; maxd]; |
---|
| 92 | index = [0; cumsum(count); 0]; |
---|
| 93 | border_cdf = interp1(datax,index,border); |
---|
| 94 | % this prevents the estimate from extending beyond the data limits. |
---|
| 95 | border_cdf(border<min(data)) = 1; |
---|
| 96 | border_cdf(border>max(data)) = n; |
---|
| 97 | plot(border,border_cdf) |
---|
| 98 | density = diff(border_cdf)./diff(border)/(n-1); |
---|
| 99 | |
---|
| 100 | if nargout == 0 |
---|
| 101 | plot(test, density) |
---|
| 102 | axis_pct |
---|
| 103 | clear density |
---|
| 104 | end |
---|