1 | function [density,test] = hhist(data, resolution) |
---|
2 | % HHIST Locally-adaptive unbiased density estimate. |
---|
3 | % |
---|
4 | % hhist(data) plots a density estimate of p(data). |
---|
5 | % hhist(data,resolution) increases the resolution of the plotted curve. |
---|
6 | % density = hhist(data,resolution) returns the density estimate instead of |
---|
7 | % plotting it. length(density) == resolution. |
---|
8 | % [density,test] = hhist(data,resolution) also returns the locations at which |
---|
9 | % the density was evaluated. |
---|
10 | % |
---|
11 | % The algorithm is linear interpolation of the empirical cumulative |
---|
12 | % distribution. Reference: |
---|
13 | % Bruce M. Hill, "Posterior Distribution of Percentiles: Bayes' |
---|
14 | % Theorem for Sampling from a Population", Journal of the American |
---|
15 | % Statistical Association, Vol. 63, No. 322. (Jun., 1968), |
---|
16 | % pp. 677-691. |
---|
17 | % http://links.jstor.org/sici?sici=0162-1459%28196806%2963%3A322%3C677%3APDOPBT%3E2.0.CO%3B2-O |
---|
18 | % |
---|
19 | % See test_hhist.m for a demonstration. |
---|
20 | |
---|
21 | % Written by Tom Minka |
---|
22 | |
---|
23 | n = length(data); |
---|
24 | data = data(:); |
---|
25 | data = sort(data); |
---|
26 | % find the number of occurrences of each point. |
---|
27 | % keep only the first two occurrences, and remember the total count. |
---|
28 | d = [0; data(1:end-1)==data(2:end)]; |
---|
29 | dd = diff([d; 0]); |
---|
30 | dd2 = [0; dd(1:end-1)]; |
---|
31 | firstdup = (dd > 0); |
---|
32 | seconddup = (dd2 > 0); |
---|
33 | lastdup = (dd < 0); |
---|
34 | nodup = (dd == 0 & d == 0); |
---|
35 | % set the second occurrence to be the next larger double-precision number. |
---|
36 | data(seconddup) = data(seconddup) + eps(data(seconddup)); |
---|
37 | dstart = find(firstdup | seconddup | nodup); |
---|
38 | dend = find(firstdup | lastdup | nodup); |
---|
39 | data = data(dstart); |
---|
40 | % count(i) is the number of occurrences of data(i) |
---|
41 | count = dend-dstart+1; |
---|
42 | |
---|
43 | if nargin < 2 |
---|
44 | resolution = 1000; |
---|
45 | end |
---|
46 | delta = 0.05*range(data); |
---|
47 | if delta == 0 |
---|
48 | delta = 1; |
---|
49 | end |
---|
50 | mind = min(data)-delta; |
---|
51 | maxd = max(data)+delta; |
---|
52 | if length(resolution) == 1 |
---|
53 | test = linspace(mind,maxd,resolution); |
---|
54 | else |
---|
55 | test = resolution; |
---|
56 | mind = min(mind, min(test)); |
---|
57 | maxd = max(maxd, max(test)); |
---|
58 | end |
---|
59 | |
---|
60 | % A tricky aspect of this algorithm is that it is not enough to simply |
---|
61 | % evaluate the density at the test locations. If you do that, you may miss |
---|
62 | % sharp spikes. Instead, you must compute the average density around each |
---|
63 | % test location. |
---|
64 | |
---|
65 | if 0 |
---|
66 | % Algorithm to evaluate the density at test locations only: |
---|
67 | % 1. Compute the slope of the cdf between consecutive input points. |
---|
68 | % For the last point, the slope is zero. |
---|
69 | % 2. For each test location, find the nearest input point which is smaller. |
---|
70 | % This is cleverly done via the interp1 function. |
---|
71 | % 3. Use the previously computed slope as the density at that test location. |
---|
72 | |
---|
73 | % g(i) = slope of the cdf between data(i) and data(i+1) |
---|
74 | g = [0; 1./diff(data); 0; 0]; |
---|
75 | % extend data so that interp1 doesn't fail |
---|
76 | % note mind < min(data), maxd > max(data) |
---|
77 | datax = [mind; data; maxd]; |
---|
78 | index = [0; cumsum(count); 0]; |
---|
79 | test_index = floor(interp1(datax,index,test)); |
---|
80 | density = g(test_index)/(length(data)+1); |
---|
81 | end |
---|
82 | |
---|
83 | % Algorithm to compute the average density around each test location: |
---|
84 | % 1. Find the borders of the nearest-neighbor cell for each test location. |
---|
85 | % 2. Evaluate the cdf at the border locations. |
---|
86 | % 3. Use the cdf slope between borders as the density for the test location. |
---|
87 | test = test(:); |
---|
88 | border = [mind; (test(1:end-1) + test(2:end))/2; maxd]; |
---|
89 | % extend data so that interp1 doesn't fail |
---|
90 | % note mind < min(data), maxd > max(data) |
---|
91 | datax = [mind; data; maxd]; |
---|
92 | index = [0; cumsum(count); 0]; |
---|
93 | border_cdf = interp1(datax,index,border); |
---|
94 | % this prevents the estimate from extending beyond the data limits. |
---|
95 | border_cdf(border<min(data)) = 1; |
---|
96 | border_cdf(border>max(data)) = n; |
---|
97 | plot(border,border_cdf) |
---|
98 | density = diff(border_cdf)./diff(border)/(n-1); |
---|
99 | |
---|
100 | if nargout == 0 |
---|
101 | plot(test, density) |
---|
102 | axis_pct |
---|
103 | clear density |
---|
104 | end |
---|