You’re definitely on the right path, and your intuition about the peaks between 5–10 and 35–40 is spot on. I ran your dataset through KDE using scipy.stats.gaussian_kde
, and it works beautifully with a tighter bandwidth.
Here's the idea:
Use gaussian_kde
for estimating the density.
Then use scipy.signal.find_peaks
to detect local maxima in that smooth curve.
Sort the detected peaks by height to get the most prominent ones.
I'm using the following code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
from scipy.signal import find_peaks
# Data
a = np.array([68, 63, 20, 55, 1, 21, 55, 58, 14, 4, 40, 54, 33, 71, 36, 38, 9, 51, 89, 40, 13, 98, 46, 12, 21, 26, 40, 59, 17, 0, 5, 25, 19, 49, 91, 55, 39, 82, 57, 28, 54, 58, 65, 2, 39, 42, 65, 1, 93, 8, 26, 69, 88, 32, 15, 10, 95, 11, 2, 44, 66, 98, 18, 21, 25, 17, 41, 74, 12, 4, 33, 93, 65, 33, 25, 76, 84, 1, 63, 74, 3, 39, 9, 40, 7, 81, 55, 78, 7, 5, 99, 37, 7, 82, 54, 16, 22, 24, 23, 3])
# Fit KDE using scipy
kde = gaussian_kde(a, bw_method=0.2)
x = np.linspace(0, 100, 1000)
y = kde(x)
# Find all peaks
peaks, properties = find_peaks(y, prominence=0.0005) # Adjust as needed
# Sort peaks by height (y value)
top_two_indices = peaks[np.argsort(y[peaks])[-2:]]
top_two_indices = top_two_indices[np.argsort(x[top_two_indices])] # left to right
# Plot
plt.figure(figsize=(14, 7))
plt.plot(x, y, label='KDE', color='steelblue')
plt.fill_between(x, y, alpha=0.3)
# Annotate top 2 peaks
for i, peak in enumerate(top_two_indices, 1):
plt.plot(x[peak], y[peak], 'ro')
plt.text(x[peak], y[peak] + 0.0005,
f'Peak {i}\n({x[peak]:.1f}, {y[peak]:.3f})',
ha='center', color='red')
plt.title("Top 2 Peaks in KDE")
plt.xlabel("a")
plt.ylabel("Density")
plt.xticks(np.arange(0, 101, 5))
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
Which displays
Prominence matters: I used prominence=0.0005
in find_peaks()
— this helps ignore tiny local bumps and just focus on meaningful peaks. You can tweak it if your data changes.
Bandwidth is everything: The choice of bandwidth (bw_method=0.2
in this case) controls the smoothness of the KDE. If it's too high, peaks will be smoothed out. Too low, and you’ll get noisy fluctuations.
Automatic bandwidth selection: If you don’t want to hard-code bw_method
, you can automatically select the optimal bandwidth using cross-validation. Libraries like sklearn.model_selection.GridSearchCV
with KernelDensity
from sklearn.neighbors
let you fit multiple models with different bandwidths and choose the one that best fits the data statistically.
But honestly — for this particular dataset, manually setting bw_method=0.2
works great and reveals exactly the two main peaks you're after (one around ~7, the other near ~38). But for production-level or general-purpose analysis, incorporating automatic bandwidth selection via cross-validation can make your approach more adaptive and robust.