Created
July 15, 2012 13:25
-
-
Save larsmans/3116927 to your computer and use it in GitHub Desktop.
Hellinger distance for discrete probability distributions in Python
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Three ways of computing the Hellinger distance between two discrete | |
probability distributions using NumPy and SciPy. | |
""" | |
import numpy as np | |
from scipy.linalg import norm | |
from scipy.spatial.distance import euclidean | |
_SQRT2 = np.sqrt(2) # sqrt(2) with default precision np.float64 | |
def hellinger1(p, q): | |
return norm(np.sqrt(p) - np.sqrt(q)) / _SQRT2 | |
def hellinger2(p, q): | |
return euclidean(np.sqrt(p), np.sqrt(q)) / _SQRT2 | |
def hellinger3(p, q): | |
return np.sqrt(np.sum((np.sqrt(p) - np.sqrt(q)) ** 2)) / _SQRT2 |
In case anyone is interested, I've implemented Hellinger Distance in Cython as a split criterion for sklearn DecisionTreeClassifier and RandomForestClassifier.
It performs great in my use cases of imbalanced data classification, beats RandomForestClassifier with gini and XGBClassifier.
You are welcome to check it out on https://github.com/EvgeniDubov/hellinger-distance-criterion
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In case anyone is wondering, I believe
hellinger2
andhellinger3
are faster thanhellinger1
. (I had been usinghellinger1
in one of my projects until some profiling determined it was a rate-limiting step.) Here is some timing code:Should get something like:
The difference shrinks for shorter arrays
p
andq
, but even ifrepeat=1
so thatp
andq
are of length 7,hellinger3
is still faster.Hope this is right.