New in version 1.0.
For an array
of
values, a map classifier places each value
into one of
mutually exclusive and exhaustive classes.
Each classifer defines the classes based on different criteria, but in all
cases the following hold for the classifiers in PySAL:
where
denotes class
which has lower bound
and upper bound
.
In addition to the classifiers, there are several utility functions that can be used to evaluate the properties of a specific classifier for different parameter values, or for automatic selection of a classifier and number of classes.
Slocum, T.A., R.B. McMaster, F.C. Kessler and H.H. Howard (2009) Thematic Cartography and Geovisualization. Pearson Prentice Hall, Upper Saddle River.
A module of classification schemes for choropleth mapping.
Calculates the quantiles for an array
| Parameters: | y : array (n,1)
k : int
|
|---|---|
| Returns: | implicit : array (n,1)
|
Examples
>>> x=np.arange(1000)
>>> quantile(x)
array([ 249.75, 499.5 , 749.25, 999. ])
>>> quantile(x,k=3)
array([ 333., 666., 999.])
>>>
Note that if there are enough ties that the quantile values repeat, we collapse to pseudo quantiles in which case the number of classes will be less than k
>>> x=[1.0]*100
>>> x.extend([3.0]*40)
>>> len(x)
140
>>> y=np.array(x)
>>> quantile(y)
array([ 1., 3.])
Abstract class for all map classifications
Methods
| get_adcm | |
| get_gadf | |
| get_tss |
Absolute deviation around class means (ADCM).
Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.
Returns sum of ADCM over all classes
Total sum of squares around class means
Returns sum of squares over all class means
Box_Plot Map Classification
| Parameters: | y : array
hinge : float
|
|---|
Notes
The bins are set as follows:
bins[0] = q[0]-hinge*IQR
bins[1] = q[0]
bins[2] = q[1]
bins[3] = q[2]
bins[4] = q[2]+hinge*IQR
bins[5] = inf (see Notes)
where q is an array of the first three quartiles of y and IQR=q[2]-q[0]
Examples
>>> cal=load_example()
>>> bp=Box_Plot(cal)
>>> bp.bins
array([ -7.24325000e+01, 2.56750000e+00, 9.36500000e+00,
3.95300000e+01, 1.14530000e+02, 4.11145000e+03])
>>> bp.counts
array([ 0, 15, 14, 14, 7, 8])
>>> bp.high_outlier_ids
array([ 0, 6, 18, 29, 33, 37, 40, 42])
>>> cal[bp.high_outlier_ids]
array([ 329.92, 181.27, 370.5 , 722.85, 192.05, 4111.45,
317.11, 264.93])
>>> bx=Box_Plot(np.arange(100))
>>> bx.bins
array([ -50.25, 24.75, 49.5 , 74.25, 149.25])
Attributes
| yb | array (n,1) | bin ids for observations |
| bins | array (n,1) | the upper bounds of each class (monotonic) |
| k | int | the number of classes |
| counts | array (k,1) | the number of observations falling in each class |
| low_outlier_ids | array | indices of observations that are low outliers |
| high_outlier_ids | array | indices of observations that are high outliers |
Methods
| get_adcm | |
| get_gadf | |
| get_tss |
Absolute deviation around class means (ADCM).
Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.
Returns sum of ADCM over all classes
Total sum of squares around class means
Returns sum of squares over all class means
Equal Interval Classification
| Parameters: | y : array (n,1)
k : int
|
|---|
Notes
Intervals defined to have equal width:

with 
Examples
>>> cal=load_example()
>>> ei=Equal_Interval(cal,k=5)
>>> ei.k
5
>>> ei.counts
array([57, 0, 0, 0, 1])
>>> ei.bins
array([ 822.394, 1644.658, 2466.922, 3289.186, 4111.45 ])
>>>
Attributes
| yb | array (n,1) | bin ids for observations, each value is the id of the class the observation belongs to yb[i] = j for j>=1 if bins[j-1] < y[i] <= bins[j], yb[i] = 0 otherwise |
| bins | array (k,1) | the upper bounds of each class |
| k | int | the number of classes |
| counts | array (k,1) | the number of observations falling in each class |
Methods
| get_adcm | |
| get_gadf | |
| get_tss |
Absolute deviation around class means (ADCM).
Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.
Returns sum of ADCM over all classes
Total sum of squares around class means
Returns sum of squares over all class means
Fisher Jenks optimal classifier
| Parameters: | y : array (n,1)
k : int
|
|---|
Examples
>>> cal=load_example()
>>> fj=Fisher_Jenks(cal)
>>> fj.adcm
832.8900000000001
>>> fj.bins
[110.73999999999999, 192.05000000000001, 370.5, 722.85000000000002, 4111.4499999999998]
>>> fj.counts
array([50, 2, 4, 1, 1])
>>>
Attributes
| yb | array (n,1) | bin ids for observations |
| bins | array (k,1) | the upper bounds of each class |
| k | int | the number of classes |
| counts | array (k,1) | the number of observations falling in each class |
Methods
| get_adcm | |
| get_gadf | |
| get_tss |
Absolute deviation around class means (ADCM).
Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.
Returns sum of ADCM over all classes
Total sum of squares around class means
Returns sum of squares over all class means
Jenks Caspall Map Classification
| Parameters: | y : array (n,1)
k : int
|
|---|
Examples
>>> cal=load_example()
>>> jc=Jenks_Caspall(cal,k=5)
>>> jc.bins
array([[ 1.81000000e+00],
[ 7.60000000e+00],
[ 2.98200000e+01],
[ 1.81270000e+02],
[ 4.11145000e+03]])
>>> jc.counts
array([14, 13, 14, 10, 7])
Attributes
| yb | array (n,1) | bin ids for observations, |
| bins | array (k,1) | the upper bounds of each class |
| k | int | the number of classes |
| counts | array (k,1) | the number of observations falling in each class |
Methods
| get_adcm | |
| get_gadf | |
| get_tss |
Absolute deviation around class means (ADCM).
Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.
Returns sum of ADCM over all classes
Total sum of squares around class means
Returns sum of squares over all class means
Jenks Caspall Map Classification with forced movements
| Parameters: | y : array (n,1)
k : int
|
|---|
Examples
>>> cal=load_example()
>>> jcf=Jenks_Caspall_Forced(cal,k=5)
>>> jcf.k
5
>>> jcf.bins
array([[ 1.34000000e+00],
[ 5.90000000e+00],
[ 1.67000000e+01],
[ 5.06500000e+01],
[ 4.11145000e+03]])
>>> jcf.counts
array([12, 12, 13, 9, 12])
>>> jcf4=Jenks_Caspall_Forced(cal,k=4)
>>> jcf4.k
4
>>> jcf4.bins
array([[ 2.51000000e+00],
[ 8.70000000e+00],
[ 3.66800000e+01],
[ 4.11145000e+03]])
>>> jcf4.counts
array([15, 14, 14, 15])
>>>
Attributes
| yb | array (n,1) | bin ids for observations, |
| bins | array (k,1) | the upper bounds of each class |
| k | int | the number of classes |
| counts | array (k,1) | the number of observations falling in each class |
Methods
| get_adcm | |
| get_gadf | |
| get_tss |
Absolute deviation around class means (ADCM).
Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.
Returns sum of ADCM over all classes
Total sum of squares around class means
Returns sum of squares over all class means
Jenks Caspall Map Classification using a random sample
| Parameters: | y : array (n,1)
k : int
pct : float
|
|---|
Notes
This is intended for large n problems. The logic is to apply Jenks_Caspall to a random subset of the y space and then bin the complete vector y on the bins obtained from the subset. This would trade off some “accuracy” for a gain in speed.
Examples
>>> cal=load_example()
>>> x=np.random.random(100000)
>>> jc=Jenks_Caspall(x)
>>> jcs=Jenks_Caspall_Sampled(x)
>>> jc.bins
array([[ 0.19770952],
[ 0.39695769],
[ 0.59588617],
[ 0.79716865],
[ 0.99999425]])
>>> jcs.bins
array([[ 0.18877882],
[ 0.39341638],
[ 0.6028286 ],
[ 0.80070925],
[ 0.99999425]])
>>> jc.counts
array([19804, 20005, 19925, 20178, 20088])
>>> jcs.counts
array([18922, 20521, 20980, 19826, 19751])
>>>
# not for testing since we get different times on different hardware # just included for documentation of likely speed gains #>>> t1=time.time();jc=Jenks_Caspall(x);t2=time.time() #>>> t1s=time.time();jcs=Jenks_Caspall_Sampled(x);t2s=time.time() #>>> t2-t1;t2s-t1s #1.8292930126190186 #0.061631917953491211
Attributes
| yb | array (n,1) | bin ids for observations, |
| bins | array (k,1) | the upper bounds of each class |
| k | int | the number of classes |
| counts | array (k,1) | the number of observations falling in each class |
Methods
| get_adcm | |
| get_gadf | |
| get_tss |
Absolute deviation around class means (ADCM).
Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.
Returns sum of ADCM over all classes
Total sum of squares around class means
Returns sum of squares over all class means
Max_P Map Classification
Based on Max_p regionalization algorithm
| Parameters: | y : array (n,1)
k : int
initial : int
|
|---|
Attributes
| yb | array (n,1) | bin ids for observations, |
| bins | array (k,1) | the upper bounds of each class |
| k | int | the number of classes |
| counts | array (k,1) | the number of observations falling in each class |
Methods
| get_adcm | |
| get_gadf | |
| get_tss |
Absolute deviation around class means (ADCM).
Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.
Returns sum of ADCM over all classes
Total sum of squares around class means
Returns sum of squares over all class means
Maximum Breaks Map Classification
| Parameters: | y : array (n x 1)
k : int
|
|---|
Examples
>>> cal=load_example()
>>> mb=Maximum_Breaks(cal,k=5)
>>> mb.k
5
>>> mb.bins
array([ 146.005, 228.49 , 546.675, 2417.15 , 4111.45 ])
>>> mb.counts
array([50, 2, 4, 1, 1])
>>>
Attributes
| yb | array (nx1) | bin ids for observations |
| bins | array (kx1) | the upper bounds of each class |
| k | int | the number of classes |
| counts | array (kx1) | the number of observations falling in each class (numpy array k x 1) |
Methods
| get_adcm | |
| get_gadf | |
| get_tss |
Absolute deviation around class means (ADCM).
Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.
Returns sum of ADCM over all classes
Total sum of squares around class means
Returns sum of squares over all class means
Natural Breaks Map Classification
| Parameters: | y : array (n,1)
k : int
initial : int (default=100)
|
|---|
Notes
There is a tradeoff here between speed and consistency of the classification If you want more speed, set initial to a smaller value (0 would result in the best speed, if you want more consistent classes in multiple runs of Natural_Breaks on the same data, set initial to a higer value.
Examples
>>> cal=load_example()
>>> nb=Natural_Breaks(cal,k=5)
>>> nb.k
5
>>> nb.counts
array([14, 13, 14, 10, 7])
>>> nb.bins
[1.8100000000000001, 7.5999999999999996, 29.82, 181.27000000000001, 4111.4499999999998]
Attributes
| yb | array (n,1) | bin ids for observations, |
| bins | array (k,1) | the upper bounds of each class |
| k | int | the number of classes |
| counts | array (k,1) | the number of observations falling in each class |
Methods
| get_adcm | |
| get_gadf | |
| get_tss |
Absolute deviation around class means (ADCM).
Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.
Returns sum of ADCM over all classes
Total sum of squares around class means
Returns sum of squares over all class means
Quantile Map Classification
| Parameters: | y : array (n,1)
k : int
|
|---|
Examples
>>> cal=load_example()
>>> q=Quantiles(cal,k=5)
>>> q.bins
array([ 1.46400000e+00, 5.79800000e+00, 1.32780000e+01,
5.46160000e+01, 4.11145000e+03])
>>> q.counts
array([12, 11, 12, 11, 12])
>>>
Attributes
| yb | array (n,1) | bin ids for observations, each value is the id of the class the observation belongs to yb[i] = j for j>=1 if bins[j-1] < y[i] <= bins[j], yb[i] = 0 otherwise |
| bins | array (k,1) | the upper bounds of each class |
| k | int | the number of classes |
| counts | array (k,1) | the number of observations falling in each class |
Methods
| get_adcm | |
| get_gadf | |
| get_tss |
Absolute deviation around class means (ADCM).
Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.
Returns sum of ADCM over all classes
Total sum of squares around class means
Returns sum of squares over all class means
Percentiles Map Classification
| Parameters: | y : array
pct : array
|
|---|
Examples
>>> cal=load_example()
>>> p=Percentiles(cal)
>>> p.bins
array([ 1.35700000e-01, 5.53000000e-01, 9.36500000e+00,
2.13914000e+02, 2.17994800e+03, 4.11145000e+03])
>>> p.counts
array([ 1, 5, 23, 23, 5, 1])
>>> p2=Percentiles(cal,pct=[50,100])
>>> p2.bins
array([ 9.365, 4111.45 ])
>>> p2.counts
array([29, 29])
>>> p2.k
2
Attributes
| yb | array | bin ids for observations (numpy array n x 1) |
| bins | array | the upper bounds of each class (numpy array k x 1) |
| k | int | the number of classes |
| counts | int | the number of observations falling in each class (numpy array k x 1) |
Methods
| get_adcm | |
| get_gadf | |
| get_tss |
Absolute deviation around class means (ADCM).
Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.
Returns sum of ADCM over all classes
Total sum of squares around class means
Returns sum of squares over all class means
Standard Deviation and Mean Map Classification
| Parameters: | y : array (n,1)
multiples : array
|
|---|
Examples
>>> cal=load_example()
>>> st=Std_Mean(cal)
>>> st.k
5
>>> st.bins
array([ -967.36235382, -420.71712519, 672.57333208, 1219.21856072,
4111.45 ])
>>> st.counts
array([ 0, 0, 56, 1, 1])
>>>
>>> st3=Std_Mean(cal,multiples=[-3,-1.5,1.5,3])
>>> st3.bins
array([-1514.00758246, -694.03973951, 945.8959464 , 1765.86378936,
4111.45 ])
>>> st3.counts
array([ 0, 0, 57, 0, 1])
>>>
Attributes
| yb | array (n,1) | bin ids for observations, |
| bins | array (k,1) | the upper bounds of each class |
| k | int | the number of classes |
| counts | array (k,1) | the number of observations falling in each class |
Methods
| get_adcm | |
| get_gadf | |
| get_tss |
Absolute deviation around class means (ADCM).
Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.
Returns sum of ADCM over all classes
Total sum of squares around class means
Returns sum of squares over all class means
User Specified Binning
| Parameters: | y : array (n,1)
bins : array (k,1)
|
|---|
Notes
If upper bound of user bins does not exceed max(y) we append an additional bin.
Examples
>>> cal=load_example()
>>> bins=[20,max(cal)]
>>> bins
[20, 4111.4499999999998]
>>> ud=User_Defined(cal,bins)
>>> ud.bins
array([ 20. , 4111.45])
>>> ud.counts
array([37, 21])
>>> bins=[20,30]
>>> ud=User_Defined(cal,bins)
>>> ud.bins
array([ 20. , 30. , 4111.45])
>>> ud.counts
array([37, 4, 17])
>>>
Attributes
| yb | array (n,1) | bin ids for observations, |
| bins | array (k,1) | the upper bounds of each class |
| k | int | the number of classes |
| counts | array (k,1) | the number of observations falling in each class |
Methods
| get_adcm | |
| get_gadf | |
| get_tss |
Absolute deviation around class means (ADCM).
Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.
Returns sum of ADCM over all classes
Total sum of squares around class means
Returns sum of squares over all class means
Evaluate the Goodness of Absolute Deviation Fit of a Classifier Finds the minimum value of k for which gadf>pct
| Parameters: | y : array (nx1)
method : string
maxk : int
pct : float
|
|---|---|
| Returns: | implicit : tuple
|
See also
Notes
The GADF is defined as:
where
is the global median and
is the median for class
.
Examples
>>> cal=load_example()
>>> qgadf=gadf(cal)
>>> qgadf[0]
15
>>> qgadf[-1]
0.37402575909092828
Quantiles fail to exceed 0.80 before 15 classes. If we lower the bar to 0.2 we see quintiles as a result
>>> qgadf2=gadf(cal,pct=0.2)
>>> qgadf2[0]
5
>>> qgadf2[-1]
0.21710231966462412
>>>
Evaluate all k-classifers and pick optimal based on k and GADF
| Parameters: | y : array (nx1)
pct : float
|
|---|
See also
Notes
This can be used to suggest a classification scheme.
Examples
>>> cal=load_example()
>>> ks=K_classifiers(cal)
>>> ks.best.name
'Fisher_Jenks'
>>> ks.best.k
4
>>> ks.best.gadf
0.84810327199081048
>>>
Attributes
| best | instance of Map_Classifier | the optimal classifer |
| results | dictionary | keys are classifier names, values are the Map_Classifier instances with the best pct for each classifer |