pysal

Table Of Contents

Previous topic

esda.join_counts — Spatial autocorrelation statistics for binary attributes

This Page

esda.mapclassify — Choropleth map classification

New in version 1.0.

For an array y of n values, a map classifier places each value y_i into one of k mutually exclusive and exhaustive classes. Each classifer defines the classes based on different criteria, but in all cases the following hold for the classifiers in PySAL:

C_j^l < y_i \le C_j^u \  \forall  i \in C_j

where C_j denotes class j which has lower bound C_j^l and upper bound C_j^u.

Utilities

In addition to the classifiers, there are several utility functions that can be used to evaluate the properties of a specific classifier for different parameter values, or for automatic selection of a classifier and number of classes.

References

Slocum, T.A., R.B. McMaster, F.C. Kessler and H.H. Howard (2009) Thematic Cartography and Geovisualization. Pearson Prentice Hall, Upper Saddle River.

API

A module of classification schemes for choropleth mapping.

pysal.esda.mapclassify.quantile(y, k=4)

Calculates the quantiles for an array

Parameters:

y : array (n,1)

values to classify

k : int

number of quantiles

Returns:

implicit : array (n,1)

quantile values

Examples

>>> x=np.arange(1000)
>>> quantile(x)
array([ 249.75,  499.5 ,  749.25,  999.  ])
>>> quantile(x,k=3)
array([ 333.,  666.,  999.])
>>> 

Note that if there are enough ties that the quantile values repeat, we collapse to pseudo quantiles in which case the number of classes will be less than k

>>> x=[1.0]*100
>>> x.extend([3.0]*40)
>>> len(x)
140
>>> y=np.array(x)
>>> quantile(y)
array([ 1.,  3.])
class pysal.esda.mapclassify.Map_Classifier(y)

Abstract class for all map classifications

Methods

get_adcm
get_gadf
get_tss
get_adcm()

Absolute deviation around class means (ADCM).

Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.

Returns sum of ADCM over all classes

get_gadf()
Goodness of absolute deviation of fit
get_tss()

Total sum of squares around class means

Returns sum of squares over all class means

class pysal.esda.mapclassify.Box_Plot(y, hinge=1.5)

Box_Plot Map Classification

Parameters:

y : array

attribute to classify

hinge : float

multiplier for IQR

Notes

The bins are set as follows:

bins[0] = q[0]-hinge*IQR
bins[1] = q[0]
bins[2] = q[1]
bins[3] = q[2]
bins[4] = q[2]+hinge*IQR
bins[5] = inf  (see Notes)

where q is an array of the first three quartiles of y and IQR=q[2]-q[0]

If q[2]+hinge*IQR > max(y) there will only be 5 classes and no high outliers,
otherwise, there will be 6 classes and at least one high outlier.

Examples

>>> cal=load_example()
>>> bp=Box_Plot(cal)
>>> bp.bins
array([ -7.24325000e+01,   2.56750000e+00,   9.36500000e+00,
         3.95300000e+01,   1.14530000e+02,   4.11145000e+03])
>>> bp.counts
array([ 0, 15, 14, 14,  7,  8])
>>> bp.high_outlier_ids
array([ 0,  6, 18, 29, 33, 37, 40, 42])
>>> cal[bp.high_outlier_ids]
array([  329.92,   181.27,   370.5 ,   722.85,   192.05,  4111.45,
         317.11,   264.93])
>>> bx=Box_Plot(np.arange(100))
>>> bx.bins
array([ -50.25,   24.75,   49.5 ,   74.25,  149.25])

Attributes

yb array (n,1) bin ids for observations
bins array (n,1) the upper bounds of each class (monotonic)
k int the number of classes
counts array (k,1) the number of observations falling in each class
low_outlier_ids array indices of observations that are low outliers
high_outlier_ids array indices of observations that are high outliers

Methods

get_adcm
get_gadf
get_tss
get_adcm()

Absolute deviation around class means (ADCM).

Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.

Returns sum of ADCM over all classes

get_gadf()
Goodness of absolute deviation of fit
get_tss()

Total sum of squares around class means

Returns sum of squares over all class means

class pysal.esda.mapclassify.Equal_Interval(y, k=5)

Equal Interval Classification

Parameters:

y : array (n,1)

values to classify

k : int

number of classes required

Notes

Intervals defined to have equal width:

bins_j = min(y)+w*(j+1)

with w=\frac{max(y)-min(j)}{k}

Examples

>>> cal=load_example()
>>> ei=Equal_Interval(cal,k=5)
>>> ei.k
5
>>> ei.counts
array([57,  0,  0,  0,  1])
>>> ei.bins
array([  822.394,  1644.658,  2466.922,  3289.186,  4111.45 ])
>>> 

Attributes

yb array (n,1) bin ids for observations, each value is the id of the class the observation belongs to yb[i] = j for j>=1 if bins[j-1] < y[i] <= bins[j], yb[i] = 0 otherwise
bins array (k,1) the upper bounds of each class
k int the number of classes
counts array (k,1) the number of observations falling in each class

Methods

get_adcm
get_gadf
get_tss
get_adcm()

Absolute deviation around class means (ADCM).

Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.

Returns sum of ADCM over all classes

get_gadf()
Goodness of absolute deviation of fit
get_tss()

Total sum of squares around class means

Returns sum of squares over all class means

class pysal.esda.mapclassify.Fisher_Jenks(y, k=5)

Fisher Jenks optimal classifier

Parameters:

y : array (n,1)

values to classify

k : int

number of classes required

Examples

>>> cal=load_example()
>>> fj=Fisher_Jenks(cal)
>>> fj.adcm
832.8900000000001
>>> fj.bins
[110.73999999999999, 192.05000000000001, 370.5, 722.85000000000002, 4111.4499999999998]
>>> fj.counts
array([50,  2,  4,  1,  1])
>>> 

Attributes

yb array (n,1) bin ids for observations
bins array (k,1) the upper bounds of each class
k int the number of classes
counts array (k,1) the number of observations falling in each class

Methods

get_adcm
get_gadf
get_tss
get_adcm()

Absolute deviation around class means (ADCM).

Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.

Returns sum of ADCM over all classes

get_gadf()
Goodness of absolute deviation of fit
get_tss()

Total sum of squares around class means

Returns sum of squares over all class means

class pysal.esda.mapclassify.Jenks_Caspall(y, k=5)

Jenks Caspall Map Classification

Parameters:

y : array (n,1)

values to classify

k : int

number of classes required

Examples

>>> cal=load_example()
>>> jc=Jenks_Caspall(cal,k=5)
>>> jc.bins
array([[  1.81000000e+00],
       [  7.60000000e+00],
       [  2.98200000e+01],
       [  1.81270000e+02],
       [  4.11145000e+03]])
>>> jc.counts
array([14, 13, 14, 10,  7])

Attributes

yb array (n,1) bin ids for observations,
bins array (k,1) the upper bounds of each class
k int the number of classes
counts array (k,1) the number of observations falling in each class

Methods

get_adcm
get_gadf
get_tss
get_adcm()

Absolute deviation around class means (ADCM).

Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.

Returns sum of ADCM over all classes

get_gadf()
Goodness of absolute deviation of fit
get_tss()

Total sum of squares around class means

Returns sum of squares over all class means

class pysal.esda.mapclassify.Jenks_Caspall_Forced(y, k=5)

Jenks Caspall Map Classification with forced movements

Parameters:

y : array (n,1)

values to classify

k : int

number of classes required

Examples

>>> cal=load_example()
>>> jcf=Jenks_Caspall_Forced(cal,k=5)
>>> jcf.k
5
>>> jcf.bins
array([[  1.34000000e+00],
       [  5.90000000e+00],
       [  1.67000000e+01],
       [  5.06500000e+01],
       [  4.11145000e+03]])
>>> jcf.counts
array([12, 12, 13,  9, 12])
>>> jcf4=Jenks_Caspall_Forced(cal,k=4)
>>> jcf4.k
4
>>> jcf4.bins
array([[  2.51000000e+00],
       [  8.70000000e+00],
       [  3.66800000e+01],
       [  4.11145000e+03]])
>>> jcf4.counts
array([15, 14, 14, 15])
>>> 

Attributes

yb array (n,1) bin ids for observations,
bins array (k,1) the upper bounds of each class
k int the number of classes
counts array (k,1) the number of observations falling in each class

Methods

get_adcm
get_gadf
get_tss
get_adcm()

Absolute deviation around class means (ADCM).

Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.

Returns sum of ADCM over all classes

get_gadf()
Goodness of absolute deviation of fit
get_tss()

Total sum of squares around class means

Returns sum of squares over all class means

class pysal.esda.mapclassify.Jenks_Caspall_Sampled(y, k=5, pct=0.10000000000000001)

Jenks Caspall Map Classification using a random sample

Parameters:

y : array (n,1)

values to classify

k : int

number of classes required

pct : float

The percentage of n that should form the sample If pct is specified such that n*pct > 1000, then pct = 1000./n

Notes

This is intended for large n problems. The logic is to apply Jenks_Caspall to a random subset of the y space and then bin the complete vector y on the bins obtained from the subset. This would trade off some “accuracy” for a gain in speed.

Examples

>>> cal=load_example()
>>> x=np.random.random(100000)
>>> jc=Jenks_Caspall(x)
>>> jcs=Jenks_Caspall_Sampled(x)
>>> jc.bins
array([[ 0.19770952],
       [ 0.39695769],
       [ 0.59588617],
       [ 0.79716865],
       [ 0.99999425]])
>>> jcs.bins
array([[ 0.18877882],
       [ 0.39341638],
       [ 0.6028286 ],
       [ 0.80070925],
       [ 0.99999425]])
>>> jc.counts
array([19804, 20005, 19925, 20178, 20088])
>>> jcs.counts
array([18922, 20521, 20980, 19826, 19751])
>>> 

# not for testing since we get different times on different hardware # just included for documentation of likely speed gains #>>> t1=time.time();jc=Jenks_Caspall(x);t2=time.time() #>>> t1s=time.time();jcs=Jenks_Caspall_Sampled(x);t2s=time.time() #>>> t2-t1;t2s-t1s #1.8292930126190186 #0.061631917953491211

Attributes

yb array (n,1) bin ids for observations,
bins array (k,1) the upper bounds of each class
k int the number of classes
counts array (k,1) the number of observations falling in each class

Methods

get_adcm
get_gadf
get_tss
get_adcm()

Absolute deviation around class means (ADCM).

Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.

Returns sum of ADCM over all classes

get_gadf()
Goodness of absolute deviation of fit
get_tss()

Total sum of squares around class means

Returns sum of squares over all class means

class pysal.esda.mapclassify.Max_P_Classifier(y, k=5, initial=1000)

Max_P Map Classification

Based on Max_p regionalization algorithm

Parameters:

y : array (n,1)

values to classify

k : int

number of classes required

initial : int

number of initial solutions to use prior to swapping

Attributes

yb array (n,1) bin ids for observations,
bins array (k,1) the upper bounds of each class
k int the number of classes
counts array (k,1) the number of observations falling in each class

Methods

get_adcm
get_gadf
get_tss
get_adcm()

Absolute deviation around class means (ADCM).

Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.

Returns sum of ADCM over all classes

get_gadf()
Goodness of absolute deviation of fit
get_tss()

Total sum of squares around class means

Returns sum of squares over all class means

class pysal.esda.mapclassify.Maximum_Breaks(y, k=5, mindiff=0)

Maximum Breaks Map Classification

Parameters:

y : array (n x 1)

values to classify

k : int

number of classes required

Examples

>>> cal=load_example()
>>> mb=Maximum_Breaks(cal,k=5)
>>> mb.k
5
>>> mb.bins
array([  146.005,   228.49 ,   546.675,  2417.15 ,  4111.45 ])
>>> mb.counts
array([50,  2,  4,  1,  1])
>>> 

Attributes

yb array (nx1) bin ids for observations
bins array (kx1) the upper bounds of each class
k int the number of classes
counts array (kx1) the number of observations falling in each class (numpy array k x 1)

Methods

get_adcm
get_gadf
get_tss
get_adcm()

Absolute deviation around class means (ADCM).

Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.

Returns sum of ADCM over all classes

get_gadf()
Goodness of absolute deviation of fit
get_tss()

Total sum of squares around class means

Returns sum of squares over all class means

class pysal.esda.mapclassify.Natural_Breaks(y, k=5, initial=100)

Natural Breaks Map Classification

Parameters:

y : array (n,1)

values to classify

k : int

number of classes required

initial : int (default=100)

number of initial solutions to generate

Notes

There is a tradeoff here between speed and consistency of the classification If you want more speed, set initial to a smaller value (0 would result in the best speed, if you want more consistent classes in multiple runs of Natural_Breaks on the same data, set initial to a higer value.

Examples

>>> cal=load_example()
>>> nb=Natural_Breaks(cal,k=5)
>>> nb.k
5
>>> nb.counts
array([14, 13, 14, 10,  7])
>>> nb.bins
[1.8100000000000001, 7.5999999999999996, 29.82, 181.27000000000001, 4111.4499999999998]

Attributes

yb array (n,1) bin ids for observations,
bins array (k,1) the upper bounds of each class
k int the number of classes
counts array (k,1) the number of observations falling in each class

Methods

get_adcm
get_gadf
get_tss
get_adcm()

Absolute deviation around class means (ADCM).

Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.

Returns sum of ADCM over all classes

get_gadf()
Goodness of absolute deviation of fit
get_tss()

Total sum of squares around class means

Returns sum of squares over all class means

class pysal.esda.mapclassify.Quantiles(y, k=5)

Quantile Map Classification

Parameters:

y : array (n,1)

values to classify

k : int

number of classes required

Examples

>>> cal=load_example()
>>> q=Quantiles(cal,k=5)
>>> q.bins
array([  1.46400000e+00,   5.79800000e+00,   1.32780000e+01,
         5.46160000e+01,   4.11145000e+03])
>>> q.counts
array([12, 11, 12, 11, 12])
>>> 

Attributes

yb array (n,1) bin ids for observations, each value is the id of the class the observation belongs to yb[i] = j for j>=1 if bins[j-1] < y[i] <= bins[j], yb[i] = 0 otherwise
bins array (k,1) the upper bounds of each class
k int the number of classes
counts array (k,1) the number of observations falling in each class

Methods

get_adcm
get_gadf
get_tss
get_adcm()

Absolute deviation around class means (ADCM).

Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.

Returns sum of ADCM over all classes

get_gadf()
Goodness of absolute deviation of fit
get_tss()

Total sum of squares around class means

Returns sum of squares over all class means

class pysal.esda.mapclassify.Percentiles(y, pct=[, 1, 10, 50, 90, 99, 100])

Percentiles Map Classification

Parameters:

y : array

attribute to classify

pct : array

percentiles default=[1,10,50,90,99,100]

Examples

>>> cal=load_example()
>>> p=Percentiles(cal)
>>> p.bins
array([  1.35700000e-01,   5.53000000e-01,   9.36500000e+00,
         2.13914000e+02,   2.17994800e+03,   4.11145000e+03])
>>> p.counts
array([ 1,  5, 23, 23,  5,  1])
>>> p2=Percentiles(cal,pct=[50,100])
>>> p2.bins
array([    9.365,  4111.45 ])
>>> p2.counts
array([29, 29])
>>> p2.k
2

Attributes

yb array bin ids for observations (numpy array n x 1)
bins array the upper bounds of each class (numpy array k x 1)
k int the number of classes
counts int the number of observations falling in each class (numpy array k x 1)

Methods

get_adcm
get_gadf
get_tss
get_adcm()

Absolute deviation around class means (ADCM).

Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.

Returns sum of ADCM over all classes

get_gadf()
Goodness of absolute deviation of fit
get_tss()

Total sum of squares around class means

Returns sum of squares over all class means

class pysal.esda.mapclassify.Std_Mean(y, multiples=[, -2, -1, 1, 2])

Standard Deviation and Mean Map Classification

Parameters:

y : array (n,1)

values to classify

multiples : array

the multiples of the standard deviation to add/subtract from the sample mean to define the bins, default=[-2,-1,1,2]

Examples

>>> cal=load_example()
>>> st=Std_Mean(cal)
>>> st.k
5
>>> st.bins
array([ -967.36235382,  -420.71712519,   672.57333208,  1219.21856072,
        4111.45      ])
>>> st.counts
array([ 0,  0, 56,  1,  1])
>>> 
>>> st3=Std_Mean(cal,multiples=[-3,-1.5,1.5,3])
>>> st3.bins
array([-1514.00758246,  -694.03973951,   945.8959464 ,  1765.86378936,
        4111.45      ])
>>> st3.counts
array([ 0,  0, 57,  0,  1])
>>> 

Attributes

yb array (n,1) bin ids for observations,
bins array (k,1) the upper bounds of each class
k int the number of classes
counts array (k,1) the number of observations falling in each class

Methods

get_adcm
get_gadf
get_tss
get_adcm()

Absolute deviation around class means (ADCM).

Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.

Returns sum of ADCM over all classes

get_gadf()
Goodness of absolute deviation of fit
get_tss()

Total sum of squares around class means

Returns sum of squares over all class means

class pysal.esda.mapclassify.User_Defined(y, bins)

User Specified Binning

Parameters:

y : array (n,1)

values to classify

bins : array (k,1)

upper bounds of classes (have to be monotically increasing)

Notes

If upper bound of user bins does not exceed max(y) we append an additional bin.

Examples

>>> cal=load_example()
>>> bins=[20,max(cal)]
>>> bins
[20, 4111.4499999999998]
>>> ud=User_Defined(cal,bins)
>>> ud.bins
array([   20.  ,  4111.45])
>>> ud.counts
array([37, 21])
>>> bins=[20,30]
>>> ud=User_Defined(cal,bins)
>>> ud.bins
array([   20.  ,    30.  ,  4111.45])
>>> ud.counts
array([37,  4, 17])
>>> 

Attributes

yb array (n,1) bin ids for observations,
bins array (k,1) the upper bounds of each class
k int the number of classes
counts array (k,1) the number of observations falling in each class

Methods

get_adcm
get_gadf
get_tss
get_adcm()

Absolute deviation around class means (ADCM).

Calculates the absolute deviations of each observation about its class mean as a measure of fit for the classification metho.

Returns sum of ADCM over all classes

get_gadf()
Goodness of absolute deviation of fit
get_tss()

Total sum of squares around class means

Returns sum of squares over all class means

pysal.esda.mapclassify.gadf(y, method='Quantiles', maxk=15, pct=0.80000000000000004)

Evaluate the Goodness of Absolute Deviation Fit of a Classifier Finds the minimum value of k for which gadf>pct

Parameters:

y : array (nx1)

values to be classified

method : string

Name of classifier [“Quantiles,”Fisher_Jenks”,”Maximum_Breaks”, “Natural_Breaks”]

maxk : int

maximum value of k to evaluate

pct : float

The percentage of GADF to exceed

Returns:

implicit : tuple

first value is k, second value is instance of classifier at k, third is the pct obtained

See also

K_classifiers

Notes

The GADF is defined as:

GADF = 1 - \sum_c \sum_{i \in c} |y_i - y_{c,med}|  / \sum_i |y_i - y_{med}|

where y_{med} is the global median and y_{c,med} is the median for class c.

Examples

>>> cal=load_example()
>>> qgadf=gadf(cal)
>>> qgadf[0]
15
>>> qgadf[-1]
0.37402575909092828

Quantiles fail to exceed 0.80 before 15 classes. If we lower the bar to 0.2 we see quintiles as a result

>>> qgadf2=gadf(cal,pct=0.2)
>>> qgadf2[0]
5
>>> qgadf2[-1]
0.21710231966462412
>>> 
class pysal.esda.mapclassify.K_classifiers(y, pct=0.80000000000000004)

Evaluate all k-classifers and pick optimal based on k and GADF

Parameters:

y : array (nx1)

values to be classified

pct : float

The percentage of GADF to exceed

See also

gadf

Notes

This can be used to suggest a classification scheme.

Examples

>>> cal=load_example()
>>> ks=K_classifiers(cal)
>>> ks.best.name
'Fisher_Jenks'
>>> ks.best.k
4
>>> ks.best.gadf
0.84810327199081048
>>> 

Attributes

best instance of Map_Classifier the optimal classifer
results dictionary keys are classifier names, values are the Map_Classifier instances with the best pct for each classifer