pysal

Table Of Contents

Previous topic

Spatial Smoothing

Next topic

Spatial Dynamics

This Page

Regionalization

Introduction

PySAL offers a number of tools for the construction of regions. For the purposes of this section, a “region” is a group of “areas,” and there are generally multiple regions in a particular dataset. At this time, PySAL offers the max-p regionalization algorithm and tools for constructing random regions.

max-p

Most regionalization algorithms require the user to define a priori the number of regions to be built (e.g. k-means clustering). The max-p algorithm [1] determines the number of regions (p) endogenously based on a set of areas, a matrix of attributes on each area and a floor constraint. The floor constraint defines the minimum bound that a variable must reach for each region; for example, a constraint might be the minimum population each region must have. max-p further enforces a contiguity constraint on the areas within regions.

To illustrate this we will use data on per capita income from the lower 48 US states over the period 1929-2010. The goal is to form contiguous regions of states displaying similar levels of income throughout this period:

>>> f=pysal.open("../examples/usjoin.csv")
>>> pci=np.array([f.by_col[str(y)] for y in range(1929,2010)])
>>> pci=pci.transpose()
>>> pci.shape
(48, 81)

We also require set of binary contiguity weights for the Maxp class:

>>> w=pysal.open("../examples/states48.gal").read()

Once we have the attribute data and our weights object we can create an instance of Maxp:

>>> r=pysal.Maxp(w,pci,floor=5,floor_variable=np.ones((48,1)),initial=99)

Here we are forming regions with a minimum of 5 states in each region, so we set the floor_variable to a simple unit vector to ensure this floor constraint is satisfied. We also specify the initial number of feasible solutions to 99 - which are then searched over to pick the optimal feasible solution to then commence with the more expensive swapping component of the algorithm. [2]

The Maxp instance s has a number of attributes regarding the solution. First is the definition of the regions:

>>> r.regions
[['38', '31', '24', '13', '4', '23', '22', '12', '1'], ['25', '9', '3', '44', '34', '47'], ['45', '43', '35', '30', '32'], ['33', '28', '40', '15', '41', '21', '2', '0'], ['39', '14', '8', '7', '37'], ['26', '42', '16', '18', '36'], ['6', '17', '27', '29', '5'], ['11', '10', '19', '46', '20']]

which is a list of eight lists of region ids. For example, the first nested list indicates there are nine states in the first region, while the last region has five states. To determine which states these are we can read in the names from the original csv file:

>>> f.header
['Name', 'STATE_FIPS', '1929', '1930', '1931', '1932', '1933', '1934', '1935', '1936', '1937', '1938', '1939', '1940', '1941', '1942', '1943', '1944', '1945', '1946', '1947', '1948', '1949', '1950', '1951', '1952', '1953', '1954', '1955', '1956', '1957', '1958', '1959', '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009']
>>> names=f.by_col('Name')
>>> names=np.array(names)
>>> names
array(['Alabama', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'],
      dtype='|S14')

and then loop over the region definitions to identify the specific states comprising each of the regions:

>>> for region in r.regions:
...     ids=map(int,region)
...     names[ids]
...
array(['South Dakota', 'North Dakota', 'Nebraska', 'Kansas', 'Colorado',
       'Montana', 'Missouri', 'Iowa', 'Arizona'],
      dtype='|S14')
array(['Nevada', 'Idaho', 'California', 'Washington', 'Oregon', 'Wyoming'],
      dtype='|S14')
array(['West Virginia', 'Virginia', 'Pennsylvania', 'North Carolina',
       'Ohio'],
      dtype='|S14')
array(['Oklahoma', 'New Mexico', 'Texas', 'Louisiana', 'Utah',
       'Mississippi', 'Arkansas', 'Alabama'],
      dtype='|S14')
array(['Tennessee', 'Kentucky', 'Georgia', 'Florida', 'South Carolina'],
      dtype='|S14')
array(['New Hampshire', 'Vermont', 'Maine', 'Massachusetts', 'Rhode Island'],
      dtype='|S14')
array(['Delaware', 'Maryland', 'New Jersey', 'New York', 'Connecticut'],
      dtype='|S14')
array(['Indiana', 'Illinois', 'Michigan', 'Wisconsin', 'Minnesota'],
      dtype='|S14')

We can evaluate our solution by developing a pseudo pvalue for the regionalization. This is done by comparing the within region sum of squares for the solution against simulated solutions where areas are randomly assigned to regions that maintain the cardinality of the original solution. This method must be explicitly called once the Maxp instance has been created:

>>> r.inference()
>>> r.pvalue
0.01

so we see we have a regionalization that is significantly different than a chance partitioning.

Random Regions

PySAL offers functionality to generate random regions based user-defined constraints. There are three optional parameters to constrain the regionalization: number of regions, cardinality and contiguity. The default case simply takes a list of area IDs and randomly selects the number of regions and then allocates areas to each region. The user can also pass a vector of integers to the cardinality parameter to designate the number of areas to randomly assign to each region. The contiguity parameter takes a spatial weights object and uses that to ensure that each region is made up of spatially contiguous areas. When the contiguity constraint is enforced, it is possible to arrive at infeasible solutions; the maxiter parameter can be set to make multiple attempts to find a feasible solution. The following examples show some of the possible combinations of constraints.

>>> import random
>>> import numpy as np
>>> import pysal
>>> from pysal.region import Random_Region
>>> nregs = 13
>>> cards = range(2,14) + [10]
>>> w = pysal.lat2W(10,10,rook=False)
>>> ids = w.id_order
>>>
>>> # unconstrained
>>> random.seed(10)
>>> np.random.seed(10)
>>> t0 = Random_Region(ids)
>>> t0.regions[0]
[19, 14, 43, 37, 66, 3, 79, 41, 38, 68, 2, 1, 60]
>>> # cardinality and contiguity constrained (num_regions implied)
>>> random.seed(60)
>>> np.random.seed(60)
>>> t1 = pysal.region.Random_Region(ids, num_regions=nregs, cardinality=cards, contiguity=w)
>>> t1.regions[0]
[7, 18, 29, 17, 38, 48, 19, 28, 39, 37, 47, 36, 57]
>>> # cardinality constrained (num_regions implied)
>>> random.seed(100)
>>> np.random.seed(100)
>>> t2 = Random_Region(ids, num_regions=nregs, cardinality=cards)
>>> t2.regions[0]
[37, 62]
>>> # number of regions and contiguity constrained
>>> random.seed(100)
>>> np.random.seed(100)
>>> t3 = Random_Region(ids, num_regions=nregs, contiguity=w)
>>> t3.regions[1]
[79, 68, 57, 89, 47, 98, 36, 56, 69, 58, 78, 26, 48]
>>> # cardinality and contiguity constrained
>>> random.seed(60)
>>> np.random.seed(60)
>>> t4 = Random_Region(ids, cardinality=cards, contiguity=w)
>>> t4.regions[0]
[7, 18, 29, 17, 38, 48, 19, 28, 39, 37, 47, 36, 57]
>>> # number of regions constrained
>>> random.seed(100)
>>> np.random.seed(100)
>>> t5 = Random_Region(ids, num_regions=nregs)
>>> t5.regions[0]
[37, 62, 26, 41, 35, 25, 36]
>>> # cardinality constrained
>>> random.seed(100)
>>> np.random.seed(100)
>>> t6 = Random_Region(ids, cardinality=cards)
>>> t6.regions[0]
[37, 62]
>>> # contiguity constrained
>>> random.seed(100)
>>> np.random.seed(100)
>>> t7 = Random_Region(ids, contiguity=w)
>>> t7.regions[0]
[37, 27, 38, 29]
>>>

Footnotes

[1]Duque, J. C., L. Anselin and S. J. Rey. 2010. “The max-p region problem” Working Paper. GeoDa Center for Geospatial Analysis and Computation.
[2]Because this is a randomized algorithm, results may vary when replicating this example. To reproduce a regionalization solution, you should first set the random seed generator. See http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.seed.html for more information.