Geopandas and Cartopy

Note

If you have not yet set up Python on your computer, you can execute this tutorial in your browser via Google Colab. Download the .ipynb file using the download button on the top right corner and import it in Google Colab.

Then install the following packages by executing the following command in a Jupyter cell at the top of the notebook.

!pip install pandas geopandas matplotlib cartopy mapclassify

Why not `pandas`?¶

Let’s reload again our example dataset of conventional power plants in Europe as a pd.DataFrame.

import pandas as pd
import matplotlib.pyplot as plt

fn = "https://raw.githubusercontent.com/PyPSA/powerplantmatching/master/powerplants.csv"
exclude_carriers = ["Wind", "Solar", "Solid Biomass", "Biogas"]
ppl = pd.read_csv(fn, index_col=0).query(
    "Fueltype not in @exclude_carriers and Capacity > 20"
)

This dataset includes coordinates (latitude and longitude), which allows us to plot the location and capacity of all power plants in a scatter plot:

ppl.plot.scatter("lon", "lat", s=ppl.Capacity / 1e3)

<Axes: xlabel='lon', ylabel='lat'>

However, this graphs misses some geographic reference point, we’d normally expect for a map like shorelines, country borders and so on.

Why `geopandas`?¶

Geopandas extends pandas by adding support for geospatial data.

The core data structure in GeoPandas is the geopandas.GeoDataFrame, a subclass of pandas.DataFrame, that can store geometry columns and perform spatial operations.

Typical geometries are points, lines, and polygons. They come from another library called shapely, which helps you create, analyze, and manipulate two-dimensional shapes and their properties, such as points, lines, and polygons.

First, we need to import the geopandas package. The conventional alias is gpd:

import geopandas as gpd

We can convert the latitude and longitude values given in the dataset to formal geometries (to be exact, a shapely.Point object) using the gpd.points_from_xy() function, and use this to gpd.GeoDataFrame. We should also specify a so-called coordinate reference system (CRS). The code 4326 means latitude and longitude values, the so-called WGS84 system.

geometry = gpd.points_from_xy(ppl["lon"], ppl["lat"])
geometry[:5]

<GeometryArray>
[ <POINT (9.88 47.085)>, <POINT (10.06 46.935)>,  <POINT (6.51 45.497)>,
 <POINT (6.001 45.386)>, <POINT (8.732 46.069)>]
Length: 5, dtype: geometry

gdf = gpd.GeoDataFrame(ppl, geometry=geometry, crs=4326)

Now, the resulting gdf looks like this:

gdf.head(3)

gdf.geometry.head()

id
0     POINT (9.88012 47.08503)
1    POINT (10.05995 46.93529)
2       POINT (6.5103 45.4968)
3       POINT (6.0013 45.3865)
4       POINT (8.7324 46.0694)
Name: geometry, dtype: geometry

With the additional geometry columns, it is now even easier to plot the geographic data:

gdf.plot(
    column="Fueltype",
    markersize=gdf.Capacity / 1e2,
)

<Axes: >

We can also start up an interactive map to explore the geodata in more detail:

gdf.explore(column="Fueltype")

Projections with `cartopy`¶

Cartopy is a Python package designed for geospatial data processing and has an interface for handling projections in matplotlib maps.

Why is it needed?

The Earth is a globe, but we present maps usually on two-dimensional surfaces. Hence, we typically need to project data points onto flat surfaces (e.g. screens, paper). However, we will always loose some information in doing so.

A map projection is:

a systematic transformation of the latitudes and longitudes of locations from the surface of a sphere or an ellipsoid into locations on a plane. Wikipedia: Map projection.

Different projections preserve different metric properties. As a result, converting geodata from one projection to another is a common exercise in geographic data science.

conformal projections preserve angles/directions (e.g. Mercator projection)
equal-area projections preserve area measure (e.g. Mollweide)
equidistant projections preserve distances between points (e.g. Plate carrée)
compromise projections seek to strike a balance between distortions (e.g. Robinson)

If you like the “Orange-as-Earth” analogy for projections, checkout this numberphile video by Hannah Fry.

First, we need to import the relevant parts of the cartopy package:

import cartopy
import cartopy.crs as ccrs

Let’s draw a first map with cartopy outlining the global coastlines in the so-called plate carrée projection (equirectangular projection):

ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()

<cartopy.mpl.feature_artist.FeatureArtist at 0x7f749bd163c0>

We can combine the functionality of cartopy with geopandas plots and even add geographical features:

fig = plt.figure(figsize=(7, 7))

ax = plt.axes(projection=ccrs.PlateCarree())

gdf.plot(
    ax=ax,
    column="Fueltype",
    markersize=gdf.Capacity / 1e2,
)

ax.coastlines()

ax.add_feature(cartopy.feature.BORDERS, color="red", linewidth=0.5)

ax.add_feature(cartopy.feature.OCEAN, color="blue")

ax.add_feature(cartopy.feature.LAND, color="yellow");

Geopandas will automatically calculate sensible bounds for the plot given the geographic data. But we can also manually zoom in or out by setting the spatial extent with the .set_extent() method:

ax.set_extent([5, 16, 47, 55])
fig

Reprojecting a `GeoDataFrame`¶

In geopandas, we can use the function .to_crs() to convert a GeoDataFrame with a given projection to another desired target coordinate reference system, using a cartopy CRS object:

crs = ccrs.Orthographic()
print(crs)

+proj=ortho +a=6378137.0 +lon_0=0.0 +lat_0=0.0 +alpha=0.0 +no_defs +type=crs

gdf.to_crs(crs).geometry.head()

id
0     POINT (745194.941 4671124.57)
1    POINT (760750.157 4659758.366)
2    POINT (506901.725 4548959.393)
3    POINT (468336.152 4540344.338)
4    POINT (671812.483 4593411.063)
Name: geometry, dtype: geometry

fig = plt.figure(figsize=(7, 7))

ax = plt.axes(projection=crs)

gdf.to_crs(crs).plot(
    ax=ax,
    column="Fueltype",
    markersize=gdf.Capacity / 1e2,
)

ax.coastlines()

<cartopy.mpl.feature_artist.FeatureArtist at 0x7f749bd67610>

Reading and Writing Files¶

In the following example, we’ll load a dataset containing the NUTS regions:

Nomenclature of Territorial Units for Statistics or NUTS (French: Nomenclature des unités territoriales statistiques) is a geocode standard for referencing the subdivisions of countries for statistical purposes.

Our ultimate goal for this part of the tutorial is to map the power plant capacities to the NUTS-1 region they belong to.

Common filetypes for vector-based geospatial datasets are GeoPackage (.gpkg), GeoJSON (.geojson), File Geodatabase (.gdb), or Shapefiles (.shp).

In geopandas we can use the gpd.read_file() function to read such files. So let’s start:

url = "https://tubcloud.tu-berlin.de/s/RHZJrN8Dnfn26nr/download/NUTS_RG_10M_2021_4326.geojson"

nuts = gpd.read_file(url)

nuts.head(3)

It is good practice to set an index. You can use .set_index() for that:

nuts = nuts.set_index("id")

We can also check out the geometries in the dataset with .geometry:

nuts.geometry

id
BG423    POLYGON ((24.42101 42.55306, 24.41032 42.4695,...
BG424    POLYGON ((25.07422 41.79348, 25.05851 41.75177...
BG425    POLYGON ((25.94863 41.32034, 25.90644 41.30757...
CH011    MULTIPOLYGON (((6.86623 46.90929, 6.89621 46.9...
CH012    POLYGON ((8.47767 46.5276, 8.39953 46.48872, 8...
                               ...                        
LV       POLYGON ((27.35158 57.51824, 27.54521 57.53444...
ME       POLYGON ((20.06394 43.00682, 20.32958 42.91149...
MK       POLYGON ((22.36021 42.31116, 22.51041 42.15516...
SK0      POLYGON ((19.88393 49.20418, 19.96275 49.23031...
IT       MULTIPOLYGON (((12.47792 46.67984, 12.69064 46...
Name: geometry, Length: 2010, dtype: geometry

With .crs we can check in which coordinate reference system the data is given:

nuts.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

With .total_bounds we can check the bounding box of the data in the original coordinate reference system:

nuts.total_bounds

array([-63.08825, -21.38917, 55.83616, 80.76427])

Let’s now filter the GeoDataFrame by NUTS-1 level...

nuts1 = nuts.query("LEVL_CODE == 1")

... and explore what kind of geometries we have in the dataset ...

nuts1.explore()

To write a GeoDataFrame back to file use GeoDataFrame.to_file(). The file format is inferred from the file ending.

nuts1.to_file("tmp.geojson")

Buffers and Areas¶

Many geospatial analyses require the calculation of areas or buffers around geometries. However, these calculations depend on the coordinate reference system used.

The first thing we need to do in order to calculate area or buffers is to reproject the GeoDataFrame to an equal-area projection (here: EPSG:3035 which is valid only within Europe; a globally valid alternative is the Mollweide projection EPSG:54009)

nuts1 = nuts1.to_crs(3035)

The area can be accessed via .area and is given in m² (after projection). Let’s convert to km²:

area = nuts1.area / 1e6
area.head()

id
AT1    23545.286205
AT2    25894.953057
EL4    17388.679384
EE0    45315.713593
EL3     3799.676547
dtype: float64

nuts1.explore(column=area, vmax=1e5)

We can also build a buffer of 1km around each geometry using .buffer():

nuts1.buffer(1000).explore()

It is also possible to calculate the centroid of each geometry, which is the geometric center of a shape. It can be useful for labelling or constructing a point-based network representation of polygonal data:

ax = nuts1.geometry.plot()
nuts1.representative_point().plot(ax=ax, color="red", markersize=2)

<Axes: >

To calculate the distance between two points (or geometries more generally), we can use the .distance() method. For instance, using the nuts1.representative_point() method, we can find a central point that represents each geometry but other than the centroid is guaranteed to be within the geometry. However, note that the distance is only meaningful if the geometries are in an appropriate projected coordinate reference system (e.g. an equidistant projection):

nuts1_de = nuts1.query("CNTR_CODE == 'DE'")
points = nuts1_de.representative_point()

ax = nuts1_de.plot()
points.plot(ax=ax, color="red", markersize=5)

<Axes: >

distances = (
    pd.concat({k: points.distance(p) for k, p in points.items()}, axis=1)
    .div(1e3)
    .astype(int)
)  # km
distances

The .distance() function uses the haversine formula to calculate the great-circle distance between two points on the Earth’s surface, given their latitude and longitude coordinates.

Dissolve¶

The .dissolve() function is used to aggregate geometries within a GeoDataFrame based on a specified attribute or set of attributes. Practically, it is very similar to the pandas.DataFrame.groupby() function, but instead of aggregating numerical values, it merges geometries.

For instance, we can dissolve the NUTS regions by country code to get country-level geometries:

nuts1.dissolve("CNTR_CODE")  # .plot()

Spatial Joins¶

Multiple GeoDataFrames can be combined via spatial joins.

Observations from two datasets are combined with the .sjoin() function based on their spatial relationship to one another (e.g. whether they are intersecting, overlapping or contained within one another). You can read more about the specific options here.

Source: https://pythongis.org

To perform a spatial join, you need to provide the following information:

The two geopandas.GeoDataFrame objects you want to join (e.g., left_df and right_df).
The type of spatial relationship to test (e.g., intersects, contains, within). This is specified using the predicate parameter.
The type of join to perform (e.g., inner, left, right). This is specified using the how parameter. When using left, the resulting GeoDataFrame will retain all records from the left_df, while matching records from the right_df will be added where the spatial relationship holds true. With inner, only records with matches in both GeoDataFrames will be retained. With right, all records from the right_df will be retained, with matching records from the left_df added where applicable.

The .sjoin() function will then go through the geometries in both GeoDataFrames, evaluate the specified spatial relationship, and join the matching records (behind the scenes, it uses spatial indexing to speed up the process).

For example, if the predicate parameter is set to 'intersects', the function will check if the geometries of each record in left_df intersect with the geometries of any records in right_df. If a match is found, the attributes of the corresponding records from both GeoDataFrames will be combined into a new record in the output GeoDataFrame.

Source: https://pythongis.org

The result of the spatial join is a new GeoDataFrame that retains the geometries from the left_df and includes attributes from both GeoDataFrames for the matching records.

When no match is found, the resulting attributes from the right_df will be set to NaN (e.g. when some power plants are located outside of NUTS-1 regions).

Let’s do an example, starting with first reprojecting the gdf object of power plant to the same CRS as nuts1 polygons and reducing it to coal plants:

gdf = gdf.to_crs(3035).query("Fueltype == 'Nuclear'")

Then, let’s have a look at both datasets at once. We want to find out which points (i.e. the power plants) lie within which shape (i.e. the NUTS1 regions).

fig = plt.figure(figsize=(7, 7))

ax = plt.axes(projection=ccrs.epsg(3035))

nuts1.plot(ax=ax, edgecolor="black", facecolor="sandybrown")

gdf.to_crs(3035).plot(
    ax=ax, column="Fueltype", markersize=gdf.Capacity / 40, legend=True
)

ax.set_extent([-7, 24, 40, 62])

We can now apply the .sjoin() function. By default, .sjoin() looks for intersections and keeps the geometries of the left GeoDataFrame.

joined = gdf.sjoin(nuts1)

If we look at this new GeoDataFrame, we now have additional columns from the NUTS1 data:

joined.head(3)

We can now use these new columns for .groupby() operations on the capacities (as well as sorting, converting values to a suitable unit and renaming the index from NUTS identifiers to NUTS region names):

cap = joined.groupby("NUTS_ID").Capacity.sum().sort_values(ascending=False) / 1000  # GW
cap.rename(nuts1.NUTS_NAME).head(3)

NUTS_ID
Auvergne-Rhône-Alpes     15.927
Centre — Val de Loire    14.022
Grand Est                13.454
Name: Capacity, dtype: float64

With our intersection and left join, some NUTS1 regions disappear in cap because they do not have any power plants. We can fix this by reindexing cap to the full set of NUTS1 regions:

nuts1.index.difference(cap.index)

Index(['AL0', 'AT1', 'AT2', 'AT3', 'BE1', 'BG4', 'CY0', 'DE3', 'DE5', 'DE6',
       'DE8', 'DEC', 'DED', 'DEE', 'DEG', 'DK0', 'EE0', 'EL3', 'EL4', 'EL5',
       'EL6', 'ES1', 'ES2', 'ES3', 'ES6', 'ES7', 'FI1', 'FI2', 'FR1', 'FRC',
       'FRG', 'FRL', 'FRM', 'FRY', 'HR0', 'HU1', 'HU3', 'IE0', 'IS0', 'ITG',
       'LI0', 'LU0', 'LV0', 'ME0', 'MK0', 'MT0', 'NL1', 'NL3', 'NL4', 'NO0',
       'PL2', 'PL4', 'PL5', 'PL6', 'PL7', 'PL8', 'PL9', 'PT1', 'PT2', 'PT3',
       'RO1', 'RO3', 'RO4', 'RS1', 'RS2', 'SE2', 'SE3', 'TR1', 'TR2', 'TR3',
       'TR4', 'TR5', 'TR6', 'TR7', 'TR8', 'TR9', 'TRA', 'TRB', 'TRC', 'UKE',
       'UKF', 'UKG', 'UKI', 'UKJ', 'UKN'],
      dtype='object')

cap = cap.reindex(nuts1.index)
cap.head(3)

id
AT1   NaN
AT2   NaN
EL4   NaN
Name: Capacity, dtype: float64

Additionally, some power plants might be located outside of NUTS1 regions (e.g. the ones in Ukraine). Hence, the total capacity in cap is lower than the total capacity in gdf.

cap.sum() < gdf.Capacity.div(1e3).sum()

np.True_

Finally, we can plot the total hard coal generation capacity per NUTS1 region on a map. For regions without any coal power plants we can use the missing_kwds parameter to set a specific appearance (here red hatching).

fig = plt.figure(figsize=(7, 7))

ax = plt.axes(projection=ccrs.epsg(3035))

nuts1.plot(
    ax=ax,
    column=cap,
    legend=True,
    missing_kwds={
        "color": "lightgrey",
        "edgecolor": "red",
        "hatch": "///",
    },
    legend_kwds={"label": "Hard Coal Capacity [GW]", "shrink": 0.7},
)

ax.set_extent([-7, 24, 40, 62])

This concludes the geopandas and cartopy tutorial!

Exercises¶

Take the following code snippet as a starting point. It loads the NUTS regions of Europe, the power plant dataset, and a shapefile for the Danish Natura2000 natural protection areas.

import cartopy
import pandas as pd
import geopandas as gpd
import cartopy.crs as ccrs
import matplotlib.pyplot as plt

url = "https://tubcloud.tu-berlin.de/s/RHZJrN8Dnfn26nr/download/NUTS_RG_10M_2021_4326.geojson"
nuts = gpd.read_file(url)

fn = "https://raw.githubusercontent.com/PyPSA/powerplantmatching/master/powerplants.csv"
df = pd.read_csv(fn, index_col=0)
geometry = gpd.points_from_xy(df["lon"], df["lat"])
ppl = gpd.GeoDataFrame(df, geometry=geometry, crs=4326)

url = "https://tubcloud.tu-berlin.de/s/mEpdmgBtmMbyjAr/download/Natura2000_end2021-DK.gpkg"
natura = gpd.read_file(url)

Task 1: Identify the coordinate reference system of the natura GeoDataFrame.

Task 2: Plot the natura GeoDataFrame on a map without transforming its CRS. Use cartopy for setting the projection of the figure and add coastlines and borders.

Task 3: Identify the name of the largest protected area in the natura GeoDataFrame.

Task 4: What is the total protection area in square kilometers.

Task 5: The natura GeoDataFrame has a column SITETYPE that indicates the type of protected area. Calculate the total area for each type of protected area (again in square kilometers).

Task 6: By how much (in percent) would the total area of protected areas increase if a buffer of 1 km around each protected area were also protected?

Task 7: List the power plants that are located within protected areas. How many power plants are located within protected areas? Use the .sjoin() function. Check the result by plotting these power plants on top of the protected areas.

Task 8 (advanced): What fraction of the natural protection area is located offshore? Use set operations with the .overlay() function and the NUTS regions GeoDataFrame.

Why not pandas?¶

Why geopandas?¶

Projections with cartopy¶

Reprojecting a GeoDataFrame¶

Reading and Writing Files¶

Buffers and Areas¶

Dissolve¶

Spatial Joins¶

Exercises¶

Why not `pandas`?¶

Why `geopandas`?¶

Projections with `cartopy`¶

Reprojecting a `GeoDataFrame`¶