Basics pandas#

Note

If you have not yet set up Python on your computer, you can execute this tutorial in your browser via Google Colab. Click on the rocket in the top right corner and launch “Colab”. If that doesn’t work download the .ipynb file and import it in Google Colab. Then install pandas and numpy by executing the following command in a Jupyter cell at the top of the notebook.

!pip install -q pandas numpy

Pandas is a an open source library providing tabular data structures and data analysis tools. In other words, if you can imagine the data in an Excel spreadsheet, then Pandas is the tool for the job.

Note

Documentation for this package is available at https://pandas.pydata.org/docs/.

Package Imports#

This will be our first experience with importing a package.

Usually we import pandas with the alias pd.

We might also need numpy, Python’s main library for numerical computations.

import pandas as pd
import numpy as np
import plotly.io as pio
import plotly.offline as py
pd.options.plotting.backend = "plotly"

Series#

A Series represents a one-dimensional array of data. It is similar to a dictionary consisting of an index and values, but has more functions.

Note

Example data on Germany’s final six nuclear power plants is from Wikipedia.

dictionary = {
    "Neckarwestheim": 1269,
    "Isar 2": 1365,
    "Emsland": 1290,
}
s = pd.Series(dictionary)
s
Neckarwestheim    1269
Isar 2            1365
Emsland           1290
dtype: int64

Arithmetic operations can be applied to the whole pd.Series.

s**0.5
Neckarwestheim    35.623026
Isar 2            36.945906
Emsland           35.916570
dtype: float64

We can access the underlying index object if we need to:

s.index
Index(['Neckarwestheim', 'Isar 2', 'Emsland'], dtype='object')

We can get values back out using the index via the .loc attribute

s.loc["Isar 2"]
np.int64(1365)

We can pass a list or array to loc to get multiple rows back:

s.loc[["Neckarwestheim", "Emsland"]]
Neckarwestheim    1269
Emsland           1290
dtype: int64

DataFrame#

Series are limited to a single column. A more useful Pandas data structure is the DataFrame. A DataFrame is basically a bunch of series that share the same index.

data = {
    "capacity": [1269, 1365, 1290],  # MW
    "type": ["PWR", "PWR", "PWR"],
    "start_year": [1989, 1988, 1988],
    "end_year": [np.nan, np.nan, np.nan],
}
df = pd.DataFrame(data, index=["Neckarwestheim", "Isar 2", "Emsland"])
df
capacity type start_year end_year
Neckarwestheim 1269 PWR 1989 NaN
Isar 2 1365 PWR 1988 NaN
Emsland 1290 PWR 1988 NaN

A wide range of statistical functions are available on both Series and DataFrames.

df.min()
capacity      1269
type           PWR
start_year    1988
end_year       NaN
dtype: object

We can get a single column as a Series using python’s getitem syntax on the DataFrame object.

df["capacity"]
Neckarwestheim    1269
Isar 2            1365
Emsland           1290
Name: capacity, dtype: int64

Indexing works very similar to series

df.loc["Emsland"]
capacity      1290
type           PWR
start_year    1988
end_year       NaN
Name: Emsland, dtype: object

But we can also specify the column(s) and row(s) we want to access

df.at["Emsland", "start_year"]
np.int64(1988)

We can also add new columns to the DataFrame:

df["reduced_capacity"] = df.capacity * 0.8
df
capacity type start_year end_year reduced_capacity
Neckarwestheim 1269 PWR 1989 NaN 1015.2
Isar 2 1365 PWR 1988 NaN 1092.0
Emsland 1290 PWR 1988 NaN 1032.0

Sorting Data#

We can also sort the entries in dataframes, e.g. alphabetically by index or numerically by column values

df.sort_index()
capacity type start_year end_year reduced_capacity
Emsland 1290 PWR 1988 NaN 1032.0
Isar 2 1365 PWR 1988 NaN 1092.0
Neckarwestheim 1269 PWR 1989 NaN 1015.2
df.sort_values(by="capacity", ascending=True)
capacity type start_year end_year reduced_capacity
Neckarwestheim 1269 PWR 1989 NaN 1015.2
Emsland 1290 PWR 1988 NaN 1032.0
Isar 2 1365 PWR 1988 NaN 1092.0

Filtering Data#

We can also filter a DataFrame using a boolean series obtained from a condition. This is very useful to build subsets of the DataFrame.

df.capacity > 1300
Neckarwestheim    False
Isar 2             True
Emsland           False
Name: capacity, dtype: bool
df[df.capacity > 1300]
capacity type start_year end_year reduced_capacity
Isar 2 1365 PWR 1988 NaN 1092.0

Modifying Values#

In many cases, we want to modify values in a dataframe based on some rule. To modify values, we need to use .loc or .iloc

df.loc["Isar 2", "capacity"] = 1366
df
capacity type start_year end_year reduced_capacity
Neckarwestheim 1269 PWR 1989 NaN 1015.2
Isar 2 1366 PWR 1988 NaN 1092.0
Emsland 1290 PWR 1988 NaN 1032.0

Time Series#

Time indexes are great when handling time-dependent data.

Let’s first read some time series data, using the pd.read_csv() function, which takes a local file path or a link to an online resource.

The example data hourly time series for Germany in 2015 for:

  1. electricity demand from OPSD in GW

  2. onshore wind capacity factors from renewables.ninja in per-unit of installed capacity

  3. offshore wind capacity factors from renewables.ninja in per-unit of installed capacity

  4. solar PV capacity factors from renewables.ninja in per-unit of installed capacity

  5. electricity day-ahead spot market prices in €/MWh from EPEX Spot zone DE/AT/LU retrieved via SMARD platform

url = (
    "https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv"
)
ts = pd.read_csv(url, index_col=0, parse_dates=True)
ts.index[:5]
DatetimeIndex(['2015-01-01 00:00:00', '2015-01-01 01:00:00',
               '2015-01-01 02:00:00', '2015-01-01 03:00:00',
               '2015-01-01 04:00:00'],
              dtype='datetime64[ns]', freq=None)

We can use Python’s slicing notation inside .loc to select a date range, and then use the built-in plotting feature of Pandas:

ts.loc["2015-01-01":"2015-03-01", "onwind"].plot()

A common operation is to change the resolution of a dataset by resampling in time, which Pandas exposes through the resample function.

Note

The resample periods are specified using pandas offset index syntax.

ts["onwind"].resample("D").mean().plot()

Groupby Functionality#

DataFrame objects have a groupby method. The simplest way to think about it is that you pass another series, whose values are used to split the original object into different groups.

Here’s an example which retrieves the total generation capacity per country:

fn = "https://raw.githubusercontent.com/PyPSA/powerplantmatching/master/powerplants.csv"
df = pd.read_csv(fn, index_col=0)
df.iloc[:5, :10]
Name Fueltype Technology Set Country Capacity Efficiency DateIn DateRetrofit DateOut
id
0 Kernkraftwerk Emsland Nuclear Steam Turbine PP Germany 1336.0 0.33 1988.0 1988.0 2023.0
1 Brokdorf Nuclear Steam Turbine PP Germany 1410.0 0.33 1986.0 1986.0 2021.0
2 Borssele Hard Coal Steam Turbine PP Netherlands 485.0 NaN 1973.0 NaN 2034.0
3 Gemeinschaftskernkraftwerk Neckarwestheim Nuclear Steam Turbine PP Germany 1310.0 0.33 1976.0 1989.0 2023.0
4 Isar Nuclear Steam Turbine PP Germany 1410.0 0.33 1979.0 1988.0 2023.0
grouped = df.groupby("Country").Capacity.sum()
grouped.head()
Country
Albania                    2370.400000
Austria                   24643.200368
Belgium                   21443.151009
Bosnia and Herzegovina     4827.195964
Bulgaria                  15699.186363
Name: Capacity, dtype: float64

Let’s break apart this operation a bit. The workflow with groupby can be divided into three general steps:

  1. Split: Partition the data into different groups based on some criterion.

  2. Apply: Do some caclulation within each group, e.g. minimum, maximum, sums.

  3. Combine: Put the results back together into a single object.

Grouping is not only possible on a single columns, but also on multiple columns. For instance, we might want to group the capacities by country and fuel type. To achieve this, we pass a list of functions to the groupby functions.

capacities = df.groupby(["Country", "Fueltype"]).Capacity.sum().unstack()
capacities
Fueltype Biogas Geothermal Hard Coal Hydro Lignite Natural Gas Nuclear Oil Other Solar Solid Biomass Waste Wind
Country
Albania NaN NaN NaN 1743.900000 NaN NaN NaN NaN 98.00000 294.500000 NaN NaN 234.0000
Austria NaN NaN 1331.400000 14205.500368 522.000000 4952.000000 NaN NaN NaN 388.300000 NaN NaN 3244.0000
Belgium NaN NaN 3316.000000 1379.751009 NaN 5251.300000 5931.0 131.60000 NaN 484.100000 498.000 259.700 4191.7000
Bosnia and Herzegovina NaN NaN NaN 2213.500000 1911.595964 NaN NaN NaN NaN 267.100000 NaN NaN 435.0000
Bulgaria NaN NaN 1909.191813 2899.540000 4807.354550 990.000000 2080.0 NaN NaN 2405.100000 NaN NaN 608.0000
Croatia NaN NaN 308.848880 2129.772189 NaN 1440.800000 NaN NaN NaN 96.300000 NaN NaN 1140.0000
Czechia NaN NaN 1225.227500 1888.219836 7297.666121 1659.000000 4164.0 NaN NaN 1293.600000 30.000 NaN 140.0000
Denmark NaN NaN 5658.885609 NaN NaN 1813.200000 NaN 664.00000 70.00000 2256.700000 2542.000 NaN 5670.8000
Estonia NaN NaN NaN NaN NaN 389.000000 NaN 1841.00000 NaN 446.200000 NaN NaN 713.4000
Finland 1518.740000 NaN 3930.809044 3201.320000 965.000000 3134.800000 4362.0 1005.00000 143.10000 162.200000 2362.000 NaN 9681.1000
France NaN NaN 6918.344374 21054.763968 977.000000 9292.200000 64682.0 5130.00000 302.00000 11192.900000 348.000 NaN 26603.9000
Germany 2265.320979 NaN 30631.473898 12716.214837 23056.815081 39489.076894 23093.0 6850.84807 3156.36805 36716.919043 1071.699 2584.851 84571.1625
Greece NaN NaN NaN 3368.100000 4786.794109 9562.000000 NaN NaN NaN 4166.300000 NaN 24.000 5155.0000
Hungary NaN NaN 745.484239 47.700000 815.171651 2859.400000 1898.9 170.00000 NaN 2582.700000 173.600 NaN 319.0000
Ireland NaN NaN 843.572315 609.000000 253.000000 5618.000000 NaN 1018.00000 NaN 1226.300000 68.000 NaN 4780.2000
Italy NaN 806.0 10313.926461 20027.240000 NaN 51339.520000 1472.0 1016.00000 173.00000 5521.100000 770.000 NaN 10049.6000
Kosovo NaN NaN NaN 65.200000 1187.716161 NaN NaN NaN NaN 161.100000 NaN NaN 135.0000
Latvia NaN NaN NaN 1536.100000 NaN 1087.000000 NaN NaN NaN 44.000000 NaN NaN 100.0000
Lithuania NaN NaN NaN 1001.000000 NaN 1508.000000 2600.0 NaN NaN 299.600000 70.000 NaN 1195.0000
Luxembourg NaN NaN NaN 1294.000000 NaN NaN NaN NaN NaN 54.100000 NaN NaN 148.1000
Moldova NaN NaN 2520.000000 48.000000 NaN 968.000000 NaN NaN NaN 24.800000 NaN NaN NaN
Montenegro NaN NaN NaN 657.321000 207.481472 NaN NaN NaN NaN 3.200000 NaN NaN 118.0000
Netherlands 36.000000 NaN 9063.994952 NaN NaN 16090.600000 60.0 NaN 548.00000 4839.600000 1307.000 NaN 11069.1000
North Macedonia NaN NaN NaN 562.600000 737.711901 418.000000 NaN NaN NaN 296.900000 NaN NaN 73.0000
Norway NaN NaN NaN 32539.390000 NaN 968.100000 NaN NaN NaN 1.400000 NaN NaN 6300.8000
Poland NaN NaN 22585.482921 2877.906738 8786.812855 6516.500000 NaN 55.00000 NaN 2712.100000 523.000 NaN 10271.9000
Portugal NaN NaN 1858.761004 8270.200000 NaN 4759.500000 NaN NaN NaN 3509.900000 210.000 NaN 5362.6000
Romania NaN NaN 1184.688989 7971.390000 5197.253348 4831.500000 1300.0 NaN NaN 2026.700000 30.000 NaN 5505.7000
Serbia NaN NaN NaN 3112.500000 5357.415458 554.000000 NaN NaN NaN 39.600000 30.000 NaN 772.1000
Slovakia NaN NaN 592.673970 2418.480000 317.863362 1004.870000 1940.0 NaN NaN 449.200000 NaN NaN NaN
Slovenia NaN NaN 114.320183 1274.025000 1130.422184 600.000000 696.0 NaN NaN 30.000000 NaN NaN NaN
Spain NaN NaN 11904.878478 26069.861248 1831.400000 28394.244000 7733.2 1854.37100 NaN 36998.200000 563.000 388.054 34371.6500
Sweden NaN NaN 291.000000 14273.686625 NaN 2358.000000 9859.0 1685.00000 NaN 281.800000 2432.600 NaN 16958.8000
Switzerland NaN NaN NaN 20115.040800 NaN 55.000000 3355.0 NaN NaN 96.800000 NaN NaN 55.0000
Ukraine NaN NaN 24474.000000 6590.000000 NaN 4687.900000 17635.0 NaN NaN 5628.700000 NaN NaN 461.4000
United Kingdom 31.000000 NaN 33823.617061 4576.175000 NaN 36366.400000 19181.0 100.00000 55.00000 11668.600000 4919.000 288.900 38670.3000

Exercises#

Task 1: Provide a list of unique fuel types included in the power plants dataset.

Hide code cell content

df.Fueltype.unique()
array(['Nuclear', 'Hard Coal', 'Hydro', 'Lignite', 'Natural Gas', 'Oil',
       'Solid Biomass', 'Wind', 'Solar', 'Other', 'Biogas', 'Waste',
       'Geothermal'], dtype=object)

Task 2: Filter the dataset by power plants with the fuel type “Hard Coal”. How many hard coal power plants are there?

Hide code cell content

coal = df.loc[df.Fueltype == "Hard Coal"]
coal
Name Fueltype Technology Set Country Capacity Efficiency DateIn DateRetrofit DateOut lat lon Duration Volume_Mm3 DamHeight_m StorageCapacity_MWh EIC projectID
id
2 Borssele Hard Coal Steam Turbine PP Netherlands 485.000000 NaN 1973.0 NaN 2034.0 51.433200 3.716000 NaN 0.0 0.0 0.0 {'49W000000000054X'} {'BEYONDCOAL': {'BEYOND-NL-2'}, 'ENTSOE': {'49...
98 Didcot Hard Coal CCGT PP United Kingdom 1490.000000 0.550000 1970.0 1998.0 2013.0 51.622300 -1.260800 NaN 0.0 0.0 0.0 {'48WSTN0000DIDCBC'} {'BEYONDCOAL': {'BEYOND-UK-22'}, 'ENTSOE': {'4...
129 Mellach Hard Coal Steam Turbine CHP Austria 200.000000 NaN 1986.0 1986.0 2020.0 46.911700 15.488300 NaN 0.0 0.0 0.0 {'14W-WML-KW-----0'} {'BEYONDCOAL': {'BEYOND-AT-11'}, 'ENTSOE': {'1...
150 Emile Huchet Hard Coal CCGT PP France 596.493211 NaN 1958.0 2010.0 2022.0 49.152500 6.698100 NaN 0.0 0.0 0.0 {'17W100P100P0344D', '17W100P100P0345B'} {'BEYONDCOAL': {'BEYOND-FR-67'}, 'ENTSOE': {'1...
151 Amercoeur Hard Coal CCGT PP Belgium 451.000000 0.187765 1968.0 NaN 2009.0 50.431000 4.395500 NaN 0.0 0.0 0.0 {'22WAMERCO000010Y'} {'BEYONDCOAL': {'BEYOND-BE-27'}, 'ENTSOE': {'2...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
29779 St Hard Coal NaN CHP Germany 21.645000 NaN 1982.0 NaN NaN 49.976593 9.068953 NaN 0.0 0.0 0.0 {nan} {'MASTR': {'MASTR-SEE971943692655'}}
29804 Uer Hard Coal NaN CHP Germany 15.200000 NaN 1964.0 NaN NaN 51.368132 6.662350 NaN 0.0 0.0 0.0 {nan} {'MASTR': {'MASTR-SEE988421065542'}}
29813 Walheim Hard Coal NaN PP Germany 244.000000 NaN 1964.0 NaN NaN 49.017585 9.157690 NaN 0.0 0.0 0.0 {nan, nan} {'MASTR': {'MASTR-SEE937157344278', 'MASTR-SEE...
29830 Wd Ffw Hard Coal NaN CHP Germany 123.000000 NaN 1990.0 NaN NaN 50.099000 8.653000 NaN 0.0 0.0 0.0 {nan, nan} {'MASTR': {'MASTR-SEE915289541482', 'MASTR-SEE...
29835 West Hard Coal NaN CHP Germany 277.000000 NaN 1985.0 NaN NaN 52.442456 10.762681 NaN 0.0 0.0 0.0 {nan, nan} {'MASTR': {'MASTR-SEE917432813484', 'MASTR-SEE...

332 rows × 18 columns

Task 3: Identify the three largest coal power plants. In which countries are they located? When were they built?

Hide code cell content

coal.loc[coal.Capacity.nlargest(3).index]
Name Fueltype Technology Set Country Capacity Efficiency DateIn DateRetrofit DateOut lat lon Duration Volume_Mm3 DamHeight_m StorageCapacity_MWh EIC projectID
id
194 Kozienice Hard Coal Steam Turbine PP Poland 3682.216205 NaN 1972.0 NaN 2042.0 51.6647 21.4667 NaN 0.0 0.0 0.0 {'19W000000000104I', '19W000000000095U'} {'BEYONDCOAL': {'BEYOND-PL-96'}, 'ENTSOE': {'1...
3652 Vuglegirska Hard Coal CCGT PP Ukraine 3600.000000 NaN 1972.0 NaN NaN 48.4652 38.2027 NaN 0.0 0.0 0.0 {nan} {'GPD': {'WRI1005107'}, 'GEO': {'GEO-43001'}}
767 Opole Hard Coal Steam Turbine PP Poland 3071.893939 NaN 1993.0 NaN 2020.0 50.7518 17.8820 NaN 0.0 0.0 0.0 {'19W0000000001292'} {'BEYONDCOAL': {'BEYOND-PL-16'}, 'ENTSOE': {'1...

Task 4: What is the average “DateIn” of each “Fueltype”? Which type of power plants is the oldest on average?

Hide code cell content

2024 - df.groupby("Fueltype").DateIn.mean().sort_values()
Fueltype
Hard Coal        52.092593
Hydro            51.232471
Nuclear          48.214953
Lignite          47.184211
Other            31.543103
Waste            26.845588
Geothermal       23.857143
Oil              23.410138
Solid Biomass    22.516588
Natural Gas      22.081505
Wind             14.483789
Biogas           11.415730
Solar             8.584491
Name: DateIn, dtype: float64

Task 5: In the time series provided, calculate the annual average capacity factors of wind and solar.

Hide code cell content

ts.mean()
load       54.736992
onwind      0.205556
offwind     0.362993
solar       0.122621
prices     31.835717
dtype: float64

Task 6: In the time series provided, calculate and plot the monthly average electricity price.

Hide code cell content

ts["prices"].resample("ME").mean().plot()