Basics pandas
#
Note
If you have not yet set up Python on your computer, you can execute this tutorial in your browser via Google Colab. Click on the rocket in the top right corner and launch “Colab”. If that doesn’t work download the .ipynb
file and import it in Google Colab.
Then install pandas
and numpy
by executing the following command in a Jupyter cell at the top of the notebook.
!pip install -q pandas numpy
Pandas is a an open source library providing tabular data structures and data analysis tools. In other words, if you can imagine the data in an Excel spreadsheet, then Pandas is the tool for the job.

Note
Documentation for this package is available at https://pandas.pydata.org/docs/.
Package Imports#
This will be our first experience with importing a package.
Usually we import pandas
with the alias pd
.
We might also need numpy
, Python’s main library for numerical computations.
import pandas as pd
import numpy as np
import plotly.io as pio
import plotly.offline as py
pd.options.plotting.backend = "plotly"
Series#
A Series represents a one-dimensional array of data. It is similar to a dictionary consisting of an index and values, but has more functions.
Note
Example data on Germany’s final six nuclear power plants is from Wikipedia.
dictionary = {
"Neckarwestheim": 1269,
"Isar 2": 1365,
"Emsland": 1290,
}
s = pd.Series(dictionary)
s
Neckarwestheim 1269
Isar 2 1365
Emsland 1290
dtype: int64
Arithmetic operations can be applied to the whole pd.Series
.
s**0.5
Neckarwestheim 35.623026
Isar 2 36.945906
Emsland 35.916570
dtype: float64
We can access the underlying index object if we need to:
s.index
Index(['Neckarwestheim', 'Isar 2', 'Emsland'], dtype='object')
We can get values back out using the index via the .loc
attribute
s.loc["Isar 2"]
np.int64(1365)
We can pass a list or array to loc to get multiple rows back:
s.loc[["Neckarwestheim", "Emsland"]]
Neckarwestheim 1269
Emsland 1290
dtype: int64
DataFrame#
Series are limited to a single column. A more useful Pandas data structure is the DataFrame. A DataFrame is basically a bunch of series that share the same index.
data = {
"capacity": [1269, 1365, 1290], # MW
"type": ["PWR", "PWR", "PWR"],
"start_year": [1989, 1988, 1988],
"end_year": [np.nan, np.nan, np.nan],
}
df = pd.DataFrame(data, index=["Neckarwestheim", "Isar 2", "Emsland"])
df
capacity | type | start_year | end_year | |
---|---|---|---|---|
Neckarwestheim | 1269 | PWR | 1989 | NaN |
Isar 2 | 1365 | PWR | 1988 | NaN |
Emsland | 1290 | PWR | 1988 | NaN |
A wide range of statistical functions are available on both Series and DataFrames.
df.min()
capacity 1269
type PWR
start_year 1988
end_year NaN
dtype: object
We can get a single column as a Series using python’s getitem syntax on the DataFrame object.
df["capacity"]
Neckarwestheim 1269
Isar 2 1365
Emsland 1290
Name: capacity, dtype: int64
Indexing works very similar to series
df.loc["Emsland"]
capacity 1290
type PWR
start_year 1988
end_year NaN
Name: Emsland, dtype: object
But we can also specify the column(s) and row(s) we want to access
df.at["Emsland", "start_year"]
np.int64(1988)
We can also add new columns to the DataFrame:
df["reduced_capacity"] = df.capacity * 0.8
df
capacity | type | start_year | end_year | reduced_capacity | |
---|---|---|---|---|---|
Neckarwestheim | 1269 | PWR | 1989 | NaN | 1015.2 |
Isar 2 | 1365 | PWR | 1988 | NaN | 1092.0 |
Emsland | 1290 | PWR | 1988 | NaN | 1032.0 |
Sorting Data#
We can also sort the entries in dataframes, e.g. alphabetically by index or numerically by column values
df.sort_index()
capacity | type | start_year | end_year | reduced_capacity | |
---|---|---|---|---|---|
Emsland | 1290 | PWR | 1988 | NaN | 1032.0 |
Isar 2 | 1365 | PWR | 1988 | NaN | 1092.0 |
Neckarwestheim | 1269 | PWR | 1989 | NaN | 1015.2 |
df.sort_values(by="capacity", ascending=True)
capacity | type | start_year | end_year | reduced_capacity | |
---|---|---|---|---|---|
Neckarwestheim | 1269 | PWR | 1989 | NaN | 1015.2 |
Emsland | 1290 | PWR | 1988 | NaN | 1032.0 |
Isar 2 | 1365 | PWR | 1988 | NaN | 1092.0 |
Filtering Data#
We can also filter a DataFrame using a boolean series obtained from a condition. This is very useful to build subsets of the DataFrame.
df.capacity > 1300
Neckarwestheim False
Isar 2 True
Emsland False
Name: capacity, dtype: bool
df[df.capacity > 1300]
capacity | type | start_year | end_year | reduced_capacity | |
---|---|---|---|---|---|
Isar 2 | 1365 | PWR | 1988 | NaN | 1092.0 |
Modifying Values#
In many cases, we want to modify values in a dataframe based on some rule. To modify values, we need to use .loc
or .iloc
df.loc["Isar 2", "capacity"] = 1366
df
capacity | type | start_year | end_year | reduced_capacity | |
---|---|---|---|---|---|
Neckarwestheim | 1269 | PWR | 1989 | NaN | 1015.2 |
Isar 2 | 1366 | PWR | 1988 | NaN | 1092.0 |
Emsland | 1290 | PWR | 1988 | NaN | 1032.0 |
Time Series#
Time indexes are great when handling time-dependent data.
Let’s first read some time series data, using the pd.read_csv()
function, which takes a local file path or a link to an online resource.
The example data hourly time series for Germany in 2015 for:
electricity demand from OPSD in GW
onshore wind capacity factors from renewables.ninja in per-unit of installed capacity
offshore wind capacity factors from renewables.ninja in per-unit of installed capacity
solar PV capacity factors from renewables.ninja in per-unit of installed capacity
electricity day-ahead spot market prices in €/MWh from EPEX Spot zone DE/AT/LU retrieved via SMARD platform
url = (
"https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv"
)
ts = pd.read_csv(url, index_col=0, parse_dates=True)
ts.index[:5]
DatetimeIndex(['2015-01-01 00:00:00', '2015-01-01 01:00:00',
'2015-01-01 02:00:00', '2015-01-01 03:00:00',
'2015-01-01 04:00:00'],
dtype='datetime64[ns]', freq=None)
We can use Python’s slicing notation inside .loc
to select a date range, and then use the built-in plotting feature of Pandas:
ts.loc["2015-01-01":"2015-03-01", "onwind"].plot()
A common operation is to change the resolution of a dataset by resampling in time, which Pandas exposes through the resample function.
Note
The resample periods are specified using pandas offset index syntax.
ts["onwind"].resample("D").mean().plot()
Groupby Functionality#
DataFrame
objects have a groupby
method. The simplest way to think about it is that you pass another series, whose values are used to split the original object into different groups.
Here’s an example which retrieves the total generation capacity per country:
fn = "https://raw.githubusercontent.com/PyPSA/powerplantmatching/master/powerplants.csv"
df = pd.read_csv(fn, index_col=0)
df.iloc[:5, :10]
Name | Fueltype | Technology | Set | Country | Capacity | Efficiency | DateIn | DateRetrofit | DateOut | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
0 | Kernkraftwerk Emsland | Nuclear | Steam Turbine | PP | Germany | 1336.0 | 0.33 | 1988.0 | 1988.0 | 2023.0 |
1 | Brokdorf | Nuclear | Steam Turbine | PP | Germany | 1410.0 | 0.33 | 1986.0 | 1986.0 | 2021.0 |
2 | Borssele | Hard Coal | Steam Turbine | PP | Netherlands | 485.0 | NaN | 1973.0 | NaN | 2034.0 |
3 | Gemeinschaftskernkraftwerk Neckarwestheim | Nuclear | Steam Turbine | PP | Germany | 1310.0 | 0.33 | 1976.0 | 1989.0 | 2023.0 |
4 | Isar | Nuclear | Steam Turbine | PP | Germany | 1410.0 | 0.33 | 1979.0 | 1988.0 | 2023.0 |
grouped = df.groupby("Country").Capacity.sum()
grouped.head()
Country
Albania 2370.400000
Austria 24643.200368
Belgium 21443.151009
Bosnia and Herzegovina 4827.195964
Bulgaria 15699.186363
Name: Capacity, dtype: float64
Let’s break apart this operation a bit. The workflow with groupby
can be divided into three general steps:
Split: Partition the data into different groups based on some criterion.
Apply: Do some caclulation within each group, e.g. minimum, maximum, sums.
Combine: Put the results back together into a single object.

Grouping is not only possible on a single columns, but also on multiple columns. For instance,
we might want to group the capacities by country and fuel type. To achieve this, we pass a list of functions to the groupby
functions.
capacities = df.groupby(["Country", "Fueltype"]).Capacity.sum().unstack()
capacities
Fueltype | Biogas | Geothermal | Hard Coal | Hydro | Lignite | Natural Gas | Nuclear | Oil | Other | Solar | Solid Biomass | Waste | Wind |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country | |||||||||||||
Albania | NaN | NaN | NaN | 1743.900000 | NaN | NaN | NaN | NaN | 98.00000 | 294.500000 | NaN | NaN | 234.0000 |
Austria | NaN | NaN | 1331.400000 | 14205.500368 | 522.000000 | 4952.000000 | NaN | NaN | NaN | 388.300000 | NaN | NaN | 3244.0000 |
Belgium | NaN | NaN | 3316.000000 | 1379.751009 | NaN | 5251.300000 | 5931.0 | 131.60000 | NaN | 484.100000 | 498.000 | 259.700 | 4191.7000 |
Bosnia and Herzegovina | NaN | NaN | NaN | 2213.500000 | 1911.595964 | NaN | NaN | NaN | NaN | 267.100000 | NaN | NaN | 435.0000 |
Bulgaria | NaN | NaN | 1909.191813 | 2899.540000 | 4807.354550 | 990.000000 | 2080.0 | NaN | NaN | 2405.100000 | NaN | NaN | 608.0000 |
Croatia | NaN | NaN | 308.848880 | 2129.772189 | NaN | 1440.800000 | NaN | NaN | NaN | 96.300000 | NaN | NaN | 1140.0000 |
Czechia | NaN | NaN | 1225.227500 | 1888.219836 | 7297.666121 | 1659.000000 | 4164.0 | NaN | NaN | 1293.600000 | 30.000 | NaN | 140.0000 |
Denmark | NaN | NaN | 5658.885609 | NaN | NaN | 1813.200000 | NaN | 664.00000 | 70.00000 | 2256.700000 | 2542.000 | NaN | 5670.8000 |
Estonia | NaN | NaN | NaN | NaN | NaN | 389.000000 | NaN | 1841.00000 | NaN | 446.200000 | NaN | NaN | 713.4000 |
Finland | 1518.740000 | NaN | 3930.809044 | 3201.320000 | 965.000000 | 3134.800000 | 4362.0 | 1005.00000 | 143.10000 | 162.200000 | 2362.000 | NaN | 9681.1000 |
France | NaN | NaN | 6918.344374 | 21054.763968 | 977.000000 | 9292.200000 | 64682.0 | 5130.00000 | 302.00000 | 11192.900000 | 348.000 | NaN | 26603.9000 |
Germany | 2265.320979 | NaN | 30631.473898 | 12716.214837 | 23056.815081 | 39489.076894 | 23093.0 | 6850.84807 | 3156.36805 | 36716.919043 | 1071.699 | 2584.851 | 84571.1625 |
Greece | NaN | NaN | NaN | 3368.100000 | 4786.794109 | 9562.000000 | NaN | NaN | NaN | 4166.300000 | NaN | 24.000 | 5155.0000 |
Hungary | NaN | NaN | 745.484239 | 47.700000 | 815.171651 | 2859.400000 | 1898.9 | 170.00000 | NaN | 2582.700000 | 173.600 | NaN | 319.0000 |
Ireland | NaN | NaN | 843.572315 | 609.000000 | 253.000000 | 5618.000000 | NaN | 1018.00000 | NaN | 1226.300000 | 68.000 | NaN | 4780.2000 |
Italy | NaN | 806.0 | 10313.926461 | 20027.240000 | NaN | 51339.520000 | 1472.0 | 1016.00000 | 173.00000 | 5521.100000 | 770.000 | NaN | 10049.6000 |
Kosovo | NaN | NaN | NaN | 65.200000 | 1187.716161 | NaN | NaN | NaN | NaN | 161.100000 | NaN | NaN | 135.0000 |
Latvia | NaN | NaN | NaN | 1536.100000 | NaN | 1087.000000 | NaN | NaN | NaN | 44.000000 | NaN | NaN | 100.0000 |
Lithuania | NaN | NaN | NaN | 1001.000000 | NaN | 1508.000000 | 2600.0 | NaN | NaN | 299.600000 | 70.000 | NaN | 1195.0000 |
Luxembourg | NaN | NaN | NaN | 1294.000000 | NaN | NaN | NaN | NaN | NaN | 54.100000 | NaN | NaN | 148.1000 |
Moldova | NaN | NaN | 2520.000000 | 48.000000 | NaN | 968.000000 | NaN | NaN | NaN | 24.800000 | NaN | NaN | NaN |
Montenegro | NaN | NaN | NaN | 657.321000 | 207.481472 | NaN | NaN | NaN | NaN | 3.200000 | NaN | NaN | 118.0000 |
Netherlands | 36.000000 | NaN | 9063.994952 | NaN | NaN | 16090.600000 | 60.0 | NaN | 548.00000 | 4839.600000 | 1307.000 | NaN | 11069.1000 |
North Macedonia | NaN | NaN | NaN | 562.600000 | 737.711901 | 418.000000 | NaN | NaN | NaN | 296.900000 | NaN | NaN | 73.0000 |
Norway | NaN | NaN | NaN | 32539.390000 | NaN | 968.100000 | NaN | NaN | NaN | 1.400000 | NaN | NaN | 6300.8000 |
Poland | NaN | NaN | 22585.482921 | 2877.906738 | 8786.812855 | 6516.500000 | NaN | 55.00000 | NaN | 2712.100000 | 523.000 | NaN | 10271.9000 |
Portugal | NaN | NaN | 1858.761004 | 8270.200000 | NaN | 4759.500000 | NaN | NaN | NaN | 3509.900000 | 210.000 | NaN | 5362.6000 |
Romania | NaN | NaN | 1184.688989 | 7971.390000 | 5197.253348 | 4831.500000 | 1300.0 | NaN | NaN | 2026.700000 | 30.000 | NaN | 5505.7000 |
Serbia | NaN | NaN | NaN | 3112.500000 | 5357.415458 | 554.000000 | NaN | NaN | NaN | 39.600000 | 30.000 | NaN | 772.1000 |
Slovakia | NaN | NaN | 592.673970 | 2418.480000 | 317.863362 | 1004.870000 | 1940.0 | NaN | NaN | 449.200000 | NaN | NaN | NaN |
Slovenia | NaN | NaN | 114.320183 | 1274.025000 | 1130.422184 | 600.000000 | 696.0 | NaN | NaN | 30.000000 | NaN | NaN | NaN |
Spain | NaN | NaN | 11904.878478 | 26069.861248 | 1831.400000 | 28394.244000 | 7733.2 | 1854.37100 | NaN | 36998.200000 | 563.000 | 388.054 | 34371.6500 |
Sweden | NaN | NaN | 291.000000 | 14273.686625 | NaN | 2358.000000 | 9859.0 | 1685.00000 | NaN | 281.800000 | 2432.600 | NaN | 16958.8000 |
Switzerland | NaN | NaN | NaN | 20115.040800 | NaN | 55.000000 | 3355.0 | NaN | NaN | 96.800000 | NaN | NaN | 55.0000 |
Ukraine | NaN | NaN | 24474.000000 | 6590.000000 | NaN | 4687.900000 | 17635.0 | NaN | NaN | 5628.700000 | NaN | NaN | 461.4000 |
United Kingdom | 31.000000 | NaN | 33823.617061 | 4576.175000 | NaN | 36366.400000 | 19181.0 | 100.00000 | 55.00000 | 11668.600000 | 4919.000 | 288.900 | 38670.3000 |
Exercises#
Task 1: Provide a list of unique fuel types included in the power plants dataset.
Task 2: Filter the dataset by power plants with the fuel type “Hard Coal”. How many hard coal power plants are there?
Task 3: Identify the three largest coal power plants. In which countries are they located? When were they built?
Task 4: What is the average “DateIn” of each “Fueltype”? Which type of power plants is the oldest on average?
Task 5: In the time series provided, calculate the annual average capacity factors of wind and solar.
Task 6: In the time series provided, calculate and plot the monthly average electricity price.