Introduction to pandas#

Note

This material is mostly adapted from the following resources:

Pandas is a an open source library providing high-performance, easy-to-use data structures and data analysis tools. Pandas is particularly suited to the analysis of tabular data, i.e. data that can can go into a table. In other words, if you can imagine the data in an Excel spreadsheet, then Pandas is the tool for the job.

  • A fast and efficient DataFrame object for data manipulation with indexing;

  • Tools for reading and writing data: CSV and text files, Excel, SQL;

  • Intelligent data alignment and integrated handling of missing data;

  • Flexible reshaping and pivoting of data sets;

  • Intelligent label-based slicing, indexing, and subsetting of large data sets;

  • High performance aggregating, merging, joining or transforming data;

  • Hierarchical indexing provides an intuitive way of working with high-dimensional data;

  • Time series-functionality: date-based indexing, frequency conversion, moving windows, date shifting and lagging;

Note

Documentation for this package is available at https://pandas.pydata.org/docs/.

Note

If you have not yet set up Python on your computer, you can execute this tutorial in your browser via Google Colab. Click on the rocket in the top right corner and launch “Colab”. If that doesn’t work download the .ipynb file and import it in Google Colab.

Then install pandas and numpy by executing the following command in a Jupyter cell at the top of the notebook.

!pip install pandas numpy
import pandas as pd
import numpy as np

Pandas Data Structures: Series#

A Series represents a one-dimensional array of data. The main difference between a Series and numpy array is that a Series has an index. The index contains the labels that we use to access the data.

There are many ways to create a Series. We will just show a few. The core constructor is pd.Series().

(Data are from Wikipedia’s List of power stations in Germany.)

names = ["Neckarwestheim", "Isar 2", "Emsland"]
values = [1269, 1365, 1290]
s = pd.Series(values, index=names)
s
Neckarwestheim    1269
Isar 2            1365
Emsland           1290
dtype: int64
dictionary = {
    "Neckarwestheim": 1269,
    "Isar 2": 1365,
    "Emsland": 1290,
}
s = pd.Series(dictionary)
s
Neckarwestheim    1269
Isar 2            1365
Emsland           1290
dtype: int64

Arithmetic operations and most numpy functions can be applied to pd.Series. An important point is that the Series keep their index during such operations.

np.log(s) / s**0.5
Neckarwestheim    0.200600
Isar 2            0.195391
Emsland           0.199418
dtype: float64

We can access the underlying index object if we need to:

s.index
Index(['Neckarwestheim', 'Isar 2', 'Emsland'], dtype='object')

We can get values back out using the index via the .loc attribute

s.loc["Isar 2"]
1365

Or by raw position using .iloc

s.iloc[2]
1290

We can pass a list or array to loc to get multiple rows back:

s.loc[["Neckarwestheim", "Emsland"]]
Neckarwestheim    1269
Emsland           1290
dtype: int64

And we can even use slice notation

s.loc["Neckarwestheim":"Emsland"]
Neckarwestheim    1269
Isar 2            1365
Emsland           1290
dtype: int64
s.iloc[:2]
Neckarwestheim    1269
Isar 2            1365
dtype: int64

If we need to, we can always get the raw data back out as well

s.values  # a numpy array
array([1269, 1365, 1290])

Pandas Data Structures: DataFrame#

There is a lot more to Series, but they are limit to a single column. A more useful Pandas data structure is the DataFrame. A DataFrame is basically a bunch of series that share the same index. It’s a lot like a table in a spreadsheet.

The core constructor is pd.DataFrame()

Below we create a DataFrame.

# first we create a dictionary
data = {
    "capacity": [1269, 1365, 1290],  # MW
    "type": ["PWR", "PWR", "PWR"],
    "start_year": [1989, 1988, 1988],
    "end_year": [np.nan, np.nan, np.nan],
}
df = pd.DataFrame(data, index=["Neckarwestheim", "Isar 2", "Emsland"])
df
capacity type start_year end_year
Neckarwestheim 1269 PWR 1989 NaN
Isar 2 1365 PWR 1988 NaN
Emsland 1290 PWR 1988 NaN

We can also switch columns and rows very easily.

df.T
Neckarwestheim Isar 2 Emsland
capacity 1269 1365 1290
type PWR PWR PWR
start_year 1989 1988 1988
end_year NaN NaN NaN

A wide range of statistical functions are available on both Series and DataFrames.

df.min()
capacity      1269
type           PWR
start_year    1988
end_year       NaN
dtype: object
df.mean(numeric_only=True)
capacity      1308.000000
start_year    1988.333333
end_year              NaN
dtype: float64
df.std(numeric_only=True)
capacity      50.467812
start_year     0.577350
end_year            NaN
dtype: float64
df.describe()
capacity start_year end_year
count 3.000000 3.000000 0.0
mean 1308.000000 1988.333333 NaN
std 50.467812 0.577350 NaN
min 1269.000000 1988.000000 NaN
25% 1279.500000 1988.000000 NaN
50% 1290.000000 1988.000000 NaN
75% 1327.500000 1988.500000 NaN
max 1365.000000 1989.000000 NaN

We can get a single column as a Series using python’s getitem syntax on the DataFrame object.

df["capacity"]
Neckarwestheim    1269
Isar 2            1365
Emsland           1290
Name: capacity, dtype: int64

…or using attribute syntax.

df.capacity
Neckarwestheim    1269
Isar 2            1365
Emsland           1290
Name: capacity, dtype: int64

Indexing works very similar to series

df.loc["Emsland"]
capacity      1290
type           PWR
start_year    1988
end_year       NaN
Name: Emsland, dtype: object
df.iloc[2]
capacity      1290
type           PWR
start_year    1988
end_year       NaN
Name: Emsland, dtype: object

But we can also specify the column(s) and row(s) we want to access

df.loc["Emsland", "start_year"]
1988
df.loc[["Emsland", "Neckarwestheim"], ["start_year", "end_year"]]
start_year end_year
Emsland 1988 NaN
Neckarwestheim 1989 NaN
df.capacity * 0.8
Neckarwestheim    1015.2
Isar 2            1092.0
Emsland           1032.0
Name: capacity, dtype: float64

Which we can easily add as another column to the DataFrame:

df["reduced_capacity"] = df.capacity * 0.8
df
capacity type start_year end_year reduced_capacity
Neckarwestheim 1269 PWR 1989 NaN 1015.2
Isar 2 1365 PWR 1988 NaN 1092.0
Emsland 1290 PWR 1988 NaN 1032.0

We can also remove columns or rows from a DataFrame:

df.drop("reduced_capacity", axis="columns")
capacity type start_year end_year
Neckarwestheim 1269 PWR 1989 NaN
Isar 2 1365 PWR 1988 NaN
Emsland 1290 PWR 1988 NaN

We can update the variable df by either overwriting df or passing an inplace keyword:

df.drop("reduced_capacity", axis="columns", inplace=True)

We can also drop columns with only NaN values

df.dropna(axis=1)
capacity type start_year
Neckarwestheim 1269 PWR 1989
Isar 2 1365 PWR 1988
Emsland 1290 PWR 1988

Or fill it up with default “fallback” data:

df.fillna(2023)
capacity type start_year end_year
Neckarwestheim 1269 PWR 1989 2023.0
Isar 2 1365 PWR 1988 2023.0
Emsland 1290 PWR 1988 2023.0

Say, we already have one value for end_year and want to fill up the missing data:

df.loc["Emsland", "end_year"] = 2023
# backward (upwards) fill from non-nan values
df.fillna(method="bfill")
/tmp/ipykernel_1891/532230991.py:2: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
  df.fillna(method="bfill")
capacity type start_year end_year
Neckarwestheim 1269 PWR 1989 2023.0
Isar 2 1365 PWR 1988 2023.0
Emsland 1290 PWR 1988 2023.0

Sorting Data#

We can also sort the entries in dataframes, e.g. alphabetically by index or numerically by column values

df.sort_index()
capacity type start_year end_year
Emsland 1290 PWR 1988 2023.0
Isar 2 1365 PWR 1988 NaN
Neckarwestheim 1269 PWR 1989 NaN
df.sort_values(by="capacity", ascending=False)
capacity type start_year end_year
Isar 2 1365 PWR 1988 NaN
Emsland 1290 PWR 1988 2023.0
Neckarwestheim 1269 PWR 1989 NaN

If we make a calculation using columns from the DataFrame, it will keep the same index:

Merging Data#

Pandas supports a wide range of methods for merging different datasets. These are described extensively in the documentation. Here we just give a few examples.

data = {
    "capacity": [1288, 1360, 1326],  # MW
    "type": ["BWR", "PWR", "PWR"],
    "start_year": [1985, 1985, 1986],
    "end_year": [2021, 2021, 2021],
    "x": [10.40, 9.41, 9.35],
    "y": [48.51, 52.03, 53.85],
}
df2 = pd.DataFrame(data, index=["Gundremmingen", "Grohnde", "Brokdorf"])
df2
capacity type start_year end_year x y
Gundremmingen 1288 BWR 1985 2021 10.40 48.51
Grohnde 1360 PWR 1985 2021 9.41 52.03
Brokdorf 1326 PWR 1986 2021 9.35 53.85

We can now add this additional data to the df object

df = pd.concat([df, df2])

Filtering Data#

We can also filter a DataFrame using a boolean series obtained from a condition. This is very useful to build subsets of the DataFrame.

df.capacity > 1300
Neckarwestheim    False
Isar 2             True
Emsland           False
Gundremmingen     False
Grohnde            True
Brokdorf           True
Name: capacity, dtype: bool
df[df.capacity > 1300]
capacity type start_year end_year x y
Isar 2 1365 PWR 1988 NaN NaN NaN
Grohnde 1360 PWR 1985 2021.0 9.41 52.03
Brokdorf 1326 PWR 1986 2021.0 9.35 53.85

We can also combine multiple conditions, but we need to wrap the conditions with brackets!

df[(df.capacity > 1300) & (df.start_year >= 1988)]
capacity type start_year end_year x y
Isar 2 1365 PWR 1988 NaN NaN NaN

Or we make SQL-like queries:

df.query("start_year == 1988")
capacity type start_year end_year x y
Isar 2 1365 PWR 1988 NaN NaN NaN
Emsland 1290 PWR 1988 2023.0 NaN NaN
threshold = 1300
df.query("start_year == 1988 and capacity > @threshold")
capacity type start_year end_year x y
Isar 2 1365 PWR 1988 NaN NaN NaN

Modifying Values#

In many cases, we want to modify values in a dataframe based on some rule. To modify values, we need to use .loc or .iloc

df.loc["Isar 2", "x"] = 12.29
df.loc["Grohnde", "capacity"] += 1
df
capacity type start_year end_year x y
Neckarwestheim 1269 PWR 1989 NaN NaN NaN
Isar 2 1365 PWR 1988 NaN 12.29 NaN
Emsland 1290 PWR 1988 2023.0 NaN NaN
Gundremmingen 1288 BWR 1985 2021.0 10.40 48.51
Grohnde 1361 PWR 1985 2021.0 9.41 52.03
Brokdorf 1326 PWR 1986 2021.0 9.35 53.85
operational = ["Neckarwestheim", "Isar 2", "Emsland"]
df.loc[operational, "y"] = [49.04, 48.61, 52.47]
df
capacity type start_year end_year x y
Neckarwestheim 1269 PWR 1989 NaN NaN 49.04
Isar 2 1365 PWR 1988 NaN 12.29 48.61
Emsland 1290 PWR 1988 2023.0 NaN 52.47
Gundremmingen 1288 BWR 1985 2021.0 10.40 48.51
Grohnde 1361 PWR 1985 2021.0 9.41 52.03
Brokdorf 1326 PWR 1986 2021.0 9.35 53.85

Applying Functions#

Sometimes it can be useful apply a function to all values of a column/row. For instance, we might be interested in normalised capacities relative to the largest nuclear power plant:

df.capacity.apply(lambda x: x / df.capacity.max())
Neckarwestheim    0.929670
Isar 2            1.000000
Emsland           0.945055
Gundremmingen     0.943590
Grohnde           0.997070
Brokdorf          0.971429
Name: capacity, dtype: float64
df.capacity.map(lambda x: x / df.capacity.max())
Neckarwestheim    0.929670
Isar 2            1.000000
Emsland           0.945055
Gundremmingen     0.943590
Grohnde           0.997070
Brokdorf          0.971429
Name: capacity, dtype: float64

For simple functions, there’s often an easier alternative:

df.capacity / df.capacity.max()
Neckarwestheim    0.929670
Isar 2            1.000000
Emsland           0.945055
Gundremmingen     0.943590
Grohnde           0.997070
Brokdorf          0.971429
Name: capacity, dtype: float64

But .apply() and .map() often give you more flexibility.

Renaming Indices and Columns#

Sometimes it can be useful to rename columns:

df.rename(columns=dict(x="lat", y="lon"))
capacity type start_year end_year lat lon
Neckarwestheim 1269 PWR 1989 NaN NaN 49.04
Isar 2 1365 PWR 1988 NaN 12.29 48.61
Emsland 1290 PWR 1988 2023.0 NaN 52.47
Gundremmingen 1288 BWR 1985 2021.0 10.40 48.51
Grohnde 1361 PWR 1985 2021.0 9.41 52.03
Brokdorf 1326 PWR 1986 2021.0 9.35 53.85

Replacing Values#

Sometimes it can be useful to replace values:

df.replace({"PWR": "Pressurized water reactor"})
capacity type start_year end_year x y
Neckarwestheim 1269 Pressurized water reactor 1989 NaN NaN 49.04
Isar 2 1365 Pressurized water reactor 1988 NaN 12.29 48.61
Emsland 1290 Pressurized water reactor 1988 2023.0 NaN 52.47
Gundremmingen 1288 BWR 1985 2021.0 10.40 48.51
Grohnde 1361 Pressurized water reactor 1985 2021.0 9.41 52.03
Brokdorf 1326 Pressurized water reactor 1986 2021.0 9.35 53.85

Plotting#

DataFrames have all kinds of useful plotting built in. Note that we do not even have to import matplotlib for this.

df.plot(kind="scatter", x="start_year", y="capacity")
<Axes: xlabel='start_year', ylabel='capacity'>
_images/0b4567098f94d65415d75af49144b7b0811faa55ff3cd16ba7ce10fb47d050a6.png
df.capacity.plot.barh(color="orange")
<Axes: >
_images/477e3883ae7b91b31769776511e1ac37df2d97bba884ec13a1f7906f92606775.png

Time Indexes#

Indexes are very powerful. They are a big part of why Pandas is so useful. There are different indices for different types of data. Time Indexes are especially great when handling time-dependent data.

time = pd.date_range(start="2021-01-01", end="2023-01-01", freq="D")
values = np.sin(2 * np.pi * time.dayofyear / 365)
ts = pd.Series(values, index=time)
ts.plot()
<Axes: >
_images/ada3f53734d526fdf7ec64a390622377c856cee8d9d14ffe1df055f17218b9fe.png

We can use Python’s slicing notation inside .loc to select a date range.

ts.loc["2021-01-01":"2021-07-01"].plot()
<Axes: >
_images/85f820acfd3c7adc6b2611fee713f401023dd7bd890932c1e86e132be4a36e99.png
ts.loc["2021-05"].plot()
<Axes: >
_images/bfba654ce67378f0f6bdb8d84c7f2a632896d74c040bbc5733bad22700aa4a43.png

The pd.TimeIndex object has lots of useful attributes

ts.index.month
Index([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       ...
       12, 12, 12, 12, 12, 12, 12, 12, 12,  1],
      dtype='int32', length=731)
ts.index.day
Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
       ...
       23, 24, 25, 26, 27, 28, 29, 30, 31,  1],
      dtype='int32', length=731)

Another common operation is to change the resolution of a dataset by resampling in time. Pandas exposes this through the resample function. The resample periods are specified using pandas offset index syntax.

Below we resample the dataset by taking the mean over each month.

ts.resample("ME").mean().head()
2021-01-31    0.268746
2021-02-28    0.698782
2021-03-31    0.949778
2021-04-30    0.959332
2021-05-31    0.709200
Freq: ME, dtype: float64
ts.resample("ME").mean().plot()
<Axes: >
_images/6117bbc05d0627ec7b011d9d310068a404d9c0ddb09097d11b5a4086ff9cc894.png

Reading and Writing Files#

To read data into pandas, we can use for instance the pd.read_csv() function. This function is incredibly powerful and complex with a multitude of settings. You can use it to extract data from almost any text file.

The pd.read_csv() function can take a path to a local file as an input, or even a link to an online text file.

Let’s import a slightly larger dataset about the power plant fleet in Europe_

fn = "https://raw.githubusercontent.com/PyPSA/powerplantmatching/master/powerplants.csv"
df = pd.read_csv(fn, index_col=0)
df.iloc[:5, :10]
Name Fueltype Technology Set Country Capacity Efficiency DateIn DateRetrofit DateOut
id
0 Borssele Hard Coal Steam Turbine PP Netherlands 485.0 NaN 1973.0 NaN 2034.0
1 Flamanville Nuclear Steam Turbine PP France 2660.0 NaN 1985.0 NaN 2051.0
2 Emsland Nuclear Steam Turbine PP Germany 1336.0 0.33 1988.0 1988.0 2023.0
3 Kernkraftwerk Gosgen Nuclear Steam Turbine PP Switzerland 1020.0 NaN 1979.0 1979.0 2051.0
4 Oskarshamn Nuclear Steam Turbine PP Sweden 1400.0 NaN 1972.0 2006.0 2051.0
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 22818 entries, 0 to 23116
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Name                 22818 non-null  object 
 1   Fueltype             22818 non-null  object 
 2   Technology           22356 non-null  object 
 3   Set                  22818 non-null  object 
 4   Country              22818 non-null  object 
 5   Capacity             22818 non-null  float64
 6   Efficiency           542 non-null    float64
 7   DateIn               16980 non-null  float64
 8   DateRetrofit         2438 non-null   float64
 9   DateOut              683 non-null    float64
 10  lat                  22818 non-null  float64
 11  lon                  22818 non-null  float64
 12  Duration             615 non-null    float64
 13  Volume_Mm3           22818 non-null  float64
 14  DamHeight_m          22818 non-null  float64
 15  StorageCapacity_MWh  22818 non-null  float64
 16  EIC                  22818 non-null  object 
 17  projectID            22818 non-null  object 
dtypes: float64(11), object(7)
memory usage: 3.3+ MB
df.describe()
Capacity Efficiency DateIn DateRetrofit DateOut lat lon Duration Volume_Mm3 DamHeight_m StorageCapacity_MWh
count 22818.000000 542.000000 16980.000000 2438.000000 683.000000 22818.000000 22818.000000 615.000000 22818.000000 22818.000000 22818.000000
mean 57.799302 0.480055 2004.922438 1988.515176 2019.980966 49.042717 8.798689 1295.789678 3.771375 8.372420 464.285028
std 225.901017 0.179453 48.109169 25.480213 12.323877 6.159855 9.112971 1542.021891 89.711399 52.640524 7561.917809
min 0.000000 0.140228 0.000000 1899.000000 1969.000000 32.647300 -27.069900 0.007907 0.000000 0.000000 0.000000
25% 3.000000 0.356400 2005.000000 1971.000000 2015.000000 44.888750 1.887070 90.676522 0.000000 0.000000 0.000000
50% 11.000000 0.385300 2012.000000 1997.000000 2020.000000 49.365235 9.152700 795.600000 0.000000 0.000000 0.000000
75% 32.000000 0.580829 2018.000000 2009.000000 2026.000000 52.318355 13.693179 2023.709052 0.000000 0.000000 0.000000
max 6000.000000 0.917460 2030.000000 2020.000000 2051.000000 71.012300 39.655350 16840.000000 9500.000000 1800.000000 421000.000000

Sometimes, we also want to store a DataFrame for later use. There are many different file formats tabular data can be stored in, including HTML, JSON, Excel, Parquet, Feather, etc. Here, let’s say we want to store the DataFrame as CSV (comma-separated values) file under the name “tmp.csv”.

df.to_csv("tmp.csv")

Groupby Functionality#

Both Series and DataFrame objects have a groupby method. It accepts a variety of arguments, but the simplest way to think about it is that you pass another series, whose unique values are used to split the original object into different groups. groupby is an amazingly powerful but also complex function.

Here’s an example which retrieves the total generation capacity per country.

grouped = df.groupby("Country").Capacity.sum()
grouped.head()
Country
Albania                    2361.708275
Austria                   25458.200368
Belgium                   21426.651009
Bosnia and Herzegovina     4827.195964
Bulgaria                  15699.186363
Name: Capacity, dtype: float64

Such “chaining” operations together is very common with pandas:

Let’s break apart this operation a bit. The workflow with groupby can be divided into three general steps:

  1. Split: Partition the data into different groups based on some criterion.

  2. Apply: Do some caclulation within each group. Different types of steps might be

    • Aggregation: Get the mean or max within the group.

    • Transformation: Normalize all the values within a group.

    • Filtration: Eliminate some groups based on a criterion.

  3. Combine: Put the results back together into a single object.

gb = df.groupby("Country")
gb
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f31d7f797d0>

The length tells us how many groups were found:

len(gb)
36

All of the groups are available as a dictionary via the .groups attribute:

groups = gb.groups
len(groups)
36
list(groups.keys())[:5]
['Albania', 'Austria', 'Belgium', 'Bosnia and Herzegovina', 'Bulgaria']

Now that we know how to create a GroupBy object, let’s learn how to do aggregation on it.

gb.Capacity.sum().nlargest(5)
Country
Germany           212675.829096
Spain             150108.858726
United Kingdom    149598.567501
France            148014.692084
Italy             101488.386461
Name: Capacity, dtype: float64
gb["DateIn"].mean().head()
Country
Albania                   1994.666667
Austria                   1987.307692
Belgium                   2001.708029
Bosnia and Herzegovina    1992.200000
Bulgaria                  1998.116279
Name: DateIn, dtype: float64

Grouping is not only possible on a single columns, but also on multiple columns. For instance, we might want to group the capacities by country and fuel type. To achieve this, we pass a list of functions to the groupby functions.

capacities = df.groupby(["Country", "Fueltype"]).Capacity.sum()
capacities
Country         Fueltype     
Albania         Hydro             1743.353732
                Oil                 89.854543
                Solar              294.500000
                Wind               234.000000
Austria         Hard Coal         1331.400000
                                     ...     
United Kingdom  Other               55.000000
                Solar            11668.600000
                Solid Biomass     4919.000000
                Waste              288.900000
                Wind             38670.300000
Name: Capacity, Length: 243, dtype: float64

By grouping by multiple attributes, our index becomes a pd.MultiIndex (a hierarchical index with multiple levels.

capacities.index[:5]
MultiIndex([('Albania',     'Hydro'),
            ('Albania',       'Oil'),
            ('Albania',     'Solar'),
            ('Albania',      'Wind'),
            ('Austria', 'Hard Coal')],
           names=['Country', 'Fueltype'])
type(capacities.index)
pandas.core.indexes.multi.MultiIndex

We can use the .unstack function to reshape the multi-indexed pd.Series into a pd.DataFrame which has the second index level as columns.

capacities.unstack().tail().T
Country Spain Sweden Switzerland Ukraine United Kingdom
Fueltype
Biogas NaN NaN NaN NaN 31.000000
Geothermal NaN NaN NaN NaN NaN
Hard Coal 11904.878478 291.000000 NaN 23782.495570 33823.617061
Hydro 26069.861248 14273.686625 20123.3008 6599.452229 4576.175000
Lignite 1831.400000 NaN NaN NaN NaN
Natural Gas 28394.244000 2358.000000 55.0000 4687.900000 36284.975440
Nuclear 7733.200000 9859.000000 3355.0000 17635.000000 19181.000000
Oil 1854.371000 1685.000000 NaN NaN 100.000000
Other NaN NaN NaN NaN 55.000000
Solar 36998.200000 281.800000 96.8000 5628.700000 11668.600000
Solid Biomass 563.000000 2432.600000 NaN NaN 4919.000000
Waste 388.054000 NaN NaN NaN 288.900000
Wind 34371.650000 16958.800000 55.0000 461.400000 38670.300000

Exercises#

Power Plants Data#

Run the function .describe() on the DataFrame that includes the power plant database:

Hide code cell content
df.describe()
Capacity Efficiency DateIn DateRetrofit DateOut lat lon Duration Volume_Mm3 DamHeight_m StorageCapacity_MWh
count 22818.000000 542.000000 16980.000000 2438.000000 683.000000 22818.000000 22818.000000 615.000000 22818.000000 22818.000000 22818.000000
mean 57.799302 0.480055 2004.922438 1988.515176 2019.980966 49.042717 8.798689 1295.789678 3.771375 8.372420 464.285028
std 225.901017 0.179453 48.109169 25.480213 12.323877 6.159855 9.112971 1542.021891 89.711399 52.640524 7561.917809
min 0.000000 0.140228 0.000000 1899.000000 1969.000000 32.647300 -27.069900 0.007907 0.000000 0.000000 0.000000
25% 3.000000 0.356400 2005.000000 1971.000000 2015.000000 44.888750 1.887070 90.676522 0.000000 0.000000 0.000000
50% 11.000000 0.385300 2012.000000 1997.000000 2020.000000 49.365235 9.152700 795.600000 0.000000 0.000000 0.000000
75% 32.000000 0.580829 2018.000000 2009.000000 2026.000000 52.318355 13.693179 2023.709052 0.000000 0.000000 0.000000
max 6000.000000 0.917460 2030.000000 2020.000000 2051.000000 71.012300 39.655350 16840.000000 9500.000000 1800.000000 421000.000000

Provide a list of unique fuel types included in the dataset

Hide code cell content
df.Fueltype.unique()
array(['Hard Coal', 'Nuclear', 'Hydro', 'Lignite', 'Natural Gas',
       'Solid Biomass', 'Oil', 'Other', 'Solar', 'Wind', 'Geothermal',
       'Waste', 'Biogas'], dtype=object)

Provide a list of unique technologies included in the dataset

Hide code cell content
df.Technology.unique()
array(['Steam Turbine', 'Reservoir', 'Run-Of-River', 'Pumped Storage',
       'CCGT', nan, 'Unknown', 'Pv', 'Onshore', 'Marine', 'Offshore',
       'Not Found', 'Combustion Engine', 'PV', 'CSP', 'unknown',
       'not found'], dtype=object)

Filter the dataset by power plants with the fuel type “Hard Coal”

Hide code cell content
coal = df.loc[df.Fueltype == "Hard Coal"]
coal
Name Fueltype Technology Set Country Capacity Efficiency DateIn DateRetrofit DateOut lat lon Duration Volume_Mm3 DamHeight_m StorageCapacity_MWh EIC projectID
id
0 Borssele Hard Coal Steam Turbine PP Netherlands 485.000000 NaN 1973.0 NaN 2034.0 51.433200 3.716000 NaN 0.0 0.0 0.0 {'49W000000000054X'} {'BEYONDCOAL': {'BEYOND-NL-2'}, 'ENTSOE': {'49...
26 Mellach Hard Coal Steam Turbine CHP Austria 200.000000 NaN 1986.0 1986.0 2020.0 46.911700 15.488300 NaN 0.0 0.0 0.0 {'14W-WML-KW-----0'} {'BEYONDCOAL': {'BEYOND-AT-11'}, 'ENTSOE': {'1...
136 Bastardo Hard Coal Steam Turbine PP Italy 138.290543 NaN 1989.0 NaN 2025.0 42.893300 12.539700 NaN 0.0 0.0 0.0 {'26WIMPI-S10BSTRJ'} {'BEYONDCOAL': {'BEYOND-IT-10'}, 'ENTSOE': {'2...
137 Brindisi Sud Hard Coal Steam Turbine PP Italy 1825.435174 NaN 1991.0 NaN 2025.0 40.563900 18.032200 NaN 0.0 0.0 0.0 {'26WIMPI-S16BSCRY'} {'BEYONDCOAL': {'BEYOND-IT-2'}, 'ENTSOE': {'26...
138 Fusina Hard Coal Steam Turbine PP Italy 899.810470 NaN 1964.0 NaN 2025.0 45.431400 12.245800 NaN 0.0 0.0 0.0 {'26WIMPI-S05FTSNK'} {'BEYONDCOAL': {'BEYOND-IT-24'}, 'ENTSOE': {'2...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22671 Westerholt Hard Coal Steam Turbine PP Germany 300.000000 NaN 1959.0 NaN 2005.0 51.601111 7.063889 NaN 0.0 0.0 0.0 {nan, nan} {'GEM': {'G100000114638', 'G100000114637'}}
22731 Wilhelmshaven Engie Hard Coal Steam Turbine PP Germany 830.000000 NaN 2015.0 NaN NaN 53.574800 8.137000 NaN 0.0 0.0 0.0 {nan} {'GEM': {'G100000110939'}}
22749 Wilton International Hard Coal Steam Turbine PP United Kingdom 66.600000 NaN 1964.0 NaN 2007.0 54.589400 -1.118500 NaN 0.0 0.0 0.0 {nan, nan} {'GEM': {'G100000110950', 'G100000110949'}}
22805 Wolfsburg West Hard Coal Steam Turbine PP Germany 306.000000 NaN 1985.0 NaN 2024.0 52.441544 10.765079 NaN 0.0 0.0 0.0 {nan, nan} {'GEM': {'G100000110964', 'G100000110963'}}
23021 Zeltweg Hard Coal Steam Turbine PP Austria 137.000000 NaN 1962.0 NaN 2001.0 47.250000 15.166667 NaN 0.0 0.0 0.0 {nan} {'GEM': {'G100000111546'}}

339 rows × 18 columns

Identify the 5 largest coal power plants. In which countries are they located? When were they built?

Hide code cell content
coal.loc[coal.Capacity.nlargest(5).index]
Name Fueltype Technology Set Country Capacity Efficiency DateIn DateRetrofit DateOut lat lon Duration Volume_Mm3 DamHeight_m StorageCapacity_MWh EIC projectID
id
224 Kozienice Hard Coal Steam Turbine PP Poland 3682.216205 NaN 1972.0 NaN 2042.0 51.6647 21.46670 NaN 0.0 0.0 0.0 {'19W000000000104I', '19W000000000095U'} {'BEYONDCOAL': {'BEYOND-PL-96'}, 'ENTSOE': {'1...
3465 Zaporiska Hard Coal CCGT PP Ukraine 3318.973044 NaN 1972.0 NaN NaN 47.5089 34.62530 NaN 0.0 0.0 0.0 {nan, nan, nan, nan, nan, nan, nan} {'GEO': {'GEO-42988'}, 'GPD': {'WRI1005101'}}
3466 Vuhlehirska Uglegorskaya Hard Coal CCGT PP Ukraine 3318.973044 NaN 1972.0 NaN NaN 48.4633 38.20328 NaN 0.0 0.0 0.0 {nan, nan, nan, nan, nan, nan, nan} {'GEO': {'GEO-43001'}, 'GPD': {'WRI1005107'}}
761 Opole Hard Coal Steam Turbine PP Poland 3071.893939 NaN 1993.0 NaN 2020.0 50.7518 17.88200 NaN 0.0 0.0 0.0 {'19W0000000001292'} {'BEYONDCOAL': {'BEYOND-PL-16'}, 'ENTSOE': {'1...
769 Longannet Hard Coal Steam Turbine PP United Kingdom 2400.000000 NaN 1970.0 NaN 2016.0 56.0500 -3.68120 NaN 0.0 0.0 0.0 {'48WSTN00000LOAND'} {'BEYONDCOAL': {'BEYOND-UK-87'}, 'ENTSOE': {'4...

Identify the power plant with the longest “Name”.

Hide code cell content
i = df.Name.map(lambda x: len(x)).argmax()
df.iloc[i]
Name                   Kesznyeten Hernadviz Hungary Kesznyeten Hernad...
Fueltype                                                           Hydro
Technology                                                  Run-Of-River
Set                                                                   PP
Country                                                          Hungary
Capacity                                                             4.4
Efficiency                                                           NaN
DateIn                                                            1945.0
DateRetrofit                                                      1992.0
DateOut                                                              NaN
lat                                                            47.995972
lon                                                            21.033037
Duration                                                             NaN
Volume_Mm3                                                           0.0
DamHeight_m                                                         14.0
StorageCapacity_MWh                                                  0.0
EIC                                                                {nan}
projectID                   {'JRC': {'JRC-H2138'}, 'GEO': {'GEO-42677'}}
Name: 3017, dtype: object

Identify the 10 northernmost powerplants. What type of power plants are they?

Hide code cell content
index = df.lat.nlargest(10).index
df.loc[index]
Name Fueltype Technology Set Country Capacity Efficiency DateIn DateRetrofit DateOut lat lon Duration Volume_Mm3 DamHeight_m StorageCapacity_MWh EIC projectID
id
12704 Havoygavlen Wind Farm Wind Onshore PP Norway 78.0 NaN 2003.0 NaN 2021.0 71.012300 24.594200 NaN 0.0 0.0 0.0 {nan, nan} {'GEM': {'G100000917626', 'G100000917445'}}
13840 Kjollefjord Wind Farm Wind Onshore PP Norway 39.0 NaN 2006.0 NaN NaN 70.918500 27.289900 NaN 0.0 0.0 0.0 {nan} {'GEM': {'G100000917523'}}
2446 Repvag Hydro Reservoir Store Norway 4.4 NaN 1953.0 1953.0 NaN 70.773547 25.616486 2810.704545 28.3 172.0 12367.1 {nan} {'JRC': {'JRC-N339'}, 'OPSD': {'OEU-4174'}}
18714 Raggovidda Wind Farm Wind Onshore PP Norway 97.0 NaN 2014.0 NaN NaN 70.765700 29.083300 NaN 0.0 0.0 0.0 {nan, nan} {'GEM': {'G100000918029', 'G100000918729'}}
2709 Maroyfjord Hydro Reservoir Store Norway 4.4 NaN 1956.0 1956.0 NaN 70.751421 27.355047 1533.681818 13.8 225.3 6748.2 {nan} {'JRC': {'JRC-N289'}, 'OPSD': {'OEU-4036'}}
2621 Melkoya Natural Gas CCGT PP Norway 230.0 NaN 2007.0 2007.0 NaN 70.689366 23.600448 NaN 0.0 0.0 0.0 {'50WP00000000456T'} {'ENTSOE': {'50WP00000000456T'}, 'OPSD': {'OEU...
12573 Hammerfest Snohvit Terminal Natural Gas CCGT PP Norway 229.0 NaN NaN NaN NaN 70.685400 23.590000 NaN 0.0 0.0 0.0 {nan} {'GEM': {'L100000407847'}}
12575 Hamnefjell Wind Farm Wind Onshore PP Norway 52.0 NaN 2017.0 NaN NaN 70.667900 29.719000 NaN 0.0 0.0 0.0 {nan} {'GEM': {'G100000918725'}}
3320 Hammerfest Hydro Reservoir Store Norway 1.1 NaN 1947.0 1947.0 NaN 70.657936 23.714418 1638.181818 10.6 88.0 1802.0 {nan} {'JRC': {'JRC-N127'}, 'OPSD': {'OEU-3550'}}
2797 Kongsfjord Hydro Reservoir Store Norway 4.4 NaN 1946.0 1946.0 NaN 70.598729 29.057905 3343.795455 88.1 70.5 14712.7 {nan} {'JRC': {'JRC-N210'}, 'OPSD': {'OEU-3801'}}

What is the average “DateIn” of each “Fueltype”? Which type of power plants is the oldest on average?

Hide code cell content
df.groupby("Fueltype").DateIn.mean().sort_values()
Fueltype
Hard Coal        1972.081571
Hydro            1973.738442
Lignite          1975.611511
Nuclear          1975.785047
Oil              1986.655172
Other            1995.931034
Natural Gas      1996.416279
Waste            1996.921053
Geothermal       2000.142857
Solid Biomass    2000.337349
Biogas           2001.878049
Wind             2009.216257
Solar            2015.740711
Name: DateIn, dtype: float64

Plot a histogram of power plant capacities with bins of length 100 MW between 0 and 4000 MW. What do you observe?

Hide code cell content
df.Capacity.plot.hist(bins=np.arange(0, 4001, 100))
<Axes: ylabel='Frequency'>
_images/9948466a7c15b0302f328111227b70d1af4fddc9d673c1bebf4293ca2f7a2f5d.png

How many power plants of each fuel type are there in each country? Display the results in a DataFrame with countries as index and fuel type as columns. Fill missing values with the value zero. Convert all values to integers.

Browse Google or the pandas documentation to find the right aggregation function to count values.

Hide code cell content
df.groupby(["Country", "Fueltype"]).size().unstack().fillna(0.0).astype(int)
Fueltype Biogas Geothermal Hard Coal Hydro Lignite Natural Gas Nuclear Oil Other Solar Solid Biomass Waste Wind
Country
Albania 0 0 0 10 0 0 0 1 0 11 0 0 1
Austria 0 0 6 174 2 19 0 0 0 61 0 0 93
Belgium 0 0 10 13 0 27 3 8 0 76 6 8 89
Bosnia and Herzegovina 0 0 0 17 10 0 0 0 0 19 0 0 7
Bulgaria 0 0 4 30 9 6 1 0 0 215 0 0 17
Croatia 0 0 1 21 0 7 0 0 0 26 0 0 27
Czechia 0 0 11 14 27 8 2 0 0 331 1 0 6
Denmark 0 0 12 0 0 14 0 1 1 73 9 0 112
Estonia 0 0 0 0 0 2 0 2 0 71 0 0 15
Finland 25 0 17 105 6 32 2 14 3 13 21 0 155
France 0 0 16 155 1 42 23 6 2 1268 5 0 1054
Germany 38 0 94 729 33 257 26 26 23 3989 22 59 1498
Greece 0 0 0 20 10 18 0 0 0 315 0 1 192
Hungary 0 0 5 4 1 17 1 1 0 319 4 0 10
Ireland 0 0 1 6 2 17 0 4 0 33 1 0 123
Italy 0 25 18 289 0 122 4 2 2 249 12 0 280
Kosovo 0 0 0 2 2 0 0 0 0 4 0 0 2
Latvia 0 0 0 3 0 3 0 0 0 12 0 0 3
Lithuania 0 0 0 2 0 4 1 0 0 32 1 0 23
Luxembourg 0 0 0 2 0 0 0 0 0 16 0 0 9
Moldova 0 0 1 1 0 2 0 0 0 8 0 0 0
Montenegro 0 0 0 4 1 0 0 0 0 1 0 0 2
Netherlands 1 0 9 0 0 53 1 1 2 221 12 0 148
North Macedonia 0 0 0 10 2 2 0 0 0 34 0 0 2
Norway 0 0 0 930 0 4 0 0 0 1 0 0 60
Poland 0 0 54 19 8 22 0 1 0 291 6 0 223
Portugal 0 0 3 44 0 18 0 0 0 92 4 0 125
Romania 0 0 3 119 15 15 1 0 0 231 1 0 48
Serbia 0 0 0 13 10 3 0 0 0 9 1 0 10
Slovakia 0 0 4 29 3 6 2 0 0 105 0 0 0
Slovenia 0 0 1 28 4 3 1 0 0 7 0 0 0
Spain 0 0 22 359 7 88 7 12 0 1288 11 15 843
Sweden 0 0 2 147 0 12 5 4 0 37 35 0 254
Switzerland 0 0 0 471 0 1 5 0 0 48 0 0 3
Ukraine 0 0 19 9 0 11 5 0 0 427 0 0 11
United Kingdom 2 0 26 101 0 93 17 1 1 1137 36 10 406

Time Series Analysis#

Read in the time series from the second lecture into a DataFrame.

The file is available at https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv. and includes hourly time series for Germany in 2015 for:

  1. electricity demand from OPSD in GW

  2. onshore wind capacity factors from renewables.ninja in per-unit of installed capacity

  3. offshore wind capacity factors from renewables.ninja in per-unit of installed capacity

  4. solar PV capacity factors from renewables.ninja in per-unit of installed capacity

  5. electricity day-ahead spot market prices in €/MWh from EPEX Spot zone DE/AT/LU retrieved via SMARD platform

Use the function pd.read_csv with the keyword arguments index_col= and parse_dates= to ensure the time stamps are treated as pd.DatetimeIndex.

# your code here

The start of the DataFrame should look like this:

load

onwind

offwind

solar

prices

2015-01-01 00:00:00

41.151

0.1566

0.703

0

nan

2015-01-01 01:00:00

40.135

0.1659

0.6875

0

nan

2015-01-01 02:00:00

39.106

0.1746

0.6535

0

nan

2015-01-01 03:00:00

38.765

0.1745

0.6803

0

nan

2015-01-01 04:00:00

38.941

0.1826

0.7272

0

nan

And it should pass the following test:

assert type(df.index) == pd.DatetimeIndex

For each column:

  • What are the average, minimum and maximum values?

  • Find the time stamps where data on prices is missing.

  • Fill up the missing data with the prices observed one week ahead.

  • Plot the time series for the full year.

  • Plot the time series for the month May.

  • Resample the time series to daily, weeky, and monthly frequencies and plot the resulting time series in one graph.

  • Sort the values in descending order and plot the duration curve. Hint: Run .reset_index(drop=True) to drop the index after sorting.

  • Plot a histogram of the time series values.

  • Perform a Fourier transformation of the time series. What are the dominant frequencies? Hint: Below you can find an example how Fourier transformation can be down with numpy.

  • Calculate the Pearson correlation coefficients between all time series. Hint: There is a function for that. Google for “pandas dataframe correlation”.

abs(pd.Series(np.fft.rfft(df.solar - df.solar.mean()), index=np.fft.rfftfreq(len(df.solar), d=1./8760))**2)

# your code here