Introduction to pandas#

Note

This material is mostly adapted from the following resources:

Pandas is a an open source library providing high-performance, easy-to-use data structures and data analysis tools. Pandas is particularly suited to the analysis of tabular data, i.e. data that can can go into a table. In other words, if you can imagine the data in an Excel spreadsheet, then Pandas is the tool for the job.

  • A fast and efficient DataFrame object for data manipulation with indexing;

  • Tools for reading and writing data: CSV and text files, Excel, SQL;

  • Intelligent data alignment and integrated handling of missing data;

  • Flexible reshaping and pivoting of data sets;

  • Intelligent label-based slicing, indexing, and subsetting of large data sets;

  • High performance aggregating, merging, joining or transforming data;

  • Hierarchical indexing provides an intuitive way of working with high-dimensional data;

  • Time series-functionality: date-based indexing, frequency conversion, moving windows, date shifting and lagging;

Note

Documentation for this package is available at https://pandas.pydata.org/docs/.

Note

If you have not yet set up Python on your computer, you can execute this tutorial in your browser via Google Colab. Click on the rocket in the top right corner and launch “Colab”. If that doesn’t work download the .ipynb file and import it in Google Colab.

Then install pandas and numpy by executing the following command in a Jupyter cell at the top of the notebook.

!pip install pandas numpy
import pandas as pd
import numpy as np

Pandas Data Structures: Series#

A Series represents a one-dimensional array of data. The main difference between a Series and numpy array is that a Series has an index. The index contains the labels that we use to access the data.

There are many ways to create a Series. We will just show a few. The core constructor is pd.Series().

(Data are from Wikipedia’s List of power stations in Germany.)

names = ["Neckarwestheim", "Isar 2", "Emsland"]
values = [1269, 1365, 1290]
s = pd.Series(values, index=names)
s
Neckarwestheim    1269
Isar 2            1365
Emsland           1290
dtype: int64
dictionary = {
    "Neckarwestheim": 1269,
    "Isar 2": 1365,
    "Emsland": 1290,
}
s = pd.Series(dictionary)
s
Neckarwestheim    1269
Isar 2            1365
Emsland           1290
dtype: int64

Arithmetic operations and most numpy functions can be applied to pd.Series. An important point is that the Series keep their index during such operations.

np.log(s) / s**0.5
Neckarwestheim    0.200600
Isar 2            0.195391
Emsland           0.199418
dtype: float64

We can access the underlying index object if we need to:

s.index
Index(['Neckarwestheim', 'Isar 2', 'Emsland'], dtype='object')

We can get values back out using the index via the .loc attribute

s.loc["Isar 2"]
1365

Or by raw position using .iloc

s.iloc[2]
1290

We can pass a list or array to loc to get multiple rows back:

s.loc[["Neckarwestheim", "Emsland"]]
Neckarwestheim    1269
Emsland           1290
dtype: int64

And we can even use slice notation

s.loc["Neckarwestheim":"Emsland"]
Neckarwestheim    1269
Isar 2            1365
Emsland           1290
dtype: int64
s.iloc[:2]
Neckarwestheim    1269
Isar 2            1365
dtype: int64

If we need to, we can always get the raw data back out as well

s.values  # a numpy array
array([1269, 1365, 1290])

Pandas Data Structures: DataFrame#

There is a lot more to Series, but they are limit to a single column. A more useful Pandas data structure is the DataFrame. A DataFrame is basically a bunch of series that share the same index. It’s a lot like a table in a spreadsheet.

The core constructor is pd.DataFrame()

Below we create a DataFrame.

# first we create a dictionary
data = {
    "capacity": [1269, 1365, 1290],  # MW
    "type": ["PWR", "PWR", "PWR"],
    "start_year": [1989, 1988, 1988],
    "end_year": [np.nan, np.nan, np.nan],
}
df = pd.DataFrame(data, index=["Neckarwestheim", "Isar 2", "Emsland"])
df
capacity type start_year end_year
Neckarwestheim 1269 PWR 1989 NaN
Isar 2 1365 PWR 1988 NaN
Emsland 1290 PWR 1988 NaN

We can also switch columns and rows very easily.

df.T
Neckarwestheim Isar 2 Emsland
capacity 1269 1365 1290
type PWR PWR PWR
start_year 1989 1988 1988
end_year NaN NaN NaN

A wide range of statistical functions are available on both Series and DataFrames.

df.min()
capacity      1269
type           PWR
start_year    1988
end_year       NaN
dtype: object
df.mean(numeric_only=True)
capacity      1308.000000
start_year    1988.333333
end_year              NaN
dtype: float64
df.std(numeric_only=True)
capacity      50.467812
start_year     0.577350
end_year            NaN
dtype: float64
df.describe()
capacity start_year end_year
count 3.000000 3.000000 0.0
mean 1308.000000 1988.333333 NaN
std 50.467812 0.577350 NaN
min 1269.000000 1988.000000 NaN
25% 1279.500000 1988.000000 NaN
50% 1290.000000 1988.000000 NaN
75% 1327.500000 1988.500000 NaN
max 1365.000000 1989.000000 NaN

We can get a single column as a Series using python’s getitem syntax on the DataFrame object.

df["capacity"]
Neckarwestheim    1269
Isar 2            1365
Emsland           1290
Name: capacity, dtype: int64

…or using attribute syntax.

df.capacity
Neckarwestheim    1269
Isar 2            1365
Emsland           1290
Name: capacity, dtype: int64

Indexing works very similar to series

df.loc["Emsland"]
capacity      1290
type           PWR
start_year    1988
end_year       NaN
Name: Emsland, dtype: object
df.iloc[2]
capacity      1290
type           PWR
start_year    1988
end_year       NaN
Name: Emsland, dtype: object

But we can also specify the column(s) and row(s) we want to access

df.loc["Emsland", "start_year"]
1988
df.loc[["Emsland", "Neckarwestheim"], ["start_year", "end_year"]]
start_year end_year
Emsland 1988 NaN
Neckarwestheim 1989 NaN
df.capacity * 0.8
Neckarwestheim    1015.2
Isar 2            1092.0
Emsland           1032.0
Name: capacity, dtype: float64

Which we can easily add as another column to the DataFrame:

df["reduced_capacity"] = df.capacity * 0.8
df
capacity type start_year end_year reduced_capacity
Neckarwestheim 1269 PWR 1989 NaN 1015.2
Isar 2 1365 PWR 1988 NaN 1092.0
Emsland 1290 PWR 1988 NaN 1032.0

We can also remove columns or rows from a DataFrame:

df.drop("reduced_capacity", axis="columns")
capacity type start_year end_year
Neckarwestheim 1269 PWR 1989 NaN
Isar 2 1365 PWR 1988 NaN
Emsland 1290 PWR 1988 NaN

We can update the variable df by either overwriting df or passing an inplace keyword:

df.drop("reduced_capacity", axis="columns", inplace=True)

We can also drop columns with only NaN values

df.dropna(axis=1)
capacity type start_year
Neckarwestheim 1269 PWR 1989
Isar 2 1365 PWR 1988
Emsland 1290 PWR 1988

Or fill it up with default “fallback” data:

df.fillna(2023)
capacity type start_year end_year
Neckarwestheim 1269 PWR 1989 2023.0
Isar 2 1365 PWR 1988 2023.0
Emsland 1290 PWR 1988 2023.0

Say, we already have one value for end_year and want to fill up the missing data:

df.loc["Emsland", "end_year"] = 2023
# backward (upwards) fill from non-nan values
df.fillna(method="bfill")
/tmp/ipykernel_2006/532230991.py:2: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
  df.fillna(method="bfill")
capacity type start_year end_year
Neckarwestheim 1269 PWR 1989 2023.0
Isar 2 1365 PWR 1988 2023.0
Emsland 1290 PWR 1988 2023.0

Sorting Data#

We can also sort the entries in dataframes, e.g. alphabetically by index or numerically by column values

df.sort_index()
capacity type start_year end_year
Emsland 1290 PWR 1988 2023.0
Isar 2 1365 PWR 1988 NaN
Neckarwestheim 1269 PWR 1989 NaN
df.sort_values(by="capacity", ascending=False)
capacity type start_year end_year
Isar 2 1365 PWR 1988 NaN
Emsland 1290 PWR 1988 2023.0
Neckarwestheim 1269 PWR 1989 NaN

If we make a calculation using columns from the DataFrame, it will keep the same index:

Merging Data#

Pandas supports a wide range of methods for merging different datasets. These are described extensively in the documentation. Here we just give a few examples.

data = {
    "capacity": [1288, 1360, 1326],  # MW
    "type": ["BWR", "PWR", "PWR"],
    "start_year": [1985, 1985, 1986],
    "end_year": [2021, 2021, 2021],
    "x": [10.40, 9.41, 9.35],
    "y": [48.51, 52.03, 53.85],
}
df2 = pd.DataFrame(data, index=["Gundremmingen", "Grohnde", "Brokdorf"])
df2
capacity type start_year end_year x y
Gundremmingen 1288 BWR 1985 2021 10.40 48.51
Grohnde 1360 PWR 1985 2021 9.41 52.03
Brokdorf 1326 PWR 1986 2021 9.35 53.85

We can now add this additional data to the df object

df = pd.concat([df, df2])

Filtering Data#

We can also filter a DataFrame using a boolean series obtained from a condition. This is very useful to build subsets of the DataFrame.

df.capacity > 1300
Neckarwestheim    False
Isar 2             True
Emsland           False
Gundremmingen     False
Grohnde            True
Brokdorf           True
Name: capacity, dtype: bool
df[df.capacity > 1300]
capacity type start_year end_year x y
Isar 2 1365 PWR 1988 NaN NaN NaN
Grohnde 1360 PWR 1985 2021.0 9.41 52.03
Brokdorf 1326 PWR 1986 2021.0 9.35 53.85

We can also combine multiple conditions, but we need to wrap the conditions with brackets!

df[(df.capacity > 1300) & (df.start_year >= 1988)]
capacity type start_year end_year x y
Isar 2 1365 PWR 1988 NaN NaN NaN

Or we make SQL-like queries:

df.query("start_year == 1988")
capacity type start_year end_year x y
Isar 2 1365 PWR 1988 NaN NaN NaN
Emsland 1290 PWR 1988 2023.0 NaN NaN
threshold = 1300
df.query("start_year == 1988 and capacity > @threshold")
capacity type start_year end_year x y
Isar 2 1365 PWR 1988 NaN NaN NaN

Modifying Values#

In many cases, we want to modify values in a dataframe based on some rule. To modify values, we need to use .loc or .iloc

df.loc["Isar 2", "x"] = 12.29
df.loc["Grohnde", "capacity"] += 1
df
capacity type start_year end_year x y
Neckarwestheim 1269 PWR 1989 NaN NaN NaN
Isar 2 1365 PWR 1988 NaN 12.29 NaN
Emsland 1290 PWR 1988 2023.0 NaN NaN
Gundremmingen 1288 BWR 1985 2021.0 10.40 48.51
Grohnde 1361 PWR 1985 2021.0 9.41 52.03
Brokdorf 1326 PWR 1986 2021.0 9.35 53.85
operational = ["Neckarwestheim", "Isar 2", "Emsland"]
df.loc[operational, "y"] = [49.04, 48.61, 52.47]
df
capacity type start_year end_year x y
Neckarwestheim 1269 PWR 1989 NaN NaN 49.04
Isar 2 1365 PWR 1988 NaN 12.29 48.61
Emsland 1290 PWR 1988 2023.0 NaN 52.47
Gundremmingen 1288 BWR 1985 2021.0 10.40 48.51
Grohnde 1361 PWR 1985 2021.0 9.41 52.03
Brokdorf 1326 PWR 1986 2021.0 9.35 53.85

Applying Functions#

Sometimes it can be useful apply a function to all values of a column/row. For instance, we might be interested in normalised capacities relative to the largest nuclear power plant:

df.capacity.apply(lambda x: x / df.capacity.max())
Neckarwestheim    0.929670
Isar 2            1.000000
Emsland           0.945055
Gundremmingen     0.943590
Grohnde           0.997070
Brokdorf          0.971429
Name: capacity, dtype: float64
df.capacity.map(lambda x: x / df.capacity.max())
Neckarwestheim    0.929670
Isar 2            1.000000
Emsland           0.945055
Gundremmingen     0.943590
Grohnde           0.997070
Brokdorf          0.971429
Name: capacity, dtype: float64

For simple functions, there’s often an easier alternative:

df.capacity / df.capacity.max()
Neckarwestheim    0.929670
Isar 2            1.000000
Emsland           0.945055
Gundremmingen     0.943590
Grohnde           0.997070
Brokdorf          0.971429
Name: capacity, dtype: float64

But .apply() and .map() often give you more flexibility.

Renaming Indices and Columns#

Sometimes it can be useful to rename columns:

df.rename(columns=dict(x="lat", y="lon"))
capacity type start_year end_year lat lon
Neckarwestheim 1269 PWR 1989 NaN NaN 49.04
Isar 2 1365 PWR 1988 NaN 12.29 48.61
Emsland 1290 PWR 1988 2023.0 NaN 52.47
Gundremmingen 1288 BWR 1985 2021.0 10.40 48.51
Grohnde 1361 PWR 1985 2021.0 9.41 52.03
Brokdorf 1326 PWR 1986 2021.0 9.35 53.85

Replacing Values#

Sometimes it can be useful to replace values:

df.replace({"PWR": "Pressurized water reactor"})
capacity type start_year end_year x y
Neckarwestheim 1269 Pressurized water reactor 1989 NaN NaN 49.04
Isar 2 1365 Pressurized water reactor 1988 NaN 12.29 48.61
Emsland 1290 Pressurized water reactor 1988 2023.0 NaN 52.47
Gundremmingen 1288 BWR 1985 2021.0 10.40 48.51
Grohnde 1361 Pressurized water reactor 1985 2021.0 9.41 52.03
Brokdorf 1326 Pressurized water reactor 1986 2021.0 9.35 53.85

Plotting#

DataFrames have all kinds of useful plotting built in. Note that we do not even have to import matplotlib for this.

df.plot(kind="scatter", x="start_year", y="capacity")
<Axes: xlabel='start_year', ylabel='capacity'>
_images/7f3d5e4db91220c593d4c1339c140656c60faa9ee1d9a41950258f617025c674.png
df.capacity.plot.barh(color="orange")
<Axes: >
_images/4d40edb68502eda382a7f36d487e560ca7b0068993e7ba9f8494378a33d58dc9.png

Time Indexes#

Indexes are very powerful. They are a big part of why Pandas is so useful. There are different indices for different types of data. Time Indexes are especially great when handling time-dependent data.

time = pd.date_range(start="2021-01-01", end="2023-01-01", freq="D")
values = np.sin(2 * np.pi * time.dayofyear / 365)
ts = pd.Series(values, index=time)
ts.plot()
<Axes: >
_images/bc754accce87daeef01329569ba556ddca4d246621828d7394e64f20559af00a.png

We can use Python’s slicing notation inside .loc to select a date range.

ts.loc["2021-01-01":"2021-07-01"].plot()
<Axes: >
_images/00a720572e0089ba845f82b23f441535846470b3b0c47d333188de37f316d911.png
ts.loc["2021-05"].plot()
<Axes: >
_images/6ef166a1a0a72f824a1bcaad87260183d211cce218eeb742db1545935d80808c.png

The pd.TimeIndex object has lots of useful attributes

ts.index.month
Index([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       ...
       12, 12, 12, 12, 12, 12, 12, 12, 12,  1],
      dtype='int32', length=731)
ts.index.day
Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
       ...
       23, 24, 25, 26, 27, 28, 29, 30, 31,  1],
      dtype='int32', length=731)

Another common operation is to change the resolution of a dataset by resampling in time. Pandas exposes this through the resample function. The resample periods are specified using pandas offset index syntax.

Below we resample the dataset by taking the mean over each month.

ts.resample("M").mean().head()
/tmp/ipykernel_2006/3719689167.py:1: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  ts.resample("M").mean().head()
2021-01-31    0.268746
2021-02-28    0.698782
2021-03-31    0.949778
2021-04-30    0.959332
2021-05-31    0.709200
Freq: ME, dtype: float64
ts.resample("M").mean().plot()
/tmp/ipykernel_2006/2719797856.py:1: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  ts.resample("M").mean().plot()
<Axes: >
_images/dcda84bf24c5b4ba47d2b935a33322de4de7bdffd1a71e90eb600b7bdce1b868.png

Reading and Writing Files#

To read data into pandas, we can use for instance the pd.read_csv() function. This function is incredibly powerful and complex with a multitude of settings. You can use it to extract data from almost any text file.

The pd.read_csv() function can take a path to a local file as an input, or even a link to an online text file.

Let’s import a slightly larger dataset about the power plant fleet in Europe_

fn = "https://raw.githubusercontent.com/PyPSA/powerplantmatching/master/powerplants.csv"
df = pd.read_csv(fn, index_col=0)
df.iloc[:5, :10]
Name Fueltype Technology Set Country Capacity Efficiency DateIn DateRetrofit DateOut
id
0 Borssele Hard Coal Steam Turbine PP Netherlands 485.0 NaN 1973.0 NaN 2034.0
1 Sainte Croix Hydro Reservoir Store France 132.3 NaN 1974.0 NaN NaN
2 Pied De Borne Hydro Reservoir Store France 109.4 NaN 1965.0 NaN NaN
3 Pouget Hydro Reservoir Store France 446.9 NaN 1951.0 NaN NaN
4 Pragneres Hydro Reservoir Store France 189.2 NaN 1953.0 NaN NaN
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 6374 entries, 0 to 6616
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Name                 6374 non-null   object 
 1   Fueltype             6374 non-null   object 
 2   Technology           5702 non-null   object 
 3   Set                  6374 non-null   object 
 4   Country              6374 non-null   object 
 5   Capacity             6374 non-null   float64
 6   Efficiency           542 non-null    float64
 7   DateIn               4450 non-null   float64
 8   DateRetrofit         1871 non-null   float64
 9   DateOut              349 non-null    float64
 10  lat                  6374 non-null   float64
 11  lon                  6374 non-null   float64
 12  Duration             452 non-null    float64
 13  Volume_Mm3           6374 non-null   float64
 14  DamHeight_m          6374 non-null   float64
 15  StorageCapacity_MWh  6374 non-null   float64
 16  EIC                  6374 non-null   object 
 17  projectID            6374 non-null   object 
dtypes: float64(11), object(7)
memory usage: 946.1+ KB
df.describe()
Capacity Efficiency DateIn DateRetrofit DateOut lat lon Duration Volume_Mm3 DamHeight_m StorageCapacity_MWh
count 6374.000000 542.000000 4450.000000 1871.000000 349.000000 6374.000000 6374.000000 452.000000 6374.000000 6374.000000 6374.000000
mean 141.550081 0.480459 1982.047416 1984.920363 2020.418338 48.915773 7.879248 1201.563456 7.128439 4.872220 410.191716
std 391.501198 0.179431 88.218795 24.935544 9.537422 6.850656 9.083968 1603.428940 138.334742 47.251297 9267.942691
min 0.000000 0.140228 0.000000 1899.000000 1996.000000 32.742000 -17.055700 0.007907 0.000000 0.000000 0.000000
25% 9.800000 0.356400 1966.000000 1967.000000 2015.000000 43.472075 0.740450 34.009337 0.000000 0.000000 0.000000
50% 26.000000 0.385710 1994.000000 1991.000000 2019.000000 48.133707 8.545323 641.151139 0.000000 0.000000 0.000000
75% 86.000000 0.580829 2008.000000 2005.000000 2023.000000 52.420000 13.098021 1839.818750 0.000000 0.000000 0.000000
max 6000.000000 0.917460 2023.000000 2020.000000 2051.000000 70.689366 39.261917 16840.000000 9500.000000 1800.000000 421000.000000

Sometimes, we also want to store a DataFrame for later use. There are many different file formats tabular data can be stored in, including HTML, JSON, Excel, Parquet, Feather, etc. Here, let’s say we want to store the DataFrame as CSV (comma-separated values) file under the name “tmp.csv”.

df.to_csv("tmp.csv")

Groupby Functionality#

Both Series and DataFrame objects have a groupby method. It accepts a variety of arguments, but the simplest way to think about it is that you pass another series, whose unique values are used to split the original object into different groups. groupby is an amazingly powerful but also complex function.

Here’s an example which retrieves the total generation capacity per country.

grouped = df.groupby("Country").Capacity.sum()
grouped.head()
Country
Albania                    1833.208275
Austria                   19341.820000
Belgium                   16446.458999
Bosnia and Herzegovina     4038.500000
Bulgaria                  11687.086363
Name: Capacity, dtype: float64

Such “chaining” operations together is very common with pandas:

Let’s break apart this operation a bit. The workflow with groupby can be divided into three general steps:

  1. Split: Partition the data into different groups based on some criterion.

  2. Apply: Do some caclulation within each group. Different types of steps might be

    • Aggregation: Get the mean or max within the group.

    • Transformation: Normalize all the values within a group.

    • Filtration: Eliminate some groups based on a criterion.

  3. Combine: Put the results back together into a single object.

gb = df.groupby("Country")
gb
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd2b7c8fb50>

The length tells us how many groups were found:

len(gb)
34

All of the groups are available as a dictionary via the .groups attribute:

groups = gb.groups
len(groups)
34
list(groups.keys())[:5]
['Albania', 'Austria', 'Belgium', 'Bosnia and Herzegovina', 'Bulgaria']

Now that we know how to create a GroupBy object, let’s learn how to do aggregation on it.

gb.Capacity.sum().nlargest(5)
Country
Germany           124029.715396
United Kingdom    115460.839493
France            106315.562084
Spain              88423.920012
Italy              82028.366461
Name: Capacity, dtype: float64
gb["DateIn"].mean().head()
Country
Albania                   1981.000000
Austria                   1971.917808
Belgium                   1989.830508
Bosnia and Herzegovina    1979.047619
Bulgaria                  1977.156250
Name: DateIn, dtype: float64

Grouping is not only possible on a single columns, but also on multiple columns. For instance, we might want to group the capacities by country and fuel type. To achieve this, we pass a list of functions to the groupby functions.

capacities = df.groupby(["Country", "Fueltype"]).Capacity.sum()
capacities
Country         Fueltype 
Albania         Hydro         1743.353732
                Oil             89.854543
Austria         Hard Coal     1128.400000
                Hydro        13462.420000
                Lignite        522.000000
                                 ...     
United Kingdom  Oil            100.000000
                Other           55.000000
                Solar         2199.000000
                Waste          288.900000
                Wind         21720.000000
Name: Capacity, Length: 186, dtype: float64

By grouping by multiple attributes, our index becomes a pd.MultiIndex (a hierarchical index with multiple levels.

capacities.index[:5]
MultiIndex([('Albania',     'Hydro'),
            ('Albania',       'Oil'),
            ('Austria', 'Hard Coal'),
            ('Austria',     'Hydro'),
            ('Austria',   'Lignite')],
           names=['Country', 'Fueltype'])
type(capacities.index)
pandas.core.indexes.multi.MultiIndex

We can use the .unstack function to reshape the multi-indexed pd.Series into a pd.DataFrame which has the second index level as columns.

capacities.unstack().tail().T
Country Spain Sweden Switzerland Ukraine United Kingdom
Fueltype
Bioenergy 20.000000 220.000000 NaN NaN 766.000000
Geothermal NaN NaN NaN NaN NaN
Hard Coal 10031.678478 291.000000 NaN 26663.0 34968.017061
Hydro 26080.361248 13979.686625 19346.76 6346.0 9735.700000
Lignite 1755.400000 NaN NaN NaN NaN
Natural Gas 23432.532000 1091.000000 NaN 1290.0 32470.222432
Nuclear 7704.200000 8617.000000 3348.00 13835.0 13158.000000
Oil 1901.271000 1685.000000 NaN NaN 100.000000
Other 91.143286 NaN NaN NaN 55.000000
Solar 5318.200000 NaN NaN 539.0 2199.000000
Waste 388.054000 NaN NaN NaN 288.900000
Wind 11701.080000 1546.000000 NaN NaN 21720.000000

Exercises#

Power Plants Data#

Run the function .describe() on the DataFrame that includes the power plant database:

Hide code cell content
df.describe()
Capacity Efficiency DateIn DateRetrofit DateOut lat lon Duration Volume_Mm3 DamHeight_m StorageCapacity_MWh
count 6374.000000 542.000000 4450.000000 1871.000000 349.000000 6374.000000 6374.000000 452.000000 6374.000000 6374.000000 6374.000000
mean 141.550081 0.480459 1982.047416 1984.920363 2020.418338 48.915773 7.879248 1201.563456 7.128439 4.872220 410.191716
std 391.501198 0.179431 88.218795 24.935544 9.537422 6.850656 9.083968 1603.428940 138.334742 47.251297 9267.942691
min 0.000000 0.140228 0.000000 1899.000000 1996.000000 32.742000 -17.055700 0.007907 0.000000 0.000000 0.000000
25% 9.800000 0.356400 1966.000000 1967.000000 2015.000000 43.472075 0.740450 34.009337 0.000000 0.000000 0.000000
50% 26.000000 0.385710 1994.000000 1991.000000 2019.000000 48.133707 8.545323 641.151139 0.000000 0.000000 0.000000
75% 86.000000 0.580829 2008.000000 2005.000000 2023.000000 52.420000 13.098021 1839.818750 0.000000 0.000000 0.000000
max 6000.000000 0.917460 2023.000000 2020.000000 2051.000000 70.689366 39.261917 16840.000000 9500.000000 1800.000000 421000.000000

Provide a list of unique fuel types included in the dataset

Hide code cell content
df.Fueltype.unique()
array(['Hard Coal', 'Hydro', 'Nuclear', 'Natural Gas', 'Lignite',
       'Bioenergy', 'Oil', 'Wind', 'Other', 'Solar', 'Geothermal',
       'Waste'], dtype=object)

Provide a list of unique technologies included in the dataset

Hide code cell content
df.Technology.unique()
array(['Steam Turbine', 'Reservoir', 'Pumped Storage', 'Run-Of-River',
       'CCGT', 'Onshore', 'Offshore', 'Unknown', 'Marine', 'Not Found',
       nan, 'Csp', 'Pv', 'Offshore Mount Unknown', 'Offshore Hard Mount',
       'Assumed Pv', 'CSP', 'Combustion Engine'], dtype=object)

Filter the dataset by power plants with the fuel type “Hard Coal”

Hide code cell content
coal = df.loc[df.Fueltype == "Hard Coal"]
coal
Name Fueltype Technology Set Country Capacity Efficiency DateIn DateRetrofit DateOut lat lon Duration Volume_Mm3 DamHeight_m StorageCapacity_MWh EIC projectID
id
0 Borssele Hard Coal Steam Turbine PP Netherlands 485.000000 NaN 1973.0 NaN 2034.0 51.4332 3.7160 NaN 0.0 0.0 0.0 {'49W000000000054X'} {'BEYONDCOAL': {'BEYOND-NL-2'}, 'ENTSOE': {'49...
44 Fusina Hard Coal Steam Turbine PP Italy 899.810470 NaN 1964.0 NaN 2025.0 45.4314 12.2458 NaN 0.0 0.0 0.0 {'26WIMPI-S05FTSNK'} {'BEYONDCOAL': {'BEYOND-IT-24'}, 'ENTSOE': {'2...
51 Bastardo Hard Coal Steam Turbine PP Italy 138.290543 NaN 1989.0 NaN 2025.0 42.8933 12.5397 NaN 0.0 0.0 0.0 {'26WIMPI-S10BSTRJ'} {'BEYONDCOAL': {'BEYOND-IT-10'}, 'ENTSOE': {'2...
52 Brindisi Sud Hard Coal Steam Turbine PP Italy 1825.435174 NaN 1991.0 NaN 2025.0 40.5639 18.0322 NaN 0.0 0.0 0.0 {'26WIMPI-S16BSCRY'} {'BEYONDCOAL': {'BEYOND-IT-2'}, 'ENTSOE': {'26...
53 Monfalcone Hard Coal Steam Turbine PP Italy 309.770817 NaN 1965.0 NaN 2025.0 45.7967 13.5456 NaN 0.0 0.0 0.0 {'26WIMPI-S06MTNFA'} {'BEYONDCOAL': {'BEYOND-IT-13'}, 'ENTSOE': {'2...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6607 Ulm Magirusstrasse Hard Coal NaN PP Germany 124.461489 NaN 1978.0 NaN NaN 48.3967 9.9654 NaN 0.0 0.0 0.0 {nan} {'BEYONDCOAL': {'BEYOND-DE-216'}}
6608 Vitkovice Hard Coal NaN PP Czech Republic 0.000000 NaN 1970.0 NaN 2020.0 49.8160 18.2733 NaN 0.0 0.0 0.0 {nan} {'BEYONDCOAL': {'BEYOND-CZ-31'}}
6609 Voerde Hard Coal NaN PP Germany 0.000000 NaN 1982.0 NaN 2017.0 51.5762 6.6841 NaN 0.0 0.0 0.0 {nan} {'BEYONDCOAL': {'BEYOND-DE-222'}}
6611 Wilhelmshaven New Hard Coal NaN PP Germany 732.939880 NaN 2015.0 NaN NaN 53.5788 8.1327 NaN 0.0 0.0 0.0 {nan} {'BEYONDCOAL': {'BEYOND-DE-247'}}
6612 Zabrze Hard Coal NaN PP Poland 68.223335 NaN 1976.0 NaN NaN 50.2992 18.8123 NaN 0.0 0.0 0.0 {nan} {'BEYONDCOAL': {'BEYOND-PL-237'}}

282 rows × 18 columns

Identify the 5 largest coal power plants. In which countries are they located? When were they built?

Hide code cell content
coal.loc[coal.Capacity.nlargest(5).index]
Name Fueltype Technology Set Country Capacity Efficiency DateIn DateRetrofit DateOut lat lon Duration Volume_Mm3 DamHeight_m StorageCapacity_MWh EIC projectID
id
1492 Longannet Hard Coal Steam Turbine PP United Kingdom 4856.000000 NaN 1970.0 NaN 2016.0 56.050000 -3.681200 NaN 0.0 0.0 0.0 {nan, nan, nan, nan, nan} {'BEYONDCOAL': {'BEYOND-UK-87'}, 'GEM': {'G106...
1493 Kingsnorth Hard Coal Steam Turbine PP United Kingdom 3772.000000 NaN 1970.0 NaN 2012.0 51.418500 0.603800 NaN 0.0 0.0 0.0 {nan, nan, nan, nan, nan, nan} {'BEYONDCOAL': {'BEYOND-UK-60'}, 'GEM': {'G105...
390 Kozienice Hard Coal Steam Turbine PP Poland 3682.216205 NaN 1972.0 NaN NaN 51.664700 21.466700 NaN 0.0 0.0 0.0 {'19W000000000104I', '19W000000000095U'} {'BEYONDCOAL': {'BEYOND-PL-96'}, 'ENTSOE': {'1...
1441 Burshtyn Hard Coal Steam Turbine PP Ukraine 3166.000000 NaN 1965.0 1984.0 NaN 49.208764 24.666075 NaN 0.0 0.0 0.0 {nan, nan, nan, nan, nan, nan, nan, nan, nan, ... {'GEM': {'G101131', 'G101126', 'G101123', 'G10...
508 Opole Hard Coal Steam Turbine PP Poland 3071.893939 NaN 1993.0 NaN NaN 50.751800 17.882000 NaN 0.0 0.0 0.0 {'19W0000000001292'} {'BEYONDCOAL': {'BEYOND-PL-16'}, 'ENTSOE': {'1...

Identify the power plant with the longest “Name”.

Hide code cell content
i = df.Name.map(lambda x: len(x)).argmax()
df.iloc[i]
Name                   Kesznyeten Hernadviz Hungary Kesznyeten Hernad...
Fueltype                                                           Hydro
Technology                                                  Run-Of-River
Set                                                                   PP
Country                                                          Hungary
Capacity                                                             4.4
Efficiency                                                           NaN
DateIn                                                            1945.0
DateRetrofit                                                      1992.0
DateOut                                                              NaN
lat                                                            47.995972
lon                                                            21.033037
Duration                                                             NaN
Volume_Mm3                                                           0.0
DamHeight_m                                                         14.0
StorageCapacity_MWh                                                  0.0
EIC                                                                {nan}
projectID                   {'JRC': {'JRC-H2138'}, 'GEO': {'GEO-42677'}}
Name: 3863, dtype: object

Identify the 10 northernmost powerplants. What type of power plants are they?

Hide code cell content
index = df.lat.nlargest(10).index
df.loc[index]
Name Fueltype Technology Set Country Capacity Efficiency DateIn DateRetrofit DateOut lat lon Duration Volume_Mm3 DamHeight_m StorageCapacity_MWh EIC projectID
id
4469 Melkoya Natural Gas CCGT PP Norway 230.0 NaN 2007.0 2007.0 NaN 70.689366 23.600448 NaN 0.0 0.0 0.0 {'50WP00000000456T'} {'ENTSOE': {'50WP00000000456T'}, 'OPSD': {'OEU...
1278 Adamselv Hydro Reservoir Store Norway 50.4 NaN 1973.0 2007.0 NaN 70.409578 26.701878 3348.800000 0.0 0.0 0.0 {nan} {'OPSD': {'OEU-3105', 'OEU-3106'}, 'JRC': {'JR...
3647 Nedre Porsa Hydro Reservoir Store Norway 12.8 NaN 1959.0 1959.0 NaN 70.401409 23.627292 1876.875000 0.0 0.0 0.0 {nan} {'OPSD': {'OEU-4052'}, 'JRC': {'JRC-N297'}}
3724 Kvaenangsbotn Hydro Reservoir Store Norway 55.0 NaN 1965.0 1965.0 NaN 69.720002 22.057216 323.658182 0.0 0.0 0.0 {nan} {'OPSD': {'OEU-3856'}, 'JRC': {'JRC-N226'}}
236 Alta Krv Hydro Reservoir Store Norway 165.0 NaN 1987.0 1987.0 NaN 69.704900 23.818500 389.700000 0.0 0.0 0.0 {'50WP00000000006N'} {'ENTSOE': {'50WP00000000006N'}, 'GEM': {'G602...
3632 Smavatna Hydro Reservoir Store Norway 18.8 NaN 1970.0 1970.0 NaN 69.684261 22.154025 1577.872340 0.0 0.0 0.0 {nan} {'OPSD': {'OEU-4353'}, 'JRC': {'JRC-N392'}}
1072 Guolas Hydro Reservoir Store Norway 82.0 NaN 1971.0 1971.0 NaN 69.460500 20.918400 2811.457500 0.0 0.0 0.0 {nan} {'GEM': {'G602944'}, 'OPSD': {'OEU-3532'}, 'JR...
3700 Melkefoss Hydro Reservoir Store Norway 22.0 NaN 1978.0 1978.0 NaN 69.399382 29.787300 NaN 0.0 0.0 0.0 {nan} {'OPSD': {'OEU-3972'}, 'JRC': {'JRC-N267'}}
3601 Skogfoss Hydro Reservoir Store Norway 48.0 NaN 1964.0 1964.0 NaN 69.372953 29.694029 2795.833333 0.0 0.0 0.0 {nan} {'OPSD': {'OEU-4320'}, 'JRC': {'JRC-N383'}}
3576 Skibotn Hydro Reservoir Store Norway 70.0 NaN 1979.0 1979.0 NaN 69.312883 20.342060 2186.080000 0.0 0.0 0.0 {nan} {'OPSD': {'OEU-4301'}, 'JRC': {'JRC-N375'}}

What is the average “DateIn” of each “Fueltype”? Which type of power plants is the oldest on average?

Hide code cell content
df.groupby("Fueltype").DateIn.mean().sort_values()
Fueltype
Hydro          1966.744212
Hard Coal      1971.894928
Lignite        1974.901639
Nuclear        1979.196970
Oil            1985.650794
Other          1994.406250
Wind           1994.562609
Natural Gas    1995.177419
Waste          1997.038961
Geothermal     1999.833333
Bioenergy      2000.851351
Solar          2013.311321
Name: DateIn, dtype: float64

Plot a histogram of power plant capacities with bins of length 100 MW between 0 and 4000 MW. What do you observe?

Hide code cell content
df.Capacity.plot.hist(bins=np.arange(0, 4001, 100))
<Axes: ylabel='Frequency'>
_images/3d5592081932c44551875ed668b5d6198fb0bdf92e35fcaa3b6640248535dbf9.png

How many power plants of each fuel type are there in each country? Display the results in a DataFrame with countries as index and fuel type as columns. Fill missing values with the value zero. Convert all values to integers.

Browse Google or the pandas documentation to find the right aggregation function to count values.

Hide code cell content
df.groupby(["Country", "Fueltype"]).size().unstack().fillna(0.0).astype(int)
Fueltype Bioenergy Geothermal Hard Coal Hydro Lignite Natural Gas Nuclear Oil Other Solar Waste Wind
Country
Albania 0 0 0 10 0 0 0 1 0 0 0 0
Austria 0 0 4 172 2 11 0 0 0 0 0 4
Belgium 1 0 6 13 0 24 2 9 0 1 8 18
Bosnia and Herzegovina 0 0 0 16 5 0 0 0 0 0 0 0
Bulgaria 0 0 5 30 8 1 1 0 0 3 0 0
Croatia 0 0 1 21 0 4 0 0 0 0 0 0
Czech Republic 0 0 10 0 23 0 0 0 0 0 0 0
Denmark 3 0 12 0 0 4 0 1 1 3 0 12
Estonia 0 0 0 0 0 1 0 2 0 0 0 9
Finland 37 0 14 105 6 21 2 16 3 0 0 13
France 0 0 13 149 0 13 20 7 2 5 0 164
Germany 46 0 71 731 32 202 9 31 25 55 60 28
Greece 0 0 0 20 9 10 0 3 0 0 1 6
Hungary 1 0 5 4 1 12 1 1 0 3 0 2
Ireland 0 0 1 6 2 9 0 4 0 0 0 21
Italy 6 24 15 288 0 93 0 2 2 23 0 138
Latvia 0 0 0 3 0 2 0 0 0 0 0 0
Lithuania 0 0 0 2 0 2 1 0 0 0 0 3
Luxembourg 0 0 0 2 0 0 0 0 0 0 0 0
Moldova 0 0 1 1 0 1 0 0 0 0 0 0
Montenegro 0 0 0 4 1 0 0 0 0 0 0 0
Netherlands 0 0 7 0 0 32 0 1 3 4 0 17
Norway 0 0 0 380 0 3 0 0 0 0 0 2
Poland 0 0 50 18 7 9 0 1 0 0 0 49
Portugal 1 0 2 41 0 5 0 0 0 4 0 103
Romania 0 0 3 119 13 7 1 0 0 4 0 6
Serbia 0 0 0 13 10 2 0 0 0 0 0 0
Slovakia 0 0 2 29 3 4 2 0 0 1 0 0
Slovenia 0 0 1 28 3 2 1 0 0 0 0 0
Spain 1 0 18 361 6 52 7 13 1 203 15 297
Sweden 2 0 2 144 0 4 3 4 0 0 0 14
Switzerland 0 0 0 470 0 0 4 0 0 0 0 0
Ukraine 0 0 19 8 0 2 4 0 0 10 0 0
United Kingdom 18 0 20 101 0 67 8 1 1 64 10 349

Time Series Analysis#

Read in the time series from the second lecture into a DataFrame.

The file is available at https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv. and includes hourly time series for Germany in 2015 for:

  1. electricity demand from OPSD in GW

  2. onshore wind capacity factors from renewables.ninja in per-unit of installed capacity

  3. offshore wind capacity factors from renewables.ninja in per-unit of installed capacity

  4. solar PV capacity factors from renewables.ninja in per-unit of installed capacity

  5. electricity day-ahead spot market prices in €/MWh from EPEX Spot zone DE/AT/LU retrieved via SMARD platform

Use the function pd.read_csv with the keyword arguments index_col= and parse_dates= to ensure the time stamps are treated as pd.DatetimeIndex.

# your code here

The start of the DataFrame should look like this:

load

onwind

offwind

solar

prices

2015-01-01 00:00:00

41.151

0.1566

0.703

0

nan

2015-01-01 01:00:00

40.135

0.1659

0.6875

0

nan

2015-01-01 02:00:00

39.106

0.1746

0.6535

0

nan

2015-01-01 03:00:00

38.765

0.1745

0.6803

0

nan

2015-01-01 04:00:00

38.941

0.1826

0.7272

0

nan

And it should pass the following test:

assert type(df.index) == pd.DatetimeIndex

For each column:

  • What are the average, minimum and maximum values?

  • Find the time stamps where data on prices is missing.

  • Fill up the missing data with the prices observed one week ahead.

  • Plot the time series for the full year.

  • Plot the time series for the month May.

  • Resample the time series to daily, weeky, and monthly frequencies and plot the resulting time series in one graph.

  • Sort the values in descending order and plot the duration curve. Hint: Run .reset_index(drop=True) to drop the index after sorting.

  • Plot a histogram of the time series values.

  • Perform a Fourier transformation of the time series. What are the dominant frequencies? Hint: Below you can find an example how Fourier transformation can be down with numpy.

  • Calculate the Pearson correlation coefficients between all time series. Hint: There is a function for that. Google for “pandas dataframe correlation”.

abs(pd.Series(np.fft.rfft(df.solar - df.solar.mean()), index=np.fft.rfftfreq(len(df.solar), d=1./8760))**2)

# your code here