Introduction to pandas

Introduction to `pandas`#

Note

This material is mostly adapted from the following resources:

Pandas is a an open source library providing high-performance, easy-to-use data structures and data analysis tools. Pandas is particularly suited to the analysis of tabular data, i.e. data that can can go into a table. In other words, if you can imagine the data in an Excel spreadsheet, then Pandas is the tool for the job.

A fast and efficient DataFrame object for data manipulation with indexing;
Tools for reading and writing data: CSV and text files, Excel, SQL;
Intelligent data alignment and integrated handling of missing data;
Flexible reshaping and pivoting of data sets;
Intelligent label-based slicing, indexing, and subsetting of large data sets;
High performance aggregating, merging, joining or transforming data;
Hierarchical indexing provides an intuitive way of working with high-dimensional data;
Time series-functionality: date-based indexing, frequency conversion, moving windows, date shifting and lagging;

Note

Documentation for this package is available at https://pandas.pydata.org/docs/.

Note

If you have not yet set up Python on your computer, you can execute this tutorial in your browser via Google Colab. Click on the rocket in the top right corner and launch “Colab”. If that doesn’t work download the .ipynb file and import it in Google Colab.

Then install pandas and numpy by executing the following command in a Jupyter cell at the top of the notebook.

!pip install pandas numpy

import pandas as pd
import numpy as np

Pandas Data Structures: Series#

A Series represents a one-dimensional array of data. The main difference between a Series and numpy array is that a Series has an index. The index contains the labels that we use to access the data.

There are many ways to create a Series. We will just show a few. The core constructor is pd.Series().

(Data are from Wikipedia’s List of power stations in Germany.)

names = ["Neckarwestheim", "Isar 2", "Emsland"]
values = [1269, 1365, 1290]
s = pd.Series(values, index=names)
s

Neckarwestheim    1269
Isar 2            1365
Emsland           1290
dtype: int64

dictionary = {
    "Neckarwestheim": 1269,
    "Isar 2": 1365,
    "Emsland": 1290,
}
s = pd.Series(dictionary)
s

Neckarwestheim    1269
Isar 2            1365
Emsland           1290
dtype: int64

Arithmetic operations and most numpy functions can be applied to pd.Series. An important point is that the Series keep their index during such operations.

np.log(s) / s**0.5

Neckarwestheim    0.200600
Isar 2            0.195391
Emsland           0.199418
dtype: float64

We can access the underlying index object if we need to:

s.index

Index(['Neckarwestheim', 'Isar 2', 'Emsland'], dtype='object')

We can get values back out using the index via the .loc attribute

s.loc["Isar 2"]

Or by raw position using .iloc

s.iloc[2]

We can pass a list or array to loc to get multiple rows back:

s.loc[["Neckarwestheim", "Emsland"]]

Neckarwestheim    1269
Emsland           1290
dtype: int64

And we can even use slice notation

s.loc["Neckarwestheim":"Emsland"]

Neckarwestheim    1269
Isar 2            1365
Emsland           1290
dtype: int64

s.iloc[:2]

Neckarwestheim    1269
Isar 2            1365
dtype: int64

If we need to, we can always get the raw data back out as well

s.values  # a numpy array

array([1269, 1365, 1290])

Pandas Data Structures: DataFrame#

There is a lot more to Series, but they are limit to a single column. A more useful Pandas data structure is the DataFrame. A DataFrame is basically a bunch of series that share the same index. It’s a lot like a table in a spreadsheet.

The core constructor is pd.DataFrame()

Below we create a DataFrame.

# first we create a dictionary
data = {
    "capacity": [1269, 1365, 1290],  # MW
    "type": ["PWR", "PWR", "PWR"],
    "start_year": [1989, 1988, 1988],
    "end_year": [np.nan, np.nan, np.nan],
}
df = pd.DataFrame(data, index=["Neckarwestheim", "Isar 2", "Emsland"])
df

	capacity	type	start_year	end_year
Neckarwestheim	1269	PWR	1989	NaN
Isar 2	1365	PWR	1988	NaN
Emsland	1290	PWR	1988	NaN

We can also switch columns and rows very easily.

df.T

	Neckarwestheim	Isar 2	Emsland
capacity	1269	1365	1290
type	PWR	PWR	PWR
start_year	1989	1988	1988
end_year	NaN	NaN	NaN

A wide range of statistical functions are available on both Series and DataFrames.

df.min()

capacity      1269
type           PWR
start_year    1988
end_year       NaN
dtype: object

df.mean(numeric_only=True)

capacity      1308.000000
start_year    1988.333333
end_year              NaN
dtype: float64

df.std(numeric_only=True)

capacity      50.467812
start_year     0.577350
end_year            NaN
dtype: float64

df.describe()

	capacity	start_year	end_year
count	3.000000	3.000000	0.0
mean	1308.000000	1988.333333	NaN
std	50.467812	0.577350	NaN
min	1269.000000	1988.000000	NaN
25%	1279.500000	1988.000000	NaN
50%	1290.000000	1988.000000	NaN
75%	1327.500000	1988.500000	NaN
max	1365.000000	1989.000000	NaN

We can get a single column as a Series using python’s getitem syntax on the DataFrame object.

df["capacity"]

Neckarwestheim    1269
Isar 2            1365
Emsland           1290
Name: capacity, dtype: int64

…or using attribute syntax.

df.capacity

Neckarwestheim    1269
Isar 2            1365
Emsland           1290
Name: capacity, dtype: int64

Indexing works very similar to series

df.loc["Emsland"]

capacity      1290
type           PWR
start_year    1988
end_year       NaN
Name: Emsland, dtype: object

df.iloc[2]

capacity      1290
type           PWR
start_year    1988
end_year       NaN
Name: Emsland, dtype: object

But we can also specify the column(s) and row(s) we want to access

df.loc["Emsland", "start_year"]

df.loc[["Emsland", "Neckarwestheim"], ["start_year", "end_year"]]

	start_year	end_year
Emsland	1988	NaN
Neckarwestheim	1989	NaN

df.capacity * 0.8

Neckarwestheim    1015.2
Isar 2            1092.0
Emsland           1032.0
Name: capacity, dtype: float64

Which we can easily add as another column to the DataFrame:

df["reduced_capacity"] = df.capacity * 0.8
df

	capacity	type	start_year	end_year	reduced_capacity
Neckarwestheim	1269	PWR	1989	NaN	1015.2
Isar 2	1365	PWR	1988	NaN	1092.0
Emsland	1290	PWR	1988	NaN	1032.0

We can also remove columns or rows from a DataFrame:

df.drop("reduced_capacity", axis="columns")

	capacity	type	start_year	end_year
Neckarwestheim	1269	PWR	1989	NaN
Isar 2	1365	PWR	1988	NaN
Emsland	1290	PWR	1988	NaN

We can update the variable df by either overwriting df or passing an inplace keyword:

df.drop("reduced_capacity", axis="columns", inplace=True)

We can also drop columns with only NaN values

df.dropna(axis=1)

	capacity	type	start_year
Neckarwestheim	1269	PWR	1989
Isar 2	1365	PWR	1988
Emsland	1290	PWR	1988

Or fill it up with default “fallback” data:

df.fillna(2023)

	capacity	type	start_year	end_year
Neckarwestheim	1269	PWR	1989	2023.0
Isar 2	1365	PWR	1988	2023.0
Emsland	1290	PWR	1988	2023.0

Say, we already have one value for end_year and want to fill up the missing data:

df.loc["Emsland", "end_year"] = 2023

# backward (upwards) fill from non-nan values
df.fillna(method="bfill")

/tmp/ipykernel_1891/532230991.py:2: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
  df.fillna(method="bfill")

	capacity	type	start_year	end_year
Neckarwestheim	1269	PWR	1989	2023.0
Isar 2	1365	PWR	1988	2023.0
Emsland	1290	PWR	1988	2023.0

Sorting Data#

We can also sort the entries in dataframes, e.g. alphabetically by index or numerically by column values

df.sort_index()

	capacity	type	start_year	end_year
Emsland	1290	PWR	1988	2023.0
Isar 2	1365	PWR	1988	NaN
Neckarwestheim	1269	PWR	1989	NaN

df.sort_values(by="capacity", ascending=False)

	capacity	type	start_year	end_year
Isar 2	1365	PWR	1988	NaN
Emsland	1290	PWR	1988	2023.0
Neckarwestheim	1269	PWR	1989	NaN

If we make a calculation using columns from the DataFrame, it will keep the same index:

Merging Data#

Pandas supports a wide range of methods for merging different datasets. These are described extensively in the documentation. Here we just give a few examples.

data = {
    "capacity": [1288, 1360, 1326],  # MW
    "type": ["BWR", "PWR", "PWR"],
    "start_year": [1985, 1985, 1986],
    "end_year": [2021, 2021, 2021],
    "x": [10.40, 9.41, 9.35],
    "y": [48.51, 52.03, 53.85],
}
df2 = pd.DataFrame(data, index=["Gundremmingen", "Grohnde", "Brokdorf"])
df2

	capacity	type	start_year	end_year	x	y
Gundremmingen	1288	BWR	1985	2021	10.40	48.51
Grohnde	1360	PWR	1985	2021	9.41	52.03
Brokdorf	1326	PWR	1986	2021	9.35	53.85

We can now add this additional data to the df object

df = pd.concat([df, df2])

Filtering Data#

We can also filter a DataFrame using a boolean series obtained from a condition. This is very useful to build subsets of the DataFrame.

df.capacity > 1300

Neckarwestheim    False
Isar 2             True
Emsland           False
Gundremmingen     False
Grohnde            True
Brokdorf           True
Name: capacity, dtype: bool

df[df.capacity > 1300]

	capacity	type	start_year	end_year	x	y
Isar 2	1365	PWR	1988	NaN	NaN	NaN
Grohnde	1360	PWR	1985	2021.0	9.41	52.03
Brokdorf	1326	PWR	1986	2021.0	9.35	53.85

We can also combine multiple conditions, but we need to wrap the conditions with brackets!

df[(df.capacity > 1300) & (df.start_year >= 1988)]

	capacity	type	start_year	end_year	x	y
Isar 2	1365	PWR	1988	NaN	NaN	NaN

Or we make SQL-like queries:

df.query("start_year == 1988")

	capacity	type	start_year	end_year	x	y
Isar 2	1365	PWR	1988	NaN	NaN	NaN
Emsland	1290	PWR	1988	2023.0	NaN	NaN

threshold = 1300
df.query("start_year == 1988 and capacity > @threshold")

	capacity	type	start_year	end_year	x	y
Isar 2	1365	PWR	1988	NaN	NaN	NaN

Modifying Values#

In many cases, we want to modify values in a dataframe based on some rule. To modify values, we need to use .loc or .iloc

df.loc["Isar 2", "x"] = 12.29
df.loc["Grohnde", "capacity"] += 1
df

	capacity	type	start_year	end_year	x	y
Neckarwestheim	1269	PWR	1989	NaN	NaN	NaN
Isar 2	1365	PWR	1988	NaN	12.29	NaN
Emsland	1290	PWR	1988	2023.0	NaN	NaN
Gundremmingen	1288	BWR	1985	2021.0	10.40	48.51
Grohnde	1361	PWR	1985	2021.0	9.41	52.03
Brokdorf	1326	PWR	1986	2021.0	9.35	53.85

operational = ["Neckarwestheim", "Isar 2", "Emsland"]
df.loc[operational, "y"] = [49.04, 48.61, 52.47]
df

	capacity	type	start_year	end_year	x	y
Neckarwestheim	1269	PWR	1989	NaN	NaN	49.04
Isar 2	1365	PWR	1988	NaN	12.29	48.61
Emsland	1290	PWR	1988	2023.0	NaN	52.47
Gundremmingen	1288	BWR	1985	2021.0	10.40	48.51
Grohnde	1361	PWR	1985	2021.0	9.41	52.03
Brokdorf	1326	PWR	1986	2021.0	9.35	53.85

Applying Functions#

Sometimes it can be useful apply a function to all values of a column/row. For instance, we might be interested in normalised capacities relative to the largest nuclear power plant:

df.capacity.apply(lambda x: x / df.capacity.max())

Neckarwestheim    0.929670
Isar 2            1.000000
Emsland           0.945055
Gundremmingen     0.943590
Grohnde           0.997070
Brokdorf          0.971429
Name: capacity, dtype: float64

df.capacity.map(lambda x: x / df.capacity.max())

Neckarwestheim    0.929670
Isar 2            1.000000
Emsland           0.945055
Gundremmingen     0.943590
Grohnde           0.997070
Brokdorf          0.971429
Name: capacity, dtype: float64

For simple functions, there’s often an easier alternative:

df.capacity / df.capacity.max()

Neckarwestheim    0.929670
Isar 2            1.000000
Emsland           0.945055
Gundremmingen     0.943590
Grohnde           0.997070
Brokdorf          0.971429
Name: capacity, dtype: float64

But .apply() and .map() often give you more flexibility.

Renaming Indices and Columns#

Sometimes it can be useful to rename columns:

df.rename(columns=dict(x="lat", y="lon"))

	capacity	type	start_year	end_year	lat	lon
Neckarwestheim	1269	PWR	1989	NaN	NaN	49.04
Isar 2	1365	PWR	1988	NaN	12.29	48.61
Emsland	1290	PWR	1988	2023.0	NaN	52.47
Gundremmingen	1288	BWR	1985	2021.0	10.40	48.51
Grohnde	1361	PWR	1985	2021.0	9.41	52.03
Brokdorf	1326	PWR	1986	2021.0	9.35	53.85

Replacing Values#

Sometimes it can be useful to replace values:

df.replace({"PWR": "Pressurized water reactor"})

	capacity	type	start_year	end_year	x	y
Neckarwestheim	1269	Pressurized water reactor	1989	NaN	NaN	49.04
Isar 2	1365	Pressurized water reactor	1988	NaN	12.29	48.61
Emsland	1290	Pressurized water reactor	1988	2023.0	NaN	52.47
Gundremmingen	1288	BWR	1985	2021.0	10.40	48.51
Grohnde	1361	Pressurized water reactor	1985	2021.0	9.41	52.03
Brokdorf	1326	Pressurized water reactor	1986	2021.0	9.35	53.85

Plotting#

DataFrames have all kinds of useful plotting built in. Note that we do not even have to import matplotlib for this.

df.plot(kind="scatter", x="start_year", y="capacity")

<Axes: xlabel='start_year', ylabel='capacity'>

_images/0b4567098f94d65415d75af49144b7b0811faa55ff3cd16ba7ce10fb47d050a6.png

df.capacity.plot.barh(color="orange")

<Axes: >

_images/477e3883ae7b91b31769776511e1ac37df2d97bba884ec13a1f7906f92606775.png

Time Indexes#

Indexes are very powerful. They are a big part of why Pandas is so useful. There are different indices for different types of data. Time Indexes are especially great when handling time-dependent data.

time = pd.date_range(start="2021-01-01", end="2023-01-01", freq="D")
values = np.sin(2 * np.pi * time.dayofyear / 365)
ts = pd.Series(values, index=time)
ts.plot()

<Axes: >

_images/ada3f53734d526fdf7ec64a390622377c856cee8d9d14ffe1df055f17218b9fe.png

We can use Python’s slicing notation inside .loc to select a date range.

ts.loc["2021-01-01":"2021-07-01"].plot()

<Axes: >

_images/85f820acfd3c7adc6b2611fee713f401023dd7bd890932c1e86e132be4a36e99.png

ts.loc["2021-05"].plot()

<Axes: >

_images/bfba654ce67378f0f6bdb8d84c7f2a632896d74c040bbc5733bad22700aa4a43.png

The pd.TimeIndex object has lots of useful attributes

ts.index.month

Index([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       ...
       12, 12, 12, 12, 12, 12, 12, 12, 12,  1],
      dtype='int32', length=731)

ts.index.day

Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
       ...
       23, 24, 25, 26, 27, 28, 29, 30, 31,  1],
      dtype='int32', length=731)

Another common operation is to change the resolution of a dataset by resampling in time. Pandas exposes this through the resample function. The resample periods are specified using pandas offset index syntax.

Below we resample the dataset by taking the mean over each month.

ts.resample("ME").mean().head()

2021-01-31    0.268746
2021-02-28    0.698782
2021-03-31    0.949778
2021-04-30    0.959332
2021-05-31    0.709200
Freq: ME, dtype: float64

ts.resample("ME").mean().plot()

<Axes: >

_images/6117bbc05d0627ec7b011d9d310068a404d9c0ddb09097d11b5a4086ff9cc894.png

Reading and Writing Files#

To read data into pandas, we can use for instance the pd.read_csv() function. This function is incredibly powerful and complex with a multitude of settings. You can use it to extract data from almost any text file.

The pd.read_csv() function can take a path to a local file as an input, or even a link to an online text file.

Let’s import a slightly larger dataset about the power plant fleet in Europe_

fn = "https://raw.githubusercontent.com/PyPSA/powerplantmatching/master/powerplants.csv"

df = pd.read_csv(fn, index_col=0)
df.iloc[:5, :10]

	Name	Fueltype	Technology	Set	Country	Capacity	Efficiency	DateIn	DateRetrofit	DateOut
id
0	Borssele	Hard Coal	Steam Turbine	PP	Netherlands	485.0	NaN	1973.0	NaN	2034.0
1	Flamanville	Nuclear	Steam Turbine	PP	France	2660.0	NaN	1985.0	NaN	2051.0
2	Emsland	Nuclear	Steam Turbine	PP	Germany	1336.0	0.33	1988.0	1988.0	2023.0
3	Kernkraftwerk Gosgen	Nuclear	Steam Turbine	PP	Switzerland	1020.0	NaN	1979.0	1979.0	2051.0
4	Oskarshamn	Nuclear	Steam Turbine	PP	Sweden	1400.0	NaN	1972.0	2006.0	2051.0

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22818 entries, 0 to 23116
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 Name                 22818 non-null  object 
 Fueltype             22818 non-null  object 
 Technology           22356 non-null  object 
 Set                  22818 non-null  object 
 Country              22818 non-null  object 
 Capacity             22818 non-null  float64
 Efficiency           542 non-null    float64
 DateIn               16980 non-null  float64
 DateRetrofit         2438 non-null   float64
 DateOut              683 non-null    float64
lat                  22818 non-null  float64
lon                  22818 non-null  float64
Duration             615 non-null    float64
Volume_Mm3           22818 non-null  float64
DamHeight_m          22818 non-null  float64
StorageCapacity_MWh  22818 non-null  float64
EIC                  22818 non-null  object 
projectID            22818 non-null  object 
dtypes: float64(11), object(7)
memory usage: 3.3+ MB

df.describe()

	Capacity	Efficiency	DateIn	DateRetrofit	DateOut	lat	lon	Duration	Volume_Mm3	DamHeight_m	StorageCapacity_MWh
count	22818.000000	542.000000	16980.000000	2438.000000	683.000000	22818.000000	22818.000000	615.000000	22818.000000	22818.000000	22818.000000
mean	57.799302	0.480055	2004.922438	1988.515176	2019.980966	49.042717	8.798689	1295.789678	3.771375	8.372420	464.285028
std	225.901017	0.179453	48.109169	25.480213	12.323877	6.159855	9.112971	1542.021891	89.711399	52.640524	7561.917809
min	0.000000	0.140228	0.000000	1899.000000	1969.000000	32.647300	-27.069900	0.007907	0.000000	0.000000	0.000000
25%	3.000000	0.356400	2005.000000	1971.000000	2015.000000	44.888750	1.887070	90.676522	0.000000	0.000000	0.000000
50%	11.000000	0.385300	2012.000000	1997.000000	2020.000000	49.365235	9.152700	795.600000	0.000000	0.000000	0.000000
75%	32.000000	0.580829	2018.000000	2009.000000	2026.000000	52.318355	13.693179	2023.709052	0.000000	0.000000	0.000000
max	6000.000000	0.917460	2030.000000	2020.000000	2051.000000	71.012300	39.655350	16840.000000	9500.000000	1800.000000	421000.000000

Sometimes, we also want to store a DataFrame for later use. There are many different file formats tabular data can be stored in, including HTML, JSON, Excel, Parquet, Feather, etc. Here, let’s say we want to store the DataFrame as CSV (comma-separated values) file under the name “tmp.csv”.

df.to_csv("tmp.csv")

Groupby Functionality#

Both Series and DataFrame objects have a groupby method. It accepts a variety of arguments, but the simplest way to think about it is that you pass another series, whose unique values are used to split the original object into different groups. groupby is an amazingly powerful but also complex function.

Here’s an example which retrieves the total generation capacity per country.

grouped = df.groupby("Country").Capacity.sum()
grouped.head()

Country
Albania                    2361.708275
Austria                   25458.200368
Belgium                   21426.651009
Bosnia and Herzegovina     4827.195964
Bulgaria                  15699.186363
Name: Capacity, dtype: float64

Such “chaining” operations together is very common with pandas:

Let’s break apart this operation a bit. The workflow with groupby can be divided into three general steps:

Split: Partition the data into different groups based on some criterion.
Apply: Do some caclulation within each group. Different types of steps might be
- Aggregation: Get the mean or max within the group.
- Transformation: Normalize all the values within a group.
- Filtration: Eliminate some groups based on a criterion.
Combine: Put the results back together into a single object.

gb = df.groupby("Country")
gb

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f31d7f797d0>

The length tells us how many groups were found:

len(gb)

All of the groups are available as a dictionary via the .groups attribute:

groups = gb.groups
len(groups)

list(groups.keys())[:5]

['Albania', 'Austria', 'Belgium', 'Bosnia and Herzegovina', 'Bulgaria']

Now that we know how to create a GroupBy object, let’s learn how to do aggregation on it.

gb.Capacity.sum().nlargest(5)

Country
Germany           212675.829096
Spain             150108.858726
United Kingdom    149598.567501
France            148014.692084
Italy             101488.386461
Name: Capacity, dtype: float64

gb["DateIn"].mean().head()

Country
Albania                   1994.666667
Austria                   1987.307692
Belgium                   2001.708029
Bosnia and Herzegovina    1992.200000
Bulgaria                  1998.116279
Name: DateIn, dtype: float64

Grouping is not only possible on a single columns, but also on multiple columns. For instance, we might want to group the capacities by country and fuel type. To achieve this, we pass a list of functions to the groupby functions.

capacities = df.groupby(["Country", "Fueltype"]).Capacity.sum()
capacities

Country         Fueltype     
Albania         Hydro             1743.353732
                Oil                 89.854543
                Solar              294.500000
                Wind               234.000000
Austria         Hard Coal         1331.400000
                                     ...     
United Kingdom  Other               55.000000
                Solar            11668.600000
                Solid Biomass     4919.000000
                Waste              288.900000
                Wind             38670.300000
Name: Capacity, Length: 243, dtype: float64

By grouping by multiple attributes, our index becomes a pd.MultiIndex (a hierarchical index with multiple levels.

capacities.index[:5]

MultiIndex([('Albania',     'Hydro'),
            ('Albania',       'Oil'),
            ('Albania',     'Solar'),
            ('Albania',      'Wind'),
            ('Austria', 'Hard Coal')],
           names=['Country', 'Fueltype'])

type(capacities.index)

pandas.core.indexes.multi.MultiIndex

We can use the .unstack function to reshape the multi-indexed pd.Series into a pd.DataFrame which has the second index level as columns.

capacities.unstack().tail().T

Country	Spain	Sweden	Switzerland	Ukraine	United Kingdom
Fueltype
Biogas	NaN	NaN	NaN	NaN	31.000000
Geothermal	NaN	NaN	NaN	NaN	NaN
Hard Coal	11904.878478	291.000000	NaN	23782.495570	33823.617061
Hydro	26069.861248	14273.686625	20123.3008	6599.452229	4576.175000
Lignite	1831.400000	NaN	NaN	NaN	NaN
Natural Gas	28394.244000	2358.000000	55.0000	4687.900000	36284.975440
Nuclear	7733.200000	9859.000000	3355.0000	17635.000000	19181.000000
Oil	1854.371000	1685.000000	NaN	NaN	100.000000
Other	NaN	NaN	NaN	NaN	55.000000
Solar	36998.200000	281.800000	96.8000	5628.700000	11668.600000
Solid Biomass	563.000000	2432.600000	NaN	NaN	4919.000000
Waste	388.054000	NaN	NaN	NaN	288.900000
Wind	34371.650000	16958.800000	55.0000	461.400000	38670.300000

Exercises#

Power Plants Data#

Run the function .describe() on the DataFrame that includes the power plant database:

Show code cell content Hide code cell content

df.describe()

	Capacity	Efficiency	DateIn	DateRetrofit	DateOut	lat	lon	Duration	Volume_Mm3	DamHeight_m	StorageCapacity_MWh
count	22818.000000	542.000000	16980.000000	2438.000000	683.000000	22818.000000	22818.000000	615.000000	22818.000000	22818.000000	22818.000000
mean	57.799302	0.480055	2004.922438	1988.515176	2019.980966	49.042717	8.798689	1295.789678	3.771375	8.372420	464.285028
std	225.901017	0.179453	48.109169	25.480213	12.323877	6.159855	9.112971	1542.021891	89.711399	52.640524	7561.917809
min	0.000000	0.140228	0.000000	1899.000000	1969.000000	32.647300	-27.069900	0.007907	0.000000	0.000000	0.000000
25%	3.000000	0.356400	2005.000000	1971.000000	2015.000000	44.888750	1.887070	90.676522	0.000000	0.000000	0.000000
50%	11.000000	0.385300	2012.000000	1997.000000	2020.000000	49.365235	9.152700	795.600000	0.000000	0.000000	0.000000
75%	32.000000	0.580829	2018.000000	2009.000000	2026.000000	52.318355	13.693179	2023.709052	0.000000	0.000000	0.000000
max	6000.000000	0.917460	2030.000000	2020.000000	2051.000000	71.012300	39.655350	16840.000000	9500.000000	1800.000000	421000.000000

Provide a list of unique fuel types included in the dataset

Provide a list of unique technologies included in the dataset

Filter the dataset by power plants with the fuel type “Hard Coal”

Show code cell content Hide code cell content

coal = df.loc[df.Fueltype == "Hard Coal"]
coal

	Name	Fueltype	Technology	Set	Country	Capacity	Efficiency	DateIn	DateRetrofit	DateOut	lat	lon	Duration	Volume_Mm3	DamHeight_m	StorageCapacity_MWh	EIC	projectID
id
0	Borssele	Hard Coal	Steam Turbine	PP	Netherlands	485.000000	NaN	1973.0	NaN	2034.0	51.433200	3.716000	NaN	0.0	0.0	0.0	{'49W000000000054X'}	{'BEYONDCOAL': {'BEYOND-NL-2'}, 'ENTSOE': {'49...
26	Mellach	Hard Coal	Steam Turbine	CHP	Austria	200.000000	NaN	1986.0	1986.0	2020.0	46.911700	15.488300	NaN	0.0	0.0	0.0	{'14W-WML-KW-----0'}	{'BEYONDCOAL': {'BEYOND-AT-11'}, 'ENTSOE': {'1...
136	Bastardo	Hard Coal	Steam Turbine	PP	Italy	138.290543	NaN	1989.0	NaN	2025.0	42.893300	12.539700	NaN	0.0	0.0	0.0	{'26WIMPI-S10BSTRJ'}	{'BEYONDCOAL': {'BEYOND-IT-10'}, 'ENTSOE': {'2...
137	Brindisi Sud	Hard Coal	Steam Turbine	PP	Italy	1825.435174	NaN	1991.0	NaN	2025.0	40.563900	18.032200	NaN	0.0	0.0	0.0	{'26WIMPI-S16BSCRY'}	{'BEYONDCOAL': {'BEYOND-IT-2'}, 'ENTSOE': {'26...
138	Fusina	Hard Coal	Steam Turbine	PP	Italy	899.810470	NaN	1964.0	NaN	2025.0	45.431400	12.245800	NaN	0.0	0.0	0.0	{'26WIMPI-S05FTSNK'}	{'BEYONDCOAL': {'BEYOND-IT-24'}, 'ENTSOE': {'2...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
22671	Westerholt	Hard Coal	Steam Turbine	PP	Germany	300.000000	NaN	1959.0	NaN	2005.0	51.601111	7.063889	NaN	0.0	0.0	0.0	{nan, nan}	{'GEM': {'G100000114638', 'G100000114637'}}
22731	Wilhelmshaven Engie	Hard Coal	Steam Turbine	PP	Germany	830.000000	NaN	2015.0	NaN	NaN	53.574800	8.137000	NaN	0.0	0.0	0.0	{nan}	{'GEM': {'G100000110939'}}
22749	Wilton International	Hard Coal	Steam Turbine	PP	United Kingdom	66.600000	NaN	1964.0	NaN	2007.0	54.589400	-1.118500	NaN	0.0	0.0	0.0	{nan, nan}	{'GEM': {'G100000110950', 'G100000110949'}}
22805	Wolfsburg West	Hard Coal	Steam Turbine	PP	Germany	306.000000	NaN	1985.0	NaN	2024.0	52.441544	10.765079	NaN	0.0	0.0	0.0	{nan, nan}	{'GEM': {'G100000110964', 'G100000110963'}}
23021	Zeltweg	Hard Coal	Steam Turbine	PP	Austria	137.000000	NaN	1962.0	NaN	2001.0	47.250000	15.166667	NaN	0.0	0.0	0.0	{nan}	{'GEM': {'G100000111546'}}

339 rows × 18 columns

Identify the 5 largest coal power plants. In which countries are they located? When were they built?

Show code cell content Hide code cell content

coal.loc[coal.Capacity.nlargest(5).index]

	Name	Fueltype	Technology	Set	Country	Capacity	Efficiency	DateIn	DateRetrofit	DateOut	lat	lon	Duration	Volume_Mm3	DamHeight_m	StorageCapacity_MWh	EIC	projectID
id
224	Kozienice	Hard Coal	Steam Turbine	PP	Poland	3682.216205	NaN	1972.0	NaN	2042.0	51.6647	21.46670	NaN	0.0	0.0	0.0	{'19W000000000104I', '19W000000000095U'}	{'BEYONDCOAL': {'BEYOND-PL-96'}, 'ENTSOE': {'1...
3465	Zaporiska	Hard Coal	CCGT	PP	Ukraine	3318.973044	NaN	1972.0	NaN	NaN	47.5089	34.62530	NaN	0.0	0.0	0.0	{nan, nan, nan, nan, nan, nan, nan}	{'GEO': {'GEO-42988'}, 'GPD': {'WRI1005101'}}
3466	Vuhlehirska Uglegorskaya	Hard Coal	CCGT	PP	Ukraine	3318.973044	NaN	1972.0	NaN	NaN	48.4633	38.20328	NaN	0.0	0.0	0.0	{nan, nan, nan, nan, nan, nan, nan}	{'GEO': {'GEO-43001'}, 'GPD': {'WRI1005107'}}
761	Opole	Hard Coal	Steam Turbine	PP	Poland	3071.893939	NaN	1993.0	NaN	2020.0	50.7518	17.88200	NaN	0.0	0.0	0.0	{'19W0000000001292'}	{'BEYONDCOAL': {'BEYOND-PL-16'}, 'ENTSOE': {'1...
769	Longannet	Hard Coal	Steam Turbine	PP	United Kingdom	2400.000000	NaN	1970.0	NaN	2016.0	56.0500	-3.68120	NaN	0.0	0.0	0.0	{'48WSTN00000LOAND'}	{'BEYONDCOAL': {'BEYOND-UK-87'}, 'ENTSOE': {'4...

Identify the power plant with the longest “Name”.

Identify the 10 northernmost powerplants. What type of power plants are they?

Show code cell content Hide code cell content

index = df.lat.nlargest(10).index
df.loc[index]

	Name	Fueltype	Technology	Set	Country	Capacity	Efficiency	DateIn	DateRetrofit	DateOut	lat	lon	Duration	Volume_Mm3	DamHeight_m	StorageCapacity_MWh	EIC	projectID
id
12704	Havoygavlen Wind Farm	Wind	Onshore	PP	Norway	78.0	NaN	2003.0	NaN	2021.0	71.012300	24.594200	NaN	0.0	0.0	0.0	{nan, nan}	{'GEM': {'G100000917626', 'G100000917445'}}
13840	Kjollefjord Wind Farm	Wind	Onshore	PP	Norway	39.0	NaN	2006.0	NaN	NaN	70.918500	27.289900	NaN	0.0	0.0	0.0	{nan}	{'GEM': {'G100000917523'}}
2446	Repvag	Hydro	Reservoir	Store	Norway	4.4	NaN	1953.0	1953.0	NaN	70.773547	25.616486	2810.704545	28.3	172.0	12367.1	{nan}	{'JRC': {'JRC-N339'}, 'OPSD': {'OEU-4174'}}
18714	Raggovidda Wind Farm	Wind	Onshore	PP	Norway	97.0	NaN	2014.0	NaN	NaN	70.765700	29.083300	NaN	0.0	0.0	0.0	{nan, nan}	{'GEM': {'G100000918029', 'G100000918729'}}
2709	Maroyfjord	Hydro	Reservoir	Store	Norway	4.4	NaN	1956.0	1956.0	NaN	70.751421	27.355047	1533.681818	13.8	225.3	6748.2	{nan}	{'JRC': {'JRC-N289'}, 'OPSD': {'OEU-4036'}}
2621	Melkoya	Natural Gas	CCGT	PP	Norway	230.0	NaN	2007.0	2007.0	NaN	70.689366	23.600448	NaN	0.0	0.0	0.0	{'50WP00000000456T'}	{'ENTSOE': {'50WP00000000456T'}, 'OPSD': {'OEU...
12573	Hammerfest Snohvit Terminal	Natural Gas	CCGT	PP	Norway	229.0	NaN	NaN	NaN	NaN	70.685400	23.590000	NaN	0.0	0.0	0.0	{nan}	{'GEM': {'L100000407847'}}
12575	Hamnefjell Wind Farm	Wind	Onshore	PP	Norway	52.0	NaN	2017.0	NaN	NaN	70.667900	29.719000	NaN	0.0	0.0	0.0	{nan}	{'GEM': {'G100000918725'}}
3320	Hammerfest	Hydro	Reservoir	Store	Norway	1.1	NaN	1947.0	1947.0	NaN	70.657936	23.714418	1638.181818	10.6	88.0	1802.0	{nan}	{'JRC': {'JRC-N127'}, 'OPSD': {'OEU-3550'}}
2797	Kongsfjord	Hydro	Reservoir	Store	Norway	4.4	NaN	1946.0	1946.0	NaN	70.598729	29.057905	3343.795455	88.1	70.5	14712.7	{nan}	{'JRC': {'JRC-N210'}, 'OPSD': {'OEU-3801'}}

What is the average “DateIn” of each “Fueltype”? Which type of power plants is the oldest on average?

Plot a histogram of power plant capacities with bins of length 100 MW between 0 and 4000 MW. What do you observe?

How many power plants of each fuel type are there in each country? Display the results in a DataFrame with countries as index and fuel type as columns. Fill missing values with the value zero. Convert all values to integers.

Browse Google or the pandas documentation to find the right aggregation function to count values.

Show code cell content Hide code cell content

df.groupby(["Country", "Fueltype"]).size().unstack().fillna(0.0).astype(int)

Fueltype	Biogas	Geothermal	Hard Coal	Hydro	Lignite	Natural Gas	Nuclear	Oil	Other	Solar	Solid Biomass	Waste	Wind
Country
Albania	0	0	0	10	0	0	0	1	0	11	0	0	1
Austria	0	0	6	174	2	19	0	0	0	61	0	0	93
Belgium	0	0	10	13	0	27	3	8	0	76	6	8	89
Bosnia and Herzegovina	0	0	0	17	10	0	0	0	0	19	0	0	7
Bulgaria	0	0	4	30	9	6	1	0	0	215	0	0	17
Croatia	0	0	1	21	0	7	0	0	0	26	0	0	27
Czechia	0	0	11	14	27	8	2	0	0	331	1	0	6
Denmark	0	0	12	0	0	14	0	1	1	73	9	0	112
Estonia	0	0	0	0	0	2	0	2	0	71	0	0	15
Finland	25	0	17	105	6	32	2	14	3	13	21	0	155
France	0	0	16	155	1	42	23	6	2	1268	5	0	1054
Germany	38	0	94	729	33	257	26	26	23	3989	22	59	1498
Greece	0	0	0	20	10	18	0	0	0	315	0	1	192
Hungary	0	0	5	4	1	17	1	1	0	319	4	0	10
Ireland	0	0	1	6	2	17	0	4	0	33	1	0	123
Italy	0	25	18	289	0	122	4	2	2	249	12	0	280
Kosovo	0	0	0	2	2	0	0	0	0	4	0	0	2
Latvia	0	0	0	3	0	3	0	0	0	12	0	0	3
Lithuania	0	0	0	2	0	4	1	0	0	32	1	0	23
Luxembourg	0	0	0	2	0	0	0	0	0	16	0	0	9
Moldova	0	0	1	1	0	2	0	0	0	8	0	0	0
Montenegro	0	0	0	4	1	0	0	0	0	1	0	0	2
Netherlands	1	0	9	0	0	53	1	1	2	221	12	0	148
North Macedonia	0	0	0	10	2	2	0	0	0	34	0	0	2
Norway	0	0	0	930	0	4	0	0	0	1	0	0	60
Poland	0	0	54	19	8	22	0	1	0	291	6	0	223
Portugal	0	0	3	44	0	18	0	0	0	92	4	0	125
Romania	0	0	3	119	15	15	1	0	0	231	1	0	48
Serbia	0	0	0	13	10	3	0	0	0	9	1	0	10
Slovakia	0	0	4	29	3	6	2	0	0	105	0	0	0
Slovenia	0	0	1	28	4	3	1	0	0	7	0	0	0
Spain	0	0	22	359	7	88	7	12	0	1288	11	15	843
Sweden	0	0	2	147	0	12	5	4	0	37	35	0	254
Switzerland	0	0	0	471	0	1	5	0	0	48	0	0	3
Ukraine	0	0	19	9	0	11	5	0	0	427	0	0	11
United Kingdom	2	0	26	101	0	93	17	1	1	1137	36	10	406

Time Series Analysis#

Read in the time series from the second lecture into a DataFrame.

The file is available at https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv. and includes hourly time series for Germany in 2015 for:

electricity demand from OPSD in GW
onshore wind capacity factors from renewables.ninja in per-unit of installed capacity
offshore wind capacity factors from renewables.ninja in per-unit of installed capacity
solar PV capacity factors from renewables.ninja in per-unit of installed capacity
electricity day-ahead spot market prices in €/MWh from EPEX Spot zone DE/AT/LU retrieved via SMARD platform

Use the function pd.read_csv with the keyword arguments index_col= and parse_dates= to ensure the time stamps are treated as pd.DatetimeIndex.

# your code here

The start of the DataFrame should look like this:

	load	onwind	offwind	prices
2015-01-01 00:00:00	41.151	0.1566	0.703	nan
2015-01-01 01:00:00	40.135	0.1659	0.6875	nan
2015-01-01 02:00:00	39.106	0.1746	0.6535	nan
2015-01-01 03:00:00	38.765	0.1745	0.6803	nan
2015-01-01 04:00:00	38.941	0.1826	0.7272	nan

And it should pass the following test:

assert type(df.index) == pd.DatetimeIndex

For each column:

What are the average, minimum and maximum values?
Find the time stamps where data on prices is missing.
Fill up the missing data with the prices observed one week ahead.
Plot the time series for the full year.
Plot the time series for the month May.
Resample the time series to daily, weeky, and monthly frequencies and plot the resulting time series in one graph.
Sort the values in descending order and plot the duration curve. Hint: Run .reset_index(drop=True) to drop the index after sorting.
Plot a histogram of the time series values.
Perform a Fourier transformation of the time series. What are the dominant frequencies? Hint: Below you can find an example how Fourier transformation can be down with numpy.
Calculate the Pearson correlation coefficients between all time series. Hint: There is a function for that. Google for “pandas dataframe correlation”.

abs(pd.Series(np.fft.rfft(df.solar - df.solar.mean()), index=np.fft.rfftfreq(len(df.solar), d=1./8760))**2)

# your code here

Introduction to pandas

Contents

Introduction to pandas#

Pandas Data Structures: Series#

Pandas Data Structures: DataFrame#

Sorting Data#

Merging Data#

Filtering Data#

Modifying Values#

Applying Functions#

Renaming Indices and Columns#

Replacing Values#

Plotting#

Time Indexes#

Reading and Writing Files#

Groupby Functionality#

Exercises#

Power Plants Data#

Time Series Analysis#

Introduction to `pandas`#