Basics pandas

Basics `pandas`#

Note

If you have not yet set up Python on your computer, you can execute this tutorial in your browser via Google Colab. Click on the rocket in the top right corner and launch “Colab”. If that doesn’t work download the .ipynb file and import it in Google Colab. Then install pandas and numpy by executing the following command in a Jupyter cell at the top of the notebook.

!pip install -q pandas numpy

Pandas is a an open source library providing tabular data structures and data analysis tools. In other words, if you can imagine the data in an Excel spreadsheet, then Pandas is the tool for the job.

Note

Documentation for this package is available at https://pandas.pydata.org/docs/.

Package Imports#

This will be our first experience with importing a package.

Usually we import pandas with the alias pd.

We might also need numpy, Python’s main library for numerical computations.

import pandas as pd
import numpy as np
import plotly.io as pio
import plotly.offline as py
pd.options.plotting.backend = "plotly"

Series#

A Series represents a one-dimensional array of data. It is similar to a dictionary consisting of an index and values, but has more functions.

Note

Example data on Germany’s final six nuclear power plants is from Wikipedia.

dictionary = {
    "Neckarwestheim": 1269,
    "Isar 2": 1365,
    "Emsland": 1290,
}
s = pd.Series(dictionary)
s

Neckarwestheim    1269
Isar 2            1365
Emsland           1290
dtype: int64

Arithmetic operations can be applied to the whole pd.Series.

s**0.5

Neckarwestheim    35.623026
Isar 2            36.945906
Emsland           35.916570
dtype: float64

We can access the underlying index object if we need to:

s.index

Index(['Neckarwestheim', 'Isar 2', 'Emsland'], dtype='object')

We can get values back out using the index via the .loc attribute

s.loc["Isar 2"]

np.int64(1365)

We can pass a list or array to loc to get multiple rows back:

s.loc[["Neckarwestheim", "Emsland"]]

Neckarwestheim    1269
Emsland           1290
dtype: int64

DataFrame#

Series are limited to a single column. A more useful Pandas data structure is the DataFrame. A DataFrame is basically a bunch of series that share the same index.

data = {
    "capacity": [1269, 1365, 1290],  # MW
    "type": ["PWR", "PWR", "PWR"],
    "start_year": [1989, 1988, 1988],
    "end_year": [np.nan, np.nan, np.nan],
}
df = pd.DataFrame(data, index=["Neckarwestheim", "Isar 2", "Emsland"])
df

	capacity	type	start_year	end_year
Neckarwestheim	1269	PWR	1989	NaN
Isar 2	1365	PWR	1988	NaN
Emsland	1290	PWR	1988	NaN

A wide range of statistical functions are available on both Series and DataFrames.

df.min()

capacity      1269
type           PWR
start_year    1988
end_year       NaN
dtype: object

We can get a single column as a Series using python’s getitem syntax on the DataFrame object.

df["capacity"]

Neckarwestheim    1269
Isar 2            1365
Emsland           1290
Name: capacity, dtype: int64

Indexing works very similar to series

df.loc["Emsland"]

capacity      1290
type           PWR
start_year    1988
end_year       NaN
Name: Emsland, dtype: object

But we can also specify the column(s) and row(s) we want to access

df.at["Emsland", "start_year"]

np.int64(1988)

We can also add new columns to the DataFrame:

df["reduced_capacity"] = df.capacity * 0.8
df

	capacity	type	start_year	end_year	reduced_capacity
Neckarwestheim	1269	PWR	1989	NaN	1015.2
Isar 2	1365	PWR	1988	NaN	1092.0
Emsland	1290	PWR	1988	NaN	1032.0

Sorting Data#

We can also sort the entries in dataframes, e.g. alphabetically by index or numerically by column values

df.sort_index()

	capacity	type	start_year	end_year	reduced_capacity
Emsland	1290	PWR	1988	NaN	1032.0
Isar 2	1365	PWR	1988	NaN	1092.0
Neckarwestheim	1269	PWR	1989	NaN	1015.2

df.sort_values(by="capacity", ascending=True)

	capacity	type	start_year	end_year	reduced_capacity
Neckarwestheim	1269	PWR	1989	NaN	1015.2
Emsland	1290	PWR	1988	NaN	1032.0
Isar 2	1365	PWR	1988	NaN	1092.0

Filtering Data#

We can also filter a DataFrame using a boolean series obtained from a condition. This is very useful to build subsets of the DataFrame.

df.capacity > 1300

Neckarwestheim    False
Isar 2             True
Emsland           False
Name: capacity, dtype: bool

df[df.capacity > 1300]

	capacity	type	start_year	end_year	reduced_capacity
Isar 2	1365	PWR	1988	NaN	1092.0

Modifying Values#

In many cases, we want to modify values in a dataframe based on some rule. To modify values, we need to use .loc or .iloc

df.loc["Isar 2", "capacity"] = 1366
df

	capacity	type	start_year	end_year	reduced_capacity
Neckarwestheim	1269	PWR	1989	NaN	1015.2
Isar 2	1366	PWR	1988	NaN	1092.0
Emsland	1290	PWR	1988	NaN	1032.0

Time Series#

Time indexes are great when handling time-dependent data.

Let’s first read some time series data, using the pd.read_csv() function, which takes a local file path or a link to an online resource.

The example data hourly time series for Germany in 2015 for:

electricity demand from OPSD in GW
onshore wind capacity factors from renewables.ninja in per-unit of installed capacity
offshore wind capacity factors from renewables.ninja in per-unit of installed capacity
solar PV capacity factors from renewables.ninja in per-unit of installed capacity
electricity day-ahead spot market prices in €/MWh from EPEX Spot zone DE/AT/LU retrieved via SMARD platform

url = (
    "https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv"
)
ts = pd.read_csv(url, index_col=0, parse_dates=True)
ts.index[:5]

DatetimeIndex(['2015-01-01 00:00:00', '2015-01-01 01:00:00',
               '2015-01-01 02:00:00', '2015-01-01 03:00:00',
               '2015-01-01 04:00:00'],
              dtype='datetime64[ns]', freq=None)

We can use Python’s slicing notation inside .loc to select a date range, and then use the built-in plotting feature of Pandas:

ts.loc["2015-01-01":"2015-03-01", "onwind"].plot()

A common operation is to change the resolution of a dataset by resampling in time, which Pandas exposes through the resample function.

Note

The resample periods are specified using pandas offset index syntax.

ts["onwind"].resample("D").mean().plot()

Groupby Functionality#

DataFrame objects have a groupby method. The simplest way to think about it is that you pass another series, whose values are used to split the original object into different groups.

Here’s an example which retrieves the total generation capacity per country:

fn = "https://raw.githubusercontent.com/PyPSA/powerplantmatching/master/powerplants.csv"

df = pd.read_csv(fn, index_col=0)
df.iloc[:5, :10]

	Name	Fueltype	Technology	Set	Country	Capacity	Efficiency	DateIn	DateRetrofit	DateOut
id
0	Kernkraftwerk Emsland	Nuclear	Steam Turbine	PP	Germany	1336.0	0.33	1988.0	1988.0	2023.0
1	Brokdorf	Nuclear	Steam Turbine	PP	Germany	1410.0	0.33	1986.0	1986.0	2021.0
2	Borssele	Hard Coal	Steam Turbine	PP	Netherlands	485.0	NaN	1973.0	NaN	2034.0
3	Gemeinschaftskernkraftwerk Neckarwestheim	Nuclear	Steam Turbine	PP	Germany	1310.0	0.33	1976.0	1989.0	2023.0
4	Isar	Nuclear	Steam Turbine	PP	Germany	1410.0	0.33	1979.0	1988.0	2023.0

grouped = df.groupby("Country").Capacity.sum()
grouped.head()

Country
Albania                    2370.400000
Austria                   24643.200368
Belgium                   21443.151009
Bosnia and Herzegovina     4827.195964
Bulgaria                  15699.186363
Name: Capacity, dtype: float64

Let’s break apart this operation a bit. The workflow with groupby can be divided into three general steps:

Split: Partition the data into different groups based on some criterion.
Apply: Do some caclulation within each group, e.g. minimum, maximum, sums.
Combine: Put the results back together into a single object.

Grouping is not only possible on a single columns, but also on multiple columns. For instance, we might want to group the capacities by country and fuel type. To achieve this, we pass a list of functions to the groupby functions.

capacities = df.groupby(["Country", "Fueltype"]).Capacity.sum().unstack()
capacities

Fueltype	Biogas	Geothermal	Hard Coal	Hydro	Lignite	Natural Gas	Nuclear	Oil	Other	Solar	Solid Biomass	Waste	Wind
Country
Albania	NaN	NaN	NaN	1743.900000	NaN	NaN	NaN	NaN	98.00000	294.500000	NaN	NaN	234.0000
Austria	NaN	NaN	1331.400000	14205.500368	522.000000	4952.000000	NaN	NaN	NaN	388.300000	NaN	NaN	3244.0000
Belgium	NaN	NaN	3316.000000	1379.751009	NaN	5251.300000	5931.0	131.60000	NaN	484.100000	498.000	259.700	4191.7000
Bosnia and Herzegovina	NaN	NaN	NaN	2213.500000	1911.595964	NaN	NaN	NaN	NaN	267.100000	NaN	NaN	435.0000
Bulgaria	NaN	NaN	1909.191813	2899.540000	4807.354550	990.000000	2080.0	NaN	NaN	2405.100000	NaN	NaN	608.0000
Croatia	NaN	NaN	308.848880	2129.772189	NaN	1440.800000	NaN	NaN	NaN	96.300000	NaN	NaN	1140.0000
Czechia	NaN	NaN	1225.227500	1888.219836	7297.666121	1659.000000	4164.0	NaN	NaN	1293.600000	30.000	NaN	140.0000
Denmark	NaN	NaN	5658.885609	NaN	NaN	1813.200000	NaN	664.00000	70.00000	2256.700000	2542.000	NaN	5670.8000
Estonia	NaN	NaN	NaN	NaN	NaN	389.000000	NaN	1841.00000	NaN	446.200000	NaN	NaN	713.4000
Finland	1518.740000	NaN	3930.809044	3201.320000	965.000000	3134.800000	4362.0	1005.00000	143.10000	162.200000	2362.000	NaN	9681.1000
France	NaN	NaN	6918.344374	21054.763968	977.000000	9292.200000	64682.0	5130.00000	302.00000	11192.900000	348.000	NaN	26603.9000
Germany	2265.320979	NaN	30631.473898	12716.214837	23056.815081	39489.076894	23093.0	6850.84807	3156.36805	36716.919043	1071.699	2584.851	84571.1625
Greece	NaN	NaN	NaN	3368.100000	4786.794109	9562.000000	NaN	NaN	NaN	4166.300000	NaN	24.000	5155.0000
Hungary	NaN	NaN	745.484239	47.700000	815.171651	2859.400000	1898.9	170.00000	NaN	2582.700000	173.600	NaN	319.0000
Ireland	NaN	NaN	843.572315	609.000000	253.000000	5618.000000	NaN	1018.00000	NaN	1226.300000	68.000	NaN	4780.2000
Italy	NaN	806.0	10313.926461	20027.240000	NaN	51339.520000	1472.0	1016.00000	173.00000	5521.100000	770.000	NaN	10049.6000
Kosovo	NaN	NaN	NaN	65.200000	1187.716161	NaN	NaN	NaN	NaN	161.100000	NaN	NaN	135.0000
Latvia	NaN	NaN	NaN	1536.100000	NaN	1087.000000	NaN	NaN	NaN	44.000000	NaN	NaN	100.0000
Lithuania	NaN	NaN	NaN	1001.000000	NaN	1508.000000	2600.0	NaN	NaN	299.600000	70.000	NaN	1195.0000
Luxembourg	NaN	NaN	NaN	1294.000000	NaN	NaN	NaN	NaN	NaN	54.100000	NaN	NaN	148.1000
Moldova	NaN	NaN	2520.000000	48.000000	NaN	968.000000	NaN	NaN	NaN	24.800000	NaN	NaN	NaN
Montenegro	NaN	NaN	NaN	657.321000	207.481472	NaN	NaN	NaN	NaN	3.200000	NaN	NaN	118.0000
Netherlands	36.000000	NaN	9063.994952	NaN	NaN	16090.600000	60.0	NaN	548.00000	4839.600000	1307.000	NaN	11069.1000
North Macedonia	NaN	NaN	NaN	562.600000	737.711901	418.000000	NaN	NaN	NaN	296.900000	NaN	NaN	73.0000
Norway	NaN	NaN	NaN	32539.390000	NaN	968.100000	NaN	NaN	NaN	1.400000	NaN	NaN	6300.8000
Poland	NaN	NaN	22585.482921	2877.906738	8786.812855	6516.500000	NaN	55.00000	NaN	2712.100000	523.000	NaN	10271.9000
Portugal	NaN	NaN	1858.761004	8270.200000	NaN	4759.500000	NaN	NaN	NaN	3509.900000	210.000	NaN	5362.6000
Romania	NaN	NaN	1184.688989	7971.390000	5197.253348	4831.500000	1300.0	NaN	NaN	2026.700000	30.000	NaN	5505.7000
Serbia	NaN	NaN	NaN	3112.500000	5357.415458	554.000000	NaN	NaN	NaN	39.600000	30.000	NaN	772.1000
Slovakia	NaN	NaN	592.673970	2418.480000	317.863362	1004.870000	1940.0	NaN	NaN	449.200000	NaN	NaN	NaN
Slovenia	NaN	NaN	114.320183	1274.025000	1130.422184	600.000000	696.0	NaN	NaN	30.000000	NaN	NaN	NaN
Spain	NaN	NaN	11904.878478	26069.861248	1831.400000	28394.244000	7733.2	1854.37100	NaN	36998.200000	563.000	388.054	34371.6500
Sweden	NaN	NaN	291.000000	14273.686625	NaN	2358.000000	9859.0	1685.00000	NaN	281.800000	2432.600	NaN	16958.8000
Switzerland	NaN	NaN	NaN	20115.040800	NaN	55.000000	3355.0	NaN	NaN	96.800000	NaN	NaN	55.0000
Ukraine	NaN	NaN	24474.000000	6590.000000	NaN	4687.900000	17635.0	NaN	NaN	5628.700000	NaN	NaN	461.4000
United Kingdom	31.000000	NaN	33823.617061	4576.175000	NaN	36366.400000	19181.0	100.00000	55.00000	11668.600000	4919.000	288.900	38670.3000

Exercises#

Task 1: Provide a list of unique fuel types included in the power plants dataset.

Task 2: Filter the dataset by power plants with the fuel type “Hard Coal”. How many hard coal power plants are there?

Show code cell content

Hide code cell content

coal = df.loc[df.Fueltype == "Hard Coal"]
coal

	Name	Fueltype	Technology	Set	Country	Capacity	Efficiency	DateIn	DateRetrofit	DateOut	lat	lon	Duration	Volume_Mm3	DamHeight_m	StorageCapacity_MWh	EIC	projectID
id
2	Borssele	Hard Coal	Steam Turbine	PP	Netherlands	485.000000	NaN	1973.0	NaN	2034.0	51.433200	3.716000	NaN	0.0	0.0	0.0	{'49W000000000054X'}	{'BEYONDCOAL': {'BEYOND-NL-2'}, 'ENTSOE': {'49...
98	Didcot	Hard Coal	CCGT	PP	United Kingdom	1490.000000	0.550000	1970.0	1998.0	2013.0	51.622300	-1.260800	NaN	0.0	0.0	0.0	{'48WSTN0000DIDCBC'}	{'BEYONDCOAL': {'BEYOND-UK-22'}, 'ENTSOE': {'4...
129	Mellach	Hard Coal	Steam Turbine	CHP	Austria	200.000000	NaN	1986.0	1986.0	2020.0	46.911700	15.488300	NaN	0.0	0.0	0.0	{'14W-WML-KW-----0'}	{'BEYONDCOAL': {'BEYOND-AT-11'}, 'ENTSOE': {'1...
150	Emile Huchet	Hard Coal	CCGT	PP	France	596.493211	NaN	1958.0	2010.0	2022.0	49.152500	6.698100	NaN	0.0	0.0	0.0	{'17W100P100P0344D', '17W100P100P0345B'}	{'BEYONDCOAL': {'BEYOND-FR-67'}, 'ENTSOE': {'1...
151	Amercoeur	Hard Coal	CCGT	PP	Belgium	451.000000	0.187765	1968.0	NaN	2009.0	50.431000	4.395500	NaN	0.0	0.0	0.0	{'22WAMERCO000010Y'}	{'BEYONDCOAL': {'BEYOND-BE-27'}, 'ENTSOE': {'2...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
29779	St	Hard Coal	NaN	CHP	Germany	21.645000	NaN	1982.0	NaN	NaN	49.976593	9.068953	NaN	0.0	0.0	0.0	{nan}	{'MASTR': {'MASTR-SEE971943692655'}}
29804	Uer	Hard Coal	NaN	CHP	Germany	15.200000	NaN	1964.0	NaN	NaN	51.368132	6.662350	NaN	0.0	0.0	0.0	{nan}	{'MASTR': {'MASTR-SEE988421065542'}}
29813	Walheim	Hard Coal	NaN	PP	Germany	244.000000	NaN	1964.0	NaN	NaN	49.017585	9.157690	NaN	0.0	0.0	0.0	{nan, nan}	{'MASTR': {'MASTR-SEE937157344278', 'MASTR-SEE...
29830	Wd Ffw	Hard Coal	NaN	CHP	Germany	123.000000	NaN	1990.0	NaN	NaN	50.099000	8.653000	NaN	0.0	0.0	0.0	{nan, nan}	{'MASTR': {'MASTR-SEE915289541482', 'MASTR-SEE...
29835	West	Hard Coal	NaN	CHP	Germany	277.000000	NaN	1985.0	NaN	NaN	52.442456	10.762681	NaN	0.0	0.0	0.0	{nan, nan}	{'MASTR': {'MASTR-SEE917432813484', 'MASTR-SEE...

332 rows × 18 columns

Task 3: Identify the three largest coal power plants. In which countries are they located? When were they built?

Show code cell content

Hide code cell content

coal.loc[coal.Capacity.nlargest(3).index]

	Name	Fueltype	Technology	Set	Country	Capacity	Efficiency	DateIn	DateRetrofit	DateOut	lat	lon	Duration	Volume_Mm3	DamHeight_m	StorageCapacity_MWh	EIC	projectID
id
194	Kozienice	Hard Coal	Steam Turbine	PP	Poland	3682.216205	NaN	1972.0	NaN	2042.0	51.6647	21.4667	NaN	0.0	0.0	0.0	{'19W000000000104I', '19W000000000095U'}	{'BEYONDCOAL': {'BEYOND-PL-96'}, 'ENTSOE': {'1...
3652	Vuglegirska	Hard Coal	CCGT	PP	Ukraine	3600.000000	NaN	1972.0	NaN	NaN	48.4652	38.2027	NaN	0.0	0.0	0.0	{nan}	{'GPD': {'WRI1005107'}, 'GEO': {'GEO-43001'}}
767	Opole	Hard Coal	Steam Turbine	PP	Poland	3071.893939	NaN	1993.0	NaN	2020.0	50.7518	17.8820	NaN	0.0	0.0	0.0	{'19W0000000001292'}	{'BEYONDCOAL': {'BEYOND-PL-16'}, 'ENTSOE': {'1...

Task 4: What is the average “DateIn” of each “Fueltype”? Which type of power plants is the oldest on average?

Task 5: In the time series provided, calculate the annual average capacity factors of wind and solar.

Task 6: In the time series provided, calculate and plot the monthly average electricity price.

Basics pandas

Contents

Basics pandas#

Package Imports#

Series#

DataFrame#

Sorting Data#

Filtering Data#

Modifying Values#

Time Series#

Groupby Functionality#

Exercises#

Basics `pandas`#