Hello, the goal of this article is to offer a clear description of the dataset that I uploaded in November 2017 on Kaggle followed by some insights on the dataset.

Description of the dataset

To better follow the energy consumption, the government wants energy suppliers to install smart meters in every home in England, Wales and Scotland. There are more than 26 million homes for the energy suppliers to get to, with the goal of every home having a smart meter by 2020.

This roll out of meter is lead by the European Union who asked all member governments to look at smart meters as part of measures to upgrade our energy supply and tackle climate change. After an initial study, the British government decided to adopt smart meters as part of their plan to update our ageing energy system.

In this dataset, you will find a refactored version of the data from the London data store, that contains the energy consumption readings for a sample of 5,567 London Households that took part in the UK Power Networks led Low Carbon London project between November 2011 and February 2014. The data from the smart meters seems associated only to the electrical consumption.

To have an easier dataset to manipulate, different transformations have been applied on the dataset:

Collection of all the data from a specific household in the same file (that was not the case in the original dataset)
People from the same ACORN group are on the same file

The original and clean dataset can be find in the halfhourly_dataset zip file and one file looks like this snapshot.

Img illustration

Preview of the half hourly data

As you can see the dataset is quite easy to manipulate with:

LCLid that corresponds to the household ID
tstp the timestamp of the measurement
energy(kWh/hh) the energy consumed in the past 30 minutes in kWh

But to make life easier for the user of my dataset, I created two other zip files that contain some pre-processed data:

the daily_dataset that contains daily information on the consumption of the households
the *hhblock_dataset* that contains the transpose data of a day for one household (as an array) with for example the hh_0 column is the consumption between 00:00 and 00:30 ######

This is an overview of all the data from the smart meter, but to facilitate the exploration there is a table that stores all the households and their associated files (informations_households.csv).

Img 4 illustration

In this table, there is:

LCLid that corresponds to the household ID
stdorToU the kind of tariff applied (ToU the dynamic tariff as a function of the days or Std the classic fixed tariff)
Acorn the ACORN group associated, that categorizes the household
Acorn_grouped this is another more global classification of the ACORN (fusion of different ACORN groups)
file name of the file in the different zip files where you can find the data of the household

All this information is from the original dataset but to complete the information available to make other studies there is an addition of some new datasets:

acorn_details.csv : that contains the index for multiple parameters in comparison of the national (that have an index of 100)
Preview of the details on the ACORN groups
uk_bank_holidays.csv : the bank holidays for the period of the study
Preview of the details on the bank holidays
weather_daily_darksky.csv : the daily information on the weather from darksky in London during the study
Preview of the daily weather informations
weather_hourly_darksky.csv : the hourly information on the weather from darksky in London during the study
Preview of the hourly weather informations

This first part offers a general overview of the content of the dataset, it’s time now to obtain a clearer vision on the data from the smart meter.

Exploration of the dataset

Selection of the households

First step in this study is to find the best period to make the comparison. In my previous article on the electrical consumption in France there was a seasonal effect so a great period to study will be at least one year of data. In the next figure there is an illustration of the count of households with data (the 48 timestamps in the day) per day of the study.

Img 9 illustration

Notes: There is clearly an increase in the number of available households since the start of the study in late 2011, the peak is reached in 2013. A good period for our study could be 2013 (and I chose this one). But it’s now important to know the distribution of the available days for this period in the households of the experiment. In the following figure there is a representation of this distribution in a boxplot.

Img 10 illustration

The decision has been made to use the households that possess at least 357 days, so on the original dataset that represents a loss of 632 households out of the 5566 available in the dataset, which is totally acceptable.

Overview of the panel

One of the first things to do is to display the average consumption per day of these households during the year 2013. In the following figure there is the average global consumption of these households during the period.

Img 11 illustration

Notes: It is obvious that there is a link between the electrical consumption and the day of the year (same result as in my previous article). The seasonal effect is very clear so in this panel there are a lot of people that are using electricity as a heating source. If the average daily outdoor temperature and the total daily consumption of the panel are crossed, the following figure displays the relation:

Img 12 illustration

This general observation offers a clear vision that the PTG (the red plot) from the previous article can be calculated for each household. In the following figure, there is a representation of the daily consumption and the PTG associated with this household (and their r² score).

Img 13 illustration

Notes: This is a good illustration that for some households the r² score is working great (these households should have an electric heating system) but for some households it doesn’t work at all. The general model derived from the average daily consumption (the yellow curve) illustrates that average daily consumption doesn’t represent the general behavior of the households. In the following figure there is the scatter plot of the pivot temperature as a function of the r² score.

Img 14 illustration

Another way to identify the households that have an electric heating system could be to compare the average consumption during the winter and the summer and make a simple ratio between these two consumptions.The data have been crossed with the informations of the households, and there is an extract of the new dataset.

Img 15 illustration

In this dataframe, there is:

model_a the slope of the ptg model (in the winter part)
model_b the intersection of the ptg models
model_x0 the temperature of regime switch
r2score the r² score of the ptg model on the household
season_0 the average consumption in winter
season_1 the average consumption in spring
season_2 the average consumption in summer
season_3 the average consumption in autumn
ratio_winter_summer the ratio of the consumption winter/summer
stdorToU the type of tariff
Acorn the ACORN group
Acorn_grouped the aggregated ACORN groups

There is a serious amount of data to cross so in the following figure there is a pairplot that crosses all this data and filters them as a function of the Acorn_grouped.

Img 16 illustration

Notes: There is no obvious relation between all these indexes that define the households except between the season_0 and the model_b but these two are winter-related so that’s normal. But there is no link between these indexes and the Acorn_grouped, the result is similar with the Acorn, which is a little bit disappointing.

Next steps

As you can see this first exploration of the dataset has highlighted some characteristics of the electrical consumption in London like the influence of the weather in this consumption but there is a lot more things to do on this dataset. Some ideas for future analytics:

Cross the ACORN data and the smart meter data
Try to forecast the consumption of the different households
Add new datasets like:
EPC data from London
extra data on London like some underground or train strike during the period
Make some clusterings in the households data and the energy profiles, as you can see in the following heatmap there is a “pattern” in the total consumption of these households.

You can find all the code to make this article in this GitHub repo

References

Smart Meters in London (Kaggle) — Kaggle
London Data Store smart meter dataset — data.london.gov.uk
ACORN group — acorn.caci.co.uk
Dark Sky API — darksky.net
london_smartmeter repository — GitHub
EPC open data API — epc.opendatacommunities.org

Description of the dataset

Preview of the half hourly data

Preview of the details on the ACORN groups

Preview of the details on the bank holidays

Preview of the daily weather informations

Preview of the hourly weather informations

Exploration of the dataset

Selection of the households

Overview of the panel

Next steps

References