🔍
Analysis of the energy performance certificates with Dataiku DSS

Analysis of the energy performance certificates with Dataiku DSS

Since I published the article on the London smart meter and the possible analysis of the data, I am receiving regularly messages from people that are interested in connecting the smart meter data and the energy efficiency of the household monitors. I wrote in this article that there is no direct way to connect the smart meter data, the ACORN and the energy efficiency of the household but there are still some interesting things to do with other datasets around energy and energy efficiency of a household.

I have wanted for a long time to make an article on an interesting tool that I tested in my previous job, which is called DSS from Dataiku, that is very interesting for people that are working with data. In this article I am going to present this tool and these other datasets.

What’s Dataiku DSS

Dataiku DSS is a product developed by the French company Dataiku that is defined on their website as a “collaborative data science software platform for teams of data scientists, data analysts, and engineers to explore, prototype, build, and deliver their own data products more efficiently”.

To make it simple, it is a tool to simplify the processing and sharing of data/models in a company. I really invite you to take a look at their website that is describing a lot of business cases and the functionalities of the platform. But what’s important to know is that there are two kinds of editions of the platform:

  • The free edition: that I am using for this project that is hosted on my machine (but there is a version that can be hosted freely on Dataiku server too)
  • The enterprise edition: that is offering more data connectors (Hive, etc.) and fewer limitations. There is no price announced on the site because I think it’s based on the client needs but from the rumors that I heard it’s not cheap. But there is a 2-week free trial to test the service.

The installation of the free edition is very simple and you can install it on all possible OS. I am going to dive into the functionalities of the software after the presentation of the data.

Describe the data

For this project, I am going to use the following data sources:

  • EPC: This is a collection of multiple performance certificates in the UK (around 15 million)
  • Nomis data: Website that is a collection of multiple information in the UK collected during the different census (most recent is the one from 2011)

Let’s see a more details description of the datasets.

EPC

So first, what’s an EPC? It’s kind of simple, it is an energy performance rating of a household. There is an example in the following figure.

Img illustration

It’s like the one that can be found on appliances for example, and it needs to be refreshed for each new tenant or landlord. The data source that is opened by the Ministry of Housing, Communities & Local Government is very complete (more than 15 million certificates), and there is more than just a simple rating in this data (data dictionary). There is information on the glazing, the energy consumption, the floor area etc.

Nomis data

The Nomis website “is a service provided by the Office for National Statistics, ONS, to give free access to the most detailed and up-to-date UK labour market statistics from official sources”, and on this website there is multiple information on UK citizens collected during different census. There is multiple information with a good level of detail and the data from the census are mostly used to create the ACORN group that was defined in my article on the smart meter.

Img 2 illustration

From this portal I extracted informations on the UK citizens at a district level on their :

  • Occupation
  • Qualification
  • NS-SEC (National Statistic-Socio Economic Classification)
  • Population

Let start now Dataiku DSS.

Data processing in DSS

All the data extracted for this project are CSV files. In the following animation there is an illustration of the process to create a dataset in DSS.

Giphy GIF

Files can be easily dragged and dropped in DSS, and you can have an overview of the data, the quality etc. In the free version a SQL database can be connected too and it’s very easy.

For this project the idea is to connect the data from the EPC and the Nomis data, so there is a big part of processing the certificates to be aggregated at a district level and be connected to the Nomis data.

There is an overview of the complete process.

Img 3 illustration

In the following figure there is the process to prepare the EPC dataset that could be connected with the Nomis data.

Img 4 illustration

There is a preparation phase (with the brush) where there is a selection of the right columns, a processing of the postcode to get the district code.

Img 5 illustration

A second step that could have been put with the previous one is to prepare the dataset to find the most recent EPC of the household (correct format of the inspection date).

To get the last inspection date of each household in the dataset, there is a group by (the square triangle circle icon on the process image) function. There is an illustration of the process in the following animation.

Giphy GIF

Now that we have the EPC cleaned, and a list of the last inspection date for each household in another table, a join between these two datasets with the join function (the join logo in the process). There is a presentation of the menu of the join where you can select the join key and the columns selected.

Img 6 illustration

And finally, there is a group by function per district, type of household, type of heating system and EPC rating.

Giphy GIF

And the final step is the Python script to get aggregate information at the district level with the pivot function of pandas (count of EPC per rating and type of household), but I could have used the pivot function of DSS.

Giphy GIF

Et voila, we have the data of the EPC aggregated to a district level, which gives us knowledge on the rating of the household and the kind of household at this level.

Img 7 illustration

This data can be easily joined with the data of Nomis with the join function.

I used some in-house functions of DSS to do the join and groupby but I could have used:

  • A Python or R script to build the dataset
  • SQL script if they were SQL tables
  • Hive or Impala in the case of a “big data” configuration

Now that the data are available let’s make an analysis and a dashboard to wrap up some results.

Data analysis in DSS

The analyses are going to be super high level, it’s only to show the features of DSS.

This is an animation of a dashboard that I built with DSS.

Giphy GIF

The building of this dashboard is much easier than a Tableau one (my opinion), but they share this drag and drop approach to build each graph which is very useful.

There is a way to make some analysis directly from the dataset with the lab tool where in-house functions can be used (to determine correlation for example) or use some scripts to analyze data. In this case I chose Python to make some plots with seaborn.

Giphy GIF

The analysis part is really cool and I think that it could fit a lot of needs, but the other part that is really impressive is the ML part to build models based on the data processed.

Model serving in DSS

So there is multiple ways to build a model, but first let’s define a purpose for this part:

“Build an energy rating estimator based on the location, the total floor area and the type of heating of the household”

There is three “levels” that DSS gives you to build a model but let’s be honest it’s the same interface just that your journey on the model configuration part start higher in the hierarchy of the menu.

Img 8 illustration

In the menu to build the model there is the possibility to:

  • Prepare the training and testing sets
  • Choose the evaluation metric
  • Pick the features
  • Choose the models and the parameters for the grid search
  • Compare the models after the testing part

There is a simple animation that is making an overview of the features of the model builder.

Giphy GIF

The interesting part is that you could use pre-built functions (I presume scikit-learn functions), or write your own Python code. The tool to test the model is really impressive in terms of visualization of the process and the results.

A good point is the visualization of the results with for example a decision tree that is really easy to understand (decomposed with this tool).

Img 9 illustration

After you find the right model there is an API builder to embed the model. In this space you can define some test requests to see the model in action.

Giphy GIF

To expose the model, the feature is not activated for the free version (or I missed how to use it) but it seems pretty intuitive.

Now it’s time to conclude.

Feedback on the project

I will highly recommend to people that are working on or interested in the energy sector to dive into the EPC data because they are a very good source of knowledge on the housing market in the UK. In general, and it’s very painful for me as the most French guy abroad to say this, but the UK government is doing a great job collecting and sharing data and there are very interesting datasets on the government platform that could be used by data scientists (in France we are very late on this topic but things are changing slowly).

For Dataiku DSS, it is a great tool for data scientists, experimented or not. I can feel that this tool has been designed by data scientists for data scientists and there are so many features that I didn’t use like all the collaboration part, the deep learning etc. There are multiple in-house functions to make the data processing easier, which is really cool but it could become a burden if for example Dataiku decides to drop these features (or make them premium). If data people don’t know how to do a join, a groupby etc., the data pipeline transfer could be difficult but I really like the fact that Dataiku doesn’t lock the user into their in-house functions and allows the possibility of other ways to manipulate data (with SQL for example).

In my day-to-day job, can DSS fill MY needs? NO, because I currently have multiple tools at my disposal to do my job and I need flexibility on the data side and on the development side to experiment and deploy things but this tool is definitely worth trying because it can fill the needs of data teams who don’t have my needs (and they are numerous around the world).