Build a data pipeline on AWS

Build a data pipeline on AWS

I started this project in echo of the Kaggle competition related to PUBG, where the goal was to predict the player rank in the match,  based on some end of match statistic (like the distance ran during the match etc) . I wanted to understand the data source  (to see if there is some intentional missing stats/informations) and try some services of Amazon like AWS Glue and Athena to build a data pipeline in AWS.

In this article, I will explain:

  • What’s PUBG, the principle of the game etc
  • The approach of the project
  • The collection and the cleaning of the data

Description of PUBG

PUBG is one of the flagship (first  and one of the most popular) of a video game type call battle royale, the principal is that 100 players (less or more) are dropped on an island, without equipment and they have to survive on this island by collecting materials, weapons, equipment and vehicle.

To push people to fight, the area available is becoming smaller more the game is progressing so this format push people to fight each others because at the end it can be only one survivor. There is two very popular games who support the movement it’s PUBG and Fortnite they are available on all the possible platforms (PUBG on console is not good), and now you can see that all the other popular games (Call of duty, Battlefield) wants to add their own battle royale mode inside, but let’s be honest Fortnite is crushing everybody (Check the google trends PUBG VS Fornite)

My opinion on this kind of game is that the principle is cool, you can build some amazing situation where  is tension during the game but I think that you cannot sell (full price) a game that is just battle royale. Epic has been smart with Fortnite because they turned a project that has been long to produce and the sell was not good to something very lucrative by proposing a free standalone application that is using the same asset than the original game but only focus on the battle royale (I hope that the guy that get the idea add a raise).

Let see the project of data collection.

Approach of the project

The idea behind this project is to:
Build a system to collect the data  from the API delivered by the PUBG corporation
Clean the data
Make all the data available without downloading everything on my machine

To build the pipeline I decided to use AWS because I am more comfortable with their services than Microsoft, Google etc but I am sure that the same system can be built on these platforms too.

To complete the task, I decided to use the following services:

  • EC2 instance to have a machine running to make the collection
  • S3 to store the data collected and process
  • Glue to make the data clean and available
  • Athena to be the interface with the data store in S3

There is an high level view of the process deploy on AWS with the different steps.

Let’s see the process in details.

Collect the data from PUBG API

The PUBG corporation has build a very good open API, there is multiple endpoints that opens differents data sources. To make it simple you can access data related to:

  • The PUBG player defined by their accountid
  • The match defined by their matchid

There is data from multiple platforms (pc, xbox, and ps4) from different regions (America, Europe and Asia). For this article I am going to focus on the PC platform and the north america area. There is multiple endpoints on this API but let’s focus on :

  • The sample endpoint: That gives the access to some matchid updated every day
  • The matches endpoint: That gives details on the match, like the end result and more important the link to download a compress packets of events.

You can find a repository with the functions that I build to collect, clean and send the data. This notebook gives goods insight on the data collected. There is three types of data:

  • The end match stats, they are stored in a “folder” in a s3 bucket
  • The complete events that are stored in another folder
  • The decomposed events that are stored in separate folder

About the packets of events there is multiple events available on the API, there is a schema that contains the details on this events.

I schedule a script to run every day on my EC2 instance that triggered the acquisition of new data from the sample and store it in the right folder and in a day folder.

To get the general informations the details on the match is collected (like the number of people, the map name etc) to an dynamodb table.

Some advices that I can give when you are storing the data is :

  • Store them as proper .gz file
  • Be clear on the delimiter, espace char and quote char
  • Drop the index
  • For Dynamodb, convert your float data to Decimal

So that’s a simple overview about how I collected the data, now let’s have a look on the cloud part.

Crawl the data data in S3 and Dynamodb with Glue (so much name dropping)

The AWS glue setup dashboard is setup in two sections with the data catalog part and the ETL part ( I will focus on the upper part of the dashboard).

Let’s focus on the data catalog first.

Data catalog and Crawler part

In this part you can find a database tab that contains all the databases and the tables associated that you created with Glue. For this project I create my first database. The interesting part is the the crawler tab where you can setup the crawler that will navigate in S3, Dynamodb et some others services of AWS (like RDS). On the previous figure there is a list of all the crawlers that I created for this project. Let’s see what’s inside a crawler.

So when you setup your crawler, there is to :

  • Find a name for your crawler
  • Define the data source, in this case I am going to focus on three events and my Dynamodb table
  • Define a AWS role for your crawler to have access to all the data sources that want to be crawled
  • Define the frequency of execution of the crawler, I decided to make it run one time per week
  • Define the output of the crawler

I am highly recommending to activate the tickbox Update all new and existing partitions with metadata from the table to avoid some partitions issue during the reading of the data if you are not controlling what you are receiving like me.

And voila just have to run the crawler from the main page of AWS Glue and you can now have access to your data extract by the crawler in Athena (SQL way to access the  data).

But the data is quite still raw so I decided to add a new layer to the crawler to make some data cleaning and add some informations.

ETL part

For the ETL part, that is the data processing part of the pipeline. I could have analyse the data as they arrived in the S3 bucket but I noticed that some of the records was not good (let’s say corrupted) so I wanted to apply a transformation state for this data for:

  • Drop the corrupted records (with a wrong matchid format)
  • Make a join with the details of the match collected in the dynamodb table (and crawl previously)
  • Calculate the delta time between the event and the beginning of the match (in seconds and minutes)
  • Make some simple string manipulation in function of the events (like cleaning the name of the weapon)

There is an illustration of the code and you can find the piece of code in the repository.

I try the code approach but there is an option the ETL configuration where you can have more a diagram approach to make the data manipulation, as I wanted to used the rdd map and the dataframe of spark I decided to not use this feature but I make some simple tests and it works fine.So now there is just a need to build a crawler to update the data catalog with the new data generated by the ETL part.

In the following schema there is the schedule of the operations in AWS Glue.

And now let’s make a quick test on AWS to request the process data from the web interface.

This data is accessible from Athena but you can connect AWS quicksight it have a more Tableau like experience (what I don’t like I am not a big fan of all these BI tool that are too blackbox for me) and more important you can access this data from a notebook (there is copy of a python environment in the repository).

That was a description of my data pipeline to collect the data from the PUBG API, the system is collecting around 1000 matches per day, it’s not a big amount but I wanted to start small.

This data is accessible from Athena but you can connect AWS quicksight it have a more Tableau like experience (what I don’t like I am not a big fan of all these BI tool that are too blackbox for me) and more important you can access this data from a notebook (there is copy of a python environment in the repository).

That was a description of my data pipeline to collect the data from the PUBG API, the system is collecting around 1000 matches per day, it’s not a big amount but I wanted to start small.

And I am a nice guy, you can find an extract of the tables that I build for this project (it’s an extract of the data of the 27th January 2019) in this kaggle dataset.

I hope that you enjoy the reading, and if you have any comments don’t hesitate to comment.