Project description

Introduction

This next section will introduce you to ‘The Titanic survival project’ , another great classification problem first presented at the Kaggle competition-a machine learning contest which we think, does not need any introduction anymore .

Through this step-by-step how-to guide, we are going to see how easy solving this Machine Learning problem is , with the help of SmartPredict.

The Titanic sailed majestically during its maiden voyage. (credits: F.G.O. Stuart (1843-1923))

Historical facts

In the early morning hours of 15 April 1912, the 'RMS Titanic' ( "Royal Merchant Ship" or "Royal Mail Steamer") , one of the biggest ships ever built, sank into the North Atlantic Ocean after colliding with an enormous iceberg.

Factually, out of the 2777 passengers and crew of the infamous ship , only 705 of them survived , mainly because of a shortage of lifeboats.

During the sinking, a set of criteria was used to decide who would be allowed to get on board of the limited number of lifeboats , which would later trace the fate of each of these people .

Even if there was a fair amount of luck for the survivors of the disaster, these well-established facts undeniably participated in their survival.

‌In fact, one of the main key points is that passengers had been registered for the journey according to their economic and social statuses.

Passengers were distributed between the stages of the gigantic ship into three classes: the first for the upper class, the second for the middle class, the third for the lower class. Thus, their order of priority were ranked accordingly. Furthermore. women were rescued before men were allowed to. Apart from that, age also played a crucial role, as children got rescued before adults .

‌All in all , women, children, and the upper-class were the group who were the most likely to survive.

🚢 The Titanic Machine Learning problem

In this Machine Learning tutorial, we are going to shed light upon the outcome of the sorting out and define how likely a given passenger would survive according to a combination of features.

Our Titanic project dataset comprises three separate but related files : a training set, a testing set and a gender submission.

The training dataset

The training dataset gives the ground truth i.e the factual outcome about whether a passenger survived or not . This is expressed through the binary couple Survived/not Survived represented respectively by 1/0.

The file displays 11 columns, namely: Survived, PClass, Name ,Sex, Age, Cabin, Embarked, SibSP ,Parch Ticket, Fare.

This provides us of enough parameters to play with, which makes of the Titanic survival project a great Machine Learning problem to solve.

The testing dataset

The testing dataset contains exactly the same data as the training set, except for the outcome that is "Survived/not Survived" which is voluntarily missing . It is exactly what the model we are going to build will try to predict.

The Gender Submission file

The gender submission file is a template for formatting the results. It is only a provided example assuming that all survivors were all female passengers. This is not an accurate historical fact. (We shall not actually use this file in our project.)

🔎 A tinge of data exploration : the choice of relevant features

Before we start creating our model, we need to organize our data a little bit in order to identify the relevant features. To do so, let us first study how likely each piece of data leverages in the passenger selection. This step is necessary for wisely choosing which features will be included in our model.

As already stated, the passengers were categorized according to their socio-economic statuses. Also we can directly conjecture that the 'PClass' feature is relevant.

Just by looking at the rescue operation' s outcome, we can already notice that first class passengers were paramount. This means that those latter had greater chances of survival .

Obviously , we can also deduce that the higher the fare amount , the higher the social status to whom a person belongs to. In consequence, 'Fare' is also a feature we are going to consider.

Then, as a matter of fact , women and children were granted prerogatives . As an outcome, 'Age' as well as 'Sex' were too other decisive features.

In the rush , it happened that persons who were accompanied were more likely able to escape since they often helped one another. This induced that those who went alone were less advantaged than those who went in family (with their parents and children) or with a spouse , but at the same time , those who traveled in group would tend to inherit/share the same fate . The reason is that , to define the class of a passenger, it is easy to pinpoint that a group member belonged to merely the same class as their peers.

We should then include 'Parch' and 'SibSp' as features.

Finally, speaking about the remaining features, they do not seem to influence much the outcome- even though exceptions are not totally exempted. Also, we may now choose to "ignore" them, so that they would not embarrass us unnecessarily. Those features are:

  • 'Passenger Id'

  • 'Name''

  • 'Cabin'

  • Ticket

  • and 'Embarked'.

.

To recapitulate the holistic view, we can assert that the parameters that impacted on the passengers' survival are :

  • passengers' class expressed through their fare amount,

  • being a child or a woman respectively represented by age and gender/sex (female),

  • and being accompanied or not i.e with one's parents , siblings or a spouse .

This entails that among all the other features, the following are the relevant ones i.e in function of which our predictor model will be based and trained on :

  1. PClass

  2. Fare

  3. Sex

  4. Age

  5. Parch

  6. SibSp