Table of Contents

HR Analytics : Hackathon Challenge

I participated in WNS Analytics Wizard hackathon, “To predict whether an employee will be promoted or not” and hence I am coming up with this blog-post of the solution submitted which ranked me 138 (Top 11%) in the challenge. The leader board ranking was decided on the F1-score which is harmonic mean of precision and recall.

About Data
The data-set consists of 54808 rows where each row had 14 attributes including target variable (i.e “is_promoted”). There are 4668 cases where employees have been promoted (8.5%). The data-set is provided in GitHub link here.

Let’s get started in building the data analytics pipeline end to end.

Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

from sklearn.metrics import confusion_matrix, f1_score, precision_recall_curve
from sklearn.model_selection import GridSearchCV, train_test_split,cross_val_score

import xgboost as xgb
import lightgbm as lgb

import warnings
warnings.filterwarnings("ignore")

# Set all options
%matplotlib inline
plt.style.use('seaborn-notebook')
plt.rcParams["figure.figsize"] = (20, 3)
pd.options.display.float_format = '{:20,.4f}'.format
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
sns.set(context="paper", font="monospace")

User Defined Functions

def convert_categorical_to_dummies(d_convert):

    """
    Author: Abhijeet Kumar
    Description: returns Dataframe with all categorical variables converted into dummies
    Arguments: Dataframe (having categorical variables)
    """

    df = d_convert.copy()
    list_to_drop = []
    for col in df.columns:
        if df[col].dtype == 'object':
            list_to_drop.append(col)
            df = pd.concat([df,pd.get_dummies(df[col],prefix=col,prefix_sep='_', drop_first=False)], axis=1)
            df = df.drop(list_to_drop,axis=1)
    return df

def quality_report(df):

    """
    Author: Abhijeet Kumar
    Description: Displays quality of data in terms of missing values, unique numbers, datatypes etc.
    Arguments: Dataframe
    """
    dtypes = df.dtypes
    nuniq = df.T.apply(lambda x: x.nunique(), axis=1)
    total = df.isnull().sum().sort_values(ascending = False)
    percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False)
    quality_df = pd.concat([total, percent, nuniq, dtypes], axis=1, keys=['Total', 'Percent','Nunique', 'Dtype'])
    display(quality_df)

def score_on_test_set(model, file_name, out_name):

    """
    Author: Abhijeet Kumar
    Description : It runs same steps of preprocessing as in training, scores
    on the test data provided in hackathon and generates the submission file.
    Argument : model, test data file, submission file
    """

    test_data = pd.read_csv(file_name)

    # Treating the missing values of education as a separate category
    test_data['education'] = test_data['education'].replace(np.NaN, 'NA')

    # Treating the missing values of education as a separate category
    test_data['previous_year_rating'] = test_data['previous_year_rating'].fillna(0)

    # Creating dummy variables for all the categorical columns, droping that column
    master_test_data = convert_categorical_to_dummies(test_data)

    # Removing the id attributes
    df_test_data = master_test_data.drop(['employee_id'],axis=1)
    if out_name == "submission_lightgbm.csv":
        y_pred = model.predict_proba(df_test_data.values, num_iteration=model.best_iteration_)
    else:
        y_pred = model.predict_proba(df_test_data.values)
    submission_df = pd.DataFrame({'employee_id':master_test_data['employee_id'],'is_promoted':y_pred[:,1]})
    submission_df.to_csv(out_name, index=False)

    score = model.predict_proba(df_test_data.values)
    return test_data,score

Reading Data

data = pd.read_csv("train.csv")
print("Shape of Data = ",data.shape)
data.sample(5)
Shape of Data =  (54808, 14)

Checking the event rate

plt.figure(figsize=(6,3))
sns.countplot(x='is_promoted',data=data)
plt.show()

# Checking the event rate : event is when claim is made
data['is_promoted'].value_counts()

event-rate

0    50140
1     4668
Name: is_promoted, dtype: int64

Displaying the attributes

# Checking the attribute names
pd.DataFrame(data.columns)

 

0 employee_id
1 department
2 region
3 education
4 gender
5 recruitment_channel
6 no_of_trainings
7 age
8 previous_year_rating
9 length_of_service
10 KPIs_met >80%
11 awards_won?
12 avg_training_score
13 is_promoted

Checking Data Quality

# checking missing data
quality_report(data)

 

Attributes Total Percent Nunique Dtype
KPIs_met >80% 0 0.0000 2 int64
age 0 0.0000 41 int64
avg_training_score 0 0.0000 61 int64
awards_won? 0 0.0000 2 int64
department 0 0.0000 9 object
education 2409 4.3953 3 object
employee_id 0 0.0000 54808 int64
gender 0 0.0000 2 object
is_promoted 0 0.0000 2 int64
length_of_service 0 0.0000 35 int64
no_of_trainings 0 0.0000 10 int64
previous_year_rating 4124 7.5244 5 float64
recruitment_channel 0 0.0000 3 object
region 0 0.0000 34 object

Missing Value Treatment

# Treating the missing values of education as a separate category
data['education'] = data['education'].replace(np.NaN, 'NA')

# Treating the missing values of previous year rating as 0
data['previous_year_rating'] = data['previous_year_rating'].fillna(0)

Looking at attributes (EDA)

Can we make some inferences from EDA ?

  • Promotions are worst in Legal department (5.1%). Best promotions are in technology department (10.7%).
  • Region 9 is worst (1.9%) and region 4 is best (14.4%) in terms of promotions.
  • Although Master’s & above has greater promotion percentage but difference is not much.
  • Employees having previous years rating greater than 5 will have better chances of promotion than others.
  • Employess having KPI greater than 80% has good chances of promotions (16%)
  • Employees winning awards are promoted more (44%).
for col in data.drop('is_promoted',axis=1).columns:
    if data[col].dtype == 'object' or data[col].nunique()
        xx = data.groupby(col)['is_promoted'].value_counts().unstack(1)
        per_not_promoted = xx.iloc[:, 0] *100/xx.apply(lambda x: x.sum(), axis=1)
        per_promoted = xx.iloc[:, 1]*100/xx.apply(lambda x: x.sum(), axis=1)
        xx['%_0'] = per_not_promoted
        xx['%_1'] = per_promoted
        display(xx)

 

is_promoted 0 1 %_0 %_1
department
Analytics 4840 512 90.4335 9.5665
Finance 2330 206 91.8770 8.1230
HR 2282 136 94.3755 5.6245
Legal 986 53 94.8989 5.1011
Operations 10325 1023 90.9852 9.0148
Procurement 6450 688 90.3614 9.6386
R&D 930 69 93.0931 6.9069
Sales & Marketing 15627 1213 92.7969 7.2031
Technology 6370 768 89.2407 10.7593

 

is_promoted 0 1 %_0 %_1
region
region_1 552 58 90.4918 9.5082
region_10 597 51 92.1296 7.8704
region_11 1241 74 94.3726 5.6274
region_12 467 33 93.4000 6.6000
region_13 2418 230 91.3142 8.6858
region_14 765 62 92.5030 7.4970
region_15 2586 222 92.0940 7.9060
region_16 1363 102 93.0375 6.9625
region_17 687 109 86.3065 13.6935
region_18 30 1 96.7742 3.2258
region_19 821 53 93.9359 6.0641
region_2 11354 989 91.9874 8.0126
region_20 801 49 94.2353 5.7647
region_21 393 18 95.6204 4.3796
region_22 5694 734 88.5812 11.4188
region_23 1038 137 88.3404 11.6596
region_24 490 18 96.4567 3.5433
region_25 716 103 87.4237 12.5763
region_26 2117 143 93.6726 6.3274
region_27 1528 131 92.1037 7.8963
region_28 1164 154 88.3156 11.6844
region_29 951 43 95.6740 4.3260
region_3 309 37 89.3064 10.6936
region_30 598 59 91.0198 8.9802
region_31 1825 110 94.3152 5.6848
region_32 905 40 95.7672 4.2328
region_33 259 10 96.2825 3.7175
region_34 284 8 97.2603 2.7397
region_4 1457 246 85.5549 14.4451
region_5 731 35 95.4308 4.5692
region_6 658 32 95.3623 4.6377
region_7 4327 516 89.3454 10.6546
region_8 602 53 91.9084 8.0916
region_9 412 8 98.0952 1.9048

 

is_promoted 0 1 %_0 %_1
education
Bachelor’s 33661 3008 91.7969 8.2031
Below Secondary 738 67 91.6770 8.3230
Master’s & above 13454 1471 90.1441 9.8559
NA 2287 122 94.9357 5.0643

 

is_promoted 0 1 %_0 %_1
gender
f 14845 1467 91.0066 8.9934
m 35295 3201 91.6849 8.3151

 

is_promoted 0 1 %_0 %_1
recruitment_channel
other 27890 2556 91.6048 8.3952
referred 1004 138 87.9159 12.0841
sourcing 21246 1974 91.4987 8.5013

 

is_promoted 0 1 %_0 %_1
previous_year_rating
0.0000 3785 339 91.7798 8.2202
1.0000 6135 88 98.5859 1.4141
2.0000 4044 181 95.7160 4.2840
3.0000 17263 1355 92.7221 7.2779
4.0000 9093 784 92.0624 7.9376
5.0000 9820 1921 83.6385 16.3615

 

is_promoted 0 1 %_0 %_1
KPIs_met >80%
0 34111 1406 96.0413 3.9587
1 16029 3262 83.0906 16.9094

 

is_promoted 0 1 %_0 %_1
awards_won?
0 49429 4109 92.3251 7.6749
1 711 559 55.9843 44.0157

 

Preparing Data for Modeling

# Creating dummy variables for all the categorical columns, droping that column
master_data = convert_categorical_to_dummies(data)
print("Total shape of Data :",master_data.shape)

# dropping the target from dataset
labels = np.array(master_data['is_promoted'].tolist())

# Removing the id attributes
df_data = master_data.drop(['is_promoted','employee_id'],axis=1)
print("Shape of Data:",df_data.shape)
df = df_data.values
Total shape of Data : (54808, 61)
Shape of Data: (54808, 59)

Model 1 – XGB Classifier

xgb_model = xgb.XGBClassifier()
print(xgb_model)

# Cross validation scores
f1_scores = cross_val_score(xgb_model, df, labels, cv=5, scoring='f1')
print("F1-score = ",f1_scores," Mean F1 score = ",np.mean(f1_scores))

# Training the models
xgb_model.fit(df,labels)

# Scoring on test set
test_data,score_xgb = score_on_test_set(xgb_model,"test.csv","submission_xgb.csv")
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
F1-score = [ 0.4526749   0.41547519  0.43579122  0.43012552  0.43427621]  
Mean F1 score =  0.433668606717

XGB Classifier : Parameter Tuning

Our goal is usually to set the model parameters to optimal values that enable a model to complete learning task in the best way possible. Thus, tuning XGboost classifier can optimize the parameters that impact the model in order to enable the algorithm to perform the best.
I performed lot of iterations patiently which led to fine tuning of parameters: n_estimators, max_depth and L1 regularization. A norm is to take baby steps to learn (small learning rate) and tune the parameters. Here, I found that with large number of trees (n_estimators), the F1-scores were improving.

# Create parameters to search
params = {
     'learning_rate': [0.01],
     'n_estimators': [900,1000,1100],
     'max_depth':[7,8,9],
     'reg_alpha':[0.3,0.4,0.5]
    }

# Initializing the XGBoost Regressor
xgb_model = xgb.XGBClassifier()

# Gridsearch initializaation
gsearch = GridSearchCV(xgb_model, params,
                    verbose=True,
                    cv=5,
                    n_jobs=2)

gsearch.fit(df, labels)

#Printing the best chosen params
print("Best Parameters :",gsearch.best_params_)

params = {'objective':'binary:logistic', 'booster':'gbtree'}

# Updating the parameter as per grid search
params.update(gsearch.best_params_)

# Initializing the XGBoost Regressor
xgb_model = xgb.XGBClassifier(**params)
print(xgb_model)

# Cross validation scores
f1_scores = cross_val_score(xgb_model, df, labels, cv=5, scoring='f1',n_jobs=2)
print("F1_scores per fold : ",f1_scores," \nMean F1_score= ",np.mean(f1_scores))

# Fitting model on tuned parameters
xgb_model.fit(df, labels)

# Scoring on test set
test_data,score_xgb_tuned = score_on_test_set(xgb_model,"test.csv","submission_xgb_tuned.csv")
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=2)]: Done   5 out of   5 | elapsed: 13.0min finished
Best Parameters :{'learning_rate': 0.01, 'max_depth': 8, 'n_estimators': 1000, 'reg_alpha': 0.4}

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.01, max_delta_step=0,
       max_depth=8, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0.4, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
F1_scores per fold : [ 0.51014041  0.48657188  0.49528302  0.53054911  0.51130164]  
Mean F1_score= 0.506769210361

XGB Classifier : Setting threshold

How does XGBoost classifier predicts the class (‘promoted’ or ‘not promoted’) ? It predicts a probability between 0 and 1 for the unseen cases. Further, it predicts 0 and 1 by putting a threshold at 0.5 by default (1 if probability > 0.5). In unbalance data-set as in here, it may be a biased setting as it would be difficult to capture rare event with 0.5 threshold.

  • We can change the by default threshold of 0.5 by finding the optimal threshold to increase F1-score.
  • We need to find the threshold where f1-score is highest.
  • I tried submissions on few optimal cut-offs to get maximum possible improved F1-score.

The following python code splits the data in 90:10 and trains XGBoost classifier with tuned parameters. It calculates precision and recall at different thresholds and plots the precision recall curve. Further, we calculate F1-score for the same using precision and recall values.

# Splitting the dataset in order to use early stopping round
X_train, X_test, y_train, y_test = train_test_split(df, labels, test_size=0.10, stratify=labels)

xgb_model = xgb.XGBClassifier(**params)

# Training the models
xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict_proba(X_test)
precision, recall, thresholds = precision_recall_curve(y_test, y_pred[:,1])

thresholds = np.append(thresholds, 1)
f1_scores = 2*(precision*recall)/(precision+recall)
plt.step(recall, precision, color='b', alpha=0.4, where='post')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('2-class Precision-Recall curve')
plt.show()

prcurve

Getting optimal threshold

We plot F1-scores with respect to threshold in x-axis to check the F1-score peak. The below python codes gets the threshold value where the F1-score was highest.

scrs = pd.DataFrame({'precision' : precision, 'recal' : recall, 'thresholds' : thresholds, 'f1_score':f1_scores})
print("Threshold cutoff: ",scrs.loc[scrs['f1_score'] == scrs.f1_score.max(),'thresholds'].iloc[0])
print("Max F1-score at cut-off : ",scrs.f1_score.max())
scrs.plot(x='thresholds', y='f1_score')
Threshold cutoff:  0.340377241373
Max F1-score at cut-off :  0.53791130186

f1_threshold

Once you get the optimal threshold, use it for test set probability predictions as a cutoff to predict class labels 0 and 1 for the final submission.

What did not work ?

I tried the following other techniques which did not work and hence my final submissions were based on single model “XGBoost classifier” as described in this post.

  • I tried logistic regression and SVM, f1 score was low (less than 0.4).
  • I tried Random Forest. F1-score was comparatively low.
  • I tried LightGBM model. In default setting, It gave 0.50 f1-score but somehow it was not improving with parameter tuning. Little improvement was there when early_stopping_rounds was used. We consider best iteration for predictions on test set.
  • I created some interaction variables like if previous_year_rating == 5 and KPI > 80 == 1 then 1 else 0, if awards_won? == 1 and KPI > 80 == 1 then 1 else 0. It did not help.
  • Finally, I took the best tuned params of all three (RF, XGboost and LightGBM) and stacked them with ‘Logistics Regression’ as classifier. It did not gave better f1 than individual XGB Classifier model.

At the End

Readers are also encouraged to download the data-set and check if they can reproduce the results. Also, I would love to check in comments if you can surpass the F1-score achieved here in the blog-post. There are following other things which one can try.

  • Generally, Stacking improves scores when there are lot of models. One can train say 100s of models of XGBoost and LightGBM (with different close by parameters) and then apply logistic regression on top of that (I tried with only 3 models, failed).
  • Also, one can try an interaction variable by calculating total score achieved in training (Number of training * Avg. training score).
  • One can try setting “early_stopping_rounds”  in XGBoost classifier training which I did not tried. It prevents over-fitting and can improve results.

The full implementation of the followed approach along with LightGBM model example (jupyter notebook) can be downloaded from GitHub link here.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy data analytics 🙂