Table of Contents
- 1 Importing Libraries
- 2 User Defined Functions
- 3 Reading Data
- 4 Displaying the attributes
- 5 Checking Data Quality
- 6 Missing Value Treatment
- 7 Looking at attributes (EDA)
- 8 Preparing Data for Modeling
-
9 Model 1 – XGB Classifier
HR Analytics : Hackathon Challenge
I participated in WNS Analytics Wizard hackathon, “To predict whether an employee will be promoted or not” and hence I am coming up with this blog-post of the solution submitted which ranked me 138 (Top 11%) in the challenge. The leader board ranking was decided on the F1-score which is harmonic mean of precision and recall.
About Data
The data-set consists of 54808 rows where each row had 14 attributes including target variable (i.e “is_promoted”). There are 4668 cases where employees have been promoted (8.5%). The data-set is provided in GitHub link here.
Let’s get started in building the data analytics pipeline end to end.
Importing Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from IPython.display import display from sklearn.metrics import confusion_matrix, f1_score, precision_recall_curve from sklearn.model_selection import GridSearchCV, train_test_split,cross_val_score import xgboost as xgb import lightgbm as lgb import warnings warnings.filterwarnings("ignore") # Set all options %matplotlib inline plt.style.use('seaborn-notebook') plt.rcParams["figure.figsize"] = (20, 3) pd.options.display.float_format = '{:20,.4f}'.format pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', None) sns.set(context="paper", font="monospace")
User Defined Functions
def convert_categorical_to_dummies(d_convert): """ Author: Abhijeet Kumar Description: returns Dataframe with all categorical variables converted into dummies Arguments: Dataframe (having categorical variables) """ df = d_convert.copy() list_to_drop = [] for col in df.columns: if df[col].dtype == 'object': list_to_drop.append(col) df = pd.concat([df,pd.get_dummies(df[col],prefix=col,prefix_sep='_', drop_first=False)], axis=1) df = df.drop(list_to_drop,axis=1) return df def quality_report(df): """ Author: Abhijeet Kumar Description: Displays quality of data in terms of missing values, unique numbers, datatypes etc. Arguments: Dataframe """ dtypes = df.dtypes nuniq = df.T.apply(lambda x: x.nunique(), axis=1) total = df.isnull().sum().sort_values(ascending = False) percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False) quality_df = pd.concat([total, percent, nuniq, dtypes], axis=1, keys=['Total', 'Percent','Nunique', 'Dtype']) display(quality_df) def score_on_test_set(model, file_name, out_name): """ Author: Abhijeet Kumar Description : It runs same steps of preprocessing as in training, scores on the test data provided in hackathon and generates the submission file. Argument : model, test data file, submission file """ test_data = pd.read_csv(file_name) # Treating the missing values of education as a separate category test_data['education'] = test_data['education'].replace(np.NaN, 'NA') # Treating the missing values of education as a separate category test_data['previous_year_rating'] = test_data['previous_year_rating'].fillna(0) # Creating dummy variables for all the categorical columns, droping that column master_test_data = convert_categorical_to_dummies(test_data) # Removing the id attributes df_test_data = master_test_data.drop(['employee_id'],axis=1) if out_name == "submission_lightgbm.csv": y_pred = model.predict_proba(df_test_data.values, num_iteration=model.best_iteration_) else: y_pred = model.predict_proba(df_test_data.values) submission_df = pd.DataFrame({'employee_id':master_test_data['employee_id'],'is_promoted':y_pred[:,1]}) submission_df.to_csv(out_name, index=False) score = model.predict_proba(df_test_data.values) return test_data,score
Reading Data
data = pd.read_csv("train.csv") print("Shape of Data = ",data.shape) data.sample(5)
Shape of Data = (54808, 14)
Checking the event rate
plt.figure(figsize=(6,3)) sns.countplot(x='is_promoted',data=data) plt.show() # Checking the event rate : event is when claim is made data['is_promoted'].value_counts()
0 50140 1 4668 Name: is_promoted, dtype: int64
Displaying the attributes
# Checking the attribute names pd.DataFrame(data.columns)
0 | employee_id |
---|---|
1 | department |
2 | region |
3 | education |
4 | gender |
5 | recruitment_channel |
6 | no_of_trainings |
7 | age |
8 | previous_year_rating |
9 | length_of_service |
10 | KPIs_met >80% |
11 | awards_won? |
12 | avg_training_score |
13 | is_promoted |
Checking Data Quality
# checking missing data quality_report(data)
Attributes | Total | Percent | Nunique | Dtype |
---|---|---|---|---|
KPIs_met >80% | 0 | 0.0000 | 2 | int64 |
age | 0 | 0.0000 | 41 | int64 |
avg_training_score | 0 | 0.0000 | 61 | int64 |
awards_won? | 0 | 0.0000 | 2 | int64 |
department | 0 | 0.0000 | 9 | object |
education | 2409 | 4.3953 | 3 | object |
employee_id | 0 | 0.0000 | 54808 | int64 |
gender | 0 | 0.0000 | 2 | object |
is_promoted | 0 | 0.0000 | 2 | int64 |
length_of_service | 0 | 0.0000 | 35 | int64 |
no_of_trainings | 0 | 0.0000 | 10 | int64 |
previous_year_rating | 4124 | 7.5244 | 5 | float64 |
recruitment_channel | 0 | 0.0000 | 3 | object |
region | 0 | 0.0000 | 34 | object |
Missing Value Treatment
# Treating the missing values of education as a separate category data['education'] = data['education'].replace(np.NaN, 'NA') # Treating the missing values of previous year rating as 0 data['previous_year_rating'] = data['previous_year_rating'].fillna(0)
Looking at attributes (EDA)
Can we make some inferences from EDA ?
- Promotions are worst in Legal department (5.1%). Best promotions are in technology department (10.7%).
- Region 9 is worst (1.9%) and region 4 is best (14.4%) in terms of promotions.
- Although Master’s & above has greater promotion percentage but difference is not much.
- Employees having previous years rating greater than 5 will have better chances of promotion than others.
- Employess having KPI greater than 80% has good chances of promotions (16%)
- Employees winning awards are promoted more (44%).
for col in data.drop('is_promoted',axis=1).columns: if data[col].dtype == 'object' or data[col].nunique() xx = data.groupby(col)['is_promoted'].value_counts().unstack(1) per_not_promoted = xx.iloc[:, 0] *100/xx.apply(lambda x: x.sum(), axis=1) per_promoted = xx.iloc[:, 1]*100/xx.apply(lambda x: x.sum(), axis=1) xx['%_0'] = per_not_promoted xx['%_1'] = per_promoted display(xx)
is_promoted | 0 | 1 | %_0 | %_1 |
---|---|---|---|---|
department | ||||
Analytics | 4840 | 512 | 90.4335 | 9.5665 |
Finance | 2330 | 206 | 91.8770 | 8.1230 |
HR | 2282 | 136 | 94.3755 | 5.6245 |
Legal | 986 | 53 | 94.8989 | 5.1011 |
Operations | 10325 | 1023 | 90.9852 | 9.0148 |
Procurement | 6450 | 688 | 90.3614 | 9.6386 |
R&D | 930 | 69 | 93.0931 | 6.9069 |
Sales & Marketing | 15627 | 1213 | 92.7969 | 7.2031 |
Technology | 6370 | 768 | 89.2407 | 10.7593 |
is_promoted | 0 | 1 | %_0 | %_1 |
---|---|---|---|---|
region | ||||
region_1 | 552 | 58 | 90.4918 | 9.5082 |
region_10 | 597 | 51 | 92.1296 | 7.8704 |
region_11 | 1241 | 74 | 94.3726 | 5.6274 |
region_12 | 467 | 33 | 93.4000 | 6.6000 |
region_13 | 2418 | 230 | 91.3142 | 8.6858 |
region_14 | 765 | 62 | 92.5030 | 7.4970 |
region_15 | 2586 | 222 | 92.0940 | 7.9060 |
region_16 | 1363 | 102 | 93.0375 | 6.9625 |
region_17 | 687 | 109 | 86.3065 | 13.6935 |
region_18 | 30 | 1 | 96.7742 | 3.2258 |
region_19 | 821 | 53 | 93.9359 | 6.0641 |
region_2 | 11354 | 989 | 91.9874 | 8.0126 |
region_20 | 801 | 49 | 94.2353 | 5.7647 |
region_21 | 393 | 18 | 95.6204 | 4.3796 |
region_22 | 5694 | 734 | 88.5812 | 11.4188 |
region_23 | 1038 | 137 | 88.3404 | 11.6596 |
region_24 | 490 | 18 | 96.4567 | 3.5433 |
region_25 | 716 | 103 | 87.4237 | 12.5763 |
region_26 | 2117 | 143 | 93.6726 | 6.3274 |
region_27 | 1528 | 131 | 92.1037 | 7.8963 |
region_28 | 1164 | 154 | 88.3156 | 11.6844 |
region_29 | 951 | 43 | 95.6740 | 4.3260 |
region_3 | 309 | 37 | 89.3064 | 10.6936 |
region_30 | 598 | 59 | 91.0198 | 8.9802 |
region_31 | 1825 | 110 | 94.3152 | 5.6848 |
region_32 | 905 | 40 | 95.7672 | 4.2328 |
region_33 | 259 | 10 | 96.2825 | 3.7175 |
region_34 | 284 | 8 | 97.2603 | 2.7397 |
region_4 | 1457 | 246 | 85.5549 | 14.4451 |
region_5 | 731 | 35 | 95.4308 | 4.5692 |
region_6 | 658 | 32 | 95.3623 | 4.6377 |
region_7 | 4327 | 516 | 89.3454 | 10.6546 |
region_8 | 602 | 53 | 91.9084 | 8.0916 |
region_9 | 412 | 8 | 98.0952 | 1.9048 |
is_promoted | 0 | 1 | %_0 | %_1 |
---|---|---|---|---|
education | ||||
Bachelor’s | 33661 | 3008 | 91.7969 | 8.2031 |
Below Secondary | 738 | 67 | 91.6770 | 8.3230 |
Master’s & above | 13454 | 1471 | 90.1441 | 9.8559 |
NA | 2287 | 122 | 94.9357 | 5.0643 |
is_promoted | 0 | 1 | %_0 | %_1 |
---|---|---|---|---|
gender | ||||
f | 14845 | 1467 | 91.0066 | 8.9934 |
m | 35295 | 3201 | 91.6849 | 8.3151 |
is_promoted | 0 | 1 | %_0 | %_1 |
---|---|---|---|---|
recruitment_channel | ||||
other | 27890 | 2556 | 91.6048 | 8.3952 |
referred | 1004 | 138 | 87.9159 | 12.0841 |
sourcing | 21246 | 1974 | 91.4987 | 8.5013 |
is_promoted | 0 | 1 | %_0 | %_1 |
---|---|---|---|---|
previous_year_rating | ||||
0.0000 | 3785 | 339 | 91.7798 | 8.2202 |
1.0000 | 6135 | 88 | 98.5859 | 1.4141 |
2.0000 | 4044 | 181 | 95.7160 | 4.2840 |
3.0000 | 17263 | 1355 | 92.7221 | 7.2779 |
4.0000 | 9093 | 784 | 92.0624 | 7.9376 |
5.0000 | 9820 | 1921 | 83.6385 | 16.3615 |
is_promoted | 0 | 1 | %_0 | %_1 |
---|---|---|---|---|
KPIs_met >80% | ||||
0 | 34111 | 1406 | 96.0413 | 3.9587 |
1 | 16029 | 3262 | 83.0906 | 16.9094 |
is_promoted | 0 | 1 | %_0 | %_1 |
---|---|---|---|---|
awards_won? | ||||
0 | 49429 | 4109 | 92.3251 | 7.6749 |
1 | 711 | 559 | 55.9843 | 44.0157 |
Preparing Data for Modeling
# Creating dummy variables for all the categorical columns, droping that column master_data = convert_categorical_to_dummies(data) print("Total shape of Data :",master_data.shape) # dropping the target from dataset labels = np.array(master_data['is_promoted'].tolist()) # Removing the id attributes df_data = master_data.drop(['is_promoted','employee_id'],axis=1) print("Shape of Data:",df_data.shape) df = df_data.values
Total shape of Data : (54808, 61) Shape of Data: (54808, 59)
Model 1 – XGB Classifier
xgb_model = xgb.XGBClassifier() print(xgb_model) # Cross validation scores f1_scores = cross_val_score(xgb_model, df, labels, cv=5, scoring='f1') print("F1-score = ",f1_scores," Mean F1 score = ",np.mean(f1_scores)) # Training the models xgb_model.fit(df,labels) # Scoring on test set test_data,score_xgb = score_on_test_set(xgb_model,"test.csv","submission_xgb.csv")
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='binary:logistic', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1) F1-score = [ 0.4526749 0.41547519 0.43579122 0.43012552 0.43427621] Mean F1 score = 0.433668606717
XGB Classifier : Parameter Tuning
Our goal is usually to set the model parameters to optimal values that enable a model to complete learning task in the best way possible. Thus, tuning XGboost classifier can optimize the parameters that impact the model in order to enable the algorithm to perform the best.
I performed lot of iterations patiently which led to fine tuning of parameters: n_estimators, max_depth and L1 regularization. A norm is to take baby steps to learn (small learning rate) and tune the parameters. Here, I found that with large number of trees (n_estimators), the F1-scores were improving.
# Create parameters to search params = { 'learning_rate': [0.01], 'n_estimators': [900,1000,1100], 'max_depth':[7,8,9], 'reg_alpha':[0.3,0.4,0.5] } # Initializing the XGBoost Regressor xgb_model = xgb.XGBClassifier() # Gridsearch initializaation gsearch = GridSearchCV(xgb_model, params, verbose=True, cv=5, n_jobs=2) gsearch.fit(df, labels) #Printing the best chosen params print("Best Parameters :",gsearch.best_params_) params = {'objective':'binary:logistic', 'booster':'gbtree'} # Updating the parameter as per grid search params.update(gsearch.best_params_) # Initializing the XGBoost Regressor xgb_model = xgb.XGBClassifier(**params) print(xgb_model) # Cross validation scores f1_scores = cross_val_score(xgb_model, df, labels, cv=5, scoring='f1',n_jobs=2) print("F1_scores per fold : ",f1_scores," \nMean F1_score= ",np.mean(f1_scores)) # Fitting model on tuned parameters xgb_model.fit(df, labels) # Scoring on test set test_data,score_xgb_tuned = score_on_test_set(xgb_model,"test.csv","submission_xgb_tuned.csv")
XGB Classifier : Setting threshold
How does XGBoost classifier predicts the class (‘promoted’ or ‘not promoted’) ? It predicts a probability between 0 and 1 for the unseen cases. Further, it predicts 0 and 1 by putting a threshold at 0.5 by default (1 if probability > 0.5). In unbalance data-set as in here, it may be a biased setting as it would be difficult to capture rare event with 0.5 threshold.
- We can change the by default threshold of 0.5 by finding the optimal threshold to increase F1-score.
- We need to find the threshold where f1-score is highest.
- I tried submissions on few optimal cut-offs to get maximum possible improved F1-score.
The following python code splits the data in 90:10 and trains XGBoost classifier with tuned parameters. It calculates precision and recall at different thresholds and plots the precision recall curve. Further, we calculate F1-score for the same using precision and recall values.
# Splitting the dataset in order to use early stopping round X_train, X_test, y_train, y_test = train_test_split(df, labels, test_size=0.10, stratify=labels) xgb_model = xgb.XGBClassifier(**params) # Training the models xgb_model.fit(X_train, y_train) y_pred = xgb_model.predict_proba(X_test) precision, recall, thresholds = precision_recall_curve(y_test, y_pred[:,1]) thresholds = np.append(thresholds, 1) f1_scores = 2*(precision*recall)/(precision+recall) plt.step(recall, precision, color='b', alpha=0.4, where='post') plt.xlabel('Recall') plt.ylabel('Precision') plt.ylim([0.0, 1.05]) plt.xlim([0.0, 1.0]) plt.title('2-class Precision-Recall curve') plt.show()
Getting optimal threshold
We plot F1-scores with respect to threshold in x-axis to check the F1-score peak. The below python codes gets the threshold value where the F1-score was highest.
scrs = pd.DataFrame({'precision' : precision, 'recal' : recall, 'thresholds' : thresholds, 'f1_score':f1_scores}) print("Threshold cutoff: ",scrs.loc[scrs['f1_score'] == scrs.f1_score.max(),'thresholds'].iloc[0]) print("Max F1-score at cut-off : ",scrs.f1_score.max()) scrs.plot(x='thresholds', y='f1_score')
Threshold cutoff: 0.340377241373 Max F1-score at cut-off : 0.53791130186
Once you get the optimal threshold, use it for test set probability predictions as a cutoff to predict class labels 0 and 1 for the final submission.
What did not work ?
I tried the following other techniques which did not work and hence my final submissions were based on single model “XGBoost classifier” as described in this post.
- I tried logistic regression and SVM, f1 score was low (less than 0.4).
- I tried Random Forest. F1-score was comparatively low.
- I tried LightGBM model. In default setting, It gave 0.50 f1-score but somehow it was not improving with parameter tuning. Little improvement was there when early_stopping_rounds was used. We consider best iteration for predictions on test set.
- I created some interaction variables like
if previous_year_rating == 5 and KPI > 80 == 1 then 1 else 0
,if awards_won? == 1 and KPI > 80 == 1 then 1 else 0
. It did not help. - Finally, I took the best tuned params of all three (RF, XGboost and LightGBM) and stacked them with ‘Logistics Regression’ as classifier. It did not gave better f1 than individual XGB Classifier model.
At the End
Readers are also encouraged to download the data-set and check if they can reproduce the results. Also, I would love to check in comments if you can surpass the F1-score achieved here in the blog-post. There are following other things which one can try.
- Generally, Stacking improves scores when there are lot of models. One can train say 100s of models of XGBoost and LightGBM (with different close by parameters) and then apply logistic regression on top of that (I tried with only 3 models, failed).
- Also, one can try an interaction variable by calculating total score achieved in training (Number of training * Avg. training score).
- One can try setting “early_stopping_rounds” in XGBoost classifier training which I did not tried. It prevents over-fitting and can improve results.
The full implementation of the followed approach along with LightGBM model example (jupyter notebook) can be downloaded from GitHub link here.
If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.
Happy data analytics 🙂
Hi, just a small question
When you’re using “convert_categorical_to_dummies” in 2 different places, you can’t be sure that the 1 hot encoded columns will be in the same order in the train DF and in the test DF.
Right?
May this be the reason for the low results?
Like
Nopes. That won’t be the case here. Data-set were consistent column wise. Also pandas does not make dummies in random order.
Max F1-score at cut-off : 0.53791130186
Leaderboard topper had score of 0.55 or .57 if I remember correctly.
Like