Salifort Motors

Salifort Motors - Capstone Project

Providing data-driven suggestions for HR

ISSUE / PROBLEM:

Salifort is currently facing a significant employee turnover rate, which has raised concerns among the senior leadership team. The company's commitment to fostering a corporate culture that nurtures employee growth and success makes addressing this issue crucial. Additionally, the high turnover is proving costly for Salifort as it involves substantial investments in recruitment, training, and upskilling of employees.

Description:

This capstone project presents an opportunity to analyze a dataset and develop predictive models that offer insights to the Human Resources (HR) department of a large consulting firm. In this project, I used my expertise in regression models and machine learning models to predict employee attrition.

Pace: Plan Stage

Stakeholders: Salifort’s senior leadership team, Human Resources.

Purpose and goal: To increase employee retention, and to learn more about what might be driving the turnover.

Initial observations:

The dataset doesn’t contain missing values.
The dataset has duplicated values.
The average_monthly_hours column has a very large standard deviation score as well as the range.
The last_evaluation column has a minimum score of 0.36, which it doesn’t contain a score below 0.36.
The time_spend_company column has no data for employees who worked less than 2 years
There are outliers in the time_spend_company variable.

Ethical considerations: The dataset has more data representing the employees who stayed

NOTE:

The boxplot shows that there are outliers in the tenure variable.

Lower limit: 1.5

Upper limit: 5.5

Number of rows in the data containing outliers in ‘tenure’: 824

Certain types of models are more sensitive to outliers than others. However, we are going to use a tree-based model which is less sensitive to outliers, we are not going to remove them.

pAce: Analyze Stage

It might be natural that people who work on more projects would also work longer hours. This appears to be the case here, with the mean hours of each group (stayed and left) increasing with number of projects worked. However, a few things stand out from this plot.

There are two groups of employees who left the company: (A) those who worked considerably less than their peers with the same number of projects, and (B) those who worked much more. Of those in group A, it's possible that they were fired. It's also possible that this group includes employees who had already given their notice and were assigned fewer hours because they were already on their way out the door. For those in group B, it's reasonable to infer that they probably quit. The folks in group B likely contributed a lot to the projects they worked on; they might have been the largest contributors to their projects.
Everyone with seven projects left the company, and the interquartile ranges of this group and those who left with six projects was ~255–295 hours/week—much more than any other group.
The optimal number of projects for employees to work on seems to be 3–4. The ratio of left/stayed is very small for these cohorts.
If you assume a work week of 40 hours and two weeks of vacation per year, then the average number of working hours per month of employees working Monday–Friday = 50 weeks * 40 hours per week / 12 months = 166.67 hours per month. This means that, aside from the employees who worked on two projects, every group—even those who didn't leave the company—worked considerably more hours than this. It seems that employees here are overworked.

The scatterplot above shows that there was a sizeable group of employees who worked ~240–315 hours per month. 315 hours per month is over 75 hours per week for a whole year. It's likely this is related to their satisfaction levels being close to zero.

The plot also shows another group of people who left, those who had more normal working hours. Even so, their satisfaction was only around 0.4. It's difficult to speculate about why they might have left. It's possible they felt pressured to work more, considering so many of their peers worked more. And that pressure could have lowered their satisfaction levels.

Finally, there is a group who worked ~210–280 hours per month, and they had satisfaction levels ranging ~0.7–0.9.

Note the strange shape of the distributions here. This is indicative of data manipulation or synthetic data

There are many observations you could make from this plot.

Employees who left fall into two general categories: dissatisfied employees with shorter tenures and very satisfied employees with medium-length tenures.
Four-year employees who left seem to have an unusually low satisfaction level. It's worth investigating changes to company policy that might have affected people specifically at the four-year mark, if possible.
The longest-tenured employees didn't leave. Their satisfaction levels aligned with those of newer employees who stayed.
The histogram shows that there are relatively few longer-tenured employees. It's possible that they're the higher-ranking, higher-paid employees.

Calculate the mean and the median satisfaction score:

As expected, the mean and median satisfaction scores of employees who left are lower than those of employees who stayed. Interestingly, among employees who stayed, the mean satisfaction score appears to be slightly below the median score. This indicates that satisfaction levels among those who stayed might be skewed to the left.

The plots above show that long-tenured employees were not disproportionately comprised of higher-paid employees.

The following observations can be made from the scatterplot above:

The scatterplot indicates two groups of employees who left: overworked employees who performed very well and employees who worked slightly under the nominal monthly average of 166.67 hours with lower evaluation scores.
There seems to be a correlation between hours worked and evaluation score.
There isn't a high percentage of employees in the upper left quadrant of this plot; but working long hours doesn't guarantee a good evaluation score.
Most of the employees in this company work well over 167 hours per month.

The plot above shows the following:

very few employees who were promoted in the last five years left
very few employees who worked the most hours were promoted
all of the employees who left were working the longest hours

There doesn't seem to be any department that differs significantly in its proportion of employees who left to those who stayed.

The correlation heatmap confirms that the number of projects, monthly hours, and evaluation scores all have some positive correlation with each other, and whether an employee leaves is negatively correlated with their satisfaction level.

Insights:

Evidence suggests that employees are departing the company due to inadequate management. This departure seems to be associated with extended working hours, a high project load, and overall lower satisfaction levels. The experience of working long hours without receiving promotions or favorable evaluation scores can be disheartening. A substantial number of employees at the company are likely experiencing burnout. Additionally, it appears that employees who have been with the company for more than six years tend to stay, indicating a potential retention pattern based on tenure.

paCe: Construct Stage

Code Link

Identify the types of prediction tasks:

In this project, our goal is to build a model that predicts whether or not an employee will leave the company. In this case, we can choose either logistic regression or a tree-based model.

Identify the types of models most appropriate for this task:

We will build a Random Forest model and an XGBoost model for the following reasons:

This dataset contains outliers, a tree-based model is less sensitive about the outliers.
The tree-based models don't require scaling of features.
This dataset is imbalanced.

The model predicts more false positives than false negatives, which means that some employees may be identified as at risk of quitting or getting fired, when that's actually not the case. But this is still a strong model.

The barplot above shows that in this decision tree model, last_evaluation, number_project, tenure, and overworked have the highest importance, in that order. These variables are most helpful in predicting the outcome variable, left.

The plot above shows that in this random forest model, last_evaluation, number_project, tenure, and overworked have the highest importance, in that order. These variables are most helpful in predicting the outcome variable, left, and they are the same as the ones used by the decision tree model.

pacE: Execute Stage

Summary of model results

Logistic Regression Model

The logistic regression model achieved the following wighted average scores on the test set:

Precision: 79%
Recall: 82%
F1-score: 80%
Accuracy: 82%

Tree-based Machine Learning Model

After conducting feature engineering, the decision tree model achieved the following results on the test set:

AUC: 94.2%
Precision: 88.3%
Recall: 90.7%
F1-score: 89.5%
Accuracy: 96.5%

Random Forest Machine Learning Model

AUC: 93.3%
Precision: 90.5%
Recall: 88.4%
F1-score: 89.5%
Accuracy: 96.5%

Conclusion, Recommendations, Next Steps

The models and the feature importances extracted from the models confirm that employees at the company are overworked.

To retain employees, the following recommendations could be presented to the stakeholders:

Cap the number of projects that employees can work on.
Consider promoting employees who have been with the company for atleast four years, or conduct further investigation about why four-year tenured employees are so dissatisfied.
Either reward employees for working longer hours, or don't require them to do so.
If employees aren't familiar with the company's overtime pay policies, inform them about this. If the expectations around workload and time off aren't explicit, make them clear.
Hold company-wide and within-team discussions to understand and address the company work culture, across the board and in specific contexts.
High evaluation scores should not be reserved for employees who work 200+ hours per month. Consider a proportionate scale for rewarding employees who contribute more/put in more effort.

Next Steps

It may be justified to still have some concern about data leakage. It could be prudent to consider how predictions change when last_evaluation is removed from the data. It's possible that evaluations aren't performed very frequently, in which case it would be useful to be able to predict employee retention without this feature. It's also possible that the evaluation score determines whether an employee leaves or stays, in which case it could be useful to pivot and try to predict performance score. The same could be said for satisfaction score.

Salifort Motors - Capstone Project

ISSUE / PROBLEM:

Description:

Tiktok - Machine Learning

Do lE