BCG Data Science Job Simulation

Task 3

Acquired descriptive statistics on the initial pricing and client datasets.
- Data types.
- Distributions of numerical variables.
- Checks for null values and counts of unique values for each variable.
Created visualisations for variables in the datasets.
- Bar plots for categorical variables in the client dataset.
- Histograms and box plots for numerical variables in the client and pricing datasets; used to identify data ranges that could potentially indicate outliers.

To handle various data types in the dataset to allow for more effective model training.
Feature types:
- Binary: has_gas
- Categorical: channel, origin_up
- Target: churn
- Numeric: All other variables provided
StandardScaler() to standardise all numeric variables.
Forcibly converted all non-numeric variables to categorical with a function provided here
For ordinal variables, use OrdinalEncoder(categories=<ORDER>) with given <ORDER> as list or list of lists.
For categorical variables, use OneHotEncoder() for automatic encoding in pipeline.
- For binary variables, use OneHotEncoder(..., drop = "if_binary") to ensure that there are no redundant columns.

One with the initial parameter, another with hyperparameter optimisation.
To predict churn of clients.
Given parameter grid param_grid_rf, used RandomizedSearchCV()to optimise the RandomForestClassifier() based on f1-score.
Best parameters are obtained, then applied into an optimised model.
f1-score used due to imbalance of churn, so using it as a metric of measurement mitigates the issue of class imbalance.
From the initial model to the optimised model, the optimised model has lower test accuracy and precision, and greater test recall and f1-score.

Mean accuracy of 0.7740963855421686.
Cross-validation scores relatively low except for accuracy.
Model generally has very poor precision-recall ratio.
Model performance is not satisfactory due to low f1-score for all instances.
Alternative models that could be considered for this classification problem include LogisticRegression(), DecisionTreeClassifier() and HistGradientBoostingClassifier().
Survival analysis based on the number of days clients remained in the service could also have been considered for classification.

Initial .feature_importances_
- margin_net_pow_ele and margin_gross_pow_ele have extremely high feature importance with at least 0.1 importance
- Feature importance for months_activ, cons_12m, cons_last_month are relatively high at approximately 0.0375.
- Two origin values origin_up_kamkkxfxxuwbdslkwifmmcsiusiuosws and origin_up_lxidpiddsbxsbosboudacockeimpuepw have particularly high feature importance.
permutation_importance
- Used to account for feature interactions and bias.
- months_to_end, months_renewal, var_year_price_off_peak, margin_net_pow_ele, var_year_price_off_peak_fix, net_margin, and off_peak_peak_var_mean_diff have particularly strong positive permutation feature importance.
- margin_gross_pow_ele and months_activ have particularly strong negative permutation feature importance, along with the two origin values origin_up_kamkkxfxxuwbdslkwifmmcsiusiuosws and origin_up_lxidpiddsbxsbosboudacockeimpuepw, which were also provided in .feature_importances_ graph.
margin_net_pow_ele, cons_12m, cons_last_month are the only features found in both that may significantly improve the model.

Initial .feature_importances_
- net_margin, cons_12m, forecast_meter_rent_12m, forecast_cons_12m, margin_net_pow_ele and margin_gross_pow_ele have extremely high feature importance with at least 0.04 importance.
permutation_importance
- Used to account for feature interactions and bias.
- margin_net_pow_ele, margin_gross_pow_ele, origin_up_lxidpiddsbxsbosboudacockeimpuepw, months_activ and forecast_meter_rent_12m all have relatively high permutation importance at above 0.015.
margin_net_pow_ele, margin_gross_pow_ele, months_activ and forecast_meter_rent_12m are shared between both feature importance models as particularly important features.

Situation
- Prediction models for whether or not a client will churn made.
- Important features: Gross and net margins on power subscriptions, past electricity consumption, the timespan the client has been with the service, forecasted bill of meter rental.
Complication
- Cannot determine how likelihood of client churning will change with given important features.
- Customers that first subscribed to the electricity campaigns "kamkkxfxxuwbdslkwifmmcsiusiuosws" and "lxidpiddsbxsbosboudacockeimpuepw" are of particular note in terms of client churning.
Question
- Client likelihood of churning is increased by higher margins on power subscriptions, shorter timespans and higher forecasted bills.
- Electricity consumption influencing churn likely due to people considering the current pricing options as unfair relative to their usage.
Answer
- Investigate prices that competition uses as frame of reference for bills, margins on power subscriptions, and pricing options based on electricity consumption; adjust accordingly to ensure potential profits.
- Give benefits for clients that have recently joined so that they are inclined to staying longer.