This repository contains the code for the first stage of the AlfaHack AutoML hackaton.
Team: DDDrkBBB.
Members: Peter Belonovskiy, Kristina Galuzina, Timofey Lashukov
- Track: Отток юридических лиц из расчетно-кассового обслуживания.
- Leaderboard score:
$81.7015322279558$ - Leaderboard position: 3.
Tune + fit 7 models on selected features. Each model is a mean blend on 5 stratified folds. Blend the best model with the model staked on the selected features and oof predictions.
.
├── README.md
├── assets
├── automl <-- clone of an open source automl package.
├── best_res <-- notebooks to reproduce the best result.
│ ├── blend.ipynb
│ ├── create_stack_df.ipynb
│ ├── data_processing.ipynb
│ ├── fit_cb.ipynb
│ ├── fit_lama.ipynb
│ ├── fit_lama_autoint.ipynb
│ ├── fit_lama_fttransformer.ipynb
│ ├── fit_lama_stack.ipynb
│ ├── fit_lama_utilized.ipynb
│ ├── fit_lgb.ipynb
│ ├── fit_xgb.ipynb
│ └── select_features.ipynb
├── configs
│ └── config.yaml <-- config file.
├── data
│ ├── models <-- folder with model artifacts.
│ ├── train <-- raw train data.
│ ├── test <-- raw test data.
│ └── . <-- processed data files.
├── notebooks <-- many-many-many-many various experiments.
├── requirements_gpu.txt <-- python requirements on a gpu server.
└── requirements.txt <-- python requirements on a cpu server.
To reproduce the results run the notebooks in best_res
in the following order:
best_res/data_processing.ipynb
- Basic feature processing. Saves processed dataset files indata/
folder.best_res/select_features.ipynb
- Selects features. Saves selected features to theconfigs/config.yaml
.best_res/fit_lgb.ipynb
- fits + tunes LightGBM. Saves model file, oof predictions, model params and test predictions indata/model/lgb_8122_full_dataset/
.best_res/fit_xgb.ipynb
- fits + tunes XGBoost. Saves model file, oof predictions, model params and test predictions indata/model/xgb_81325_full_dataset/
.best_res/fit_cb.ipynb
- fits + tunes CatBoost. Saves model file, oof predictions, model params and test predictions indata/model/cb_8114_full_dataset/
.best_res/fit_lama.ipynb
- fits + tunes LightAutoML. Saves model file, oof predictions, model params and test predictions indata/model/lama_81298_full_dataset/
.best_res/fit_lama_utilized.ipynb
- fits + tunes LightAutoMLUtilized. Saves model file, oof predictions, model params and test predictions indata/model/lamau_81425_full_dataset/
.best_res/fit_lama_autoint.ipynb
- fits LightAutoML AutoInt. Saves model file, oof predictions, model params and test predictions indata/model/lamann_autoint_8053_full_dataset/
. IMPORTANT: GPU is required.best_res/fit_lama_fttransformer.ipynb
- fits LightAutoML AutoInt. Saves model file, oof predictions, model params and test predictions indata/model/lamann_fttransformer_8050_full_dataset/
. IMPORTANT: GPU is required.best_res/create_stack_df.ipynb
- Adds out-of-fold predictions as features to the dataset. Saves stacked datasets indata/
folder.best_res/fit_lama_stack.ipynb
- fits + tunes stack LightAutoML on a time series cross-validation. Saves model file, model params and test predictions indata/model/lama_stack_time_series/
.blend.ipynb
- blends the predictions oflamau_81425_full_dataset
andlama_stack_time_series
models and produces the final submission.
Firstly, we have explored the dataset. Data has appeared to be pretty clean and well-prepared for modeling, even in a raw format. Due to the lack of information about features (they are depersonalized), feature engineering is impossible. The only thing we have done is found categorical columns in the data (the ones that contain less than 150 unique values). Finally, basic feature processing (standard scaling + ordinal encoding) was applied.
Next, we have understood that data contains too many features, and some of them are useless. Thus, we have decided to perform feature selection. This evidently speeds up training and results in high scores. As for the feature selection algorithm, we have chosen Catboost Shapley values feature selection. From our point of view, this is the most unbiased way to find really important features.
Then we have proceeded to the model training. From the very beginning, we have decided that we will apply stacking due to its dominance in tabular tasks. For stacking, we need to train several base models with different structures, save their out-of-fold predictions, and then train the final model on the default features + out-of-fold predictions.
For base models we have chosen: (out-of-fold scores are shown in bold)
- CatBoost 0.8114
- LightGBM 0.8122
- XGBoost 0.8132
- LightAutoML 0.81298
- LightAutoMLUtilized 0.81425
- LightAutoML AutoInt (Tabular neural network) 0.8053
- LightAutoML FtTransformer (Tabular neural network) 0.805
For the stacking model:
- LightAutoML
Wrapped implementation of all these models, that significantly eases the workflow, has been taken from the open-source automl package. This package allows to automatically tune parameters for each model and then fit the model with the best parameters on the cross-validated folds. Such fitting strategy reduces the variance of predictions and allows for out-of-fold predictions. Training of all the models except for the LightAutoML AutoInt and LightAutoML FtTransformer has been performed on the CPU server. Tabular neural networks have been trained on the GPU server We have tried two cross-validation strategies: stratified and time-series. For the base models, stratified cross-validation has shown much better results, while for the stacking model, time-series cross-validation has made the deal.
Each model has been tuned with the timeout of 1 hour (2 hours for LightAutoMLUtilized). The full train dataset has been used because we have observed that decreasing the train size significantly worsens the results.
The best model by out-of-fold (LightAutoMLUtilized), was also the best on the leaderboard
Then, stacking has been performed by fitting LightAutoML on the enriched training dataset. As mentioned earlier, time-series cross validation has shown better results. The score
Finally, we have decided to blend the predictions of the best base model (LightAutoMLUtilized) and the stacking model with weights
Final leaderboard score:
- CPU server:
- CPU: Intel Xeon (Cascadelake) (16) @ 2.992GHz
- Memory: 16 GB
- Python: Python 3.10.12
- GPU server:
- CPU: Intel Xeon Gold 6240R (6) @ 2.400GHz
- GPU: NVIDIA GeForce RTX 2080 Ti Rev. A
- Memory: 16 GB
- Python: Python 3.10.0