Gender classifier for reddit comments (kaggle competition, 🏆#1)
- python (3.9.6) check with
python3 --version
- jupyter notebook (6.4.0): check with
jupyter notebook --version
- jupyter nbconvert (6.0.7): check with
jupyter nbconvert --version
Maybe previous versions works but it is not guaranteed.
Clone this repository:
git clone
Move inside the cloned repository
cd reddit
Assuming you have Kaggle API download the dataset:
kaggle competitions download -c datamining2021
Unzip inside input/datamining2021:
mkdir input && unzip -d input/datamining2021 && rm
Create python virtual environment and activate it
python3 -m venv venv source venv/bin/activate
Dependencies are define in requirements.txt. Install them with
python -m pip install -r requirements.txt
Install venv as jupyter kernel
python -m ipykernel install
(Optional) download the trained model from and then unzip the folder into
Move inside the repository
cd reddit
Start jupyter-notebook with --config flag and check that the right kernel is active.
jupyter notebook --no-browser --notebook-dir working
Dowload notebook from kaggle and replace the local version
kaggle kernels pull -p working -m s1m0n38/simone-bertolotto-857533
(At the moment there is an open issue
about kernel-metadata.json
generation. UPDATE: now the bug is fixed and issue is closed.)
Upload local version of the notebook to Kaggle and start the exectuion
kaggle kernels push -p working
During the exectution of Kaggle notebook, some output file are generated. These are trained sklearn estimators, learning curves, csv solutions, ... You can download the last version of output files with:
kaggle kernels output s1m0n38/simone-bertolotto-857533 -p working
(Bug in Kaggle API, output command does not dowload all the files)
Working with jupter notebook is great but sometimes code can become messy. Some cell are move up or down, some are deleted and the notebook stop working when try to run top to bottom. The following command convert the notebook into python script and then run it. This ensure that the notebook will be working.
cd working && \
jupyter nbconvert \
--to python --stdout \
--TagRemovePreprocessor.remove_cell_tags 'cmd' \
simone-bertolotto-857533.ipynb | \
PYTHONWARNINGS='ignore' MPLBACKEND='agg' python || cd ..
The script execution is fast because does not need to draw the plot on the screen (matplotlib backend is agg
Moreover can be used to run part of the notebook that produce some output that is sometime misrender by browser version of jupyter notebook (e.g. tqdm).