Skip to content

S1M0N38/reddit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit

Gender classifier for reddit comments (kaggle competition, 🏆#1)

Requirements

  • python (3.9.6) check with python3 --version
  • jupyter notebook (6.4.0): check with jupyter notebook --version
  • jupyter nbconvert (6.0.7): check with jupyter nbconvert --version

Maybe previous versions works but it is not guaranteed.

Installation

  1. Clone this repository:

    git clone https://github.com/S1M0N38/reddit.git
  2. Move inside the cloned repository

    cd reddit
  3. Assuming you have Kaggle API download the dataset:

    kaggle competitions download -c datamining2021
  4. Unzip datamining2021.zip inside input/datamining2021:

    mkdir input && unzip datamining2021.zip -d input/datamining2021 && rm datamining2021.zip
  5. Create python virtual environment and activate it

    python3 -m venv venv
    source venv/bin/activate
  6. Dependencies are define in requirements.txt. Install them with

    python -m pip install -r requirements.txt
  7. Install venv as jupyter kernel

    python -m ipykernel install
  8. (Optional) download the trained model from https://ufile.io/5m99mlwy and then unzip the folder into reddit/working.

Working setup

  1. Move inside the repository

    cd reddit
  2. Start jupyter-notebook with --config flag and check that the right kernel is active.

    jupyter notebook --no-browser --notebook-dir working

Interact with Kaggle kernel

Dowload notebook from kaggle and replace the local version

kaggle kernels pull -p working -m s1m0n38/simone-bertolotto-857533

(At the moment there is an open issue about kernel-metadata.json generation. UPDATE: now the bug is fixed and issue is closed.)

Upload local version of the notebook to Kaggle and start the exectuion

kaggle kernels push -p working

During the exectution of Kaggle notebook, some output file are generated. These are trained sklearn estimators, learning curves, csv solutions, ... You can download the last version of output files with:

kaggle kernels output s1m0n38/simone-bertolotto-857533 -p working

(Bug in Kaggle API, output command does not dowload all the files)

Testing notebook

Working with jupter notebook is great but sometimes code can become messy. Some cell are move up or down, some are deleted and the notebook stop working when try to run top to bottom. The following command convert the notebook into python script and then run it. This ensure that the notebook will be working.

cd working && \
   jupyter nbconvert \
   --to python --stdout \
   --TagRemovePreprocessor.remove_cell_tags 'cmd' \
   simone-bertolotto-857533.ipynb | \
   PYTHONWARNINGS='ignore' MPLBACKEND='agg' python || cd ..

The script execution is fast because does not need to draw the plot on the screen (matplotlib backend is agg). Moreover can be used to run part of the notebook that produce some output that is sometime misrender by browser version of jupyter notebook (e.g. tqdm).

About

Gender classifier for reddit comments

Resources

License

Stars

Watchers

Forks