Transformer-based model from gene expression to molecules.
TransGEM is a phenotype-based de novo drug design model, which can generate new bioactive molecules, independent from disease target information.
- Create a conda environment:
conda env create -f environment.yaml
- Activate the environment:
conda activate TransGEM
The data related to this study can be downloaded here.
- in subLINCS dataset
python train.py --data_path ./data/ --dataset subLINCS --gene_encoder tenfold_binary --gpu cuda:0 --epochs 200
- in HCC515 dataset
python train.py --data_path ./data/ --dataset HCC515 --gene_encoder tenfold_binary --gpu cuda:0 --epochs 200
python ft_train.py --data_path ./data/ --dataset HCC515 --gene_encoder tenfold_binary --gpu cuda:0
python test.py --data_path ./data/ --dataset subLINCS --gene_encoder tenfold_binary --gpu cuda:0
python ft_test.py --data_path ./data/ --dataset HCC515 --gene_encoder tenfold_binary --gpu cuda:0
- for Prostate cancer
python app.py --data_path ./data/ --dataset PC --cell_line PC3 --gene_encoder tenfold_binary --gpu cuda:0 --seq_num 1000
- for Non-small cell lung cancer
python app.py --data_path ./data/ --dataset nsclc --cell_line A549 --gene_encoder tenfold_binary --gpu cuda:0 --seq_num 1000
python get_attention.py --data_path ./data/ --gene_encoder tenfold_binary --gpu cuda:0
- usage:
python train.py --help
- optional arguments:
-h, --help show this help message and exit
--data_path the directory where the data was inputted
--out_path the directory where the trianing results was output
--dataset the dataset used by the model (subLINCS/HCC515/PC/nsclc)
--gene_encoder encoding form of gene expression (value/one_hot/binary/tenfold_binary)
--gpu CUDA device ids
--hidden_dim hidden size of transformer decoder
--ff_dim dimension number of the feed-forward layer
--PE_dropout dropout of position coding
--TF_dropout dropout of transformer layer
--TF_N number of transformer decoder layer
--TF_H number of transformer decoder head
--TF_act activation function of transformer layer
--batch_size number of batch_size
--epochs number of epochs
--lr learning rate of adam
--cell_line cell line names of disease
--pad_idx id of pad symbol
--start_idx id of start symbol
--end_idx id of end symbol
--max_len maximum length of generated molecule
--vocab_size vocab size
--k number of molecules generated in a single beam search
--alpha the weight of the length and score of molecules generated by bundle search
--seq_num number of molecules ultimately retained
- Model parameters of 4 encoding forms
Encoding form | hidden_dim | ff_dim | TF_N | TF_H |
---|---|---|---|---|
value | 64 | 2048 | 6 | 8 |
one_hot | 64 | 512 | 6 | 8 |
binary | 64 | 512 | 6 | 8 |
tenfold_binary | 64 | 512 | 6 | 8 |