This repository showcases the application of Logistic Regression and Random Forest for gene expression analysis. It automates data processing, model training, and evaluation.
- Preprocessing: Formats gene expression data
- Machine Learning: Uses Logistic Regression & Random Forest
- Evaluation: Generates accuracy reports and confusion matrices
- Automation: Includes a shell script & GitHub Actions
📂 **ml_gene_expression_project**
┣ 📂 **data/** → Stores gene expression data
┣ 📂 **results/** → Model reports and visualizations
┣ 📜 **preprocess.py** → Data processing script
┣ 📜 **train.py** → Model training script
┣ 📜 **evaluate.py** → Model evaluation with visualization
┣ 📜 **run_pipeline.sh** → Shell script to automate pipeline execution
┣ 📜 **run_pipeline.yml** → GitHub Actions workflow
┣ 📜 **app.py** → Streamlit app for interactive visualization
┣ 📜 **README.md** → Project documentation
bash run_pipeline.sh
- Model Accuracy Reports in
results/
- Confusion Matrices saved as images
This repository supports automatic execution when new data is pushed.
- Clone the repository:
git clone https://github.com/sivkri/GeneExpression-MachineLearning.git
- Navigate to the directory:
cd GeneExpression-MachineLearning
- Run the pipeline:
bash run_pipeline.sh
streamlit run app.py
✔ Streamlit app is properly emphasized
✔ Instructions for running the app are added
✔ Clear explanation of app features
This version sells your ML + Streamlit project effectively. Let me know if you need further refinements! 🚀
- Identified top genes differentiating wild-type (WT) vs knockout (KO) conditions
- Evaluated the impact of Eltrombopag (E20) treatment
- Achieved 75% accuracy with the Random Forest classifier
This project applies machine learning techniques to analyze gene expression data under different experimental conditions. Using logistic regression and random forest classifiers, we identify genes differentially expressed due to HuR knockout (ELAVL1 deletion) and Eltrombopag (E20) drug treatment.
Additionally, a Streamlit web application is integrated to provide an interactive visualization of the results.
If you use this dataset or findings, please cite the following study:
📖 DOI: 10.1186/s12915-025-02131-z
This study investigates how HuR knockout (KO) and Eltrombopag (E20) treatment influence gene expression compared to wild-type (WT) and mock treatment (DMSO).
The dataset consists of the following experimental conditions:
Sample Group | Description |
---|---|
WT-DMSO | Wild-type (WT) cells treated with mock (DMSO) |
WT-E20 | Wild-type (WT) cells treated with Eltrombopag (E20) |
KO-DMSO | HuR knockout (KO) cells treated with mock (DMSO) |
KO-E20 | HuR knockout (KO) cells treated with Eltrombopag (E20) |
Each sample contains gene expression data across thousands of genes. HuR (ELAVL1) is a key RNA-binding protein, and its knockout may significantly alter gene expression. Eltrombopag is a thrombopoietin receptor agonist that may influence transcriptional programs.
I have performed three key comparisons using supervised learning to classify gene expression profiles.
- Comparison: WT-DMSO vs. KO-DMSO
- Objective: Identify genes affected by HuR deletion.
- Machine Learning Approach:
- Features: Gene expression levels
- Labels: WT-DMSO (class 0) vs. KO-DMSO (class 1)
- Comparison: WT-DMSO vs. WT-E20
- Objective: Determine gene expression changes due to Eltrombopag in normal cells.
- Machine Learning Approach:
- Features: Gene expression levels
- Labels: WT-DMSO (class 0) vs. WT-E20 (class 1)
- Comparison: KO-DMSO vs. KO-E20
- Objective: Understand the HuR-dependent response to Eltrombopag.
- Machine Learning Approach:
- Features: Gene expression levels
- Labels: KO-DMSO (class 0) vs. KO-E20 (class 1)
-
Preprocessing:
- Normalize expression data
- Convert into a machine-learning-ready format
-
Model Training & Feature Selection:
- Train logistic regression and random forest classifiers
- Perform Principal Component Analysis (PCA)
-
Evaluation:
- Compute accuracy, confusion matrices, classification reports
- Identify top differentially expressed genes
-
Visualization & Reporting:
- Generate PCA scatter plots
- Save model performance metrics
For queries, feel free to reach out! 🚀