Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create RGDR modules #38

Closed
semvijverberg opened this issue Jun 30, 2022 · 3 comments · Fixed by #68
Closed

Create RGDR modules #38

semvijverberg opened this issue Jun 30, 2022 · 3 comments · Fixed by #68

Comments

@semvijverberg
Copy link
Member

We want to have a minimal workflow to test our pipeline, we like to implement the RGDR method to reduce dimensionality. The approach broadly consists of the following steps, of which a basic minimum workflow is not implemented in prototype_RGDR.ipynb in branch (prototype_RGDR_usecase). For more info on the method, see this paper.

  1. gridpoint-wise correlation map analysis versus 1-d target
  2. clusters these region via DBSCAN
  3. calculate 1-d timeseries based on each label (i.e. precursor region)

My vision is create two separate python files:

  1. one purely for the (partial) correlation map analysis , with also support for regression analysis; now called map_analysis.py
  2. the second for the clustering analysis (people might want to use this functionality stand-alone; now called cluster_regions.py

I already created some NotImplemented functions in both python files. To a large extent, the cluster_regions.py is a refractor of the original find_precursors.py in proto.

@geek-yang
Copy link
Member

Thanks for the explanation and for completing the minimum example @semvijverberg . This #36 already seems to be a very nice starting point. Now, I think we need to discuss the design of the new module. Going back to the whiteboard discussion we had before, my envision about the modules in the package looks like this:

image

Basically, what we are trying to address here, is the module for dimensionality reduction. In this module, we want to have some methods, like:

  • PCA
  • MCA
  • RGDR (Response guided dimensionality reduction)
  • ...

For most of the methods, the implementations are not difficult and there are packages we could make use of easily (e.g. statsmodel, scipy, scikit-learn). Only for RGDR, we need to code ourselves. But this method is quite valuable (and essential for s2s) and we want to include it as a minimum example.

This is my understand about what we've done so far. Before diving into the implementation of RGDR method, we need to think about the general API design of this module, and how do we want this module to work with the module we already have (e.g. time.py, traintest.py), and the module we skip for now but would like to have later (e.g. preprocess module). When we decide the design of this module, we will know what python files do we need and how will we organize them. For now, RGDR is just one method (although it carries the largest work-load for us), let's think about the overall design first and then get into details about the implementation.

My vision is create two separate python files:
one purely for the (partial) correlation map analysis , with also support for regression analysis; now called map_analysis.py
the second for the clustering analysis (people might want to use this functionality stand-alone; now called cluster_regions.py

To me, it is more logical that we create a new module to do dimensionality reduction (dimensionality.py?). We support multiple methods in it. And RGDR is one of the method. Based on the code you have, we know what parameters we need. The implementation of RGDR could be placed in an independent python file. Regarding the correlation map analysis, I think we need to think about it as I would assume that the user may also want to check correlations of timeseries as well as the autocorrelation of timeseries for each data point. These are relevant topics and we would like to group them together.

@semvijverberg
Copy link
Member Author

Dear Yang,

Thanks for the nice workflow image!

Yes, creating one python script (dimensionality.py) is an option, but I prefer to not group all methods to make things more clear for the users (and ourselves). I do think 1 overarching package such as dimensionality could/should exist, but it seems inconsistent to 'hide' RGDR in dimensionality.py.

All these packages (see overview below) are already stand-alone packages, and (my vision of) the task of dimensionality.py is to be a wrapper that enables integration with our pipeline (build upon time.py and traintest.py).

image

It seems logical to me that RGDR should become a stand-alone thing too. Maybe this requires an in-person discussion:).

@geek-yang
Copy link
Member

Thanks for the illustration. The pipeline idea sounds good to me. Once we discussed the plan for next-step on the whiteboard, we came up with the thought that maybe we could use the sklearn.pipepline to accommodate our workflow and see how it fits the recipe of the methods we would like to implement. To me, it makes sense to keep RGDR a individual python module since it just includes many operations and features, but again since it is also a method doing dimensionality reduction, we'd better fit it to the pipeline (or wrapper) as well, just to provide a consistent interface to the user. Unless we have specific reason to make it isolated. Let's discuss it next time we meet. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants