-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for multiple lags in RGDR #85
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I'm just wondering if there's an easy way to extract clusters for a certain lag after applying RGDR, e.g. such that you could do: clustered_data.sel(lag=1)
. I realize that not all lags have the same number of clusters, so it's not as easy as stacking them along "lag" dimension though. Unless we just fill them with NaNs... What do you think?
s2spy/rgdr/rgdr.py
Outdated
n_clusters[i] = np.unique(data.isel(i_interval=i).cluster_labels).size | ||
|
||
if np.any(n_clusters == 1): # A single cluster is the '0' (leftovers) cluster. | ||
empty_lags = data["i_interval"].values[n_clusters == 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just check within the loop?
n_clusters[i] = np.unique(data.isel(i_interval=i).cluster_labels).size | |
if np.any(n_clusters == 1): # A single cluster is the '0' (leftovers) cluster. | |
empty_lags = data["i_interval"].values[n_clusters == 0] | |
n_clusters[i] = np.unique(data.isel(i_interval=i).cluster_labels).size | |
if n_clusters == 1: # A single cluster is the '0' (leftovers) cluster. | |
empty_lags = data["i_interval"].values[n_clusters == 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked after the loop to avoid users needing to iteratively remove lags from their experiment.
Let's say the user picks lags 10 through 20, and from lag 15 onward there are no significant clusters. With the current code they would immediately be notified: No significant clusters found in lag(s): i_interval=[15, 16, 17, 18, 19, 20]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. On the other hand, if the clustering takes long, and it fails at the first lag, then they won't have to wait for it to complete to be notified of the error. Which scenario is more likely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't know!
When playing with the example data I set freq='10d'
, and it turns out that there were two "windows of predictability", namely close to the target date (i_interval
= 1:8), and ~7 months earlier (i_interval
= 21:25).
Having a summary of when significant clusters can be found could be very useful to users, but I am not sure 🤷♂️
As you said, not all lags have the same number of clusters, and additionally, the clusters sharing a label does not mean they represent the same physical regions. I feel like making the cluster labels a dimension along with lag would kind-of imply that. If we want to support this kind of selection we could create a utility function, but I think that the current way of flattening is required to be able to continue with fitting a model, or to be able to put RGDR in a pipeline. |
that's a convincing point.
I agree. Let's see if there's demand for that. |
Kudos, SonarCloud Quality Gate passed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's take the remaining discussion into a new issue and stick with the current implementation for now.
This PR adds support for providing multiple lags to RGDR.
Example:
Note: when plotting data, the user needs to provide the lag they want to see (unless there is only a single lag).
Additionally, I refactored the DBSCAN implementation into more manageable chunks.