Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multiple lags in RGDR #85

Merged
merged 2 commits into from
Aug 23, 2022
Merged

Support for multiple lags in RGDR #85

merged 2 commits into from
Aug 23, 2022

Conversation

BSchilperoort
Copy link
Contributor

@BSchilperoort BSchilperoort commented Aug 18, 2022

This PR adds support for providing multiple lags to RGDR.

Example:

>>> precursor_field = field_resampled.sst.isel(i_interval=slice(1,5)) # Multiple lags: 1 through 4
>>> rgdr = RGDR(min_area_km2=3000**2)
>>> clustered_data = rgdr.fit_transform(precursor_field)
>>> clustered_data.cluster_labels
<xarray.DataArray 'cluster_labels' (cluster_labels: 6)>
'lag:1_cluster:-2' 'lag:1_cluster:1' ... 'lag:3_cluster:-1' 'lag:4_cluster:-2'
Coordinates:
  * cluster_labels  (cluster_labels) <U20 'lag:1_cluster:-2' ... 'lag:4_clust...
    latitude        (cluster_labels) float64 36.05 29.44 37.33 29.58 38.14 39.78
    longitude       (cluster_labels) float64 223.9 185.4 221.8 190.2 217.8 219.3

Note: when plotting data, the user needs to provide the lag they want to see (unless there is only a single lag).

Additionally, I refactored the DBSCAN implementation into more manageable chunks.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@BSchilperoort BSchilperoort marked this pull request as ready for review August 18, 2022 10:54
@BSchilperoort BSchilperoort changed the title Added support for multiple lags in RGDR Support for multiple lags in RGDR Aug 18, 2022
Copy link
Contributor

@Peter9192 Peter9192 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I'm just wondering if there's an easy way to extract clusters for a certain lag after applying RGDR, e.g. such that you could do: clustered_data.sel(lag=1). I realize that not all lags have the same number of clusters, so it's not as easy as stacking them along "lag" dimension though. Unless we just fill them with NaNs... What do you think?

Comment on lines 111 to 114
n_clusters[i] = np.unique(data.isel(i_interval=i).cluster_labels).size

if np.any(n_clusters == 1): # A single cluster is the '0' (leftovers) cluster.
empty_lags = data["i_interval"].values[n_clusters == 0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just check within the loop?

Suggested change
n_clusters[i] = np.unique(data.isel(i_interval=i).cluster_labels).size
if np.any(n_clusters == 1): # A single cluster is the '0' (leftovers) cluster.
empty_lags = data["i_interval"].values[n_clusters == 0]
n_clusters[i] = np.unique(data.isel(i_interval=i).cluster_labels).size
if n_clusters == 1: # A single cluster is the '0' (leftovers) cluster.
empty_lags = data["i_interval"].values[n_clusters == 0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked after the loop to avoid users needing to iteratively remove lags from their experiment.

Let's say the user picks lags 10 through 20, and from lag 15 onward there are no significant clusters. With the current code they would immediately be notified: No significant clusters found in lag(s): i_interval=[15, 16, 17, 18, 19, 20].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. On the other hand, if the clustering takes long, and it fails at the first lag, then they won't have to wait for it to complete to be notified of the error. Which scenario is more likely?

Copy link
Contributor Author

@BSchilperoort BSchilperoort Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't know!

When playing with the example data I set freq='10d', and it turns out that there were two "windows of predictability", namely close to the target date (i_interval = 1:8), and ~7 months earlier (i_interval = 21:25).

Having a summary of when significant clusters can be found could be very useful to users, but I am not sure 🤷‍♂️

@BSchilperoort
Copy link
Contributor Author

Awesome! I'm just wondering if there's an easy way to extract clusters for a certain lag after applying RGDR, e.g. such that you could do: clustered_data.sel(lag=1). I realize that not all lags have the same number of clusters, so it's not as easy as stacking them along "lag" dimension though. Unless we just fill them with NaNs... What do you think?

As you said, not all lags have the same number of clusters, and additionally, the clusters sharing a label does not mean they represent the same physical regions. I feel like making the cluster labels a dimension along with lag would kind-of imply that.

If we want to support this kind of selection we could create a utility function, but I think that the current way of flattening is required to be able to continue with fitting a model, or to be able to put RGDR in a pipeline.

@Peter9192
Copy link
Contributor

clusters sharing a label does not mean they represent the same physical regions. I feel like making the cluster labels a dimension along with lag would kind-of imply that

that's a convincing point.

If we want to support this kind of selection we could create a utility function

I agree. Let's see if there's demand for that.

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

94.3% 94.3% Coverage
0.0% 0.0% Duplication

Copy link
Contributor

@Peter9192 Peter9192 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take the remaining discussion into a new issue and stick with the current implementation for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants