Support for multiple lags in RGDR #85

BSchilperoort · 2022-08-18T09:59:06Z

This PR adds support for providing multiple lags to RGDR.

Example:

>>> precursor_field = field_resampled.sst.isel(i_interval=slice(1,5)) # Multiple lags: 1 through 4
>>> rgdr = RGDR(min_area_km2=3000**2)
>>> clustered_data = rgdr.fit_transform(precursor_field)
>>> clustered_data.cluster_labels
<xarray.DataArray 'cluster_labels' (cluster_labels: 6)>
'lag:1_cluster:-2' 'lag:1_cluster:1' ... 'lag:3_cluster:-1' 'lag:4_cluster:-2'
Coordinates:
  * cluster_labels  (cluster_labels) <U20 'lag:1_cluster:-2' ... 'lag:4_clust...
    latitude        (cluster_labels) float64 36.05 29.44 37.33 29.58 38.14 39.78
    longitude       (cluster_labels) float64 223.9 185.4 221.8 190.2 217.8 219.3

Note: when plotting data, the user needs to provide the lag they want to see (unless there is only a single lag).

Additionally, I refactored the DBSCAN implementation into more manageable chunks.

review-notebook-app · 2022-08-18T09:59:10Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Peter9192

Awesome! I'm just wondering if there's an easy way to extract clusters for a certain lag after applying RGDR, e.g. such that you could do: clustered_data.sel(lag=1). I realize that not all lags have the same number of clusters, so it's not as easy as stacking them along "lag" dimension though. Unless we just fill them with NaNs... What do you think?

Peter9192 · 2022-08-19T07:25:51Z

s2spy/rgdr/rgdr.py

+            n_clusters[i] = np.unique(data.isel(i_interval=i).cluster_labels).size
+
+        if np.any(n_clusters == 1):  # A single cluster is the '0' (leftovers) cluster.
+            empty_lags = data["i_interval"].values[n_clusters == 0]


Why not just check within the loop?

Suggested change

n_clusters[i] = np.unique(data.isel(i_interval=i).cluster_labels).size

if np.any(n_clusters == 1): # A single cluster is the '0' (leftovers) cluster.

empty_lags = data["i_interval"].values[n_clusters == 0]

n_clusters[i] = np.unique(data.isel(i_interval=i).cluster_labels).size

if n_clusters == 1: # A single cluster is the '0' (leftovers) cluster.

empty_lags = data["i_interval"].values[n_clusters == 0]

I checked after the loop to avoid users needing to iteratively remove lags from their experiment.

Let's say the user picks lags 10 through 20, and from lag 15 onward there are no significant clusters. With the current code they would immediately be notified: No significant clusters found in lag(s): i_interval=[15, 16, 17, 18, 19, 20].

Good point. On the other hand, if the clustering takes long, and it fails at the first lag, then they won't have to wait for it to complete to be notified of the error. Which scenario is more likely?

I really don't know!

When playing with the example data I set freq='10d', and it turns out that there were two "windows of predictability", namely close to the target date (i_interval = 1:8), and ~7 months earlier (i_interval = 21:25).

Having a summary of when significant clusters can be found could be very useful to users, but I am not sure 🤷‍♂️

s2spy/rgdr/rgdr.py

BSchilperoort · 2022-08-19T08:22:15Z

Awesome! I'm just wondering if there's an easy way to extract clusters for a certain lag after applying RGDR, e.g. such that you could do: clustered_data.sel(lag=1). I realize that not all lags have the same number of clusters, so it's not as easy as stacking them along "lag" dimension though. Unless we just fill them with NaNs... What do you think?

As you said, not all lags have the same number of clusters, and additionally, the clusters sharing a label does not mean they represent the same physical regions. I feel like making the cluster labels a dimension along with lag would kind-of imply that.

If we want to support this kind of selection we could create a utility function, but I think that the current way of flattening is required to be able to continue with fitting a model, or to be able to put RGDR in a pipeline.

Peter9192 · 2022-08-19T13:42:50Z

clusters sharing a label does not mean they represent the same physical regions. I feel like making the cluster labels a dimension along with lag would kind-of imply that

that's a convincing point.

If we want to support this kind of selection we could create a utility function

I agree. Let's see if there's demand for that.

sonarqubecloud · 2022-08-22T14:37:04Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

94.3% Coverage
0.0% Duplication

Peter9192

Let's take the remaining discussion into a new issue and stick with the current implementation for now.

Added support for multiple lags in RGDR

a3b52bb

BSchilperoort requested a review from Peter9192 August 18, 2022 10:02

BSchilperoort marked this pull request as ready for review August 18, 2022 10:54

BSchilperoort changed the title ~~Added support for multiple lags in RGDR~~ Support for multiple lags in RGDR Aug 18, 2022

BSchilperoort mentioned this pull request Aug 18, 2022

Use case for lorenz workshop #84

Closed

Peter9192 reviewed Aug 19, 2022

View reviewed changes

Renamed internal dbscan routines, added docstrings

d470004

Peter9192 approved these changes Aug 23, 2022

View reviewed changes

BSchilperoort mentioned this pull request Aug 23, 2022

When/how should RGDR fail if a lag has no significant clusters #86

Closed

BSchilperoort merged commit d228782 into main Aug 23, 2022

BSchilperoort deleted the rgdr_modifications branch August 23, 2022 07:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for multiple lags in RGDR #85

Support for multiple lags in RGDR #85

BSchilperoort commented Aug 18, 2022 •

edited

Loading

review-notebook-app bot commented Aug 18, 2022

Peter9192 left a comment

Peter9192 Aug 19, 2022

BSchilperoort Aug 19, 2022

Peter9192 Aug 19, 2022

BSchilperoort Aug 22, 2022 •

edited

Loading

BSchilperoort commented Aug 19, 2022

Peter9192 commented Aug 19, 2022

sonarqubecloud bot commented Aug 22, 2022

Peter9192 left a comment

Support for multiple lags in RGDR #85

Support for multiple lags in RGDR #85

Conversation

BSchilperoort commented Aug 18, 2022 • edited Loading

review-notebook-app bot commented Aug 18, 2022

Peter9192 left a comment

Choose a reason for hiding this comment

Peter9192 Aug 19, 2022

Choose a reason for hiding this comment

BSchilperoort Aug 19, 2022

Choose a reason for hiding this comment

Peter9192 Aug 19, 2022

Choose a reason for hiding this comment

BSchilperoort Aug 22, 2022 • edited Loading

Choose a reason for hiding this comment

BSchilperoort commented Aug 19, 2022

Peter9192 commented Aug 19, 2022

sonarqubecloud bot commented Aug 22, 2022

Peter9192 left a comment

Choose a reason for hiding this comment

BSchilperoort commented Aug 18, 2022 •

edited

Loading

BSchilperoort Aug 22, 2022 •

edited

Loading