Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling model causes memory to max then crash the os on Linux #1207

Closed
mike-lawrence opened this issue Sep 29, 2023 · 10 comments
Closed

Sampling model causes memory to max then crash the os on Linux #1207

mike-lawrence opened this issue Sep 29, 2023 · 10 comments

Comments

@mike-lawrence
Copy link

I'm on Ubuntu 22.04 running cmdstan 2.33.1 with empty makefile. Attached is R code to sample a Stan model's priors, generating synthetic data that is then fed back as data for sampling. The prior sampling and generated data all look ok, but when real sampling happens my RAM balloons rapidly to 100% then Ubuntu crashes.

Further fiddling identifies the following user-defined function as the culprit:

	vector time_varying_frequency_sine(
		vector hz
		, vector amp
		, vector time
	){
		int n_time = num_elements(time) ;
		vector[n_time] out ;
		vector[n_time] time_copy = time ;
		int next_sweep_start_index = 1 ;
		int this_sweep_start_index ;
		int this_sweep_end_index ;
		vector[n_time] time_minus_next_sweep_start_time ;
		int sweep_num = 0 ;
		real tmp ;
		while(next_sweep_start_index<n_time){
			sweep_num += 1 ;
			this_sweep_start_index = next_sweep_start_index ;
			time_minus_next_sweep_start_time = time_copy - (1/hz[this_sweep_start_index]) ;
			while(
				(next_sweep_start_index<n_time)
				&&
				(time_minus_next_sweep_start_time[next_sweep_start_index]<0)
			){
				next_sweep_start_index += 1 ;
			}
			this_sweep_end_index = next_sweep_start_index - 1 ;
			out[this_sweep_start_index:this_sweep_end_index] = (
				sine(
					hz[this_sweep_start_index]
					, time_copy[this_sweep_start_index:this_sweep_end_index]
				)
				* amp[this_sweep_start_index]
			) ;
			tmp = (
				sine(
					hz[this_sweep_start_index]
					, time_copy[next_sweep_start_index]
				)
				* amp[this_sweep_start_index]
			) ;
			time_copy = time_copy - (1/hz[this_sweep_start_index]);
		}
		if(this_sweep_end_index<n_time){
			out[n_time] = tmp ;
		}
		return(out) ;
	}

This function runs fine in the GQ block, but when used in the TP block the above noted crash occurs.

@WardBrian
Copy link
Member

@mike-lawrence can you provide the data in a JSON format rather than the R script?

@mike-lawrence
Copy link
Author

mike-lawrence commented Sep 29, 2023

Sure, here you go

@mike-lawrence
Copy link
Author

Sampling with a single chain avoids a crash, but ram goes to about 26GB (my system has 32GB) and "failed unexpectedly" after two warmup iterations.

@WardBrian
Copy link
Member

The inner while loop is not being executed in the second iteration of HMC. Not sure why yet, but that's the problem (next_sweep_start_index never changes)

@mike-lawrence
Copy link
Author

Adding a bunch of print statements inside the user-defined function, plus sampling with adapt_engaged=1 iter_warmup=0 iter_sampling=1 refresh=1, it seems to complete one iteration (prints all the print statements added, plus the Chain 1 Iteration: 1 / 1 [100%] message, but hangs for some reason thereafter (RAM hits 98%, OS doesn't crash but Rstudio eventually does).

@mike-lawrence
Copy link
Author

Not sure what I did differently, but a second attempt at the above configuration eventually un-hung, returning seemingly successfully but when the $draws() method is then called on the returned CmdStanMCMC object, it errors with Error in UseMethod("subset_draws") : no applicable method for 'subset_draws' applied to an object of class "NULL", and inspection of the csv indeed shows only the header, no draws.

@mike-lawrence
Copy link
Author

mike-lawrence commented Sep 30, 2023

Oh, if the contents of the TP block are instead put into the model block, the model samples a single chain just fine (notebaly, with only 3Gb RAM usage). If two or more parallel chains are attempted, RAM jumps and at least with the run I just did one of the chains "failed unexpectedly". Could this somehow be a bug related to the writer service receiving quantities that are outputs of a User-defined function (and this specific UDF, as I've written outputs from other UDFs just fine in the past)? Tagging writer expert @mitzimorris in case they have insight here.

@mike-lawrence
Copy link
Author

Hm. When varying the seed, for some seeds a single chain finishes ok with reasonable ram usage, and for other seeds it immediately causes ram to explode and yields a "chain finished unexpectedly" result. I'll re-add the prints to see if I can discern if there's a particular area of the parameter space that's pathological. I thought I made the function such that it's guaranteed to exit the while loops regardless of the parameters, but I possibly missed a corner case.

@WardBrian
Copy link
Member

When running under gdb I definitely ended up stuck in the outer loop. This will keep adding variables to the autodiff stack until you run out of RAM

@mike-lawrence
Copy link
Author

Closing as I indeed found the corner case that caused an unterminating while loop. Sorry for the false alarm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants