-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recommendations taking a long time #192
Comments
the long time with 1000 data points could be caused by running into memory limit, I imagine that possible on a 16GB machine as it might write stuff temporarily to disk (super slow). but its weird if it also happens for only 10 added datapoints. Did you monitor memory after you start requesting recommendations? Theres also the possibility that you've constructed a gigantic searchspace (not in feature dimension but in number of combinations). Please provide the way you construct it. I'd also be interested in the dimensionality of your searchspace obtained via To investigate whether the surrogate model choice is causing this you could use a more scaling friendly random forest model (here done for a ngboost model https://emdgroup.github.io/baybe/examples/Custom_Surrogates/surrogate_params.html but you can easily change it to |
Hi @Scienfitz thanks for the quick reply! I suspect memory isn't the issue because memory stays far below my capacity (usually ~2 GB) throughout the duration of the recommendation. I construct the searchspace like this: comp_rep produces a table with 43,740 rows x 194 columns Also, when I try to use a RandomForestSurrogate, I get the following error: |
ok I overlooked that you have a hybrid search space, in that case random forest cant be used I cant see anything obvious, although 2GB seems almost suspiciously low memory usage. Lets wait of the other devs have more ideas In the meantime you could also try to model all parameters as discrete. For that the Are you trying to model a mixture by any chance? We have a detailed example for how to do it with all discrete parameters. If you want to do that with molecular representations it can get complicated due to the constraints rather quickly. This is also one of the reasons why I suggest all-discrete parameters, some of the needed constraints are not yet supported between sets of mixed (ie both conti and discrete) parameters |
Hi @brandon-holt, thanks for the report 🥇 I (am 99% confident that I) know exactly what causes the problem. Will compile a detailed explanation and share few suggestions in the next hour but wanted to briefly speak up so that you and @Scienfitz can stop searching for the cause. It has nothing to do with your memories of the fact you are using GPs but all with the fact that you have a hybrid search space with a discrete part of non-trivial size, which makes the used optimization routine explode 💥 (details to come ...) |
I agree with @AdrianSosic that this is probably the reason. Won't go into much detail here until he has posted the more detailed explanation, but just wanted to confirm that this is probably the issue :) |
Now here finally the explanation: ProblemThe "problem" with your setting is, as mentioned above, that you operate in a hybrid space and the applied optimizer simply does not work well in situations where the discrete subspace is large. Our main workhorse, the SolutionUnfortunately, optimizing hybrid spaces is notoriously hard and there is no easy solution – it's an active field of research. We've already started investigating more scalable approach some time ago but back then our code base was not yet ready for such advanced techniques. Today, the situation is different and we are already planning to continue our work on that end. In the meantime, I think there are only the following things you can do:
The last one can be done very easily and we have a special convenience constructor for that. You'll find an example below. In this approach, you are only limited by your computer's memory size. ExampleThe following should roughly reproduce your setting. The discretized version takes about one minute on my machine. However, the discretization is rather crude and finer resolutions will quickly crash your memory. Also, the involved dataframe operations take quite some time but at least for this part there is already a fix on the horizon (we are planning to transition to polars soon ...) Setup Codeimport numpy as np
from baybe.campaign import Campaign
from baybe.constraints.continuous import ContinuousLinearInequalityConstraint
from baybe.objective import Objective
from baybe.parameters.numerical import (
NumericalContinuousParameter,
NumericalDiscreteParameter,
)
from baybe.parameters.substance import SubstanceParameter
from baybe.searchspace.core import SearchSpace
from baybe.searchspace.discrete import SubspaceDiscrete
from baybe.targets.numerical import NumericalTarget
substances = [
"C", "CC", "CN", "CO", "CCC", "CCN", "CCO", "CNC", "COC", "CCNC", "CCOC", "CNCN",
"COCN", "COCO", "CCCOC", "CCNCC", "CCOCO", "CNCCO", "CNCNC", "CNCOC", "COCCN",
"CCCCOC", "CCNCNC", "CCNCOC", "CCOCOC", "CNCCOC", "CNCNCN", "CNCNCO", "CNCOCO",
"COCCOC", "CCCNCOC", "CCNCCNC", "CCOCCOC", "CCOCNCO", "CNCNCNC", "CNCNCOC",
"CNCOCCN", "CNCOCOC", "COCNCCN", "CCCCCCCO", "CCCOCOCO", "CCNCCNCO", "CCNCCOCN",
"CCNCOCCO", "CCNCOCOC", "CCOCNCOC", "CNCCNCOC", "CNCNCCOC", "COCCCNCN",
"COCCNCCO", "CCCCNCNCN", "CCCCOCOCN", "CCNCCNCOC", "CCOCCOCNC", "CNCNCNCOC",
"CNCNCOCNC", "CNCNCOCOC", "CNCOCNCOC", "COCCNCNCN", "COCCOCNCO", "COCOCOCOC",
"CCNCNCOCNC", "CCOCCCOCOC", "CNCCOCOCOC", "CNCNCNCCOC", "CNCNCOCCCN", "COCCOCNCOC",
"CCCOCCNCCOC", "CCNCOCCCCNC", "CNCCCCNCOCN", "CNCNCOCOCCO", "CNCOCOCOCOC",
"COCCCNCNCCN", "COCCNCCNCOC", "COCNCOCOCOC", "CCCCOCNCOCOC", "CCCOCNCOCOCC",
"CCNCCCNCNCOC", "CCNCNCNCNCOC", "CNCCOCNCNCNC", "CNCCOCOCOCNC", "CNCNCNCNCOCO",
"CNCOCOCNCNCN", "COCOCNCOCOCO", "CCNCNCCOCCCCO", "CNCCNCCOCNCNC", "CNCCNCNCNCOCO",
"CNCCOCCOCOCOC", "COCOCCCOCOCCO", "COCOCCNCNCNCN", "CCNCOCNCOCOCNC",
"CCOCCNCCNCNCOC", "CCOCCOCNCNCNCO", "CCOCNCCOCOCOCN", "CNCNCNCCCCOCOC",
"COCNCCCNCNCOCN", "COCOCCCCNCCOCO", "CCCCCCNCCCCCNCC", "CCNCOCCOCCNCCNC",
"CCOCCOCNCCOCCOC", "COCCNCNCNCOCCOC", "COCCNCOCCOCOCOC", "CCNCCCCNCCOCNCNC",
"CCNCCNCNCCNCOCNC", "CCNCOCCNCOCOCCNC", "CNCCCCCCCNCNCCOC", "CNCCCCOCCOCCNCNC",
"CNCNCNCOCCOCNCNC", "COCCCNCNCOCNCCOC", "COCNCCCOCNCNCCCN", "COCNCOCNCNCCCNCO",
"CCCOCCCNCOCOCCCNC", "CCCOCNCNCNCOCOCOC", "CCNCCCNCNCNCCNCNC", "CCNCOCCNCOCCNCCNC",
"CNCNCCNCOCNCCNCOC", "CNCNCCOCCCNCNCOCO", "CNCOCCNCCNCNCOCNC", "COCCCNCNCNCCOCNCN",
"COCOCCCNCCOCCOCOC", "CCCCCNCOCOCNCCOCCC", "CCNCOCNCNCCNCCNCOC",
"CCOCNCCNCNCNCCOCNC", "CNCCNCCCNCNCCCNCNC", "CNCCNCOCNCOCOCCNCN",
"CNCOCNCNCNCOCOCCOC", "CNCOCOCOCCOCOCNCCO", "COCNCCCCOCNCNCOCOC",
"COCOCOCNCNCOCNCCCN", "CCNCCNCNCCCCNCOCCCO", "CCNCNCCOCOCCOCCOCNC",
"CNCNCNCOCOCOCOCCCOC", "CNCOCNCCCOCNCOCNCCN", "CNCOCOCNCCNCOCCCCOC",
"COCNCCCOCOCOCCCNCCO", "CCNCCNCCCCOCOCNCCNCC", "CCOCNCOCCOCCCOCOCOCC",
"CNCNCNCOCNCNCNCCNCOC", "CNCOCCCCCOCCOCCCOCOC", "COCCNCCCOCNCCOCNCCOC",
] # fmt: skip
chunks = [substances[:10], substances[10:20], substances[20:24], substances[24:]]
substance_parameters = [
SubstanceParameter(
name=f"s_{i}",
data={f"substance_{j}": substance for j, substance in enumerate(chunk)},
)
for i, chunk in enumerate(chunks)
]
targets = [NumericalTarget(name="target", mode="MAX")]
objective = Objective(mode="SINGLE", targets=targets) Search Space: Hybrid Version (your approach)continuous_parameters = [
NumericalContinuousParameter(name=f"c_{i}", bounds=(0, 1)) for i in range(4)
]
parameters = substance_parameters + continuous_parameters
constraints = [
ContinuousLinearInequalityConstraint(
parameters=[p.name for p in continuous_parameters],
coefficients=[1.0 for _ in continuous_parameters],
rhs=1.0,
)
]
searchspace = SearchSpace.from_product(parameters, constraints) Search Space: Discretized Versiondiscrete_parameters = [
NumericalDiscreteParameter(name=f"d_{i}", values=np.linspace(0, 1, 5))
for i in range(4)
]
searchspace = SearchSpace(
discrete=SubspaceDiscrete.from_simplex(
max_sum=1.0,
simplex_parameters=discrete_parameters,
product_parameters=substance_parameters,
boundary_only=True,
)
) Getting Recommendationscampaign = Campaign(searchspace, objective)
recommendations = campaign.recommend(10)
recommendations["target"] = np.random.random(len(recommendations))
campaign.add_measurements(recommendations)
campaign.recommend(3) |
Ah, forgot one more thing. Of course, you can also try to fix the problem from other angles. In fact, we have two potential other solutions in our current code base, but be aware that they are not yet properly tested against realistic examples and they will probably give you very crude approximations:
But as I said, both approaches are rather experimental and I wouldn't consider them actual solutions to your problem ... |
Thanks for this detailed explanation @AdrianSosic :) Just some additional note regarding the |
Also, regarding the use of seq_greedy_recommender = TwoPhaseMetaRecommender(
recommender=SequentialGreedyRecommender(
hybrid_sampler="Random", sampling_percentage=0.05
),
) You can choose between two different |
@Scienfitz @AdrianSosic @AVHopp Thank you all so much for this amazing responsiveness!! ❤️ These comments are all super helpful, and did help with the long run times. A couple of follow up questions:
|
|
@Scienfitz Thank you, this is very helpful! Much appreciated |
@brandon-holt, a few additional comments to question 2: DiscretizationPros:
Cons:
RelaxationPros:
Cons:
|
@Scienfitz @AdrianSosic Hi following up on this, this worked for me very well with my original dataset. However, I am trying to include an expanded set of 11 features that is pushing the memory over the brink of what I have available again, and I'm wondering if theres a better way to include these features. I'm attaching a spreadsheet of the dataset of the 11 additional features to give you an idea of the complexity of the dataset. Is there something you see in here that would lend itself to a different way of constructing the searchspace? Is 11 new features really that much given the context? |
Hi @brandon-holt, please apologize, I saw your message on the weekend and then forgot on Monday, so it got lost 🙈 Can you quickly bring me up to speed again how exactly you currently try to create the search space based on this table? That is, if you have loaded the csv into a dataframe |
@AdrianSosic Hey no worries, it happens! So this would be revising based on the "Search Space: Discretized Version" in the tagged comment: new_features = pd.read_csv('new_features.csv')
updated_substance_parameters = deepcopy(substance_parameters)
for nf in new_features.columns:
values = sorted(new_features[nf].unique())
if len(values) == 1:
continue
param = NumericalDiscreteParameter(name=nf, values=values)
updated_substance_parameters.append(param)
searchspace = SearchSpace(
discrete=SubspaceDiscrete.from_simplex(
max_sum=200,
simplex_parameters=discrete_parameters,
product_parameters=updated_substance_parameters,
boundary_only=False,
)
) Basically, just adding in these new features as numerical discrete parameters to the substance parameters, that get slotted in as the product parameters when constructing the discrete searchspace from simplex (excuse the inaccurate name I'm aware once we add these numerical discrete features the |
Hi @brandon-holt, finally had some time to look into (these days the workload is a bit heavy 😬). To answer your question if "addding the features is really that much": Have a look at our new helper method
Here you can witness what the term "exponential explosion" really means 🙃. So building that product space is just not possible, we need to find a different approach. Is a discrete product search space really what you want/need? |
I'm wondering if the recommendation times I'm encountering are expected given my setup:
Machine:
MacBook Air, 15 inch, M2, 2023
Memory: 16 GB
OS: Sonoma 14.4.1
Python: 3.11.8
Model:
Single NumericalTarget
Parameters: 4 SubstanceParameters (~140 total SMILES molecules), 4 NumericalContinuousParameters
Constraints: 4 numerical parameters must sum to 1.0
Recommender: TwoPhaseMetaRecommender(
initial_recommender=RandomRecommender(),
recommender=SequentialGreedyRecommender())
So when I add 1000 datapoints via campaign.add_measurements() it takes ~4 days to make a recommendation with a batch size of 3. I started a test with only 10 datapoints and it is still running from overnight.
Does this sound expected given my machine, model, and data? If so, what would be the recommended ways to improve the speed? For the molecules I've tried with and without mordred & decorrelation, doesn't seem to make a big difference.
If this doesn't sound expected, how would you recommend I troubleshoot what could be causing the issue?
Thanks in advance!
The text was updated successfully, but these errors were encountered: