Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving search performance while using --full flag #8

Closed
manuyavuz opened this issue Sep 21, 2015 · 29 comments
Closed

Improving search performance while using --full flag #8

manuyavuz opened this issue Sep 21, 2015 · 29 comments

Comments

@manuyavuz
Copy link
Member

Hi,

I like this pod search --full command very much, however i think it has performance drawbacks which makes me hesitate while using it.

First of all, it makes an update index operation whenever it's called, even if an index was generated before. This makes the command very slow to operate.

Secondly, it generates a hash index such that keys are the pod names and values are strings which are created by concatenating podspec information such as summary, description and authors names. While searching for a given query, it traverses whole these value strings and perform a gsub() operation on them.

I have the following proposal to improve search speed for full text search scenario:

  • Search index generation should be performed only when master repo or another private repo is updated (i.e. after performing git pull --ff-only). Operation should perform as a subprocess to not to block user for pod install and pod update commands.
  • Search index should have a different format which is more suitable for search scenarios. Specifically, index should be a hash index where keys are words used in whole spec repository, and values are names of the pods which contains given word inside it's spec information. This eliminates need to run gsub() on long strings repeatedly. We need to check query with keys only.

I've actually implemented proposed scenario, and made some benchmarks. Results are as follows (_default_ means current implementation, _fast_ means proposed implementation):

Generating Index From Scratch

    user            system      total     real
    defaultIndex:   12.540000   1.130000  13.670000 ( 13.705714)
    fastIndex:      12.800000   0.990000  13.790000 ( 13.797296)

There is no important difference on indexing performance.

Searching 'afnetworking' Query

    user        system     total      real
    default:    7.390000   0.370000   7.760000 (  7.760688)
    fast:       0.470000   0.020000   0.490000 (  0.500742)

Here the proposed method offers a huge performance gain. This gain results from two improvements:

  1. Performing a proactive indexing policy eliminates the need to update index during command execution. We are sure that stored index is up-to-date.
  2. Evaluating query only for repo vocabulary (thanks to the new index format).

To measure the effects individually, I run one more benchmark, which disables default implementation's update index policy. In this case, it just gets index from the file and performs search on it. Here are the results:

    user                        system      total       real
    defaultWithoutUpdateIndex:  2.140000    0.040000    2.180000 (  2.189329)
    fast:                       0.510000    0.030000    0.540000 (  0.548586)

I think the gain of the new index format is pretty obvious.

Conclusion

Proposed method can remarkably improve search performance for full text search scenario. I will generate a PR containing the implementation, but I have a concern for this. Change contains modifications on three classes each belonging to different repositories (CocoaPods/CocoaPods, CocoaPods/Core, CocoaPods/cocoapods-search). How do you perform PR reviews for these cases? Should I create PR's for each repo and give links to each other? Or are there any other suitable methods?

Thanks so much for giving your time to this long issue description!

@orta
Copy link
Member

orta commented Sep 22, 2015

Awesome, so we generally create multiple PRs that all point to one issue as the source of the "why", which you've provided. Leaving the individual PRs free to talk about the "how."

One of the downsides is that our Specs repo change's pretty often.

screen shot 2015-09-22 at 07 38 35

But if you're using search to find things at the same time then this makes sense.

It's a shame that we can't offer the cocoapods.org search for your own specs repo. Otherwise I was going to say you could get similar performance gains + way more flexibility by switching to using our API.

@manuyavuz
Copy link
Member Author

Hi @orta,

I've also thought about the frequent refresh rate of the specs repo. Even though it won't bother user's pod install/update speed, making index update on every such command would be unnecessary for users that do not use command line search much.

What is the index update policy for the cocoapods.org search? I've seen an issue on the repo about updating index periodically using a time interval, however it was not specifying current policy.

We should think more about the index update policy. However, we can change index format now since it seems like it does not have a similar drawback. So it would be better to split changes to different PRs.

@orta
Copy link
Member

orta commented Sep 22, 2015

The search can get incremental updates from trunk via web hooks for every pod, so it has a very different problem ( and a completely unique index - I expect, I don't know the details. )

I wonder if ideal answer is that we tie the index to the commit SHA, don't do a repo update but at the end of the results we check for if the HEAD SHA on the repo remote is different and if so, say " you might want to run pod repo update to ensure your results are up-to-date." Allowing for fast search, and letting the user choose whether to update?

@manuyavuz
Copy link
Member Author

Relating search index with the commit SHA would be very reasonable. Your suggestion works for most people, so we can go for it.

What I wonder is if a more ideal scenario is possible. Would it be better to update search index incrementally using sth like git diff --name-only OLD_SHA..NEW_SHA and use output to get changed specs and update index for these only. This would be much more fast, even negligible on pod install/update commands. This gives us an always up-to-date index in a much more efficient way.

Here is a sample output for git diff on master specs repo:

[~/.cocoapods/repos/master on master ✓ ] % git diff --name-only b28531c3..13e8fc6a

Specs/CoreStore/1.3.1/CoreStore.podspec.json
Specs/DRYUtilities/1.4.2/DRYUtilities.podspec.json
Specs/GCDKit/1.1.2/GCDKit.podspec.json
Specs/Medable/1.1.1/Medable.podspec.json
Specs/NKNetworkKit/0.2.4/NKNetworkKit.podspec.json
Specs/VeesoPlayer/2.0.22/VeesoPlayer.podspec.json

However, if it feels too hackie, your suggestion also fits great.

@orta
Copy link
Member

orta commented Sep 22, 2015

That feels like the perfect answer @manuyavuz

@manuyavuz
Copy link
Member Author

Great then. I need to mature solution to index update problem, so I will submit two different PRs, one for the index format(which will hopefully be done today), other for updating the index (soon).

Thanks @orta!

@alloy
Copy link
Member

alloy commented Sep 23, 2015

I’m loving this, superb work @manuyavuz! 👏 ✨

@manuyavuz
Copy link
Member Author

Hey @alloy, thanks!

And there is more to come soon 😋

@manuyavuz
Copy link
Member Author

Hey, I have a question.

What's the best way to get the name of the pod from the following string?

Specs/LazyPageScrollView/1.0/LazyPageScrollView.podspec.json

I don't want to do hard-coded splitting. I'm sure there is a convenient method for this inside somewhere in the projects :)

cc @segiddins

@segiddins
Copy link
Member

Where's the string coming from? I'm pretty sure Source has all of the necessary APIs, but I can't promise anything.

@manuyavuz
Copy link
Member Author

I have this inside sources_manager.rb file. I also have the source object.

@segiddins
Copy link
Member

But where are you getting that path from? Ideally, the source could yield the version and name when iterating, but if you're reading the spec anyways, you could read the name off of it.

@manuyavuz
Copy link
Member Author

Oh sorry I misunderstood.

I get it from the output of git diff --name-only PREV_SHA..HEAD command. So I just get it as a string.

@segiddins
Copy link
Member

I'd add a method on source that can do the splitting, then.

@manuyavuz
Copy link
Member Author

Umm, ok for me. Is there any sample that currently performs a similar kind of splitting inside the project? For instance which parts can I assume that it won't change?

I see that 'Specs' string is hardcoded inside specs_dir method of source.rb file, so I can also use it as hardcoded, right?

@alloy
Copy link
Member

alloy commented Oct 6, 2015

The filesystem layout is stable.

Will search still work without a spec repo being inside a git repo? (Iirc, there is for instance a svn spec repo plugin.)

@manuyavuz
Copy link
Member Author

Good question.

I was planning to stick to the behavior inside update method of sources_manager.rb. I see that it currently performs update only for git sources. How pod install/update commands work for non-git spec repos? Are they updated in some other place?

@alloy
Copy link
Member

alloy commented Oct 6, 2015

It’s true that the built-in code for updating spec-repos only supports git, but here’s that plugin I was referring to that makes it work for svn as well https://github.com/clarkda/cocoapods-repo-svn.

The goal should generally always be to have it work as scm-agnostic as possible.

So will it still work, but not support partial updates? If so, that’s fine, someone like @clarkda can then add partial update support if they need it.

@manuyavuz
Copy link
Member Author

I got the point. Then, we may still perform index updates for non-git sources while performing search operation. This way, we still gain from performance on git specs (which is the default), and pod search command would also work for users with non-git sources, but it will be slow.

@alloy
Copy link
Member

alloy commented Oct 6, 2015

I think that’s a fine trade-off, as other spec repos will probably much smaller than the master spec repo anyways 👍

@manuyavuz
Copy link
Member Author

Actually they will still gain from speed up resulted from the new index format. However, there will be a re-index for non-git sources whenever search command is run.

Or maybe we could apply @orta's suggestion and perform a search on current search index and print a message directing to pod repo update. However, I should find a way to detect if a non-git repo is outdated in order to do that.

@alloy
Copy link
Member

alloy commented Oct 6, 2015

Nah, it doesn’t sound like the added complexity is worth it, simpler is better 👍

@manuyavuz
Copy link
Member Author

Hi again,

I'm nearly done but need help for some decisions to be taken.

If a user runs pod install/update command, the code detects changed pods and updates the search index if there is one. What should we do if there is no search index created before? Should we also create it during install/update commands? It will take approximately 10-15 secs but do not block user because I run it as a fork subprocess.

If a user runs pod search --full QUERY command, code returns results using the existing search index. Should we try to update spec repos if they are outdated? It may take some time because currently running git pull on master repo is somewhat slow due to repo being huge.

According to suggestions, I will complete implementation and update the PRs.

@segiddins
Copy link
Member

I don't think pod search should update the repos

@alloy
Copy link
Member

alloy commented Oct 7, 2015

What should we do if there is no search index created before? Should we also create it during install/update commands? It will take approximately 10-15 secs but do not block user because I run it as a fork subprocess.

What about doing that on the first ‘search’ and print out what it’s doing?

This has the added benefit of never spending time on updating the index if the user never actually uses search.

Should we try to update spec repos if they are outdated?

I agree with @segiddins.

@manuyavuz
Copy link
Member Author

It sounds good.

  • The search index will be created only during pod search command.
  • pod search will run with existing index if there is one.
  • Search index will be updated only during pod install/update commands.

@alloy
Copy link
Member

alloy commented Oct 7, 2015

Sounds good 👍

@manuyavuz
Copy link
Member Author

I've completed general implementation and updated PRs CocoaPods/CocoaPods#4249 and CocoaPods/Core#265.

Need some thinking for specs.

Feedbacks are welcome!

@manuyavuz
Copy link
Member Author

I think we can close the issue because all changes have been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants