While on a project where we were using ElasticSearch extensively, we had a major problem with intermittent test failures. You know, the best kind of test failure. The kind of failure that makes people stop running tests, disable certain tests or blame quality software like Elasticsearch. Oh the pain of non-deterministic failures.
But what are you going to do? Disable the ElasticSearch tests? Maintain complicated stubs and lie to yourself about the app behaving as you'd expect? Not me. Not us. So we dug deep to resolve the issue. It took an unfortunate amount of fiddling to fix, but we ultimately killed the bug. Failures mean something once again. Oh how good it feels to fail!
So what was actually wrong?
It's important to have a pristine environment when each test starts. In order to ensure this, we were wiping the index and recreating it before every test. Our spec helper had a block like this to ensure that was the case.
... config.before(:each, elasticsearch: true) do array-of-indexed-models.each do |klass| klass.tire.index.delete klass.create_elasticsearch_index end end ...
This is unfortunately heavy handed, but you don't have the luxury of transactional testing or simple truncation of to help you. This is what we have. It's fine! Trust me. However, every now and then we'd get a 503 error back from ElasticSearch Failed to execute phase [query], all shards failed
.
Bummer.
Through careful inspection (via pounding the refresh button on ElasticSearch head) we were able to see that it occasionally took a few extra 10s of milliseconds to prepare the index which was enough to cause the next test to fail. Oh brother, a timing issue. I blame Java GC. Damn you Java and all the free software you've provided to me. This was confirmed by adding a sleep of ~400 milliseconds after the index is created. Voilà! No more intermittent failures. We're done.
What? A sleep in the tests? No effing way padre.
No sleep for the weary
How we ultimately addressed this was by catching and retrying the query when it failed. This required a change to the code where we were executing the query. Normally I'm not happy about altering code just to help tests pass, but intermittent failures and slow tests are more lame so we decided this was the way to go.
We knew we wanted the shape of the retry construct to look something like this.
execute_with_retries 3 do # Your ultra awesome ElasticSearch query end
The above block will execute the block of code three times until it's successful. It's important the search is the last value of the block you pass to execute_with_retries
so it's the return value of the block. And now, in all of its glory, the method itself.
def execute_with_retries(retries = 3, &block) begin retries -= 1 response = block.call rescue Tire::Search::SearchRequestFailed => e if retries > 0 && e.message.match(/SearchPhaseExecutionException\[Failed to execute phase \[query\], all shards failed\]/) retry else raise e end end end
What you'll notice here is that we're very specific about what error we're catching and retrying on. We don't want other errors triggering a retry, so we only catch the specific error and match on the exception's message. If you're interested catching and retrying another exception it's easy enough to modify. In our case, this is a private method in the model because it's the only place where calls to search are made.
You didn't think we talk all this nonsense about tests and not test the retry method? This would not make Michael happy. Oh and some tests for this retry method.
describe "#execute_with_retries" do let(:model_search) { ModelSearch.new } it "retries 3 times by default" do flags = [] error_block = Proc.new do flags << 1 raise Tire::Search::SearchRequestFailed, "SearchPhaseExecutionException[Failed to execute phase [query], all shards failed]" end expect do model_search.send(:execute_with_retries, &error_block) end.to raise_error(Tire::Search::SearchRequestFailed) expect(flags.size).to eq(3) end it "only rescues tire search request failed, 503s, when all shards fail" do flags = [] error_block = Proc.new do flags << 1 raise Tire::Search::SearchRequestFailed, "does not match error" end expect do model_search.send(:execute_with_retries, &error_block) end.to raise_error(Tire::Search::SearchRequestFailed) expect(flags.size).to eq(1) end it "outputs a message when retrying" do model_search.should_receive(:puts).twice flags = [] error_block = Proc.new do flags << 1 raise Tire::Search::SearchRequestFailed, "SearchPhaseExecutionException[Failed to execute phase [query], all shards failed]" end expect do model_search.send(:execute_with_retries, &error_block) end.to raise_error(Tire::Search::SearchRequestFailed) end it "allows non-tire errors to bubble up" do flags = [] error_block = Proc.new do flags << 1 raise NoMethodError end expect do model_search.send(:execute_with_retries, &error_block) end.to raise_error(NoMethodError) expect(flags.size).to eq(1) end end
Conclusion
So there you have it. There are other blogs out there on how to retry blocks of code in Ruby but I wanted to talk about the specifics of this ElasticSearch failure and how we decided to retry only if that specific error was causing us pain.
The only other alternatives were to rip out ElasticSearch tests, not run them as frequently, or use stubs. That's nonsense. We can sleep easier knowing we have real tests around our ElasticSearch. I'll pour a little liquor on the ground for the overhead of an additional method call, but nobody's going to miss those CPU cycles anyways.
DevMynd - software development companies in Chicago with practice areas in digital strategy, human-centered design, UI/UX, and web application and custom mobile development.
