Intermittent delays when performing search queries #1475

mikezopa · 2015-07-01T08:43:17Z

We have a strange problem whereby most queries take a very short amount of time to execute (<50ms), but intermittently they will take almost exactly 21 seconds.

While trying to find a solution to the problem, I stumbled across an unanswered stack overflow question describing almost identical behaviour (and use case).

Our use case

We have a small micro service API (using NancyFx) which performs relatively simple queries and filters. We instantiate our ElasticClient at start up, and inject it where required.

We enabled the elasticsearch slow query log on the index, and this shows that queries never take more than ~130ms to execute (and that is seldom). It also doesn't align with these 21s "blips" that we see in the logs.

We have 2 nodes in our production cluster, and the ElasticClient uses a SniffingConnectionPool to utilise both nodes.

At the moment I wonder if it is caused by the client connections to the production cluster being dropped, however, the really odd thing is that this delay is 21 seconds. I would expect the time taken to reconnect to be more random. We have a bizarre spike if we graph the request times. 95% < 1s, 97.5% < 2s, and then nothing until this spike of the last 2.5% of requests at 21s.

Any ideas on things to try would be appreciated.

The text was updated successfully, but these errors were encountered:

Mpdreamz · 2015-07-01T10:23:14Z

Hi mikezopa,

Sorry to hear this, I'm inclined to believe this is something that happens in the bowels of WebRequest itself. googling webrequest and 21 seconds yields numerous results such as this one

WebRequest notoriously measures timeouts from time to send first byte to last byte read. Anything that needs to happen prior (DNS setup, enqueueing connections) uses the default timeout of 20 seconds. So even if you set the Timeout to 1second on the request, a slow DNS lookup can cause the overal request to fail only after 20 seconds.

Are you able to tell us more about your connectionsettings setup, the exception that you see when this intermittent spikes happen?

AFAIK we do properly dispose of all the request streams we start and have extensive unit tests of all the code paths to make sure we close the responsestreams.

We also do extensive Wireshark sessions every now and then to make sure we are reusing TCP connections. To hear you still face these sporadic spikes is saddening but I would love to find out the root cause of this.

mikezopa · 2015-07-06T16:28:48Z

Hi @Mpdreamz,

Sorry for my delayed response, I hadn't noticed that the issue had been updated! Thank you for the fast initial response.

With regards to the connectionsettings setup, we initialise the client like this:

var settings =
                new ConnectionSettings(new SniffingConnectionPool(nodeUris), this.indexName)
                    .ThrowOnElasticsearchServerExceptions();

            this.client = new ElasticClient(settings);

Thanks for the information regarding the WebRequest. We are making a larger number of requests which we have not been doing before. We used to be making a request once every few seconds, but during times of loading, one ElasticClient instance might be making a couple of requests within a second. Could this be upsetting any reuse of TCP connections? Is it wise for us to be using one ElasticClient instance for the entire micro-service?

I don't believe there are any exceptions thrown when this error occurs, the request completes successfully, just very delayed. Unfortunately, the caller of this micro-service gives up after 5 seconds (as we can't wait longer than that for various reasons). Sometimes we see multiples of this delay on our micro-service, so we see spikes in lower counts at 42, 63 seconds etc.

We might have a go at implementing some of these suggestions and let you know what we find!

Mpdreamz · 2015-07-06T19:13:20Z

You can new ElasticClient as many times as you like but if you are using connection pooling make sure each client gets passed the same instance of the SniffingConnectionPool. This should not have bearing on socket reeuse and open TCP connections this is handled by the servicepoint(s) for the urls in the cluster in the .NET framework already. If you do not share IConnectionPool then you might be doing more bookkeeping (sniffiing/ping) calls then necessary though.

We try to follow most of the suggestions in that thread already :) request.Proxy = null; can be a silver bullet for some (and can be controlled with .DisableAutomaticProxyDetection() on ConnectionSettings but usually this is making the very first call on a system very slow not intermittent ones. We also had reports from users where request.Proxy = null actually introduces a slow down on some systems so there really is no sane default there.

If you are running on Azure we typically recommend setting SetTcpKeepAlive() on connectionsettings. This will cause actual keep alive packets to be send periodically on the reused tcp connections. (http://blogs.msdn.com/b/cie/archive/2014/02/14/connection-timeout-for-windows-azure-cloud-service-roles-web-worker.aspx)

A little involved but the get a bit more clarity you could run your microservice with trace enabled on system.net: http://mikehadlow.blogspot.nl/2012/07/tracing-systemnet-to-debug-http-clients.html

This might give some insights into peeks.

mikezopa · 2015-07-08T08:02:24Z

With regards to some more investigation work, I have been unable to reproduce the problem by creating various integration tests which hammer requests to elasticsearch, even when using multiple threads (run via the NUnit test runner). This is leading me to suspect it might be something to do with hosting within IIS and how connections are managed there.

I was hoping to reproduce on my local machine allowing me to use Wireshark and various tracing tools a bit more easily then when located on a server. I will see if I can reliably reproduce using load testing against the micro-service when its running under IIS.

None of the suggestions in the thread helped us unfortunately! :)

Mpdreamz · 2015-07-16T15:41:39Z

Hi @mikezopa closing this due to inactivity but please feel free to report back! This issue for us is never really closed (due to its heisenbug nature) and something we revisit often.

nhusby · 2019-11-21T22:28:22Z

I know I'm kicking a dead horse here, but I was having a similar issue with the first query in an automated test taking ~20 seconds. Turns out, in my case, it was probably a DNS issue. Switching from localhost to 127.0.0.1 solved the problem.

If anyone else happens upon this, maybe that info will be helpful for them.

Mpdreamz closed this as completed Jul 16, 2015

bittusarkar mentioned this issue Jun 28, 2018

Attempting to resurrect a dead node takes 21 seconds even after a ping failure #3308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent delays when performing search queries #1475

Intermittent delays when performing search queries #1475

mikezopa commented Jul 1, 2015

Mpdreamz commented Jul 1, 2015

mikezopa commented Jul 6, 2015

Mpdreamz commented Jul 6, 2015

mikezopa commented Jul 8, 2015

Mpdreamz commented Jul 16, 2015

nhusby commented Nov 21, 2019

Intermittent delays when performing search queries #1475

Intermittent delays when performing search queries #1475

Comments

mikezopa commented Jul 1, 2015

Mpdreamz commented Jul 1, 2015

mikezopa commented Jul 6, 2015

Mpdreamz commented Jul 6, 2015

mikezopa commented Jul 8, 2015

Mpdreamz commented Jul 16, 2015

nhusby commented Nov 21, 2019