Description
Issue description
While using SparkSQL with the ES connector I noticed that all the search requests that hit ES use "track_total_hits": 2147483647
. This seems to have a big impact on query performance, especially for queries that scroll through large datasets.
Steps to reproduce
The following reproduction uses the NOAA dataset from https://github.com/elastic/rally-tracks/tree/master/sql to demonstrate the impact on performance:
Setup
esrally install --distribution-version=8.1.0 --node-name="rally-node-0" --network-host="127.0.0.1" --http-port=39200 --master-nodes="rally-node-0" --seed-hosts="127.0.0.1:39300"
esrally race --pipeline=benchmark-only --target-host=127.0.0.1:39200 --track=sql --race-id="randomId" --include-tasks=tag:setup
Spark Shell
val noaa = spark.sqlContext.read.format("es").options(Map("es.port" -> "39200")).load("weather-data-2016")
noaa.registerTempTable("noaa")
spark.sql("SELECT count(*) FROM noaa WHERE station.elevation > 2000").show()
The ES slowlog shows increasing elasticsearch.slowlog.took
times for the search queries:
{"@timestamp":"2022-03-23T12:49:21.436Z", "log.level": "INFO", "elasticsearch.slowlog.id":"[spark] [lukas] [Spark shell] [local-1648039696262], stage 0, task attempt 0","elasticsearch.slowlog.message":"[weather-data-2016][0]","elasticsearch.slowlog.search_type":"QUERY_THEN_FETCH","elasticsearch.slowlog.source":"{\\\"size\\\":1000,\\\"query\\\":{\\\"bool\\\":{\\\"must\\\":[{\\\"match_all\\\":{\\\"boost\\\":1.0}}],\\\"filter\\\":[{\\\"exists\\\":{\\\"field\\\":\\\"station\\\",\\\"boost\\\":1.0}}],\\\"boost\\\":1.0}},\\\"_source\\\":{\\\"includes\\\":[\\\"station\\\"],\\\"excludes\\\":[]},\\\"sort\\\":[{\\\"_doc\\\":{\\\"order\\\":\\\"asc\\\"}}],\\\"track_total_hits\\\":2147483647}","elasticsearch.slowlog.stats":"[]","elasticsearch.slowlog.took":"99.9ms","elasticsearch.slowlog.took_millis":99,"elasticsearch.slowlog.total_hits":"33659481 hits","elasticsearch.slowlog.total_shards":1 , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.index_search_slowlog","process.thread.name":"elasticsearch[rally-node-0][search][T#11]","log.logger":"index.search.slowlog.query","elasticsearch.cluster.uuid":"6QGG4lquRWGoeb0n2lhEoQ","elasticsearch.node.id":"ia2IuvRdQUmu1AxhGG3TJQ","elasticsearch.node.name":"rally-node-0","elasticsearch.cluster.name":"rally-benchmark"}
...
{"@timestamp":"2022-03-23T12:50:07.599Z", "log.level": "INFO", "elasticsearch.slowlog.id":"[spark] [lukas] [Spark shell] [local-1648039696262], stage 0, task attempt 0","elasticsearch.slowlog.message":"[weather-data-2016][0]","elasticsearch.slowlog.search_type":"QUERY_THEN_FETCH","elasticsearch.slowlog.source":"{\\\"size\\\":1000,\\\"query\\\":{\\\"bool\\\":{\\\"must\\\":[{\\\"match_all\\\":{\\\"boost\\\":1.0}}],\\\"filter\\\":[{\\\"exists\\\":{\\\"field\\\":\\\"station\\\",\\\"boost\\\":1.0}}],\\\"boost\\\":1.0}},\\\"_source\\\":{\\\"includes\\\":[\\\"station\\\"],\\\"excludes\\\":[]},\\\"sort\\\":[{\\\"_doc\\\":{\\\"order\\\":\\\"asc\\\"}}],\\\"track_total_hits\\\":2147483647}","elasticsearch.slowlog.stats":"[]","elasticsearch.slowlog.took":"123.6ms","elasticsearch.slowlog.took_millis":123,"elasticsearch.slowlog.total_hits":"33659481 hits","elasticsearch.slowlog.total_shards":1 , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.index_search_slowlog","process.thread.name":"elasticsearch[rally-node-0][search][T#18]","log.logger":"index.search.slowlog.query","elasticsearch.cluster.uuid":"6QGG4lquRWGoeb0n2lhEoQ","elasticsearch.node.id":"ia2IuvRdQUmu1AxhGG3TJQ","elasticsearch.node.name":"rally-node-0","elasticsearch.cluster.name":"rally-benchmark"}
Altogether, the query fetching 33659481 docs did not finish within 3 hours eventually reaching took
times of > 2s.
Version Info
OS: : OSX
JVM : 11.0.14-tem
Hadoop/Spark: Spark 3.2.1
ES-Hadoop : elasticsearch-spark-30_2.12-8.1.0.jar
ES : 8.1.0