-
Notifications
You must be signed in to change notification settings - Fork 455
Open
Labels
Description
hi kmike, i use scrapy-splash and meet a issue, when i first run 'scrapy crawl toutiao', it's run right, bug when i run it's second, it occur a issue.
i find the issue because headers i add, when i not use headers, it's run right, but it's errors when i use headers and run the second.
the lua script and project follows, i need your help, thanks.
code:
import scrapy
import json
from scrapy_splash import SplashRequest
from scrapy.http.headers import Headers
script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
headers = last_response.headers,
cookies = splash:get_cookies(),
html = splash:html(),
url = splash:url(),
http_status = last_response.status,
}
end
"""
HEADERS = Headers({
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'compress',
'Accept-Language': 'en-US',
'Connection': 'keep-alive',
'Cache-Control': 'no-cache',
'Pragma': 'no-cache',
'Host':'m.toutiao.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Mobile Safari/537.36'
})
class MySpider(scrapy.Spider):
name = "toutiao"
def __init__(self):
self.start_url = "https://m.toutiao.com"
def start_requests(self):
yield SplashRequest(url=self.start_url,
callback=self.parse_result,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script, 'http_method': 'GET'},
headers=HEADERS)
def parse_result(self, response):
print("ok")
print(response.headers)
the first run correct:
ok
{b'Vary': [b'Accept-Encoding, Accept-Encoding, Accept-Encoding'], b'Timing-Allow-Origin': [b'*'], b'Set-Cookie': [b'tt_webid=653006869922952004; Max-Age=7776000'], b'Transfer-Encoding': [b
'chunked'], b'Content-Type': [b'text/html; charset=utf-8'], b'Connection': [b'keep-alive'], b'X-Tt-Timestamp': [b'152040098.652'], b'X-Ss-Set-Cookie': [b'tt_webid=653006899221952004; Max-
Age=7776000'], b'Server': [b'Tengine'], b'Via': [b'cache1.cn406[13,0]'], b'Content-Encoding': [b'gzip'], b'Eagleid': [b'dcb54e411524000986256455e'], b'Date': [b'Wed, 07 Mar 2018 05:21:38 G
MT']}
the second run error:
2018-03-07 13:18:54 [scrapy.core.engine] INFO: Spider opened
2018-03-07 13:18:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-07 13:18:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-03-07 13:18:55 [scrapy_splash.middleware] WARNING: Bad request to Splash: {'info': {'message': 'Lua error: [string "..."]:14: attempt to index field \'?\' (a nil value)', 'type': 'LUA_
ERROR', 'source': '[string "..."]', 'error': "attempt to index field '?' (a nil value)", 'line_number': 14}, 'description': 'Error happened while executing Lua script', 'error': 400, 'type'
: 'ScriptError'}
2018-03-07 13:18:55 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://m.toutiao.com via http://172.17.0.2:8050/execute> (referer: None)
2018-03-07 13:18:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://m.toutiao.com>: HTTP status code is not handled or not allowed
2018-03-07 13:18:55 [scrapy.core.engine] INFO: Closing spider (finished)