This service uses puppeteer(to be deprecated), playwright and axios to crawl pages.
It allows to scroll, take screenshots and get images encoded as Base64.
Out of scope: it allows to get apps from playstore, in the future it will be migrated to a new service.
Dev environment
yarn start
or
make run
A Dockerfile
is provided which install chrome inside the container.
WEB_PORT = 3000
WEB_TIMEOUT = 150 # segs
JWT_SECRET = my-secret-hash
JWT_ALG = HS256
PLAYWRIGHT_CHROMIUM_EXECUTABLE_PATH=/usr/bin/chromium-browser
HEADLESS = true
REDIS = redis://127.0.0.1:6379 # with localhost it tries ipv6, redis is disabled
If JWT_ALG
is "ES512", then JWT_SECRET
must contain the absolute or relative path to the public key:
JWT_ALG = "ES512"
JWT_SECRET = ".secrets/public.key"
PLAYWRIGHT_CHROMIUM_EXECUTABLE_PATH
is set by default and is related to where Alpine install the chromiun browser.
This variable doesn't belong to playwright, and it is used becose playwright uses their own browser binaries, which are not compatible for alpine.
⚠️ In the future a Dockerfile.debian version could be provided. It requires a change in how the dockerfile is structured.
The project follows a calendar version release as YY.MM.DD_PATCH[-MODIFIER]
, where:
- YY short year: 23, 24..
- MM short month: 01, 02...
- DD Short day: 01, 02, 03
- PATCH: bug fix or changes 1, 2, 3
- MODIFIER: lts, rc0, rc2...
As example, the version 23.08.10_2
means from the year 2023 of month August from day 10th, patch 0.
MODIFIER field is optional.
const browserConfType =
{
headless: Joi.boolean().default(headless),
emulation: Joi.object().keys(emulationType).default(emulationDefault),
proxy: Joi.object().keys(proxyType).allow(null).default(null),
}
EmulationType
const emulationType =
{
locale: Joi.string().default("en-US"),
timezoneId: Joi.string().default("America/New_York"),
isMobile: Joi.boolean().default(false),
viewport: Joi.object().keys(viewPortType),
geoEnabled: Joi.boolean().default(false),
geolocation: Joi.object().keys(geoType).optional().allow(null),
}
emulationDefault:
const emulationDefault = {
locale: "en-US",
timezoneId: "America/New_York",
isMobile: false,
viewport: { width: 1280, height: 720},
geoEnabled: false,
geolocation: {
longitude: 40.6976312,
latitude: -74.1444858
}
}
ProxyType
const proxyType =
{
server: Joi.string().required(),
username: Joi.string(),
password: Joi.string()
}
const crawlPageType = Joi.object(
{
// Valid formed url to open
url: Joi.string().required(),
// Timeout in secs
ts: Joi.number().default(defaultTs),
// Visible text of an element to wait
waitElement: Joi.string().optional().allow(null),
// Take a screenshot of the fullpage
screenshot: Joi.bool().default(false),
// Save cookies for the domain of the url
useCookies: Joi.bool().default(false),
// If ture, the browser will have a fresh start
cleanCookies: Joi.bool().default(false),
// Derecated
cookieId: Joi.string().allow(null).default(null),
// Headers used only for axios
headers: Joi.any().allow(null),
// Browser configuration see BrowserConfType
browser: Joi.object().keys(browserConfType).optional().allow(null).default(defaultBrowserConf),
}
)
-
GET /
- 200 Root status
-
GET /metrics
- 200 prometheus stats
-
GET /v6/image
- Uses axios to get an image encoded as base64
- Query Params: url
-
POST /v6/axios
- 200 if everything ok, 500 if something went wrong
- body:
CrawlPageType
- notes: ProxyType not used in axios
- POST /v6/chrome
- 200 if everything ok, 500 if something went wrong
- body:
CrawlPageType
- response 200:
fullurl
[string]: Raw html of the responsecontent
[string]: Raw html of the responseheaders
[object]: not usedstatus
[number]: status code, 200 or 500fullLoaded
[bool]: if the page was loaded completlyscreenshot
[string]: Base64 encoded imageerror
[string]: any message errorcookieId
[string]: generated
- response 500:
error
[string]: message error
CrawlDuckGoType
const crawlDuckGoType = Joi.object(
{
// a query to search in duckduckgo.com
text: Joi.string().required(),
// timeout in secs
ts: Joi.number().default(defaultTs),
// How many clicks on "More Results"
moreResults: Joi.number().default(1),
// "ar-es" by default.
region: Joi.string().default("ar-es"),
// "Any Time", "Past day", "Past week", "Past month", "Past year". Null by default
timeFilter: Joi.string().default(null).allow(null),
// Take a screenshot of full rendered page
screenshot: Joi.bool().default(false),
// It will store and load cookies
useCookies: Joi.bool().default(true),
// Deprecated
cookieId: Joi.string().allow(null).default(null),
browser: Joi.object().keys(browserConfType).optional().allow(null).default(defaultBrowserConf),
}
)
For regios codes, check see regions codes
- POST /v6/duckduckgo
- 200 if everything ok, 500 if something went wrong
- body
crawlDuckGoType
- response 200:
query
[string]: Parsed queryfullurl
[string]: Fullurlcontent
[string]: Raw html of the responseheaders
[object]: Emptystatus
[number]: status code, 200 or 500links
[List[{href:text}]]: uri of the next pagefullLoaded
[bool]: if the page was loaded completlyscreenshot
[string]: Base64 encoded imageerror
[string]: any message errorcookieId
[string]: generated
- response 500:
error
[string]: message error
crawlGoogleType
const crawlGoogleType = Joi.object(
{
// a query to search in google.com
text: Joi.string().required(),
// timeout in secs
ts: Joi.number().default(defaultTs),
// It will performs a "PgDown" actions for `moreResults` times.
moreResults: Joi.number().default(1),
// region: Joi.string().default("countryAR"),
region: Joi.string().default("Argentina"),
// country
cr: Joi.string().default("US"),
// interfaz lang
hl: Joi.string().default("en"),
// "Any Time", "Past hour", "Past 24 hours", "Past week", "Past month", "Past year". Null by default
timeFilter: Joi.string().default(null).allow(null),
// Take and screenshot
screenshot: Joi.bool().default(false),
// use cookies
useCookies: Joi.bool().default(true),
// deprecated
cookieId: Joi.string().allow(null).default(null),
browser: Joi.object().keys(browserConfType).optional().allow(null).default(defaultBrowserConf),
}
)
- POST /v6/google
- 200 if everything ok, 500 if something went wrong
- body
crawlGoogleType
- response 200:
query
[string]: Parsed queryfullurl
[string]: Fullurlcontent
[string]: Raw html of the responseheaders
[object]: Emptystatus
[number]: status code, 200 or 500links
[List[{href:text}]]: uri of the next pagefullLoaded
[bool]: if the page was loaded completlyscreenshot
[string]: Base64 encoded imageerror
[string]: any message errorcookieId
[string]: generated
- response 500:
error
[string]: message error
⚠️ Playstore API could be very inestable, for more information refer to https://github.com/facundoolano/google-play-scraper
-
GET /v1/playstore/:appid
- Get app detail based on the appid
-
POST /v1/playstore/list
-
POST /v1/playstore/search
- Perform a search in google playstore
- body
term
: the term to search by.num
(optional, defaults to 20, max is 250): the amount of apps to retrieve.lang
(optional, defaults to'en'
): the two letter language code used to retrieve the applications.country
(optional, defaults to'us'
): the two letter country code used to retrieve the applications.fullDetail
(optional, defaults tofalse
): iftrue
, an extra request will be made for every resulting app to fetch its full detail.price
(optional, defaults toall
): allows to control if the results apps are free, paid or both.all
: Free and paidfree
: Free apps onlypaid
: Paid apps only
-
POST /v1/playstore/similar
- Returns a list of similar apps to the one specified
- body:
appId
: the Google Play id of the application to get similar apps for.lang
(optional, defaults to'en'
): the two letter language code used to retrieve the applications.country
(optional, defaults to'us'
): the two letter country code used to retrieve the applications.fullDetail
(optional, defaults tofalse
): iftrue
, an extra request will be made for every resulting app to fetch its full detail.
Example:
curl http://localhost:3000/v1/chrome?url=https://www.google.com/doodles/
url must have the protocol screen is a optional param, any value is taked as true
duckduckgo
Check https://duckduckgo.com/settings
Copy the value of the option:
- For "All regions" the value is
wt-wt
- For "Argentina" the value is
ar-es
google: Check https://www.google.com/preferences Copy as the text shown in Region settings part:
- For "Brazil", the value is
Brazil
- For "Agentina", the value is
Argentina
2023-08-10
-
Restarting changelog from now.
-
Main endpoints (chrome, axios, image, google and duckduck) will follow simple version schema: Breaking changes in the response or the payload for req/resp will imply a new version in the enpoint.
-
calendar version release adopted for the project. More detail in versioning, but the new format is:
YY.MM.DD_PATCH[-MODIFIER]
. As example, this version will be:23.08.10_0
-
Cookies are now per instance, stored locally. In version 5, cookies were shared cross instances using redis. Because instances could be hosted in different servers with different IPs and having requests with the same cookies from different IPS is not recommended. Instead, each instance store locally their cookies, which is aligned with how actually a browser works.
-
Duckduckgo parsing of links disabled by now.y
-
Google try first to get the search button using "Search" if it fails, it will try "Buscar"
-
A
HEADLESS
env added. Iffalse
it will open the browser. Useful for debugging.
Before
⚠️ From V4 endpoint, it uses Playwright instead of puppeteer