How to use Selenium to crawl websites
OpenSearchServer 1.5 introduces a powerful scripting feature. It is now possible to drive a web browser (Firefox, PhantomJS) using a written script.
The new scripting feature can:
- Open a web browser,
- Execute Javascript,
- Extract data using an XPATH query, a CSS selector, or the ID of the web element,
- Insert documents in an index.
A set of REST APIs was created to manage the scripts. Each script can be executed/stored/updated/deleted by calling its name.
This new feature is particularly useful when crawling websites protected by a login/password form, websites making heavy use of Javascript to display content -- or when the values to be indexed can only be defined by precise CSS or XPATH selectors.
JSON script
Each command is a JSON structure with a command value and an array of parameters.
{
"command": "WEBDRIVER_OPEN",
"parameters": [ "FIREFOX"]
}
A script is an array of JSON command structures. Here is an example:
[
{ "command": "WEBDRIVER_OPEN", "parameters": [ "PHANTOMJS"] },
{ "command": "WEBDRIVER_SET_TIMEOUTS", "parameters": [ 60, 60 ] },
{ "command": "WEBDRIVER_RESIZE", "parameters": [ 1024, 768 ] },
{ "command": "WEBDRIVER_GET", "parameters": [ "{url}" ] },
{ "command": "SLEEP", "parameters": [ 1 ] },
{ "command": "INDEX_DOCUMENT_NEW", "parameters": [ "english" ] },
{ "command": "INDEX_DOCUMENT_ADD_VALUE", "parameters": [ "url", "{url}" ] },
{ "command": "INDEX_DOCUMENT_ADD_VALUE", "parameters": [ "title", "Test" ] },
{ "command": "CSS_SELECTOR_INDEX_FIELD", "parameters": [ "content", "div ul li a" ] },
{ "command": "INDEX_DOCUMENT_UPDATE" }
]
This script executes the following actions:
- open a PHANTOMJS window,
- set a timeout duration of one minute,
- the width and the height of the window are set to 1024×768,
- a WEB page is loaded, based on the
{url}
parameter, - the script waits for one second,
- an index document is created,
- the fields
url
andtitle
are set, - a CSS selector locates a text element within the web page. These elements are stored in the
content
field. - the document is inserted within the index.
RESTful API
A set of RESTful API is available.
Saving the script
To store the script, use the following call:
- URL: http://localhost:9090/services/rest/index/script/script/{script_name}
- HTTP Method: PUT
- HTTP Header: Content-Type: application/json
- Payload: The JSON script
Running the script
To run the script use the following call:
- URL: http://localhost:9090/services/rest/index/script/script/{script_name}/run
- HTTP Method: POST
- HTTP Header: Content-Type: application/json
- Payload: The JSON structure with the variable
- The JSON structure for the variables:
{"url": "http://www.open-search-server.com", "name": "John Doe" }
Subscript
A script can call a subscript for each web element found by a selector.
[
{ "command": "WEBDRIVER_OPEN", "parameters": [ "PHANTOMJS" ] },
{ "command": "WEBDRIVER_SET_TIMEOUTS", "parameters": [ 60, 60 ] },
{ "command": "WEBDRIVER_RESIZE", "parameters": [ 1024, 768 ] },
{ "command": "WEBDRIVER_GET", "parameters": [ "http://www.dmoz.org/" ] },
{ "command": "SLEEP", "parameters": [ 3 ] },
{ "command": "CSS_SELECTOR_SUBSCRIPT", "parameters": [ "dmoz_sub", "div#catalogs span a" ] },
{ "command": "WEBDRIVER_CLOSE" }
]
This script will extract all the root categories of the homepage of Dmoz.org. Then, for each category found, the subscript dmoz_sub
will be called.
Full list of commands
WEBDRIVER_OPEN
: open a web browser- parameter 1: name of driver. Possible values:
PHANTOMJS
,FIREFOX
.
- parameter 1: name of driver. Possible values:
WEBDRIVER_CLOSE
: close the browserWEBDRIVER_NEW_WINDOW
: open a new window and keep the session runningWEBDRIVER_CLOSE_WINDOW
: close the current windowWEBDRIVER_SET_TIMEOUTS
: define a timeout delay after which script execution will be stopped.- parameter 1: delay, in seconds
WEBDRIVER_RESIZE
: resize window- parameter 1: width
- parameter 2: height
WEBDRIVER_GET
: access an URL- parameter 1: URL. Variables can be used within this parameter.
SLEEP
: pause script execution- parameter 1: time, in seconds
CSS_SELECTOR_SUBSCRIPT
,XPATH_SELECTOR_SUBSCRIPT
: select an element and run a subscript. The selected element must be an<a
or an<img
. Thehref
orsrc
attribute will be passed as the{url}
variable of the subscript.- parameter 1: name of the script
- parameter 2: selector, CSS or XPATH depending on the command.
WEBDRIVER_JAVASCRIPT
: execute some Javascript.- parameter 1: Javascript commands to execute. Variables can be used within this parameter.
SCRIPT
: call another script- parameter 1: name of the script
CSS_SELECTOR_CLICK_AND_SCRIPT
,XPATH_SELECTOR_CLICK_AND_SCRIPT
: select an element, click on it, wait some time and run a script- parameter 1: selector, CSS or XPATH depending on the command.
- parameter 2: name of the script
- parameter 3: time to wait between the click and the script execution
INDEX_DOCUMENT_NEW
: create a new document to be indexed- parameter 1: name of the new document
INDEX_DOCUMENT_ADD_VALUE
: add a value in the current new document- parameter 1: field
- parameter 2: value. Variables can be used in this parameter.
INDEX_DOCUMENT_UPDATE
: commit the current new document to the indexVAR_NEW_REGEX
: create a new variable by using a regexp on a variable. The new variable can then be used in the script and subscripts.- parameter 1: variable on which to apply the regexp
- parameter 2: regexp. The capture group will give the new variable its value.
- parameter 3: new variable name
CSS_SELECTOR_DOWNLOAD
,XPATH_SELECTOR_DOWNLOAD
: select an element and download the URL within thehref
attribute. Selected element must be an<a
.- parameter 1: directory into which the file will be downloaded. It will be fully (recursively) created if it does not exist. Variables can be used within this parameter.
- parameter 2: selector, CSS or XPATH depending on the command.
WEBDRIVER_DOWNLOAD
: download the current file indicated by the{url}
variable.- parameter 1: directory into which the file will be downloaded. It will be fully (recursively) created if it does not exist. Variables can be used within this parameter.
PARSER_MERGE
: merge every PDF document from a directory to a new PDF document.- parameter 1: name of the PDF parser. Value can only be "PDF parser" at this stage.
- parameter 2: directory that contains the PDF files to merge.
- parameter 3: full path (including filename) to the PDF to be created
SEARCH_TEMPLATE_JSON
: execute a query on the index and run an action depending on the result.- parameter 1: name of the query template to use for the search
- parameter 2: keywords to use for the search query
- parameter 3: JSON path to match a specific part of the result
- parameter 4: action to take. Values can be:
EXIT_IF_NOT_FOUND
: exit current script if the JSON path does not matchIF_FOUND
: if the JSON path matches:- parameter 5: must be
NEXT_COMMAND
- parameter 6: next command to run. Will often be
WEBDRIVER_CLOSE_WINDOW
- parameter 5: must be