OpenSearchServer Documentation

How to use Selenium to crawl websites

OpenSearchServer 1.5 introduces a powerful scripting feature. It is now possible to drive a web browser (Firefox, PhantomJS) using a written script.

The new scripting feature can:

Open a web browser,
Execute Javascript,
Extract data using an XPATH query, a CSS selector, or the ID of the web element,
Insert documents in an index.

A set of REST APIs was created to manage the scripts. Each script can be executed/stored/updated/deleted by calling its name.

This new feature is particularly useful when crawling websites protected by a login/password form, websites making heavy use of Javascript to display content -- or when the values to be indexed can only be defined by precise CSS or XPATH selectors.

JSON script

Each command is a JSON structure with a command value and an array of parameters.

{
"command": "WEBDRIVER_OPEN",
"parameters": [ "FIREFOX"]
}

A script is an array of JSON command structures. Here is an example:

[
{ "command": "WEBDRIVER_OPEN", "parameters": [ "PHANTOMJS"] },
{ "command": "WEBDRIVER_SET_TIMEOUTS", "parameters": [ 60, 60 ] },
{ "command": "WEBDRIVER_RESIZE", "parameters": [ 1024, 768 ] },
{ "command": "WEBDRIVER_GET", "parameters": [ "{url}" ] },
{ "command": "SLEEP", "parameters": [ 1 ] },
{ "command": "INDEX_DOCUMENT_NEW", "parameters": [ "english" ] },
{ "command": "INDEX_DOCUMENT_ADD_VALUE", "parameters": [ "url", "{url}" ] },
{ "command": "INDEX_DOCUMENT_ADD_VALUE", "parameters": [ "title", "Test" ] },
{ "command": "CSS_SELECTOR_INDEX_FIELD", "parameters": [ "content", "div ul li a" ] },
{ "command": "INDEX_DOCUMENT_UPDATE" }
]

This script executes the following actions:

open a PHANTOMJS window,
set a timeout duration of one minute,
the width and the height of the window are set to 1024×768,
a WEB page is loaded, based on the {url} parameter,
the script waits for one second,
an index document is created,
the fields url and title are set,
a CSS selector locates a text element within the web page. These elements are stored in the content field.
the document is inserted within the index.

RESTful API

A set of RESTful API is available.

Saving the script

To store the script, use the following call:

URL: http://localhost:9090/services/rest/index/script/script/{script_name}
HTTP Method: PUT
HTTP Header: Content-Type: application/json
Payload: The JSON script

Running the script

To run the script use the following call:

URL: http://localhost:9090/services/rest/index/script/script/{script_name}/run
HTTP Method: POST
HTTP Header: Content-Type: application/json
Payload: The JSON structure with the variable
The JSON structure for the variables:

{"url": "http://www.open-search-server.com", "name": "John Doe" }

Subscript

A script can call a subscript for each web element found by a selector.

[
{ "command": "WEBDRIVER_OPEN", "parameters": [ "PHANTOMJS" ] },
{ "command": "WEBDRIVER_SET_TIMEOUTS", "parameters": [ 60, 60 ] },
{ "command": "WEBDRIVER_RESIZE", "parameters": [ 1024, 768 ] },
{ "command": "WEBDRIVER_GET", "parameters": [ "http://www.dmoz.org/" ] },
{ "command": "SLEEP", "parameters": [ 3 ] },
{ "command": "CSS_SELECTOR_SUBSCRIPT", "parameters": [ "dmoz_sub", "div#catalogs span a" ] },
{ "command": "WEBDRIVER_CLOSE" }
]

This script will extract all the root categories of the homepage of Dmoz.org. Then, for each category found, the subscript dmoz_sub will be called.

Full list of commands

WEBDRIVER_OPEN: open a web browser
- parameter 1: name of driver. Possible values: PHANTOMJS, FIREFOX.
WEBDRIVER_CLOSE: close the browser
WEBDRIVER_NEW_WINDOW: open a new window and keep the session running
WEBDRIVER_CLOSE_WINDOW: close the current window
WEBDRIVER_SET_TIMEOUTS: define a timeout delay after which script execution will be stopped.
- parameter 1: delay, in seconds
WEBDRIVER_RESIZE: resize window
- parameter 1: width
- parameter 2: height
WEBDRIVER_GET: access an URL
- parameter 1: URL. Variables can be used within this parameter.
SLEEP : pause script execution
- parameter 1: time, in seconds
CSS_SELECTOR_SUBSCRIPT, XPATH_SELECTOR_SUBSCRIPT: select an element and run a subscript. The selected element must be an <a or an <img. The href or src attribute will be passed as the {url} variable of the subscript.
- parameter 1: name of the script
- parameter 2: selector, CSS or XPATH depending on the command.
WEBDRIVER_JAVASCRIPT: execute some Javascript.
- parameter 1: Javascript commands to execute. Variables can be used within this parameter.
SCRIPT: call another script
- parameter 1: name of the script
CSS_SELECTOR_CLICK_AND_SCRIPT, XPATH_SELECTOR_CLICK_AND_SCRIPT: select an element, click on it, wait some time and run a script
- parameter 1: selector, CSS or XPATH depending on the command.
- parameter 2: name of the script
- parameter 3: time to wait between the click and the script execution
INDEX_DOCUMENT_NEW: create a new document to be indexed
- parameter 1: name of the new document
INDEX_DOCUMENT_ADD_VALUE: add a value in the current new document
- parameter 1: field
- parameter 2: value. Variables can be used in this parameter.
INDEX_DOCUMENT_UPDATE: commit the current new document to the index
VAR_NEW_REGEX : create a new variable by using a regexp on a variable. The new variable can then be used in the script and subscripts.
- parameter 1: variable on which to apply the regexp
- parameter 2: regexp. The capture group will give the new variable its value.
- parameter 3: new variable name
CSS_SELECTOR_DOWNLOAD, XPATH_SELECTOR_DOWNLOAD : select an element and download the URL within the href attribute. Selected element must be an <a.
- parameter 1: directory into which the file will be downloaded. It will be fully (recursively) created if it does not exist. Variables can be used within this parameter.
- parameter 2: selector, CSS or XPATH depending on the command.
WEBDRIVER_DOWNLOAD: download the current file indicated by the {url} variable.
- parameter 1: directory into which the file will be downloaded. It will be fully (recursively) created if it does not exist. Variables can be used within this parameter.
PARSER_MERGE: merge every PDF document from a directory to a new PDF document.
- parameter 1: name of the PDF parser. Value can only be "PDF parser" at this stage.
- parameter 2: directory that contains the PDF files to merge.
- parameter 3: full path (including filename) to the PDF to be created
SEARCH_TEMPLATE_JSON: execute a query on the index and run an action depending on the result.
- parameter 1: name of the query template to use for the search
- parameter 2: keywords to use for the search query
- parameter 3: JSON path to match a specific part of the result
- parameter 4: action to take. Values can be:
  - EXIT_IF_NOT_FOUND: exit current script if the JSON path does not match
  - IF_FOUND: if the JSON path matches:
    - parameter 5: must be NEXT_COMMAND
    - parameter 6: next command to run. Will often be WEBDRIVER_CLOSE_WINDOW

View/edit on GitHub