Crawling one URL

Use this API to crawl a page by passing its URL.

The URL must match the pattern list.

Requirement: OpenSearchServer v1.5

Call parameters

URL: /services/rest/index/{index_name}/crawler/web/crawl?url={url}&returnData=true

Method: GET

Header (optional returned type):

  • Accept: application/json
  • Accept: application/xml

URL parameters:

  • index_name (required): The name of the index
  • url (required): The URL to crawl
  • returnData (optionnal): If set to true will return a JSON array with the extracted data

Success response

The page has been crawled.

HTTP code:
200

Content (application/json):

{
"successful": true,
"info": "Result: Fetched - Parsed - Indexed",
"details":{
"ContentBaseType":"text/html",
"ContentLength":"-1",
"ContentTypeCharset":"UTF-8",
"FetchStatus":"Fetched",
"HttpResponseCode":"200",
"IndexStatus":"Indexed",
"ParserStatus":"Parsed",
"RobotsTxtStatus":"Allow",
"URL":"http://www.loremipsum.dolor/"
},
"items":[
[
{
"fieldName":"title",
"values":[
"Lorem ipsum dolor sit amet"
]
},
{
"fieldName":"content",
"values":[
"Vivamus consectetur lorem at metus lobortis, a ullamcorper sapien ornare. Donec et ornare mauris, at",
"interdum libero. Fusce tempor purus laoreet, eleifend mi in, elementum velit. Nunc aliquet vulputate urna"
}
}
]
}

Error response

The index has not been found.

HTTP code:
404

Content (text/plain):

The index my_index has not been found

Sample call

Using CURL:

curl -XGET http://localhost:8080/services/rest/index/my_index/crawler/web/crawl?url=http://www.example.org/&returnData=true

Using jQuery:

$.ajax({
type: "GET",
dataType: "json",
url: "http://localhost:8080/services/rest/index/my_index/crawler/web/crawl?url=http://www.example.org/&returnData=true"
}).done(function (data) {
console.log(data);
});

View/edit on GitHub