Crawling one URL

Use this API to crawl a page by passing its URL.

The URL must match the pattern list.

Requirement: OpenSearchServer v1.5

Call parameters

URL: /services/rest/index/{index_name}/crawler/web/crawl?url={url}&returnData=true

Method: GET

Header (optional returned type):

  • Accept: application/json
  • Accept: application/xml

URL parameters:

  • index_name (required): The name of the index
  • url (required): The URL to crawl
  • returnData (optionnal): If set to true will return a JSON array with the extracted data

Success response

The page has been crawled.

HTTP code:
200

Content (application/json):

{
    "successful": true,
    "info": "Result: Fetched - Parsed - Indexed",
    "details":{  
       "ContentBaseType":"text/html",
       "ContentLength":"-1",
       "ContentTypeCharset":"UTF-8",
       "FetchStatus":"Fetched",
       "HttpResponseCode":"200",
       "IndexStatus":"Indexed",
       "ParserStatus":"Parsed",
       "RobotsTxtStatus":"Allow",
       "URL":"http://www.loremipsum.dolor/"
    },
    "items":[  
    [  
       {  
         "fieldName":"title",
         "values":[  
            "Lorem ipsum dolor sit amet"
         ]
       },
       {  
         "fieldName":"content",
         "values":[  
            "Vivamus consectetur lorem at metus lobortis, a ullamcorper sapien ornare. Donec et ornare mauris, at",
            "interdum libero. Fusce tempor purus laoreet, eleifend mi in, elementum velit. Nunc aliquet vulputate urna"
         }
       }
    ]
}

Error response

The index has not been found.

HTTP code:
404

Content (text/plain):

The index my_index has not been found

Sample call

Using CURL:

curl -XGET http://localhost:8080/services/rest/index/my_index/crawler/web/crawl?url=http://www.example.org/&returnData=true

Using jQuery:

$.ajax({ 
   type: "GET",
   dataType: "json",
   url: "http://localhost:8080/services/rest/index/my_index/crawler/web/crawl?url=http://www.example.org/&returnData=true"
}).done(function (data) {
   console.log(data);
});

View/edit on GitHub


comments powered by Disqus