html2api is a REST API that returns JSON converted from a HTML. Although public API is useful for developers, most of web sites do not provide it. html2api offers a virtual API for retrieving data from those web sites.

html2api has two parameters: "url" and "template". "url" is a location of the web page where you want to retrieve data. HTML of the web page is converted into JSON format according to a "template" passed to the API. The template is written with a simple syntax.

API URL

http://html2api.appspot.com/api/json

GET and POST are acceptable. Both methods recieve same parameters.

Template Syntax Overview

HTML of url, and template are on the left, JSON output is on the right.

Syntax of the template is similar to JSON. It supports object (dictionary) form. Unlike JSON, value of object is a CSS3 selector description that indicates HTML element that contains text you want to retrieve.

<div id="main">
  <p>hello world.</p>
</div>
{'message': #main p}
{"message": "hello world."}

Value of a HTML attribute can also be retrieved. $[attr_name] description appended after CSS selector description is used.

<div id="main">
  <a href="http://html2api.appspot.com/">
    html2api
  </a>
</div>
{'link': #main a $[href]}
{"link":
  "http://html2api.appspot.com/"
}

As well as object, array is a structure html2api can treat. Array is represented as [sel @ val]. sel describes a set of HTML elements that corresponds to JSON array. Actual value of each element listed in the JSON array is referred by val. The root element of val is set to the one selected by sel.

<ul id="planets">
  <li><a href="/curiosity">Mars</a></li>
  <li><a href="/cassini">Saturn</a></li>
  <li>Neptune</li>
</ul>
[#planets li @ a]
["Venus", "Saturn", null]

Of course, these structures can be combined.

<ul>
  <li class="lang">
    <div class="name">C</div>
    <div class="extensions">
      <span>.c</span>
      <span>.h</span>
    </div>
  </li>
  <li class="lang">
    <div class="name">Python</div>
    <div class="extensions">
      <span>.py</span>
    </div>
  </li>
</ul>
[li.lang @ {
  'name': .name,
  'extensions': [.extensions span @ ]
}]
[{
    "name": "C",
    "extensions": [".c", ".h"]
  }, {
    "name": "Python",
    "extensions": [".py"]
}]

Parameters

Most of parameters can be categorized into one of the two groups: html-group and template-group. The API is called with two parameters picked from each group.

parameters of html-group are used in order to indicate HTML directly or indirectly.

url URL of the target web page.
url_ref URL of a plain web page that responds the location (URL) of the target web page.
html HTML itself.

Each parameter of template-group is used for indicating the template directly or indirectly.

template Template itself.
template_ref URL of a plain web page that responds the template.

The API accesses to the target web site with GET request. If it is preferable to using POST, you can set post_body optional parameter. If the parameter is set, The API uses POST instead of GET.

post_body HTTP message body data sent with a request to the target web page.

Returns

If an error was occurred while processing a request, it returns "null" and adds "X-ERROR-TYPE" into the response header.

Related resources

The core functionality of html2api is based on the Python library xml2data. That source code is available for free.

Disclaimer

This service does not guarantee high availability enough for business purposes. Thus, it is recommended to build a server by yourself using xml2data instead of calling html2api, if you are planning to use its functionality for your business products.