Tabletojson
An npm module for node.js to convert HTML tables to JSON objects
Install / Use
/learn @maugenst/TabletojsonREADME
Tabletojson: Converting Table to JSON objects made easy
Convert local or remote HTML embedded tables into JSON objects.
Table of Contents
- Introduction
- Incompatible changes
- Installation
- Quickstart
- Use case examples
- Options
- Known issues and limitations
- Usage
- Contributing
Introduction
Tabletojson attempts to convert local or remote HTML tables into JSON with a very low footprint. Can be passed the markup for a single table as a string, a fragment of HTML or an entire page or just a URL (with an optional callback function; promises also supported).
The response is always an array. Every array entry in the response represents a table found on the page (in same the order they were found in HTML).
As of version 2.0 tabletojson is completely written in typescript.
Incompatible changes
- Version 2 on request.js is not used anymore
- Version >=2.1.0 got is not used anymore and got replaced by node internal fetch. more information here...
- Switched from commonjs to module system. Bumped version to 3.0.0
- Providing a "hybrid" library to cope with the needs of both esm and commonjs. Bumped version to 4.0.1.
- Adding support for complex headings as key in the output json object. Bumped version to 4.1.0.
Conversion from version 1.+ to 2.x
- Require must be changed from
const tabletojson = require('../lib/tabletojson');to eitherconst tabletojson = require('../lib/tabletojson').Tabletojson;orconst {Tabletojson: tabletojson} = require('../lib/tabletojson'); - Replace request options by fetch options. More information here...
Conversion from version 2.0.1 to 3.x
- Tabletojson now uses esm. Use
import {Tabletojson as tabletojson} from 'tabletojson';orimport {tabletojson} from 'tabletojson'; - Added lowercase import
import {tabletojson} from 'tabletojson'; - If you are using Node 18 execute examples by calling:
npm run build:examples
cd dist/examples
node --experimental-vm-modules --experimental-specifier-resolution=node example-1.js --prefix=dist/examples
Installation
npm install tabletojson
Quickstart
esm
import {tabletojson} from 'tabletojson';
tabletojson.convertUrl('https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes', function (tablesAsJson) {
console.log(tablesAsJson[1]);
});
commonjs
const {tabletojson} = require('tabletojson');
tabletojson.convertUrl('https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes', function (tablesAsJson) {
console.log(tablesAsJson[1]);
});
Remote (convertUrl)
// example-1.ts
import {tabletojson} from 'tabletojson';
tabletojson.convertUrl('https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes', function (tablesAsJson) {
console.log(tablesAsJson[1]);
});
Local (convert)
More examples can be found in examples folder.
// example-6.ts
import {tabletojson} from 'tabletojson';
import * as fs from 'fs';
import * as path from 'path';
const html = fs.readFileSync(path.resolve(process.cwd(), '../../test/tables.html'), {
encoding: 'utf-8',
});
const converted = tabletojson.convert(html);
console.log(converted);
Use case examples
Duplicate column headings
Tables with duplicate column headings, subsequent headings are suffixed with a count:
|| PLACE || VALUE || PLACE || VALUE ||
|--------|--------|--------|---------|
| abc | 1 | def | 2 |
[
{
"PLACE": "abc",
"VALUE": "1",
"PLACE_2": "def",
"VALUE_2": "2"
}
]
Tables with rowspan
In tables with rowspan, the content of the spawned cell must be available in the respective object.
<table id="table11" class="table" style="border: solid"> <thead> <tr> <th>Parent</th> <th>Child</th> <th>Age</th> </tr> </thead> <tbody> <tr> <td rowspan="3">Marry</td> <td>Sue</td> <td>15</td> </tr> <tr> <td>Steve</td> <td>12</td> </tr> <tr> <td>Tom</td> <td>3</td> </tr> </tbody> </table>[
{"Parent": "Marry", "Child": "Tom", "Age": "3"},
{"Parent": "Marry", "Child": "Steve", "Age": "12"},
{"Parent": "Marry", "Child": "Sue", "Age": "15"}
]
Tables with complex rowspan
In tables with complex rowspans, the content of the spawned cell must be available in the respective object.
<table id="table12" class="table" border="1"> <thead> <tr> <th>Parent</th> <th>Child</th> <th>Age</th> </tr> </thead> <tbody> <tr> <td rowspan="3">Marry</td> <td>Sue</td> <td>15</td> </tr> <tr> <td>Steve</td> <td>12</td> </tr> <tr> <td rowspan="2">Tom</td> <td rowspan="2">3</td> </tr> <tr> <td rowspan="2">Taylor</td> </tr> <tr> <td>Peter</td> <td>17</td> </tr> </tbody> </table>[
{"Parent": "Marry", "Child": "Sue", "Age": "15"},
{"Parent": "Marry", "Child": "Steve", "Age": "12"},
{"Parent": "Marry", "Child": "Tom", "Age": "3"},
{"Parent": "Taylor", "Child": "Tom", "Age": "3"},
{"Parent": "Taylor", "Child": "Peter", "Age": "17"}
]
Tables with even more complex rowspans
In tables with even more complex rowspans, the content of the spawned cell must be available in the respective object.
<table id="table12-a" class="table" border="1"> <thead> <tr> <th>Department</th> <th>Major</th> <th>Class</th> <th>Instructor</th> <th>Credit</th> </tr> </thead> <tbody> <tr> <td rowspan="4">Engineering</td> <td rowspan="3">Computer Science</td> <td>CS101</td> <td>Kim</td> <td rowspan="2">3</td> </tr> <tr> <td>CS201</td> <td rowspan="2">Garcia</td> </tr> <tr> <td>CS303</td> <td>2</td> </tr> <tr> <td>Electrical Engineering</td> <td>EE101</td> <td>Müller</td> <td>3</td> </tr> <tr> <td rowspan="2">Social Science</td> <td rowspan="2">Economics</td> <td>EC101</td> <td>Nguyen</td> <td rowspan="2">3</td> </tr> <tr> <td>EC401</td> <td>Smith</td> </tr> </tbody> </table>[
{
"Department": "Engineering",
"Major": "Computer Science",
"Class": "CS101",
"Instructor": "Kim",
"Credit": "3"
},
{
"Department": "Engineering",
"Major": "Computer Science",
"Credit": "3",
"Class": "CS201",
"Instructor": "Garcia"
},
{
"Department": "Engineering",
"Major": "Computer Scien
