Pickaxe
SQL Based DSL Web Scraper/Screen Scraper
Install / Use
/learn @bitsummation/PickaxeREADME
Pickaxe
Pickaxe uses SQL statements combined with CSS selectors to pick out text from a web page. If you know SQL and a little about CSS selectors, this is the tool for you.
Downloads
Pickaxe runs on linux, MacOS, and windows. Quickest way to get started is to run command below if docker is installed. To install locally, see installation instructions at bottom of page.
docker run -it bitsummation/pickaxe /bin/bash
Pickaxe is command line tool. Once installed or docker images is running you type pickaxe. If no arguments are given, it runs in interactive mode where code can be typed into the prompt and when a semicolon is typed the code is run. A file location to the source code can be passed or a url to source on the web. Run below to see if things are working.
pickaxe https://raw.githubusercontent.com/bitsummation/pickaxe/master/Examples/georgetown-airport.s
you should see a table like:
+---+-----+-------+------+-----------+----------+
| | day | time | wind | visibilty | weather |
+---+-----+-------+------+-----------+----------+
| 1 | 21 | 11:56 | N 5 | 10.00 | Fog/Mist |
| 2 | 21 | 10:56 | NE 3 | 10.00 | Fog/Mist |
| 3 | 21 | 09:56 | Calm | 6.00 | Fog/Mist |
| 4 | 21 | 08:56 | E 3 | 0.25 | Fog/Mist |
| 5 | 21 | 07:56 | E 5 | 0.25 | Fog/Mist |
| 6 | 21 | 06:53 | Calm | 0.75 | Fog/Mist |
| 7 | 21 | 05:56 | Calm | 1.00 | Fog/Mist |
+---+-----+-------+------+-----------+----------+
More examples are found here: https://github.com/bitsummation/pickaxe/tree/master/Examples
How To Write Queries
Pickaxe uses SQL like statements to select text from web pages. Instead of the SQL statements running against a database they run against live web pages.
Download Page
Download page returns a table with columns url, nodes, date, size. The statement below downloads the entire web page but doesn't select any nodes in the where clause so not very useful.
select *
from download page 'https://www.faa.gov/air_traffic/weather/asos/?state=TX'
Where
Select the nodes we are interested in. To accomplish, set the nodes equal to a css expression. The css selector below gets all tr nodes that are under the table with class asos. In this case, multiple tr nodes are selected but nothing of interested is returned until we select nodes under each tr using the pick expression in the select statement.
select *
from download page 'https://www.faa.gov/air_traffic/weather/asos/?state=TX'
where nodes = 'table.asos tbody tr'
Pick
The pick expression picks out nodes under each node specified in the where clause. Pick takes a css selector. In this case, we are getting data in the td elements under each tr element. After the pick css selector, a part of the element can be specified.
- take attribute 'attribute' - takes the attribute value of the node.
- take text - takes the text of the node (default value and doesn't have to be specified)
- take html - takes the html of the node
select
pick 'td:nth-child(1) a' take attribute 'href' as details,
pick 'td:nth-child(1) a' as station,
pick 'td:nth-child(2)' as city,
pick 'td:nth-child(4)' as state
from download page 'https://www.faa.gov/air_traffic/weather/asos/?state=TX'
where nodes = 'table.asos tbody tr'
Nested download selects
We create a memory table to store state strings then we insert states into it. The nested download select statement allows the download page statement to download multiple pages at once.
create buffer states(state string)
insert into states
select 'TX'
insert into states
select 'OR'
select
pick 'td:nth-child(1) a' take attribute 'href', --link to details
pick 'td:nth-child(1) a', --station
pick 'td:nth-child(2)', --city
pick 'td:nth-child(4)' --state
from download page (select
'https://www.faa.gov/air_traffic/weather/asos/?state=' + state
from states)
where nodes = 'table.asos tbody tr'
Download Threads (make it faster)
A download page statement can use with (thread(2)) hint. The download page statement will then use the number of threads specified to download the pages resulting in much better performance.
create buffer states(state string)
insert into states
select 'TX'
insert into states
select 'OR'
select
pick 'td:nth-child(1) a' take attribute 'href', --link to details
pick 'td:nth-child(1) a', --station
pick 'td:nth-child(2)', --city
pick 'td:nth-child(4)' --state
from download page (select
'https://www.faa.gov/air_traffic/weather/asos/?state=' + state
from states) with (thread(2))
where nodes = 'table.asos tbody tr'
Proxies
Must be first statement in program. If the expression in the test block returns any rows, the proxy is considered good and all http requests will be routed through it. If more than one passes they are used in Round-robin fashion.
proxies ('104.156.252.188:8080', '75.64.204.199:8888', '107.191.49.249:8080')
with test {
select
pick '#tagline' take text
from download page 'http://vtss.brockreeve.com/'
}
Identity Column
Specify type as identity and it will auto increment.
create buffer temp(id identity, name string)
insert into temp
select 'test'
insert into temp
select 'test2'
select *
from temp
Storage Buffers
There are three different ways to store results. In memory, files, and sql databases. An example of each is listed below.
In Memory Buffer
Store results in memory. The insert overwrite statement overwrites existing data in the buffer--if any--while insert into just appends to existing data.
create buffer results(type string, folder string, message string, changeDate string)
insert overwrite results
select
case pick '.icon .octicon-file-text'
when null then 'Folder'
else 'File'
end, --folder/file
pick '.content a', --name
pick '.message a', --comment
pick '.age span' --date
from download page 'https://github.com/bitsummation/pickaxe'
where nodes = 'table.files tr.js-navigation-item'
select *
from results
File Buffer
Store results into a file.
create file results(type string, folder string, message string, changeDate string)
with (
fieldterminator = '|',
rowterminator = '\r\n'
)
location 'C:\windows\temp\results.txt'
insert into results
select
case pick '.icon .octicon-file-text'
when null then 'Folder'
else 'File'
end, --folder/file
pick '.content a', --name
pick '.message a', --comment
pick '.age span' --date
from download page 'https://github.com/bitsummation/pickaxe'
where nodes = 'table.files tr.js-navigation-item'
SQL Buffer
Store results in Microsoft SQL Server. The mssql buffer definition must match the sql table structure.
create mssql results(type string, folder string, message string, changeDate string)
with (
connectionstring = 'Server=localhost;Database=scrape;Trusted_Connection=True;',
dbtable = 'Results'
)
insert into results
select
case pick '.icon .octicon-file-text'
when null then 'Folder'
else 'File'
end, --folder/file
pick '.content a', --name
pick '.message a', --comment
pick '.age span' --date
from download page 'https://github.com/bitsummation/pickaxe'
where nodes = 'table.files tr.js-navigation-item'
Javscript Rendered Pages
If a page renders the HTML client side with javascript a simple js hint is all that is needed. Only use if needed as performance is slower.
select
pick '.main-link a' take attribute 'href',
pick '.main-link a',
pick '.posts span', --replies
pick '.views span' --views
from download page 'https://try.discourse.org/' with (js)
where nodes = 'tr.topic-list-item'
For more control, specify the HTML element to wait for -- will wait until javscript renders the element -- and the time to wait before timing out (in seconds).
select
pick '.main-link a' take attribute 'href',
pick '.main-link a',
pick '.posts span', --replies
pick '.views span' --views
from download page 'https://try.discourse.org/' with (js('tr.topic-list-item', 5))
where nodes = 'tr.topic-list-item'
Run Javascript
Run Javascript on the downloaded pages. Must return a javascript object or array of objects. The url variable is given by the framework and stores the url of the downloaded page.
select upc, u
from download page 'https://www.walmart.com/ip/Cheerios-Family-Size-Gluten-Free-Cereal-21-oz/33886599' with (js) => (
"
var primaryProductId = __WML_REDUX_INITIAL_STATE__.product.primaryProduct;
var primaryProduct = __WML_REDUX_INITIAL_STATE__.product.products[primaryProductId];
return { upc:primaryProduct.upc, u:url };
"
)
Sub Queries
select p.title, u.upc
from (
select
pick '.prod-ProductTitle div' as title,
url
from download page 'https://www.walmart.com/ip/Cheerios-Family-Size-Gluten-Free-Cereal-21-oz/33886599'
) p
join (
select upc, url
from download page 'https://www.walmart.com/ip/Cheerios-Family-Size-Gluten-Free-Cereal-21-oz/33886599' with (js) =>
(
"
var primaryProductId = __WML_REDUX_INITIAL_STATE__.product.primaryProduct;
var primaryProduct = __WML_REDUX_INITIAL_STATE__.product.products[primaryProductId];
return { upc:primaryProduct.upc, url:url };
"
)
) u on u.url = p.url
More Examples
Example 1
Capture the commit information from this page.
select
case pick '.icon .octicon-file-text'
when null then 'Folder'
else 'File'
end as type, --folder/file
pick '.content a' as name, --name
pick '.message a' as comment, --comment
pick '.age span' as date --date
from download pag
Related Skills
feishu-drive
343.3k|
things-mac
343.3kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
343.3kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
postkit
PostgreSQL-native identity, configuration, metering, and job queues. SQL functions that work with any language or driver
