Transistor

Transistor, a Python web scraping framework for intelligent use cases.

Generate Convert Improve

Install / Use

/learn @bomquote/Transistor

About this skill

Quality Score

0/100

README

.. image:: https://raw.githubusercontent.com/bmjjr/transistor/master/img/transistor_logo.png?token=AAgJc9an2d8HwNRHty-6vMZ94VfUGGSIks5b8VHbwA%3D%3D

Web data collection and storage for intelligent use cases.

.. image:: https://img.shields.io/badge/Python-3.6%20%7C%203.7%20%7C%203.8-blue.svg :target: https://github.com/bomquote/transistor .. image:: https://img.shields.io/badge/pypi%20package-0.2.4-blue.svg :target: https://pypi.org/project/transistor/0.2.4/ .. image:: https://img.shields.io/pypi/dm/transistor.svg :target: https://pypistats.org/packages/transistor .. image:: https://img.shields.io/badge/Status-Stable-blue.svg :target: https://github.com/bomquote/transistor .. image:: https://img.shields.io/badge/license-MIT-lightgrey.svg :target: https://github.com/bomquote/transistor/blob/master/LICENSE .. image:: https://ci.appveyor.com/api/projects/status/xfg2yedwyrbyxysy/branch/master?svg=true :target: https://ci.appveyor.com/project/bmjjr/transistor .. image:: https://pyup.io/repos/github/bomquote/transistor/shield.svg?t=1542037265283 :target: https://pyup.io/account/repos/github/bomquote/transistor/ :alt: Updates .. image:: https://api.codeclimate.com/v1/badges/0c34950c38db4f38aea6/maintainability :target: https://codeclimate.com/github/bomquote/transistor/maintainability :alt: Maintainability .. image:: https://codecov.io/gh/bomquote/transistor/branch/master/graph/badge.svg :target: https://codecov.io/gh/bomquote/transistor

============= transistor

About

The web is full of data. Transistor is a web scraping framework for collecting, storing, and using targeted data from structured web pages.

Transistor's current strengths are in being able to: - provide an interface to use Splash <https://github.com/scrapinghub/splash>_ headless browser / javascript rendering service. - includes optional support for using the scrapinghub.com Crawlera <https://scrapinghub.com/crawlera>_ 'smart' proxy service. - ingest keyword search terms from a spreadsheet or use RabbitMQ or Redis as a message broker, transforming keywords into task queues. - scale one Spider into an arbitrary number of workers combined into a WorkGroup. - coordinate an arbitrary number of WorkGroups searching an arbitrary number of websites, into one scrape job. - send out all the WorkGroups concurrently, using gevent based asynchronous I/O. - return data from each website for each search term 'task' in our list, for easy website-to-website comparison. - export data to CSV, XML, JSON, pickle, file object, and/or your own custom exporter. - save targeted scrape data to the database of your choice.

Suitable use cases include: - comparing attributes like stock status and price, for a list of book titles or part numbers, across multiple websites. - concurrently process a large list of search terms on a search engine and then scrape results, or follow links first and then scrape results.

Development of Transistor is sponsored by BOM Quote Manufacturing <https://www.bomquote.com>. Here is a Medium story from the author about creating Transistor: That time I coded 90-hours in one week <https://medium.com/bomquote/that-time-i-coded-90-hours-in-one-week-a28732cac754>.

Primary goals:

Enable scraping targeted data from a wide range of websites including sites rendered with Javascript.
Navigate websites which present logins, custom forms, and other blockers to data collection, like captchas.
Provide asynchronous I/O for task execution, using gevent <https://github.com/gevent/gevent>_.
Easily integrate within a web app like Flask <https://github.com/pallets/flask>, Django <https://github.com/django/django> , or other python based web frameworks <https://github.com/vinta/awesome-python#web-frameworks>_.
Provide spreadsheet based data ingest and export options, like import a list of search terms from excel, ods, csv, and export data to each as well.
Utilize quick and easy integrated task work queues which can be automatically filled with data search terms by a simple spreadsheet import.
Able to integrate with more robust task queues like Celery <https://github.com/celery/celery>_ while using rabbitmq <https://www.rabbitmq.com/>_ or redis <https://redis.io/>_ as a message broker as desired.
Provide hooks for users to persist data via any method they choose, while also supporting our own opinionated choice which is a PostgreSQL <https://www.postgresql.org/>_ database along with newt.db <https://github.com/newtdb/db>_.
Contain useful abstractions, classes, and interfaces for scraping and crawling with machine learning assistance (wip, timeline tbd).
Further support data science use cases of the persisted data, where convenient and useful for us to provide in this library (wip, timeline tbd).
Provide a command line interface (low priority wip, timeline tbd).

Quickstart

First, install Transistor from pypi:

.. code-block:: rest

pip install transistor

If you have previously installed Transistor, please ensure you are using the latest version:

.. code-block:: rest

pip-install --upgrade transistor

Next, setup Splash, following the Quickstart instructions. Finally, follow the minimal abbreviated Quickstart example books_to_scrape as detailed below.

This example is explained in more detail in the source code found in the examples/books_to_scrape folder, including fully implementing object persistence with newt.db.

Quickstart: Setup Splash

Successfully scraping is now a complex affair. Most websites with useuful data will rate limit, inspect headers, present captchas, and use javascript that must be rendered to get the data you want.

This rules out using simple python requests scripts for most serious use. So, setup becomes much more complicated.

To deal with this, we are going to use Splash <https://github.com/scrapinghub/splash>_, "A Lightweight, scriptable browser as a service with an HTTP API".

Transistor also supports the optional use of a smart proxy service from scrapinghub <https://scrapinghub.com/>_ called Crawlera <https://scrapinghub.com/crawlera>_. The crawlera smart proxy service helps us:

avoid getting our own server IP banned
enable regional browsing which is important to us, because data can differ per region on the websites we want to scrape, and we are interested in those differences

The minimum monthly cost for the smallest size crawlera C10 plan is $25 USD/month. This level is useful but can easily be overly restrictive. The next level up is $100/month.

The easiest way to get setup with Splash is to use Aquarium <https://github.com/TeamHG-Memex/aquarium>_ and that is what we are going to do. Using Aquarium requires Docker and Docker Compose.

Windows Setup

On Windows, the easiest way to get started with Docker is to use Chocolately <https://chocolatey.org/>_ to install docker-desktop (the successor to docker-for-windows, which has now been depreciated). Using Chocolately requires installing Chocolately <https://chocolatey.org/install>_.

Then, to install docker-desktop with Chocolately:

.. code-block:: rest

C:\> choco install docker-desktop

You will likely need to restart your Windows box after installing docker-desktop, even if it doesn't tell you to do so.

All Platforms

Install Docker for your platform. For Aquarium, follow the installation instructions <https://github.com/TeamHG-Memex/aquarium#usage>_.

After setting up Splash with Aquarium, ensure you set the following environment variables:

.. code-block:: python

SPLASH_USERNAME = '<username you set during Aquarium setup>'
SPLASH_PASSWORD = '<password you set during Aquarium setup>'

Finally, to run Splash service, cd to the Aquarium repo on your hard drive, and then run docker-compose up in your command prompt.

Troubleshooting Aquarium and Splash service:

Ensure you are in the aquarium folder when you run the docker-compose up command.
You may have some initial problem if you did not share your hard drive with Docker.
Share your hard drive with docker (google is your friend to figure out how to do this).
Try to run the docker-compose up command again.
Note, upon computer/server restart, you need to ensure the Splash service is started, either daemonized or with docker-compose up.

At this point, you should have a splash service running in your command prompt.

Crawlera

Using crawlera is optional and not required for this books_to_scrape quickstart.

But, if you want to use Crawlera with Transistor, first, register for the service and buy a subscription at scrapinghub.com <https://scrapinghub.com>_.

After registering for Crawlera, create accounts in scrapinghub.com for each region you would like to present a proxied ip address from. For our case, we are setup to handle three regions, ALL for global, China, and USA.

Next, you should set environment variables on your computer/server with the api key for each region you need, like below:

.. code-block:: python

CRAWLERA_ALL = '<your crawlera account api key for ALL regions>'
CRAWLERA_CN = '<your crawlera account api key for China region>'
CRAWLERA_USA = '<your crawlera account api key for USA region>'
CRAWLERA_REGIONS = 'CRAWLERA_ALL,CRAWLERA_USA,CRAWLERA_CN'

There are some utility functions which are helpful for working with crawlera found in transistor/utility/crawlera.py which require the CRAWLERA_REGIONS environment variable to be set. CRAWLERA_REGIONS should just be a comma separated string of whatever region environment variables you have set.

Finally, to use Crawlera, you will need to pass a keyword arg like crawlera_user=<your api key> into your custom Scraper spider which has been subclassed from the SplashScraper clas

Related Skills

node-connect

333.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

82.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

333.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

82.0k

Commit, push, and open a PR