SkillAgentSearch skills...

BAScraper

An asynchronous Python Reddit API wrapper for fetching posts, comments for data anlytics from Reddit. Utilizes PullPush and Arctic-Shift.

Install / Use

/learn @maxjo020418/BAScraper
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

BAScraper

Table of Contents

  1. Introduction
  2. Features
  3. Installation and Basic Usage
  4. Parameters
  5. Rate Limits and Performance
  6. Returned JSON Object Structure

[!WARNING] Usage (Classes and Functions) method has drastically changed and also the following README doc. The old docs are in ./BAScraper_old/README_old.md.

This new v0.2.x-a is only tested to the extent that I personally use, so full coverage testing has not been done. It also hasn't been published to PyPi (PyPi on v0.1.2), manually download for the newest v0.2-a please report unexpected issues that may occur.

An API wrapper for PullPush.io and Arctic-Shift - the 3rd party replacement APIs for Reddit. Nothing special.

After the 2023 Reddit API controversy, PushShift.io(and also wrappers such as PSAW and PMAW) is now only available to reddit admins and Reddit PRAW is honestly useless when trying to get a lots of data and data from a specific timeframe. This aims to help with that since these 3rd party services didn't have any official/unofficial python wrappers.

Features

  • Asynchronous operations for better performance. (updated from the old multithreaded approach)
  • Support for PullPush.io and Arctic Shift APIs.
  • Parameter customization for subreddit, comment, and submission searches.
  • Integrated rate-limit management.
  • Parameter schemes for data selection.

Also, please respect cool-down times and refrain from requesting very large amount of data. It stresses the server and can cause inconvenience for everyone.

For large amounts of data, head to ArcticShift's academic torrent zst dumps

Links to the services:

Installation and basic usage

you can install the package via pip

pip install BAScraper

Python 3.12+ is required

Usage Example

from BAScraper.BAScraper_async import PullPushAsync, ArcticShiftAsync
import asyncio

ppa = PullPushAsync(log_stream_level="DEBUG", task_num=2)
asa = ArcticShiftAsync(log_stream_level="DEBUG", task_num=10)


async def test1():
    print('TEST 1-1 - PullPushAsync basic fetching')
    result1 = await ppa.fetch(
        mode='submissions',
        subreddit='cars',
        get_comments=True,
        after='2024-07-01',
        before='2024-07-01T06:00:00',
        file_name='test1-1'
    )
    print('test 1 len:', len(result1))

    print('\nTEST 1-2 - PullPushAsync basic comment fetching')
    result2 = await ppa.fetch(
        mode='comments',
        subreddit='cars',
        after='2024-07-01',
        before='2024-07-01T06:00:00',
        file_name='test1-2'
    )
    print('test 2 len:', len(result2))


async def test2():
    print('TEST 2-1 - ArcticShiftAsync basic fetching')
    result1 = await asa.fetch(
        mode='submissions_search',
        subreddit='cars',
        # get_comments=True,  # can be uncommented to fetch comments
        after='2024-07-01',
        before='2024-07-05T03:00:00',
        file_name='test2-1',
        fields=['created_utc', 'title', 'url', 'id'],
        limit=0  # auto
    )
    print('test 1 len:', len(result1))

    print('\nTEST 2-2 - ArcticShiftAsync basic comment fetching')
    result2 = await asa.fetch(
        mode='comments_search',
        subreddit='cars',
        body='bmw honda benz',
        after='2024-07-01',
        before='2024-07-01T12:00:00',
        file_name='test2-2',
        limit=100,
        fields=['created_utc', 'body', 'id'],
    )
    print('test 2 len:', len(result2))

    print('\nTEST 2-3 - ArcticShiftAsync subreddits_search')
    result3 = await asa.fetch(
        mode='subreddits_search',
        subreddit_prefix='what',
        file_name='test2-3',
        limit=1000
    )
    print('test 3 len:', len(result3))

if __name__ == '__main__':
    if input('test pullpush?: ') == 'y':
        asyncio.run(test1())
    if input('test arcticshift?: ') == 'y':
        asyncio.run(test2())

# all results are saved to 'resultX.json' since the `file_name` field was specified. 
# it'll save all the results in the current directory since `save_dir` wasn't specified

[!NOTE] When using multiple requests, (as in multiple functions under PullPushAsync) it is highly recommended to use the functions under the same instance because all the request pool related variables would be shared in that case.

Also, when re-running scripts using this, pools recording the request status is reset every time. So keep in mind that unexpected soft/hard rate limits may occur when frequently (re-)running scripts. Consider waiting a few minutes or seconds before running scripts if needed.

Parameters

For more info on each of the parameters as well as additional info (TOS, extra tools, etc) visit the following links:

Initialization Parameters

for PullPushAsync.__init__ & ArcticShiftAsync.__init__

| Parameter | Type | Restrictions | Required | Default Value | Notes | |--------------------|-------|------------------------------------------------------------------------------------------|----------|-----------------------------------|-----------------------------------------------------------| | sleep_sec | int | Positive int | No | 1 | Cooldown time between each request. | | backoff_sec | int | Positive int | No | 3 | Backoff time for each failed request. | | max_retries | int | Positive int | No | 5 | Number of retries for failed requests before it gives up. | | timeout | int | Positive int | No | 10 | Time until it's considered as timeout error. | | pace_mode | str | One of 'auto-soft', 'auto-hard', 'manual' | No | 'auto-hard' | Sets the pace to mitigate rate-limiting. | | save_dir | str | Valid path | No | os.getcwd() (current directory) | Directory to save the results. | | task_num | int | Positive int | No | 3 | Number of async tasks to be made. | | log_stream_level | str | One of ['NOTSET', 'DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'] | No | 'INFO' | Sets the log level for logs streamed on the terminal. | | log_level | str | Same as log_stream_level | No | 'DEBUG' | Sets the log level for logging (file). | | duplicate_action | str | One of 'keep_newest', 'keep_oldest', 'remove', 'keep_original', 'keep_removed' | No | 'keep_newest' | Decides handling of duplicates. |

Fetch Parameters (fetch)

PullPushAsync.fetch common parameters

| Parameter | Type | Restrictions | Required | Notes | |-------------|--------|-------------------------------------------------------------|----------|-------------------------------------------| | q | str | Quoted string for phrases | No | Search query for comments or submissions. | | ids | list | Maximum length: 100 | No | List of IDs to fetch. | | size | int | Must be <= 100 | No | Number of results to return. | | sort | str | Must be one of "asc", "desc" | No | Sorting order. | | sort_type | str | Must be one of "score", "num_comments", "created_utc" | No | Sorting criteria. | | author | str | None | No | Filter by author. | | subreddit | str | None

Related Skills

View on GitHub
GitHub Stars23
CategoryDevelopment
Updated7d ago
Forks2

Languages

Python

Security Score

95/100

Audited on Mar 29, 2026

No findings