BAScraper
An asynchronous Python Reddit API wrapper for fetching posts, comments for data anlytics from Reddit. Utilizes PullPush and Arctic-Shift.
Install / Use
/learn @maxjo020418/BAScraperREADME
BAScraper
Table of Contents
- Introduction
- Features
- Installation and Basic Usage
- Parameters
- Rate Limits and Performance
- Returned JSON Object Structure
[!WARNING] Usage (Classes and Functions) method has drastically changed and also the following README doc. The old docs are in
./BAScraper_old/README_old.md.This new v0.2.x-a is only tested to the extent that I personally use, so full coverage testing has not been done. It also hasn't been published to PyPi (PyPi on v0.1.2), manually download for the newest v0.2-a please report unexpected issues that may occur.
An API wrapper for PullPush.io and Arctic-Shift - the 3rd party replacement APIs for Reddit. Nothing special.
After the 2023 Reddit API controversy, PushShift.io(and also wrappers such as PSAW and PMAW) is now only available to reddit admins and Reddit PRAW is honestly useless when trying to get a lots of data and data from a specific timeframe. This aims to help with that since these 3rd party services didn't have any official/unofficial python wrappers.
Features
- Asynchronous operations for better performance. (updated from the old multithreaded approach)
- Support for PullPush.io and Arctic Shift APIs.
- Parameter customization for subreddit, comment, and submission searches.
- Integrated rate-limit management.
- Parameter schemes for data selection.
Also, please respect cool-down times and refrain from requesting very large amount of data. It stresses the server and can cause inconvenience for everyone.
For large amounts of data, head to ArcticShift's academic torrent zst dumps
Links to the services:
Installation and basic usage
you can install the package via pip
pip install BAScraper
Python 3.12+ is required
Usage Example
from BAScraper.BAScraper_async import PullPushAsync, ArcticShiftAsync
import asyncio
ppa = PullPushAsync(log_stream_level="DEBUG", task_num=2)
asa = ArcticShiftAsync(log_stream_level="DEBUG", task_num=10)
async def test1():
print('TEST 1-1 - PullPushAsync basic fetching')
result1 = await ppa.fetch(
mode='submissions',
subreddit='cars',
get_comments=True,
after='2024-07-01',
before='2024-07-01T06:00:00',
file_name='test1-1'
)
print('test 1 len:', len(result1))
print('\nTEST 1-2 - PullPushAsync basic comment fetching')
result2 = await ppa.fetch(
mode='comments',
subreddit='cars',
after='2024-07-01',
before='2024-07-01T06:00:00',
file_name='test1-2'
)
print('test 2 len:', len(result2))
async def test2():
print('TEST 2-1 - ArcticShiftAsync basic fetching')
result1 = await asa.fetch(
mode='submissions_search',
subreddit='cars',
# get_comments=True, # can be uncommented to fetch comments
after='2024-07-01',
before='2024-07-05T03:00:00',
file_name='test2-1',
fields=['created_utc', 'title', 'url', 'id'],
limit=0 # auto
)
print('test 1 len:', len(result1))
print('\nTEST 2-2 - ArcticShiftAsync basic comment fetching')
result2 = await asa.fetch(
mode='comments_search',
subreddit='cars',
body='bmw honda benz',
after='2024-07-01',
before='2024-07-01T12:00:00',
file_name='test2-2',
limit=100,
fields=['created_utc', 'body', 'id'],
)
print('test 2 len:', len(result2))
print('\nTEST 2-3 - ArcticShiftAsync subreddits_search')
result3 = await asa.fetch(
mode='subreddits_search',
subreddit_prefix='what',
file_name='test2-3',
limit=1000
)
print('test 3 len:', len(result3))
if __name__ == '__main__':
if input('test pullpush?: ') == 'y':
asyncio.run(test1())
if input('test arcticshift?: ') == 'y':
asyncio.run(test2())
# all results are saved to 'resultX.json' since the `file_name` field was specified.
# it'll save all the results in the current directory since `save_dir` wasn't specified
[!NOTE] When using multiple requests, (as in multiple functions under
PullPushAsync) it is highly recommended to use the functions under the same instance because all the request pool related variables would be shared in that case.Also, when re-running scripts using this, pools recording the request status is reset every time. So keep in mind that unexpected soft/hard rate limits may occur when frequently (re-)running scripts. Consider waiting a few minutes or seconds before running scripts if needed.
Parameters
For more info on each of the parameters as well as additional info (TOS, extra tools, etc) visit the following links:
Initialization Parameters
for PullPushAsync.__init__ & ArcticShiftAsync.__init__
| Parameter | Type | Restrictions | Required | Default Value | Notes |
|--------------------|-------|------------------------------------------------------------------------------------------|----------|-----------------------------------|-----------------------------------------------------------|
| sleep_sec | int | Positive int | No | 1 | Cooldown time between each request. |
| backoff_sec | int | Positive int | No | 3 | Backoff time for each failed request. |
| max_retries | int | Positive int | No | 5 | Number of retries for failed requests before it gives up. |
| timeout | int | Positive int | No | 10 | Time until it's considered as timeout error. |
| pace_mode | str | One of 'auto-soft', 'auto-hard', 'manual' | No | 'auto-hard' | Sets the pace to mitigate rate-limiting. |
| save_dir | str | Valid path | No | os.getcwd() (current directory) | Directory to save the results. |
| task_num | int | Positive int | No | 3 | Number of async tasks to be made. |
| log_stream_level | str | One of ['NOTSET', 'DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'] | No | 'INFO' | Sets the log level for logs streamed on the terminal. |
| log_level | str | Same as log_stream_level | No | 'DEBUG' | Sets the log level for logging (file). |
| duplicate_action | str | One of 'keep_newest', 'keep_oldest', 'remove', 'keep_original', 'keep_removed' | No | 'keep_newest' | Decides handling of duplicates. |
Fetch Parameters (fetch)
PullPushAsync.fetch common parameters
| Parameter | Type | Restrictions | Required | Notes |
|-------------|--------|-------------------------------------------------------------|----------|-------------------------------------------|
| q | str | Quoted string for phrases | No | Search query for comments or submissions. |
| ids | list | Maximum length: 100 | No | List of IDs to fetch. |
| size | int | Must be <= 100 | No | Number of results to return. |
| sort | str | Must be one of "asc", "desc" | No | Sorting order. |
| sort_type | str | Must be one of "score", "num_comments", "created_utc" | No | Sorting criteria. |
| author | str | None | No | Filter by author. |
| subreddit | str | None
Related Skills
node-connect
349.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
