Datascience
Curated list of Python resources for data science.
Install / Use
/learn @r0f1/DatascienceREADME
Awesome Data Science with Python
A curated list of awesome resources for practicing data science using Python, including not only libraries, but also links to tutorials, code snippets, blog posts and talks.
Core
pandas - Data structures built on top of numpy.
scikit-learn - Core ML library, intelex.
matplotlib - Plotting library.
seaborn - Data visualization library based on matplotlib.
ydata-profiling - Descriptive statistics using ProfileReport.
sklearn_pandas - Helpful DataFrameMapper class.
missingno - Missing data visualization.
rainbow-csv - VSCode plugin to display .csv files with nice colors.
General Python Programming
Advanced Python Features - Generics, Protocols, Structural Pattern Matching and more.
uv - Dependency management.
pdm - For large binary distributions, works with uv.
just - Command runner. Replacement for make.
python-dotenv - Manage environment variables.
structlog - Python logging.
more_itertools - Extension of itertools.
tqdm - Progress bars for for-loops. Also supports pandas apply().
hydra - Configuration management.
Pandas Tricks, Alternatives and Additions
duckdb - Efficiently run SQL queries on pandas DataFrame, duckplyr for R, Great Intro.
ducklake - Duckdb extention for storing data in a datalake.
fireducks - Speedier alternative to pandas with similar API.
pandasvault - Large collection of pandas tricks.
polars - Multi-threaded alternative to pandas.
xarray - Extends pandas to n-dimensional arrays.
mlx - An array framework for Apple silicon.
pandas_flavor - Write custom accessors like .str and .dt.
daft - Distributed DataFrame.
vaex - Out-of-Core DataFrames.
modin - Parallelization library for faster pandas DataFrame.
swifter - Apply any function to a pandas DataFrame faster (works with modin).
Tables
great-tables - Display tabular data nicely.
Interactive Dataframe Visualization
pygwalker - Interactive dataframe.
marimo - Visualization and reproducible environment.
lux - DataFrame visualization within Jupyter.
dtale - View and analyze Pandas data structures, integrating with Jupyter.
pandasgui - GUI for viewing, plotting and analyzing Pandas DataFrames.
quak - Scalable, interactive data table, twitter.
Environment and Jupyter
Jupyter Tricks
nteract - Open Jupyter Notebooks with doubleclick.
papermill - Parameterize and execute Jupyter notebooks, tutorial.
nbdime - Diff two notebook files, Alternative GitHub App: ReviewNB.
RISE - Turn Jupyter notebooks into presentations.
handcalcs - More convenient way of writing mathematical equations in Jupyter.
notebooker - Productionize and schedule Jupyter Notebooks.
voila - Turn Jupyter notebooks into standalone web applications. Voila grid layout.
Jupyter Alternatives
positron - Data Science IDE.
Deepnote - Data Science platform with real-time collaboration, environment management.
Extraction
textract - Extract text from any document.
Big Data
spark - DataFrame for big data, cheatsheet, tutorial.
dask, dask-ml - Pandas DataFrame for big data and machine learning library, resources, talk1, talk2, notebooks, videos.
h2o - Helpful H2OFrame class for out-of-memory dataframes.
cuDF - GPU DataFrame Library, Intro.
cupy - NumPy-like API accelerated with CUDA.
ray - Flexible, high-performance distributed execution framework.
bottleneck - Fast NumPy array functions written in C.
petastorm - Data access library for parquet files by Uber.
zarr - Distributed NumPy arrays.
NVTabular - Feature engineering and preprocessing library for tabular data by Nvidia.
tensorstore - Reading and writing large multi-dimensional arrays (Google).
Command line tools, CSV
csvkit - Command line tool for CSV files.
csvsort - Sort large csv files.
Classical Statistics
Books
Lakens - Improving Your Statistical Inferences - Testing, Effect Sizes, Confidence Intervals, Sample Size, Equivalence Testing, Sequential Analysis, Github
Models Demystified - From Linear Regression to Deep Learning. Github.
The Math Behind Artificial Intelligence - Engineering-focused book covering linear algebra, calculus, probability & statistics, and optimization theory with Python examples.
Datasets
Rdatasets - Collection of more than 2000 datasets, stored as csv files (R package).
crimedatasets - Datasets focused on crimes, criminal activities (R package).
educationr - Datasets related to education (performance, learning methods, test scores, absenteeism) (R package).
MedDataSets - Datasets related to medicine, diseases, treatments, drugs, and public health (R package).
oncodatasets - Datasets focused on cancer research, survival rates, genetic studies, biomarkers, epidemiology (R package).
timeseriesdatasets_R - Time series datasets (R package).
usdatasets - US-exclusive datasets (crime, economics, education, finance, energy, healthcare) (R package).
economic datasets - Economic datasets.
p-values
The ASA Statement on p-Values: Context, Process, and Purpose
Greenland - Statistical tests, P-values, confidence intervals, and power: a guide to misinterpretations
Rubin - Inconsistent multiple testing corrections: The fallacy of using family-based error rates to make inferences about individual hypotheses
Gigerenzer - Mindless Statistics
Rubin - That's not a two-sided test! It's two one-sided tests! (TOST)
[Lakens - How were we supposed to move beyo
Security Score
Audited on Mar 28, 2026
