16 skills found
droyed / Eucl DistEuclidean Distance Computation in Python for 4x-100x+ speedups over SciPy and scikit-learn. Also leverages GPU for better performance on specific datasets.
juancarlospaco / ThatlibFaster pathlib for Python
nicholasharris / GPU Parallel Genetic Algorithm Using CUDA With Python NumbaImplementation of a GPU-parallel Genetic Algorithm using CUDA with python numba for significant speedup.
sunsetcoder / Flash Attention WindowsFlash Attention 2 pre-built wheels for Windows. Drop-in replacement for PyTorch attention providing up to 10x speedup and 20x memory reduction. Compatible with Python 3.10 and CUDA 11.7+. No build setup required - just pip install and accelerate your transformer models. Supports modern NVIDIA GPUs (RTX 30/40, A100, H100).
Naereen / Lempel Ziv Complexity:package: Lempel-Ziv Complexity, fast implementations with :snake: Python (naive, Numba or Cython for speedup), Open-Source (MIT) :+1: →
Ambeteco / Faster Os6800% faster "os" module replacement. A drop-in replacement for Python's standard 'OS' module. Fully-rewritten, optimized, and speeded-up functions, that replace ones in the os.path module.
derlih / LevdistA Python package to calculate the Levenshtein distance algorithm implementation with MIT license, typing and speedups.
paulgavrikov / Parallel Matplotlib GridThis Python 3 module helps you speedup generation of subplots in pseudo-parallel mode using matplotlib and multiprocessing. This can be useful if you are dealing with expensive preprocessing or plotting tasks such as violin plots per subplot.
PhoenixWebGroupJay / Google Diff Match Patch C The Diff Match and Patch libraries offer robust algorithms to perform the operations required for synchronizing plain text. Diff: Compare two blocks of plain text and efficiently return a list of differences. Match: Given a search string, find its best fuzzy match in a block of plain text. Weighted for both accuracy and location. Patch: Apply a list of patches onto plain text. Use best-effort to apply patch even when the underlying text doesn't match. Currently available in Java, JavaScript, Dart, C++, C#, Objective C, Lua and Python. Regardless of language, each library features the same API and the same functionality. All versions also have comprehensive test harnesses. Algorithms: This library implements Myer's diff algorithm which is generally considered to be the best general-purpose diff. A layer of pre-diff speedups and post-diff cleanups surround the diff algorithm, improving both performance and output quality. This library also implements a Bitap matching algorithm at the heart of a flexible matching and patching strategy.
sean-engelstad / A2d ShellsExplore ways to speedup linear static, linear buckling and nonlinear static structural analyses using GPUs in C++, Python
scivision / Asyncio Subprocess ExamplesExamples of speedup from Python asyncio-subprocess
arsenyinfo / Python Speedup BenchmarkNo description available
kitsuyaazuma / Pyconjp2025Demo code for the PyConJP 2025 talk: "Beyond Multiprocessing: A Real-World ML Workload Speedup with Python 3.13+ Free-Threading"
RomanAILabs-Auth / RomaPyRomaPy is a Python execution engine that uses LLVM JIT compilation to accelerate hot paths at runtime. It delivers massive speedups over CPython and reaches ~20% of native Rust performance on numeric workloads—without rewriting code or leaving Python.
sandyherho / Kelvin Helmholtz 2d SolverHigh-performance 2D solver for Kelvin-Helmholtz instability in incompressible flows, featuring Numba JIT compilation for 10-50x speedup and multi-core parallelization. Includes predefined scenarios, NetCDF output, automated visualization, and both CLI and Python API for the geophysical fluid dynamics studies.
MadhavSingh2236 / Parallization Of CPU And GPU For Plant Disease DetectionImage classification algorithms such as Convolutional Neural Network used for classifying huge image datasets takes a lot of time to perform convolution operations, thus increasing the computational demand of image processing. Compared to CPU, Graphics Processing Unit (GPU) is a good way to accelerate the processing of the images. Parallelizing multiple CPU cores is also another way to process the images faster. Increasing the system memory (RAM) can also decrease the computational time of image processing. Comparing the architecture of CPU and GPU, the former consists of a few cores optimized for sequential processing whereas the later has thousands of relatively simple cores clocked at approx. 1Ghz. The aim of this project is to compare the performance of parallelized CPUs and a GPU. Python’s Ray library is being used to parallelize multicore CPUs. The benchmark image classification algorithm used in this project is Convolutional Neural Network. The dataset used in this project is Plant Disease Image Dataset. Our results show that the GPU implementation achieves 80% speedup compared to the CPU implementation. It has always been cumbersome to process real time images. Studies showed that there can be two ways to analyse this. One hand is about central processing unit (CPU) and the other is about Graphics Processing Unit (GPU). To obtain highest possible performance they have to be used at the same time. This project will compare the performance of CPU and GPU for real time image processing. The main drawback of Python’s Multiprocessing module is that it cannot be used for handling large numeric data. It cannot be used in Deep Learning Frameworks such as Keras as it decreases the accuracy of the models. In recent years, parallel computing and soft computing has become a rapidly evolving field of study. The demand for parallel processing in increasing day by day. There are various software tools and libraries by which we can parallelize our programs. For example, we have OPENMP in c++ for parallel computing. OPENMP supports FORTRAN, C and C++. It is basically an Application Programming Interface for shared Memory Model programming. Python has its separate parallel processing module named Multiprocessing. Multiprocessing module enables to spawn multiple processes, allowing programmer to fully leverage the computing power of multiple processors. The main drawback of Python’s Multiprocessing module is that it cannot be used for handling large numeric data. It cannot be used in Deep Learning Frameworks such as Keras as it decreases the accuracy of the models. Shared variables cannot be used in the Multiprocessing Module. Python also has a Parallel and Distributed computing framework called Ray. Ray can be used for developing emerging AI applications such as image classification, face recognition etc. Parallelizing multiple cores of CPU using Ray can also increase the speedup of the model significantly. The benchmark image classification algorithm used in this project is Convolutional Neural Network. The dataset used in this project is Plant 4 Disease Image dataset containing around 30000 images. The system is configured with 16 GB RAM with 4 CPU Cores and Tesla P100 GPU. This project compares the performance of 2-core, 3-core and 4-core parallelized CPUs with GPU.