70 skills found · Page 1 of 3
vllm-project / VllmA high-throughput and memory-efficient inference and serving engine for LLMs
ai-dynamo / DynamoA Datacenter Scale Distributed Inference Serving Framework
skyzh / Tiny LlmA course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.
ModelTC / LightLLMLightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
containers / RamalamaRamaLama is an open-source developer tool that simplifies the local serving of AI models from any source and facilitates their use for inference in production, all through the familiar language of containers.
awslabs / Multi Model ServerMulti Model Server is a tool for serving neural net models for inference
SeldonIO / MLServerAn inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more
EricLBuehler / Candle VllmEfficent platform for inference and serving local LLMs including an OpenAI compatible API server.
NLPOptimize / Flash TokenizerEFFICIENT AND OPTIMIZED TOKENIZER ENGINE FOR LLM INFERENCE SERVING
hpcaitech / SwiftInferEfficient AI Inference & Serving
cuckoo-network / CuckooCuckoo is a Decentralized AI Model-Serving Platform, starting with GPU-sharing for text-to-image generation and LLM inference.
aws / Sagemaker ContainersWARNING: This package has been deprecated. Please use the SageMaker Training Toolkit for model training and the SageMaker Inference Toolkit for model serving.
aws / Sagemaker Pytorch Inference ToolkitToolkit for allowing inference and serving with PyTorch on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
psmarter / Mini InferLLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, MoE expert parallelism, OpenAI-compatible serving
anyscale / E2e Llm WorkflowsFine-tune an LLM to perform batch inference and online serving.
anyscale / Multimodal AIMultimodal AI workloads: batch inference, model training and online serving.
SJTU-IPADS / ReefREEF is a GPU-accelerated DNN inference serving system that enables instant kernel preemption and biased concurrent execution in GPU scheduling.
yuanmu97 / Secure Transformer Inference[NDSS 2026] Secure Transformer Inference is a protocol for serving Transformer-based models securely.
stanford-mast / INFaaSModel-less Inference Serving
aerlabsAI / AI Inference ResourcesCurated collection of AI inference engineering resources — LLM serving, GPU kernels, quantization, distributed inference, and production deployment. Compiled from the AER Labs community.