MultiQueryAttention

This is a simple torch implementation of the high performance Multi-Query Attention

Generate Convert Improve

Install / Use

/learn @kyegomez/MultiQueryAttention

About this skill

Quality Score

0/100

README

MultiQueryAttention

Multi Query Attention (MQA) is an innovative Python package that offers an efficient and flexible implementation of the Multi-Query self-attention mechanism.

Installation

Use the package manager pip to install MultiQueryAttention. You can do this via the following command:

pip install mqa

Usage

Here is a simple example of how to initialize and use the MultiQueryAttention class.

import torch
from mqa import MultiQueryAttention

x = torch.rand(4, 10, 512).to('cuda')

attn = MultiQueryAttention(
    d_model=512,
    heads=8,
    attn_impl="triton",
    attn_pdrop=0.1,
    device="cuda"
)

#forward pass
output, attn_weights, past_key_values = attn(x)

Class Documentation

MultiQueryAttention

The MultiQueryAttention class is the core component of this package and provides an implementation of the Multi-Query self-attention mechanism.

Initialization

The MultiQueryAttention class is initialized with the following parameters:

d_model: Dimensionality of the input.
heads: Number of attention heads.
attn_impl: Attention implementation to use ('triton', 'flash', or 'torch').
clip_qkv: Optional parameter to clip query, key, and value vectors.
qk_ln: Optional Boolean flag to apply layer normalization to the query and key vectors.
softmax_scale: Optional scaling factor for the softmax function.
attn_pdrop: Dropout probability for the attention mechanism.
norm_type: Type of normalization to use (default is 'low_precision_layernorm').
fc_type: Type of fully connected layer to use (default is 'torch').
verbose: Verbosity level (default is 0).
device: Device to run the computations on (default is None, automatically chosen).

Forward Method

The forward method of the MultiQueryAttention class accepts the following parameters:

x: The input tensor.
past_key_value: Optional tensor containing past key and value vectors.
bias: Optional tensor containing attention bias.
attention_mask: Optional tensor containing the attention mask.
causal: Optional Boolean flag indicating if the attention mechanism is causal (default is True).
needs_weights: Optional Boolean flag indicating if the attention weights are needed (default is False).

The forward method returns the output tensor, the attention weights, and the past key and value vectors.

Conclusion

The MQA package delivers a flexible and efficient toolset for the implementation of the Multi-Query self-attention mechanism. Designed for ease-of-use and integration, it represents a valuable addition to any PyTorch-based project.

Related Skills

node-connect

351.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

351.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

351.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。