llama.vim

Local LLM-assisted text completion.

Fill-in-Middle (FIM) completions

llama vim-spec-1

Instruction-based editing

https://github.com/user-attachments/assets/641a6e72-f1a2-4fe5-b0fd-c2597c6f4cdc

https://github.com/user-attachments/assets/68bff15b-2d91-4800-985d-b7f110a0ccb7

Features

Auto-suggest on cursor movement in Insert mode
Accept a suggestion with Tab
Accept the first line of a suggestion with Shift+Tab
Instruction-based editing with <leader>lli
Control max text generation time
Configure scope of context around the cursor
Ring context with chunks from open and edited files and yanked text
Supports very large contexts even on low-end hardware via smart context reuse
Display performance stats

Installation

Plugin setup

vim-plug
```
Plug 'ggml-org/llama.vim'
```
Vundle
```
cd ~/.vim/bundle
git clone https://github.com/ggml-org/llama.vim
```
Then add Plugin 'llama.vim' to your .vimrc in the vundle#begin() section.
lazy.nvim
```
{
    'ggml-org/llama.vim',
}
```

Plugin configuration

You can customize llama.vim by setting the g:llama_config variable.

Examples:

Disable the inline info:

" put before llama.vim loads
let g:llama_config = { 'show_info': 0 }

Same thing but setting directly
```
let g:llama_config.show_info = v:false
```

Disable auto FIM (Fill-In-the-Middle) completion with lazy.nvim

{
    'ggml-org/llama.vim',
    init = function()
        vim.g.llama_config = {
            auto_fim = false,
        }
    end,
}

Configure FIM keymaps:

let g:llama_config.keymap_fim_trigger     = "<leader>llf"
let g:llama_config.keymap_fim_accept_full = "<Tab>"
let g:llama_config.keymap_fim_accept_line = "<S-Tab>"
let g:llama_config.keymap_fim_accept_word = "<leader>ll]"

Configure instruction-based editing keymaps

let g:llama_config.keymap_inst_trigger  = "<leader>lli"
let g:llama_config.keymap_inst_retry    = "<leader>llr"
let g:llama_config.keymap_inst_continue = "<leader>llc"
let g:llama_config.keymap_inst_accept   = "<Tab>"
let g:llama_config.keymap_inst_cancel   = "<Esc>"

Please refer to :help llama_config or the source for the full list of options.

llama.cpp setup

The plugin requires a llama.cpp server instance to be running at g:llama_config.endpoint_fim and/or g:llama_config.endpoint_inst.

Mac OS

brew install llama.cpp

Windows

winget install llama.cpp

Any other OS

Either build from source or use the latest binaries: https://github.com/ggml-org/llama.cpp/releases

llama.cpp settings

Here are recommended settings, depending on the amount of VRAM that you have:

More than 64GB VRAM:
```
llama-server --fim-qwen-30b-default
```
More than 16GB VRAM:
```
llama-server --fim-qwen-7b-default
```
Less than 16GB VRAM:
```
llama-server --fim-qwen-3b-default
```
Less than 8GB VRAM:
```
llama-server --fim-qwen-1.5b-default
```

Use :help llama for more details.

Recommended LLMs

The plugin requires FIM-compatible models: HF collection

Examples

Using `llama.vim` on M1 Pro (2021) with `Qwen2.5-Coder 1.5B Q8_0`:

The orange text is the generated suggestion. The green text contains performance stats for the FIM request: the currently used context is 15186 tokens and the maximum is 32768. There are 30 chunks in the ring buffer with extra context (out of 64). So far, 1 chunk has been evicted in the current session and there are 0 chunks in queue. The newly computed prompt tokens for this request were 260 and the generated tokens were 24. It took 1245 ms to generate this suggestion after entering the letter c on the current line.

Using `llama.vim` on M2 Ultra with `Qwen2.5-Coder 7B Q8_0`:

https://github.com/user-attachments/assets/1f1eb408-8ac2-4bd2-b2cf-6ab7d6816754

Demonstrates that the global context is accumulated and maintained across different files and showcases the overall latency when working in a large codebase.

Another example on a small Swift code

llama vim-swift

Implementation details

The plugin aims to be very simple and lightweight and at the same time to provide high-quality and performant local FIM completions, even on consumer-grade hardware. Read more on how this is achieved in the following links:

Initial implementation and technical description: https://github.com/ggml-org/llama.cpp/pull/9787
Classic Vim support: https://github.com/ggml-org/llama.cpp/pull/9995

Other IDEs

VS Code: https://github.com/ggml-org/llama.vscode

Llama.vim

Install / Use

README

llama.vim

Fill-in-Middle (FIM) completions

Instruction-based editing

Features

Installation

Plugin setup

Plugin configuration

llama.cpp setup

Mac OS

Windows

Any other OS

llama.cpp settings

Recommended LLMs

Examples

Using `llama.vim` on M1 Pro (2021) with `Qwen2.5-Coder 1.5B Q8_0`:

Using `llama.vim` on M2 Ultra with `Qwen2.5-Coder 7B Q8_0`:

Another example on a small Swift code

Implementation details

Other IDEs

Llama.vim

Install / Use

README

llama.vim

Fill-in-Middle (FIM) completions

Instruction-based editing

Features

Installation

Plugin setup

Plugin configuration

llama.cpp setup

Mac OS

Windows

Any other OS

llama.cpp settings

Recommended LLMs

Examples

Using llama.vim on M1 Pro (2021) with Qwen2.5-Coder 1.5B Q8_0:

Using llama.vim on M2 Ultra with Qwen2.5-Coder 7B Q8_0:

Another example on a small Swift code

Implementation details

Other IDEs

Using `llama.vim` on M1 Pro (2021) with `Qwen2.5-Coder 1.5B Q8_0`:

Using `llama.vim` on M2 Ultra with `Qwen2.5-Coder 7B Q8_0`: