lmdocs: Generative AI for code documentation :brain: :arrow_right: :computer: :snake:

lmdocs automatically generates documentation for your Python code using LLMs.

:bulb: All documentation for this repository has been generated by lmdocs itself!

Features

Codebase Preservation: Guarantees no changes to your code
Context aware Documentation: References relevant documentation from imported libraries and your codebase
LLM-Generated Comments: Understands your code and adds relevant docstrings and comments
No dependancies: Written in pure Python, no dependancies on any external packages *It is recommended that you install libraries specific to your project before running

lmdocs in Action :hammer:

<details> <summary>Example 1 - Fibonacci numbers</summary>

# Original function
def fibonacci(n):
    a, b = 0, 1
    fib_seq = []
    for i in range(n):
        fib_seq.append(a)
        a, b = b, a + b
    return fib_seq

# Commented using lmdocs
def fibonacci(n):
    """
    Generates the Fibonacci sequence up to n terms.
    
    Input:
        n (int): The number of terms in the Fibonacci sequence to generate.
        
    Returns:
        list: A list containing the first n terms of the Fibonacci sequence.
        
    Raises:
        ValueError: If n is less than 1.
    """
    
    a, b = 0, 1 # Initialize two variables to store the last and current term in the sequence
    fib_seq = [] # Initialize an empty list to store the generated Fibonacci sequence
    
    for i in range(n): # Generate n terms of the Fibonacci sequence
        fib_seq.append(a) # Append the current term to the sequence
        
        # Update the last two terms for the next iteration
        a, b = b, a + b 
    
    return fib_seq # Return the generated Fibonacci sequence

</details> <details> <summary>Example 2 - K means</summary>

# Original function
def k_means(X, k, max_iter=300, tol=1e-4, random_state=None):
    np.random.seed(random_state)
    centroids = X[np.random.choice(X.shape[0], k, replace=False), :]
    
    for _ in range(max_iter):
        distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2)) 
        cluster_assignments = np.argmin(distances, axis=0)
        new_centroids = np.array([X[cluster_assignments == i].mean(axis=0) for i in range(k)])  
        
        if np.abs(centroids - new_centroids).sum() < tol:
            break
            
        centroids = new_centroids
        
    return cluster_assignments, centroids

# Commented using lmdocs
def k_means(X, k, max_iter=300, tol=1e-4, random_state=None):
    '''
    Perform K-Means clustering. 
    
    Input: 
        X : array-like of shape (n_samples, n_features)
            The input data.
        
        k : int
            The number of clusters to form.
            
        max_iter : int, default=300
            Maximum number of iterations of the k-means algorithm for a single run.
                
        tol : float, default=1e-4
            Relative tolerance with regards to Frobenius norm of the difference in the cluster centers 
            of two consecutive iterations to declare convergence.
            
        random_state : int, default=None
            Determines random number generation for centroid initialization. Use an integer to 
            get reproducible results.
    
    Returns: 
        tuple : (cluster_assignments, centroids)
        
            cluster_assignments : array-like of shape (n_samples,)
                Cluster assignments for each sample in the input data.
                
            centroids : array-like of shape (k, n_features)
                Coordinates of cluster centers.
    
    Raises: 
        ValueError : If k greater than number of samples or less than one.
        
    '''
    np.random.seed(random_state)
    centroids = X[np.random.choice(X.shape[0], k, replace=False), :]
    
    for _ in range(max_iter):
        distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2))  # Calculate Euclidean distance to each centroid
        cluster_assignments = np.argmin(distances, axis=0)  # Assign sample to nearest centroid
        
        # Recalculate centroids as mean of samples in the same cluster
        new_centroids = np.array([X[cluster_assignments == i].mean(axis=0) for i in range(k)])  
        
        if np.abs(centroids - new_centroids).sum() < tol:  # Check if centroids have converged
            break
            
        centroids = new_centroids  # Update centroids for next iteration
    
    return cluster_assignments, centroids

</details>

The examples above were generated locally using lmdocs with the DeepSeek coder 6.7B model.

Quickstart :rocket:

Using an OpenAI model

python lmdocs.py <project path> --openai_key <key>

Tested with gpt-3.5-turbo, gpt-4-turbo, gpt-4o

Using a local model

python lmdocs.py <project path> --port <local LLM server port>

Setup

To use local LLMs, you need to set up an openAI compatible server.
You can use local desktops apps like LM Studio, Ollama, GPT4All, llama.cpp or any other method to set up your LLM server.

Although lmdocs is compatible with any local LLM, I have tested that it works for the following models:
deepseek-coder-6.7b-instruct, WizardCoder-Python-7B-V1, Meta-Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.2, Phi-3-mini-4k-instruct

How it works

Step 1: Collect and Analyze Code
Gather all Python files from the project directory and identify all function, class, and method calls

Step 2: Create Dependency Graph
Map out the dependencies between the identified calls to create a dependency graph of the entire codebase

Step 3: Retrieve and Generate Documentation
For calls with no dependencies, retrieve existing documentation using their __doc__ attribute
For calls with dependents, prompt the LLM to generate documented code, providing the original code and reference documentation for all its dependencies in the prompt

Step 4: Verify and Replace Code
Compare the Abstract Syntax Tree (AST) of the original and generated code
If they match, replace the original code with the documented code
If they don't match, retry the generation and verification process (up to three times)

Additional options :gear:

usage: lmdocs.py [-h] [-v] [--openai_key OPENAI_KEY] [--openai_key_env OPENAI_KEY_ENV] [--openai_model {gpt-3.5-turbo,gpt-4-turbo,gpt-4o}] [-p PORT]
                 [--ref_doc {truncate,summarize,full}] [--max_retries MAX_RETRIES] [--temperature TEMPERATURE] [--max_tokens MAX_TOKENS]
                 path

positional arguments:
  path                  Path to the file/folder of project

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Give out verbose logs
  --openai_key OPENAI_KEY
                        Your Open AI key
  --openai_key_env OPENAI_KEY_ENV
                        Environment variable where Open AI key is stored
  --openai_model {gpt-3.5-turbo,gpt-4-turbo,gpt-4o}
                        Which openAI model to use. Supported models are ['gpt-3.5-turbo', 'gpt-4-turbo', 'gpt-4o']            
                        gpt-3.5-turbo is used by default
  -p PORT, --port PORT  Port where Local LLM server is hosted
  --ref_doc {truncate,summarize,full}
                        Strategy to process reference documentation. Supported choices are:            
                        truncate    - Truncate documentation to the first paragraph            
                        summarize   - Generate a single summary of the documentation using the given LLM            
                        full        - Use the complete documentation (Can lead to very long context length)            
                        "truncate" is used as the default strategy
  --max_retries MAX_RETRIES
                        Number of attempts that the LLM gets to generate the documentation for each function/method/class
  --temperature TEMPERATURE
                        Temperature parameter used to sample output from the LLM
  --max_tokens MAX_TOKENS
                        Maximum number of tokens that the LLM is allowed to generate

Caveats and limitations

Language Support

Only supports Python 3.0+

Dependancy extraction

The ast module is used to analyze the Abstract Syntax Tree of every Python file in the codebase.
Only functional and class dependancies are tracked i.e Only code written within a class, method or function, is tracked and documented

Package Dependancies

lmdocs is written in pure Python, it does not depend on any other packages.
It is strongly recommended that you install the libraries/packages for the project that needs to be documented for reference document extraction

Reference documentation extraction

Documentation for functions which have no dependancies is extracted using Pythons ___doc___() method
For external libraries (e.g numpy), the library is imported as it is from the original code

Note that, since Python does no

Lmdocs

Install / Use

README