SkillAgentSearch skills...

Rebulk

Define simple search patterns in bulk to perform advanced matching on any string

Install / Use

/learn @Toilal/Rebulk
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

ReBulk

Latest Version MIT License Build Status Coveralls semantic-release

ReBulk is a python library that performs advanced searches in strings that would be hard to implement using re module or String methods only.

It includes some features like Patterns, Match, Rule that allows developers to build a custom and complex string matcher using a readable and extendable API.

This project is hosted on GitHub: https://github.com/Toilal/rebulk

Install

$ pip install rebulk

Usage

Regular expression, string and function based patterns are declared in a Rebulk object. It use a fluent API to chain string, regex, and functional methods to define various patterns types.

>>> from rebulk import Rebulk
>>> bulk = Rebulk().string('brown').regex(r'qu\w+').functional(lambda s: (20, 25))

When Rebulk object is fully configured, you can call matches method with an input string to retrieve all Match objects found by registered pattern.

>>> bulk.matches("The quick brown fox jumps over the lazy dog")
[<brown:(10, 15)>, <quick:(4, 9)>, <jumps:(20, 25)>]

If multiple Match objects are found at the same position, only the longer one is kept.

>>> bulk = Rebulk().string('lakers').string('la')
>>> bulk.matches("the lakers are from la")
[<lakers:(4, 10)>, <la:(20, 22)>]

String Patterns

String patterns are based on str.find method to find matches, but returns all matches in the string. ignore_case can be enabled to ignore case.

>>> Rebulk().string('la').matches("lalalilala")
[<la:(0, 2)>, <la:(2, 4)>, <la:(6, 8)>, <la:(8, 10)>]

>>> Rebulk().string('la').matches("LalAlilAla")
[<la:(8, 10)>]

>>> Rebulk().string('la', ignore_case=True).matches("LalAlilAla")
[<La:(0, 2)>, <lA:(2, 4)>, <lA:(6, 8)>, <la:(8, 10)>]

You can define several patterns with a single string method call.

>>> Rebulk().string('Winter', 'coming').matches("Winter is coming...")
[<Winter:(0, 6)>, <coming:(10, 16)>]

Regular Expression Patterns

Regular Expression patterns are based on a compiled regular expression. re.finditer method is used to find matches.

If regex module is available, it can be used by rebulk instead of default re module. Enable it with REBULK_REGEX_ENABLED=1 environment variable.

>>> Rebulk().regex(r'l\w').matches("lolita")
[<lo:(0, 2)>, <li:(2, 4)>]

You can define several patterns with a single regex method call.

>>> Rebulk().regex(r'Wint\wr', r'com\w{3}').matches("Winter is coming...")
[<Winter:(0, 6)>, <coming:(10, 16)>]

All keyword arguments from re.compile are supported.

>>> import re  # import required for flags constant
>>> Rebulk().regex('L[A-Z]KERS', flags=re.IGNORECASE) \
...         .matches("The LaKeRs are from La")
[<LaKeRs:(4, 10)>]

>>> Rebulk().regex('L[A-Z]', 'L[A-Z]KERS', flags=re.IGNORECASE) \
...         .matches("The LaKeRs are from La")
[<La:(20, 22)>, <LaKeRs:(4, 10)>]

>>> Rebulk().regex(('L[A-Z]', re.IGNORECASE), ('L[a-z]KeRs')) \
...         .matches("The LaKeRs are from La")
[<La:(20, 22)>, <LaKeRs:(4, 10)>]

If regex module is available, it automatically supports repeated captures.

>>> # If regex module is available, repeated_captures is True by default.
>>> matches = Rebulk().regex(r'(\d+)(?:-(\d+))+').matches("01-02-03-04")
>>> matches[0].children # doctest:+SKIP
[<01:(0, 2)>, <02:(3, 5)>, <03:(6, 8)>, <04:(9, 11)>]

>>> # If regex module is not available, or if repeated_captures is forced to False.
>>> matches = Rebulk().regex(r'(\d+)(?:-(\d+))+', repeated_captures=False) \
...                   .matches("01-02-03-04")
>>> matches[0].children
[<01:(0, 2)+initiator=01-02-03-04>, <04:(9, 11)+initiator=01-02-03-04>]
  • abbreviations

    Defined as a list of 2-tuple, each tuple is an abbreviation. It simply replace tuple[0] with tuple[1] in the expression.

    >>> Rebulk().regex(r'Custom-separators', abbreviations=[("-", r"[W_]+")])... .matches("Custom_separators using-abbreviations") [<Custom_separators:(0, 17)>]

Functional Patterns

Functional Patterns are based on the evaluation of a function.

The function should have the same parameters as Rebulk.matches method, that is the input string, and must return at least start index and end index of the Match object.

>>> def func(string):
...     index = string.find('?')
...     if index > -1:
...         return 0, index - 11
>>> Rebulk().functional(func).matches("Why do simple ? Forget about it ...")
[<Why:(0, 3)>]

You can also return a dict of keywords arguments for Match object.

You can define several patterns with a single functional method call, and function used can return multiple matches.

Chain Patterns

Chain Patterns are ordered composition of string, functional and regex patterns. Repeater can be set to define repetition on chain part.

>>> r = Rebulk().regex_defaults(flags=re.IGNORECASE)\
...             .defaults(children=True, formatter={'episode': int, 'version': int})\
...             .chain()\
...             .regex(r'e(?P<episode>\d{1,4})').repeater(1)\
...             .regex(r'v(?P<version>\d+)').repeater('?')\
...             .regex(r'[ex-](?P<episode>\d{1,4})').repeater('*')\
...             .close() # .repeater(1) could be omitted as it's the default behavior
>>> r.matches("This is E14v2-15-16-17").to_dict()  # converts matches to dict
MatchesDict([('episode', [14, 15, 16, 17]), ('version', 2)])

Patterns parameters

All patterns have options that can be given as keyword arguments.

  • validator

    Function to validate Match value given by the pattern. Can also be a dict, to use validator with pattern named with key.

    >>> def check_leap_year(match):
    ...     return int(match.value) in [1980, 1984, 1988]
    >>> matches = Rebulk().regex(r'\d{4}', validator=check_leap_year) \
    ...                   .matches("In year 1982 ...")
    >>> len(matches)
    0
    >>> matches = Rebulk().regex(r'\d{4}', validator=check_leap_year) \
    ...                   .matches("In year 1984 ...")
    >>> len(matches)
    1
    

Some base validator functions are available in rebulk.validators module. Most of those functions have to be configured using functools.partial to map them to function accepting a single match argument.

  • formatter

    Function to convert Match value given by the pattern. Can also be a dict, to use formatter with matches named with key.

    >>> def year_formatter(value):
    ...     return int(value)
    >>> matches = Rebulk().regex(r'\d{4}', formatter=year_formatter) \
    ...                   .matches("In year 1982 ...")
    >>> isinstance(matches[0].value, int)
    True
    
  • pre_match_processor / post_match_processor

    Function to mutagen or invalidate a match generated by a pattern.

    Function has a single parameter which is the Match object. If function returns False, it will be considered as an invalid match. If function returns a match instance, it will replace the original match with this instance in the process.

  • post_processor

    Function to change the default output of the pattern. Function parameters are Matches list and Pattern object.

  • name

    The name of the pattern. It is automatically passed to Match objects generated by this pattern.

  • tags

    A list of string that qualifies this pattern.

  • value

    Override value property for generated Match objects. Can also be a dict, to use value with pattern named with key.

  • validate_all

    By default, validator is called for returned Match objects only. Enable this option to validate them all, parent and children included.

  • format_all

    By default, formatter is called for returned Match values only. Enable this option to format them all, parent and children included.

  • disabled

    A function(context) to disable the pattern if returning True.

  • children

    If True, all children Match objects will be retrieved instead of a single parent Match object.

  • private

    If True, Match objects generated from this pattern are available internally only. They will be removed at the end of Rebulk.matches method call.

  • private_parent

    Force parent matches to be returned and flag them as private.

  • private_children

    Force children matches to be returned and flag them as private.

  • private_names

    Matches names that will be declared as private

  • ignore_names

    Matches names that will be ignored from the pattern output, after validation.

  • marker

    If true, Match objects generated from this pattern will be markers matches instead of standard matches. They won't be included in Matches sequen

View on GitHub
GitHub Stars57
CategoryDevelopment
Updated3mo ago
Forks11

Languages

Python

Security Score

92/100

Audited on Dec 25, 2025

No findings