Rebulk
Define simple search patterns in bulk to perform advanced matching on any string
Install / Use
/learn @Toilal/RebulkREADME
ReBulk
ReBulk is a python library that performs advanced searches in strings that would be hard to implement using re module or String methods only.
It includes some features like Patterns, Match, Rule that allows
developers to build a custom and complex string matcher using a readable
and extendable API.
This project is hosted on GitHub: https://github.com/Toilal/rebulk
Install
$ pip install rebulk
Usage
Regular expression, string and function based patterns are declared in a
Rebulk object. It use a fluent API to chain string, regex, and
functional methods to define various patterns types.
>>> from rebulk import Rebulk
>>> bulk = Rebulk().string('brown').regex(r'qu\w+').functional(lambda s: (20, 25))
When Rebulk object is fully configured, you can call matches method
with an input string to retrieve all Match objects found by registered
pattern.
>>> bulk.matches("The quick brown fox jumps over the lazy dog")
[<brown:(10, 15)>, <quick:(4, 9)>, <jumps:(20, 25)>]
If multiple Match objects are found at the same position, only the
longer one is kept.
>>> bulk = Rebulk().string('lakers').string('la')
>>> bulk.matches("the lakers are from la")
[<lakers:(4, 10)>, <la:(20, 22)>]
String Patterns
String patterns are based on
str.find
method to find matches, but returns all matches in the string.
ignore_case can be enabled to ignore case.
>>> Rebulk().string('la').matches("lalalilala")
[<la:(0, 2)>, <la:(2, 4)>, <la:(6, 8)>, <la:(8, 10)>]
>>> Rebulk().string('la').matches("LalAlilAla")
[<la:(8, 10)>]
>>> Rebulk().string('la', ignore_case=True).matches("LalAlilAla")
[<La:(0, 2)>, <lA:(2, 4)>, <lA:(6, 8)>, <la:(8, 10)>]
You can define several patterns with a single string method call.
>>> Rebulk().string('Winter', 'coming').matches("Winter is coming...")
[<Winter:(0, 6)>, <coming:(10, 16)>]
Regular Expression Patterns
Regular Expression patterns are based on a compiled regular expression. re.finditer method is used to find matches.
If regex module is available, it
can be used by rebulk instead of default re
module. Enable it with REBULK_REGEX_ENABLED=1 environment variable.
>>> Rebulk().regex(r'l\w').matches("lolita")
[<lo:(0, 2)>, <li:(2, 4)>]
You can define several patterns with a single regex method call.
>>> Rebulk().regex(r'Wint\wr', r'com\w{3}').matches("Winter is coming...")
[<Winter:(0, 6)>, <coming:(10, 16)>]
All keyword arguments from re.compile are supported.
>>> import re # import required for flags constant
>>> Rebulk().regex('L[A-Z]KERS', flags=re.IGNORECASE) \
... .matches("The LaKeRs are from La")
[<LaKeRs:(4, 10)>]
>>> Rebulk().regex('L[A-Z]', 'L[A-Z]KERS', flags=re.IGNORECASE) \
... .matches("The LaKeRs are from La")
[<La:(20, 22)>, <LaKeRs:(4, 10)>]
>>> Rebulk().regex(('L[A-Z]', re.IGNORECASE), ('L[a-z]KeRs')) \
... .matches("The LaKeRs are from La")
[<La:(20, 22)>, <LaKeRs:(4, 10)>]
If regex module is available, it automatically supports repeated captures.
>>> # If regex module is available, repeated_captures is True by default.
>>> matches = Rebulk().regex(r'(\d+)(?:-(\d+))+').matches("01-02-03-04")
>>> matches[0].children # doctest:+SKIP
[<01:(0, 2)>, <02:(3, 5)>, <03:(6, 8)>, <04:(9, 11)>]
>>> # If regex module is not available, or if repeated_captures is forced to False.
>>> matches = Rebulk().regex(r'(\d+)(?:-(\d+))+', repeated_captures=False) \
... .matches("01-02-03-04")
>>> matches[0].children
[<01:(0, 2)+initiator=01-02-03-04>, <04:(9, 11)+initiator=01-02-03-04>]
-
abbreviationsDefined as a list of 2-tuple, each tuple is an abbreviation. It simply replace
tuple[0]withtuple[1]in the expression.>>> Rebulk().regex(r'Custom-separators', abbreviations=[("-", r"[W_]+")])... .matches("Custom_separators using-abbreviations") [<Custom_separators:(0, 17)>]
Functional Patterns
Functional Patterns are based on the evaluation of a function.
The function should have the same parameters as Rebulk.matches method,
that is the input string, and must return at least start index and end
index of the Match object.
>>> def func(string):
... index = string.find('?')
... if index > -1:
... return 0, index - 11
>>> Rebulk().functional(func).matches("Why do simple ? Forget about it ...")
[<Why:(0, 3)>]
You can also return a dict of keywords arguments for Match object.
You can define several patterns with a single functional method call,
and function used can return multiple matches.
Chain Patterns
Chain Patterns are ordered composition of string, functional and regex patterns. Repeater can be set to define repetition on chain part.
>>> r = Rebulk().regex_defaults(flags=re.IGNORECASE)\
... .defaults(children=True, formatter={'episode': int, 'version': int})\
... .chain()\
... .regex(r'e(?P<episode>\d{1,4})').repeater(1)\
... .regex(r'v(?P<version>\d+)').repeater('?')\
... .regex(r'[ex-](?P<episode>\d{1,4})').repeater('*')\
... .close() # .repeater(1) could be omitted as it's the default behavior
>>> r.matches("This is E14v2-15-16-17").to_dict() # converts matches to dict
MatchesDict([('episode', [14, 15, 16, 17]), ('version', 2)])
Patterns parameters
All patterns have options that can be given as keyword arguments.
-
validatorFunction to validate
Matchvalue given by the pattern. Can also be adict, to usevalidatorwith pattern named with key.>>> def check_leap_year(match): ... return int(match.value) in [1980, 1984, 1988] >>> matches = Rebulk().regex(r'\d{4}', validator=check_leap_year) \ ... .matches("In year 1982 ...") >>> len(matches) 0 >>> matches = Rebulk().regex(r'\d{4}', validator=check_leap_year) \ ... .matches("In year 1984 ...") >>> len(matches) 1
Some base validator functions are available in rebulk.validators
module. Most of those functions have to be configured using
functools.partial to map them to function accepting a single match
argument.
-
formatterFunction to convert
Matchvalue given by the pattern. Can also be adict, to useformatterwith matches named with key.>>> def year_formatter(value): ... return int(value) >>> matches = Rebulk().regex(r'\d{4}', formatter=year_formatter) \ ... .matches("In year 1982 ...") >>> isinstance(matches[0].value, int) True -
pre_match_processor/post_match_processorFunction to mutagen or invalidate a match generated by a pattern.
Function has a single parameter which is the Match object. If function returns False, it will be considered as an invalid match. If function returns a match instance, it will replace the original match with this instance in the process.
-
post_processorFunction to change the default output of the pattern. Function parameters are Matches list and Pattern object.
-
nameThe name of the pattern. It is automatically passed to
Matchobjects generated by this pattern. -
tagsA list of string that qualifies this pattern.
-
valueOverride value property for generated
Matchobjects. Can also be adict, to usevaluewith pattern named with key. -
validate_allBy default, validator is called for returned
Matchobjects only. Enable this option to validate them all, parent and children included. -
format_allBy default, formatter is called for returned
Matchvalues only. Enable this option to format them all, parent and children included. -
disabledA
function(context)to disable the pattern if returningTrue. -
childrenIf
True, all childrenMatchobjects will be retrieved instead of a single parentMatchobject. -
privateIf
True,Matchobjects generated from this pattern are available internally only. They will be removed at the end ofRebulk.matchesmethod call. -
private_parentForce parent matches to be returned and flag them as private.
-
private_childrenForce children matches to be returned and flag them as private.
-
private_namesMatches names that will be declared as private
-
ignore_namesMatches names that will be ignored from the pattern output, after validation.
-
markerIf
true,Matchobjects generated from this pattern will be markers matches instead of standard matches. They won't be included inMatchessequen
