Lieu
Dedupe/batch geocode addresses and venues around the world with libpostal
Install / Use
/learn @openvenues/LieuREADME
lieu
lieu is a Python library for deduping places/POIs, addresses, and streets around the world using libpostal's international street address normalization.
Installation
pip install lieu
Note: libpostal and its Python binding are required to use this library, setup instructions here.
Input formats
Inputs are expected to be GeoJSON files. The command-line client works on both standard GeoJSON (wrapped in a FeatureCollection) and line-delimited GeoJSON, but for Spark/EMR the input must be line-delimited GeoJSON so it can be effectively split across machines.
Lieu supports two primary schemas: Whos on First and OpenStreetMap which are mapped to libpostal's tagset.
Geographic qualifiers
For the purposes of blocking/candidate generation (grouping similar items together to narrow down the number pairwise checks lieu has to do to significantly fewer than N²), we need at least one field that specifies a geographic area for which to compare records so we don't need to compare every instance of a very common address ("123 Main St") or a very common name ("Ben & Jerry's") with every other instance. As such, at least one of the following fields must be present in all records:
- lat/lon: by default we use a prefix of the geohash of the lat/lon plus its neighbors (to avoid faultlines). See here for the distance each prefix size covers (and multiply those numbers by 3 for neighboring tiles). The default setting is a geohash precision of 6 characters, and since the geohash is only used to block or group candidate pairs together, it's possible for pairs within ~2-3km of each other with the same name/address to be considered duplicaates. This should work reasonably well for real-world place data where the locations may have been recorded with varying devices and degrees of precision.
- postcode: postal codes tend to constrain the geography to a few neighborhoods, and can work well if the data set is for a single country, for multiple countries where the postcodes do not overlap (although even if they do overlap, e.g. postcodes in the US and Italy, it's possible that the use of street names will also be sufficient to disambiguate). The postcode will be used in place of the lat/lon when the
--use-postcodeflag is set. - city, city_district, suburb, or island: libpostal will use any of the named place tags found in the address components when the
--use-cityflag is set. Simple normalizations will match like "Saint Louis" with "St Louis" and "IXe Arrondissement" with "9e Arrondissement", but we do not currently have a database-backed method for matching city name variants like "New York City" vs. "NYC" or containment e.g. suburb="Crown Heights" vs. city_district="Brooklyn". Note: this method does handle tagging differences, so suburb="Harlem" vs. city="Harlem" will match. - state_district: if addresses are already known to be within a certain small geographic boundary (for instance in the US, county governments are often the purveyors of address-related data), where address dupes within that boundary are rare/unlikely, the state_district tag may be used as well when the
--use-containingflag is set.
Note: none of these fields are used in pairwise comparisons, only for blocking/grouping.
Name field
For name deduping, each record must contain:
- name: the venue/company/person's name
Note: when the --name-only flag is set, only name and a geo qualifier (see above) are required. This option is useful e.g. for deduping check-in or simple POI data sets of names and lat/lons, though this use case has not been as thoroughly tested and may require some parameter tuning.
Address fields
By default, we assume every record has an address, which is composed of these fields:
- street: street names are used in addresses in most countries. Lieu/libpostal can match a wide variety of variations here including abbreviations like "Main St" vs. "Main Street" in 60+ languages, missing thoroughfare types e.g. simply "Main", missing ordinal types like "W 149th St" vs. "W 149 St", and spacing differences like "Sea Grape Ln" vs. "Seagrape Ln".
- house_number: house number needs to be parsed into its own field. If the source does not separate house number from street, libpostal's parser can be used to extract it. Any subparsing of compound house numbers should be done as a preprocessing step (i.e. 1-3-5 and 3-5 could be the same address in Japan provided that they're both in 1-chome).
Lieu will also handle cases where neither entry has a house number (e.g. England) or where neither entry has a street (e.g. Japan).
Secondary units/sub-building information
Optionally lieu may also compare secondary units when the --with-unit flag is set. In that case, the following fields may be compared as well:
- unit: normalized unit numbers. Lieu can handle many variations in apartment or floor numbers like "Apt 1C" vs. "#1C" vs. "Apt No. 1 C"
- floor: normalized floor numbers. Again, here lieu can handle many variations like "Fl 1" vs. "1st Floor" vs. "1/F".
Other details
Lieu will also use the following information to increase the accuracy/quality of the dupes:
- phone: this uses the Python port of Google's libphonenumber to parse phone numbers in various countries, flagging dupes for review if they have different phone numbers, and upgrading
Running locally with the command-line tool
The dedupe_geojson command-line tool will be installed in the environment's bin dir and can be used like so:
dedupe_geojson file1.geojson [files ...] -o /some/output/dir
[--address-only] [--geocode] [--name-only]
[--address-only-candidates] [--dupes-only] [--no-latlon]
[--use-city] [--use-small-containing]
[--use-postal-code] [--no-phone-numbers]
[--no-fuzzy-street-names] [--with-unit]
[--features-db-name FEATURES_DB_NAME]
[--index-type {tfidf,info_gain}]
[--info-gain-index INFO_GAIN_INDEX]
[--tfidf-index TFIDF_INDEX]
[--temp-filename TEMP_FILENAME]
[--output-filename OUTPUT_FILENAME]
[--name-dupe-threshold NAME_DUPE_THRESHOLD]
[--name-review-threshold NAME_REVIEW_THRESHOLD]
Option descriptions:
--address-onlyaddress duplicates only (ignore names).--geocodeonly compare entries without a lat/lon to canonicals with lat/lons.--name-onlyname duplicates only (ignore addresses).--address-only-candidatesuse the address-only hash keys for candidate generation.--dupes-onlyonly output the dupes.--no-latlondo not use lat/lon and geohashing (if one data set has no lat/lon for instance).--use-cityuse the city name as a geo qualifier (for local data sets where city is relatively unambiguous).--use-small-containinguse the small containing boundaries like county as a geo qualifier (for local data sets).--use-postal-codeuse the postcode as a geo qualifier (for single-country data sets or cases where postcode is unambiguous).--no-phone-numbersturn off comparison of normalized phone numbers as a postprocessing step (when available), which revises dupe classifications for phone number matches or definite mismatches.--no-fuzzy-street-namesdo not use fuzzy street name comparison for minor misspellings, etc. Only use libpostal expansion equality.--with-unitinclude secondary unit/floor comparisons in deduplication (only if both addresses have unit).--features-db-namepath to database to store features for lookup (default='features_db').--index-typechoice of {info_gain, tfidf}, (default='info_gain').--info-gain-indexinformation gain index filename (default='info_gain.index').--tfidf-indexTF-IDF index file (default='tfidf.index').--temp-filenametemporary file for near-dupe hashes (default='near_dupes').--output-filenameoutput filename (default='deduped.geojson').--name-dupe-thresholdlikely-dupe threshold between 0 and 1 for name deduping with Soft-TFIDF/Soft-Information-Gain (default=0.9).--name-review-thresholdhuman review threshold between 0 and 1 for name deduping with Soft-TFIDF/Soft-Information-Gain (default=0.7).
Running on Spark/ElasticMapReduce
It's also possible to dedupe larger/global data sets using Apache Spark and AWS ElasticMapReduce (EMR). Using Spark/EMR should look and feel pretty similar to the command-line script (thanks in large part to the mrjob project from David Marin from Yelp). However, instead of running on your local machine, it spins up a cluster, runs the Spark job, writes the results to S3, shuts down the cluster, and optionally downloads/prints all the results to stdout. There's no need to worry about provisioning the machines or maintaining a standing cluster, and it requires only minimal configuration.
To get started, you'll need to create an Amazon Web Services account and an IAM role that has the permissions required for ElasticMapReduce. Once that's set up, we need to configure the job to use your account:
cd scripts/jobs
cp mrjob.conf.example mrjob.conf
Open up mrjob.conf in your favorite text editor. The config is a YAML file and under runners.emr there are comments describing the few required fields (e.g. access key and secret, instance types, number of instances, etc.) and some optional ones (AWS region, spot instance bid price, etc.)
Related Skills
node-connect
341.6kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.6kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.6kCommit, push, and open a PR
