SkillAgentSearch skills...

Lieu

Dedupe/batch geocode addresses and venues around the world with libpostal

Install / Use

/learn @openvenues/Lieu
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

lieu

lieu is a Python library for deduping places/POIs, addresses, and streets around the world using libpostal's international street address normalization.

Installation

pip install lieu

Note: libpostal and its Python binding are required to use this library, setup instructions here.

Input formats

Inputs are expected to be GeoJSON files. The command-line client works on both standard GeoJSON (wrapped in a FeatureCollection) and line-delimited GeoJSON, but for Spark/EMR the input must be line-delimited GeoJSON so it can be effectively split across machines.

Lieu supports two primary schemas: Whos on First and OpenStreetMap which are mapped to libpostal's tagset.

Geographic qualifiers

For the purposes of blocking/candidate generation (grouping similar items together to narrow down the number pairwise checks lieu has to do to significantly fewer than N²), we need at least one field that specifies a geographic area for which to compare records so we don't need to compare every instance of a very common address ("123 Main St") or a very common name ("Ben & Jerry's") with every other instance. As such, at least one of the following fields must be present in all records:

  • lat/lon: by default we use a prefix of the geohash of the lat/lon plus its neighbors (to avoid faultlines). See here for the distance each prefix size covers (and multiply those numbers by 3 for neighboring tiles). The default setting is a geohash precision of 6 characters, and since the geohash is only used to block or group candidate pairs together, it's possible for pairs within ~2-3km of each other with the same name/address to be considered duplicaates. This should work reasonably well for real-world place data where the locations may have been recorded with varying devices and degrees of precision.
  • postcode: postal codes tend to constrain the geography to a few neighborhoods, and can work well if the data set is for a single country, for multiple countries where the postcodes do not overlap (although even if they do overlap, e.g. postcodes in the US and Italy, it's possible that the use of street names will also be sufficient to disambiguate). The postcode will be used in place of the lat/lon when the --use-postcode flag is set.
  • city, city_district, suburb, or island: libpostal will use any of the named place tags found in the address components when the --use-city flag is set. Simple normalizations will match like "Saint Louis" with "St Louis" and "IXe Arrondissement" with "9e Arrondissement", but we do not currently have a database-backed method for matching city name variants like "New York City" vs. "NYC" or containment e.g. suburb="Crown Heights" vs. city_district="Brooklyn". Note: this method does handle tagging differences, so suburb="Harlem" vs. city="Harlem" will match.
  • state_district: if addresses are already known to be within a certain small geographic boundary (for instance in the US, county governments are often the purveyors of address-related data), where address dupes within that boundary are rare/unlikely, the state_district tag may be used as well when the --use-containing flag is set.

Note: none of these fields are used in pairwise comparisons, only for blocking/grouping.

Name field

For name deduping, each record must contain:

  • name: the venue/company/person's name

Note: when the --name-only flag is set, only name and a geo qualifier (see above) are required. This option is useful e.g. for deduping check-in or simple POI data sets of names and lat/lons, though this use case has not been as thoroughly tested and may require some parameter tuning.

Address fields

By default, we assume every record has an address, which is composed of these fields:

  • street: street names are used in addresses in most countries. Lieu/libpostal can match a wide variety of variations here including abbreviations like "Main St" vs. "Main Street" in 60+ languages, missing thoroughfare types e.g. simply "Main", missing ordinal types like "W 149th St" vs. "W 149 St", and spacing differences like "Sea Grape Ln" vs. "Seagrape Ln".
  • house_number: house number needs to be parsed into its own field. If the source does not separate house number from street, libpostal's parser can be used to extract it. Any subparsing of compound house numbers should be done as a preprocessing step (i.e. 1-3-5 and 3-5 could be the same address in Japan provided that they're both in 1-chome).

Lieu will also handle cases where neither entry has a house number (e.g. England) or where neither entry has a street (e.g. Japan).

Secondary units/sub-building information

Optionally lieu may also compare secondary units when the --with-unit flag is set. In that case, the following fields may be compared as well:

  • unit: normalized unit numbers. Lieu can handle many variations in apartment or floor numbers like "Apt 1C" vs. "#1C" vs. "Apt No. 1 C"
  • floor: normalized floor numbers. Again, here lieu can handle many variations like "Fl 1" vs. "1st Floor" vs. "1/F".

Other details

Lieu will also use the following information to increase the accuracy/quality of the dupes:

  • phone: this uses the Python port of Google's libphonenumber to parse phone numbers in various countries, flagging dupes for review if they have different phone numbers, and upgrading

Running locally with the command-line tool

The dedupe_geojson command-line tool will be installed in the environment's bin dir and can be used like so:

dedupe_geojson file1.geojson [files ...] -o /some/output/dir
               [--address-only] [--geocode] [--name-only]
               [--address-only-candidates] [--dupes-only] [--no-latlon]
               [--use-city] [--use-small-containing]
               [--use-postal-code] [--no-phone-numbers]
               [--no-fuzzy-street-names] [--with-unit]
               [--features-db-name FEATURES_DB_NAME]
               [--index-type {tfidf,info_gain}]
               [--info-gain-index INFO_GAIN_INDEX]
               [--tfidf-index TFIDF_INDEX]
               [--temp-filename TEMP_FILENAME]
               [--output-filename OUTPUT_FILENAME]
               [--name-dupe-threshold NAME_DUPE_THRESHOLD]
               [--name-review-threshold NAME_REVIEW_THRESHOLD]

Option descriptions:

  • --address-only address duplicates only (ignore names).
  • --geocode only compare entries without a lat/lon to canonicals with lat/lons.
  • --name-only name duplicates only (ignore addresses).
  • --address-only-candidates use the address-only hash keys for candidate generation.
  • --dupes-only only output the dupes.
  • --no-latlon do not use lat/lon and geohashing (if one data set has no lat/lon for instance).
  • --use-city use the city name as a geo qualifier (for local data sets where city is relatively unambiguous).
  • --use-small-containing use the small containing boundaries like county as a geo qualifier (for local data sets).
  • --use-postal-code use the postcode as a geo qualifier (for single-country data sets or cases where postcode is unambiguous).
  • --no-phone-numbers turn off comparison of normalized phone numbers as a postprocessing step (when available), which revises dupe classifications for phone number matches or definite mismatches.
  • --no-fuzzy-street-names do not use fuzzy street name comparison for minor misspellings, etc. Only use libpostal expansion equality.
  • --with-unit include secondary unit/floor comparisons in deduplication (only if both addresses have unit).
  • --features-db-name path to database to store features for lookup (default='features_db').
  • --index-type choice of {info_gain, tfidf}, (default='info_gain').
  • --info-gain-index information gain index filename (default='info_gain.index').
  • --tfidf-index TF-IDF index file (default='tfidf.index').
  • --temp-filename temporary file for near-dupe hashes (default='near_dupes').
  • --output-filename output filename (default='deduped.geojson').
  • --name-dupe-threshold likely-dupe threshold between 0 and 1 for name deduping with Soft-TFIDF/Soft-Information-Gain (default=0.9).
  • --name-review-threshold human review threshold between 0 and 1 for name deduping with Soft-TFIDF/Soft-Information-Gain (default=0.7).

Running on Spark/ElasticMapReduce

It's also possible to dedupe larger/global data sets using Apache Spark and AWS ElasticMapReduce (EMR). Using Spark/EMR should look and feel pretty similar to the command-line script (thanks in large part to the mrjob project from David Marin from Yelp). However, instead of running on your local machine, it spins up a cluster, runs the Spark job, writes the results to S3, shuts down the cluster, and optionally downloads/prints all the results to stdout. There's no need to worry about provisioning the machines or maintaining a standing cluster, and it requires only minimal configuration.

To get started, you'll need to create an Amazon Web Services account and an IAM role that has the permissions required for ElasticMapReduce. Once that's set up, we need to configure the job to use your account:

cd scripts/jobs
cp mrjob.conf.example mrjob.conf

Open up mrjob.conf in your favorite text editor. The config is a YAML file and under runners.emr there are comments describing the few required fields (e.g. access key and secret, instance types, number of instances, etc.) and some optional ones (AWS region, spot instance bid price, etc.)

Related Skills

View on GitHub
GitHub Stars84
CategoryDevelopment
Updated5mo ago
Forks22

Languages

Python

Security Score

97/100

Audited on Oct 12, 2025

No findings