Geograpy3
Extract place names from a URL or text, and add context to those names -- for example distinguishing between a country, region or city.
Install / Use
/learn @somnathrakshit/Geograpy3README
geograpy3
geograpy3 is a fork of geograpy2, which is itself a fork of geograpy and inherits most of it, but solves several problems (such as support for utf8, places names with multiple words, confusion over homonyms etc). Also, geograpy3 is compatible with Python 3, unlike geograpy2.
since geograpy3 0.0.2 cities,countries and regions are matched against a database derived from the corresponding wikidata entries
What it is
geograpy extracts place names from a URL or text, and adds context to those names -- for example distinguishing between a country, region or city.
The extraction is a two step process. The first process is a Natural Language Processing task which analyzes a text for potential mentions of geographic locations. In the next step the words which represent such locations are looked up using the Locator.
If you already know that your content has geographic information you might want to use the Locator interface directly.
Examples/Tutorial
Install & Setup
Grab the package using pip (this will take a few minutes)
pip install geograpy3
geograpy3 uses NLTK for entity recognition, so you'll also need to download the models we're using. Fortunately there's a command that'll take care of this for you.
geograpy-nltk
Command Line Usage
geograpy3 provides a command-line interface for extracting geographic information from text, URLs, and for locating cities.
Extract places from text
geograpy -t "Paris is the capital of France. Berlin is in Germany."
Output:
Countries: ['Germany', 'France']
Regions: []
Cities: ['Paris', 'Berlin']
Other: []
Extract places from a URL
geograpy -u https://en.wikipedia.org/wiki/2012_Summer_Olympics_torch_relay
Locate a city with disambiguation
geograpy -l "Paris, Texas"
Output:
Paris (US-TX(Texas) - US(United States of America))
The locator disambiguates between cities with the same name based on region and country context.
Recreate the database
geograpy -db
This downloads and recreates the location database from Wikidata.
All CLI options
geograpy -h
Options:
-u URL, --url URL- extract places from the given URL-t TEXT, --text TEXT- extract places from the given text-l LOCATION, --location LOCATION- locate a city (e.g. 'Paris, Texas')-db, --recreateDatabase- recreate the database-cm, --correctSpelling- correct typical misspellings-d, --debug- show debug information-V, --version- show program version
Getting the source code
git clone https://github.com/somnathrakshit/geograpy3
cd geograpy3
scripts/install
Basic Usage
Import the module, give some text or a URL, and presto.
import geograpy
url = 'https://en.wikipedia.org/wiki/2012_Summer_Olympics_torch_relay'
places = geograpy.get_geoPlace_context(url=url)
Now you have access to information about all the places mentioned in the linked article.
places.countriescontains a list of country namesplaces.regionscontains a list of region namesplaces.citiescontains a list of city namesplaces.otherlists everything that wasn't clearly a country, region or city
Note that the other list might be useful for shorter texts, to pull out
information like street names, points of interest, etc, but at the moment is
a bit messy when scanning longer texts that contain possessive forms of proper
nouns (like "Russian" instead of "Russia").
But Wait, There's More
In addition to listing the names of discovered places, you'll also get some information about the relationships between places.
places.country_regionsregions broken down by countryplaces.country_citiescities broken down by countryplaces.address_stringscity, region, country strings useful for geocoding
Last But Not Least
While a text might mention many places, it's probably focused on one or two, so geograpy3 also breaks down countries, regions and cities by number of mentions.
places.country_mentionsplaces.region_mentionsplaces.city_mentions
Each of these returns a list of tuples. The first item in the tuple is the place name and the second item is the number of mentions. For example:
[('Russian Federation', 14), (u'Ukraine', 11), (u'Lithuania', 1)]
If You're Really Serious
You can of course use each of Geograpy's modules on their own. For example:
from geograpy import extraction
e = extraction.Extractor(url='https://en.wikipedia.org/wiki/2012_Summer_Olympics_torch_relay')
e.find_geoEntities()
# You can now access all of the places found by the Extractor
print(e.places)
Place context is handled in the places module. For example:
from geograpy import places
pc = places.PlaceContext(['Cleveland', 'Ohio', 'United States'])
pc.set_countries()
print pc.countries #['United States']
pc.set_regions()
print(pc.regions #['Ohio'])
pc.set_cities()
print(pc.cities #['Cleveland'])
print(pc.address_strings #['Cleveland, Ohio, United States'])
And of course all of the other information shown above (country_regions etc)
is available after the corresponding set_ method is called.
Stackoverflow
Credits
geograpy3 uses the following excellent libraries:
- NLTK for entity recognition
- newspaper4k for text extraction from HTML
- jellyfish for fuzzy text match
- pylodstorage for storage and retrieval of tabular data from SQL and SPARQL sources
geograpy3 uses the following data sources:
- ISO3166ErrorDictionary for common country mispellings via Sara-Jayne Terp
- Wikidata for country/region/city information with disambiguation via population
Hat tip to Chris Albon for the name.
