Keepright
Data Consistency Checks for openstreetmap.org
Install / Use
/learn @keepright/KeeprightREADME
/_ _ _ _ _ . _ /__/_ /\ /'/'// / / /// // / _/ Data Consistency Checks for openstreetmap.org
openstreetmap.org (OSM) provides a wiki-style means of creating a world-wide street map where everybody is encouraged to contribute. This is a collection of scripts that will examine part of the OSM database and try to find errors that should be corrected by users. As a result you get ugly lists of errors and are invited to correct them.
This document explains how to run data consistency checks on your own database and set up a webpage presenting the results.
Online Resources
The official instance of keepright can be found at http://keepright.at
The development instance with all the latest changes can be found at http://osm.mueschelsoft.de/keepright
PREREQUISITES
Packages required on Linux: php5 php5-cli apache postgis postgresql >= 8.3 with matching release of postgis (postgresql-8.3-postgis) postgres-client php5-mysql php5-pgsql php5-intl (support for utf-8) php5-idn (support for IDN domain names) mysql-server mysql-client phpMyAdmin phpPgAdmin sun-java7-jre wget wput bzip2
Optional: joe mc
You will need both Postgres and MySQL because the checks require GIS functions and the error-presentation scripts rely on MySQL. They will not be recoded to use Postgres because you won't find Postgres on many webhosters.
Using sun-java7-jre is not optional. You need java7 by Sun (now Oracle), at least release 6.
The checks depend on a copy of the OSM database, split up in parts. Using only a subset of the planet file will result in false-positives because ways are cut in two at the border. To avoid this the splitting is done with overlapping borders - the border regions are included in both adjacent dumps. In the end errors in the overlapping area are discarded. The planet is split up in currently 85 parts, so called 'schemas'. They are processed sequentially and independently.
It looks like the osmosis plugin heavily depends on the osmosis version being used. The plugin is tested and works with osmosis_0.42.
THE BIG PICTURE
This is the whole process from getting the planet file, running the checks, publishing the check results and collecting user comments. error_view is the resulting table containing all errors. It's the source for the map presentation.
backend scripts running on processing servers:
main.php update source code loop over all database schemas and process them one by one finally start all over
process_schema.php do all that is necessary for processing a single schema: prepeare database (create db tables) diff-update planet file load database with planet file run the checks export & upload results to web server
prepareDB.php create database tables, activate postGIS
planet.php call osmosis with options for diff-updating a planet file part let osmosis use a custom plugin called 'pl' that creates special dump files
osmosis plugin 'pl' PostgreSqlMyDatasetDumpWriter.java create dump files suitable for loading with COPY commands in PostgreSQL the format mainly differs from the current 'snapshot' format in that all geometries are in meters instead of lat/lon it was established before the current 'snapshot' format evolved and cannot be changed with realistic effort any more
prepare_helpertables.php update redundant columns
prepare_countries.php create structures needed for boundary processing
run-checks.php start all the check routines found in config file error_types.php 0010_.php ... 9999_.php compare old and new errors, update error states rebuild the error_view table
export_errors.php export error_view to dump file
webUpdateClient.php upload error_view to web server start procedures on web server for loading the new file communicating with webUpdateServer.php
frontend scripts running on web server:
report_map.php myText.js, myTextFormat.js main display script including the map and myText layer derived from OpenLayers using an extended version of the Text layer
points.php deliver error entries to the client browser selecting errors matching error type selection and current viewport of map
comment.php receive user feedback and store it on the webserver's comments table
PLANET FILE MANAGEMENT & SQUID
The planet file is split in appriximately 85 rectangular areas called 'schemas' (this wrong term evolved in the early days, because every part of the planet resides in its own database schema). Have a look at config/planet.odg for the splitting layout.
When processing a file osmosis will download all diffs since last update and apply them to the schema's planet file. As the planet diffs include updates for the whole planet osmosis includes objects out of scope to the current schema's planet file. That is why cutting the schema's planet file has to be repeated after the diff-update.
All of these files are diff-updated individually. That means you always work with the most recent version of each file but you end up downloading the same diff files over and over. That's where the web proxy squid comes into play: Squid caches all web access. It speeds up your downloads and avoids unnecessary traffic (the saving is by a factor of 85 - 1).
Setting up squid is quite easy. On Debian/ubuntu Linuxes do something like this:
aptitude install squid
change the config file /etc/squid/squid.conf to increase overall cache size to 1000MB if you like (default is 100MB which is a little bit small). Choose cache size big enough to hold all planet diffs that are needed for updating even the oldest schema in the loop (depending on loop cycle time).
cache_dir ufs /var/spool/squid 1000 16 256
restart squid
/etc/init.d/squid restart
tell your osmosis: add this line to ~/.osmosis:
JAVACMD_OPTIONS="-Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128 "
my ~/.osmosis looks like this: JAVACMD_OPTIONS=" -Xmx2500m -Djava.io.tmpdir=/media/big_harddisk/tmp/ -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128 "
The website check (#410) too benefits from an http proxy. Have a look at the respective options in you keepright user config file (~/.keepright). Increasing the cache size to 5000MB for use with the website check seems appropriate.
SETTING UP LOCAL DATABASES
[Don't skip this section if you already have a local database!]
This project uses a modified version of the "simple PostgreSQL schema" as specified in osmosis/script/pgsql_simple_schema.sql, which is part of the source distribution of Osmosis. This means that the base tables are the same, but there are additional columns providing redundancy. This redundancy is used to boost performance of the queries as it can save some joins. For example the ways table has the number of nodes, as well as the id and lat/lon of the first and last node as additional columns; in way_nodes you find lat/lon of the nodes.
The downside is, you cannot use a default database. And you have to use a modified version of Osmosis to convert the planet file. A plugin for Osmosis is provided with the sources. It teaches Osmosis a new option --pl that will create dump files with parts of the redundancy needed.
This is the short form of an article on the wiki http://wiki.openstreetmap.org/wiki/Mapnik/PostGIS
Tuning PostgreSQL configuration for performance of OSM databases is an adventure. Since PostgreSQL 9.0 you can use the pgtune tool. It creates a modified version of your postgresql.conf depending on main memory installed and depending on the usage type you provide (DW seems to be the best matching one).
These are setup parameters you could set before starting out manually:
Tune database parameters edit /etc/postgresql/8.3/main/postgresql.conf and add/modify these parameters: shared_buffers = 1024MB work_mem = 128MB maintenance_work_mem = 128MB wal_buffers = 512kB checkpoint_segments = 20 max_fsm_pages = 1536000 effective_cache_size = 512MB autovacuum = off
assert the auto-vacuum daemon being shut down joe /etc/crontab comment out any auto-vacuum-daemon entry
Tune shmmax kernel parameter joe /etc/sysctl.conf edit/add the parameter kernel.shmmax=300000000 after that reboot the machine or simply execute sysctl -w kernel.shmmax=300000000 && /etc/init.d/postgresql-8.3 restart
Optionally turn off postgres user authentication for local access joe /etc/postgresql/8.3/main/pg_hba.conf Add this line: local all all trust This is a security risk. You will not need a password when using the command line psql shell. Most probably you'll use phppgadmin and won't need this.
Alternatively to turning off local password prompting you may create a .pgpass file joe ~/.pgpass add a line of this form: hostname:port:database:username:password 127.0.0.1:::keepright:yourpasswordhere chmod 0600 ~/.pgpass
Create the new user su - postgres createuser keepright Shall the new role be a superuser? (y/n) y
You needn't create the postgres database, as the updateDB script will do that automatically. But you have to set the password for the keepright user inside postgres.
Still as user postgres start the psql shell:
psql
postgres=# ALTER ROLE keepright WITH PASSWORD 'shhh!'; ALTER ROLE
just in case the scripts don't work as expected: creating the database and installing postGIS is easy if you're using postgresql>=9.1
CREATE DATABASE osm WITH OWNER = osm;
inside the newly created database just run CREATE LANGUAGE plpgsql; -- (should already be there) CREATE EXTENSION postgis;
Wondering why the auto-vac-daemon ist shut off? The daemon will start analyzing and vacuuming tables every few hours to keep index performance up on a high level. But this consumes large amounts of IO bandwidth and disturbes normal operation. Vacuuming is done by hand throughout the scripts because there are many temporary tables that need
