SkillAgentSearch skills...

Acropolis

A toolkit for building knowledge graphs

Install / Use

/learn @bbcarchdev/Acropolis
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Acropolis

The Research & Education Space software stack.

![Current build status][travis] ![Apache 2.0 licensed][license] ![Implemented in C][language] ![Follow @RES_Project][twitter]

This “umbrella” project has been assembled to make it easier to maintain and run tests on the all of the individual components which make up the Acropolis stack.

This software was developed as part of the Research & Education Space project and is actively maintained by a development team within BBC Design and Engineering. We hope you’ll find this project useful!

Table of Contents

Requirements

In order to build Acropolis in its entirety, you will need:—

  • A working build environment, including a C compiler such as Clang, along with Autoconf, Automake and Libtool
  • The Jansson JSON library
  • libfcgi
  • The client libraries for PostgreSQL and/or MySQL; it may be possible to use MariaDB’s libraries in place of MySQL’s, but this is untested
  • libcurl
  • libxml2
  • The Redland RDF library (a more up-to-date version than your operating system provides may be required for some components to function correctly)

Optionally, you may also wish to install:—

On a Debian-based system, the following should install all of the necessary dependencies:

$ sudo apt-get install -qq libjansson-dev libmysqlclient-dev libpq-dev libqpid-proton-dev libcurl4-gnutls-dev libxml2-dev librdf0-dev libltdl-dev uuid-dev libfcgi-dev automake autoconf libtool pkg-config libcunit1-ncurses-dev build-essential clang xsltproc docbook-xsl-ns

Acropolis has not yet been ported to non-Unix-like environments, and will install as shared libraries and on macOS rather than a framework.

Much of it ought to build inside Cygwin on Windows, but this is untested.

Contributions for building properly with Visual Studio or Xcode, and so on, are welcome (provided they do not significantly complicate the standard build logic).

Using Acropolis

Once you have built and installed the Acropolis stack, you probably want to do something with it.

Acropolis consists of a number of different individual components, including libraries, command-line tools, web-based servers, and back-end daemons. They are:—

Note that this repository exists for development and testing purposes only: in a production environment, each component is packaged and deployed individually.

Components

Anansi

Anansi is a web crawler. It uses a relational database to track URLs that will be fetched, their status, and cache IDs. Anansi can operate in resizeable clusters of up to 256 nodes via libcluster.

Anansi has the notion of a processor —a named implementation of the "business logic" of evaluating resources that have been retrieved and using them to add new entries to the queue.

In the Research & Education Space, Anansi is configured to use the lod (Linked Open Data) processor, which:

  • Only accepts resources which can be parsed by librdf
  • Can apply licensing checks (based upon a configured list)
  • Adds any URIs that it finds in retrieved RDF to the crawl queue (allowing spidering to actually occur)

Twine

Twine is a modular RDF-oriented processing engine. It can be configured to do a number of different things, but its main purpose is to fetch some data, convert it if necessary, perform some processing, and then put it somewhere.

Twine can operate as a daemon, which will continuously fetch data to process from a queue of some kind (see libmq), or it can be invoked from the command-line to ingest data directly from a file.

Twine is extended through two different kinds of loadable modules (which reside in ${libdir}/twine, by default /opt/res/lib/twine):

  • Handlers are responsible for taking some input and populating an RDF model based upon it. The mechanism is flexible enough to allow, for example, the input data to be a set of URLs which the module should fetch and parse, but it's also used for data conversion.
  • Processors are modules which can manipulate the RDF model in some way, including dealing with storage and output.

Twine ships with a number of modules for interacting with SPARQL servers, XML data ingest via XSLT transform, as well as parsing and outputting RDF. More information can be found in the Twine README.

Twine is always configured with a workflow: a list of processors which should be invoked in turn for each item of data being processed. Like all configuration options, the workflow can be specified on the command-line.

In the Research & Education Space, the Spindle project provides additional Twine modules which implement the key logic of the platform.

Quilt

Quilt is a Linked Data server designed to efficiently serve RDF data in a variety of serialisations, including templated HTML. Like Twine, Quilt is modular (see ${libdir}/quilt), and in particular modules are used to provide engine implementations—these are the code responsible for populating an RDF model based upon the request parameters (Quilt itself then handles the serving of that model). The Spindle project includes a Quilt module which implements the Research & Education Space public API.

Spindle

Spindle is the core of the Research & Education Space. It includes three processor modules for Twine:

  • spindle-strip uses a rulebase to decide which triples in an RDF model should be retained and which should be discarded.
  • spindle-correlate processes graphs in conjunction with a SPARQL server (and optionally a PostgreSQL database) in order to aggregate co-references: the result is distinct RDF descriptions about the same things are clustered together.
  • spindle-generate performs indexing of RDF data, including dealing with media resource licensing, and is responsible for generating "proxy" sets of triples which summarise the Research & Education Space's interpretation of the various pieces of source data about each thing they describe.

It also includes a module for Quilt, which uses the data from spindle-correlate and spindle-generate in order to provide the Research & Education Space API.

Running the stack

Annotated configuration files are provided in the config directory which should help get you started. By default, the components expect to find these files in /opt/res/etc, but this can be altered by specifying the --prefix or --sysconfdir options when invoking the top-level configure script.

Requirements

You will need:

  • A PostgreSQL database server, with databases for Anansi and Spindle (you do not need to define any tables or views, the schemas will be generated automatically)
  • A RADOS or FakeS3 bucket for Anansi and one for Spindle. Note that pending resolution of libawsclient#1, you can no longer use a real Amazon S3 bucket for storage.
  • A SPARQL server, such as 4store.

In production, the Research & Education Space uses PostgreSQL, RADOS, and 4store. It has been successfully used in development environments with FakeS3 and alternative SPARQL servers.

Running Anansi

Important! Do not run the Anansi daemon (crawld) without first carefully checking the configuration to ensure that it doesn’t simply start crawling the web unchecked. If you're using the `l

Related Skills

View on GitHub
GitHub Stars9
CategoryDevelopment
Updated1y ago
Forks7

Languages

Python

Security Score

55/100

Audited on Oct 6, 2024

No findings