SkillAgentSearch skills...

Fileconveyor

File Conveyor is a daemon written in Python to detect, process and sync files. In particular, it's designed to sync files to CDNs. Amazon S3 and Rackspace Cloud Files, as well as any Origin Pull or (S)FTP Push CDN, are supported. Originally written for my bachelor thesis at Hasselt University in Belgium.

Install / Use

/learn @wimleers/Fileconveyor
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Description

File Conveyor is designed to discover new, changed and deleted files via the operating system's built-in file system monitor. After discovering the files, they can be optionally be processed by a chain of processors – you can easily write new ones yourself. After files have been processed, they can also optionally be transported to a server.

Discovery happens through inotify on Linux (with kernel >= 2.6.13), through FSEvents on Mac OS X (>= 10.5) and through polling on other operating systems.

Processors are simple Python scripts that can change the file's base name (it is impossible to change the path) and apply any sort of processing to the file's contents. Examples are image optimization and video transcoding.

Transporters are simple threaded abstractions around Django storage systems.

For a detailed description of the innards of file conveyor, see my bachelor thesis text (find it via http://wimleers.com/tags/bachelor-thesis).

This application was written as part of the bachelor thesis [1] of Wim Leers at Hasselt University [2].

[1] http://wimleers.com/tags/bachelor-thesis [2] http://uhasselt.be/

<BLINK>IMPORTANT WARNING</BLINK>

I've attempted to provide a solid enough README to get you started, but I'm well aware that it isn't superb. But as this is just a bachelor thesis, time was fairly limited. I've opted to create a solid basis instead of an extremely rigourously documented piece of software. If you cannot find the answer in the README.txt, nor the INSTALL.txt, nor the API.txt files, then please look at my bachelor thesis text instead. If neither of that is sufficient, then please contact me.

Upgrading

If you're upgrading from a previous version of File Conveyor, please run upgrade.py.

============================================================================== | The basics |

Configuring File Conveyor

The sample configuration file (config.sample.xml) should be self explanatory. Copy this file to config.xml, which is the file File Conveyor will look for, and edit it to suit your needs. For a detailed description, see my bachelor thesis text (look for the "Configuration file design" section).

Each rule consists of 3 components:

  • filter
  • processorChain
  • destinations

A rule can also be configured to delete source files after they have been synced to the destination(s).

The filter and processorChain components are optional. You must have at least one destination. If you want to use File Conveyor to process files locally, i.e. without transporting them to a server, then use the Symlink or Copy transporter (see below).

Starting File Conveyor

File Conveyor must be started by starting its arbitrator (which links everything together; it controls the file system monitor, the processor chains, the transporters and so on). You can start the arbitrator like this: python /path/to/fileconveyor/arbitrator.py

Stopping File Conveyor

File Conveyor listens to standard signals to know when it should end, like the Apache HTTP server does too. Send the TERMinate signal to terminate it: kill -TERM cat ~/.fileconveyor.pid

You can configure File Conveyor to store the PID file in the more typical /var/run location on *nix:

  • You can change the PID_FILE setting in settings.py to /var/run/fileconveyor.pid. However, this requires File Conveyor to be run with root permissions (/var/run requires root permissions).
  • Alternatively, you can create a new directory in /var/run which then no longer requires root permissions. This can be achieved through these commands:
  1. sudo mkdir /var/run/fileconveyor`
  2. sudo chown fileconveyor-user /var/run/fileconveyor
  3. sudo chown 700 /var/run/fileconveyor Then, you can change the PID_FILE setting in settings.py to /var/run/fileconveyor/fileconveyor.pid, and you won't need to run File Conveyor with root permissions anymore.

File Conveyor's behavior

Upon startup, File Conveyor starts the file system monitor and then performs a "manual" scan to detect changes since the last time it ran. If you've got a lot of files, this may take a while.

Just for fun, type the following while File Conveyor is syncing: killall -9 python Now File Conveyor is dead. Upon starting it again, you should see something like: 2009-05-17 03:52:13,454 - Arbitrator - WARNING - Setup: initialized 'pipeline' persistent queue, contains 2259 items. 2009-05-17 03:52:13,455 - Arbitrator - WARNING - Setup: initialized 'files_in_pipeline' persistent list, contains 47 items. 2009-05-17 03:52:13,455 - Arbitrator - WARNING - Setup: initialized 'failed_files' persistent list, contains 0 items. 2009-05-17 03:52:13,671 - Arbitrator - WARNING - Setup: moved 47 items from the 'files_in_pipeline' persistent list into the 'pipeline' persistent queue. 2009-05-17 03:52:13,672 - Arbitrator - WARNING - Setup: moved 0 items from the 'failed_files' persistent list into the 'pipeline' persistent queue. As you can see, 47 items were still in the pipeline when File Conveyor was killed. They're now simply added to the pipeline queue again and they will be processed once again.

The initial sync

To get a feeling of File Conveyor's speed, you may want to run it in the console and look at its output.

Verifying the synced files

Running the verify.py script will open the synced files database and verify that each synced file actually exists.

============================================================================== | Processors |

Addressing processors

You can address a specific processor by first specifying its processor module and then the exact processor name (which is its class name):

  • unique_filename.MD5
  • image_optimizer.KeepMetadata
  • yui_compressor.YUICompressor
  • link_updater.CSSURLUpdater

But, it works with third-party processors too! Just make sure the third-party package is in the Python path and then you can just use this in config.xml:

  • MyProcessorPackage.SomeProcessorClass

Processor module: filename

Available processors:

  1. SpacesToUnderscores Changes a filename; replaces spaces by underscores. E.g.: this is a test.txt --> this_is_a_test.txt
  2. SpacesToDashes Changes a filename; replaces spaces by dashes. E.g.: this is a test.txt --> this-is-a-test.txt

Processor module: unique_filename

Available processors:

  1. Mtime Changes a filename based on the file's mtime. E.g.: logo.gif --> logo_1240668971.gif
  2. MD5 Changes a filename based on the file's MD5 hash. E.g.: logo.gif --> logo_2f0342a2b9aaf48f9e75aa7ed1d58c48.gif

Processor module: image_optimizer

It's important to note that all metadata is stripped from JPEG images, as that is the most effective way to reduce the image size. However, this might also strip copyright information, i.e. this can also have legal consequences. Choose one of the "keep metadata" classes if you want to avoid this. When optimizing GIF images, they are converted to the PNG format, which also changes their filename.

Available processors:

  1. Max optimizes image files losslessly (GIF, PNG, JPEG, animated GIF)
  2. KeepMetadata same as Max, but keeps JPEG metadata
  3. KeepFilename same as Max, but keeps the original filename (no GIF optimization)
  4. KeepMetadataAndFilename same as Max, but keeps JPEG metadata and the original filename (no GIF optimization)

Processor module: yui_compressor

Warning: this processor is CPU-intensive! Since you typically don't get new CSS and JS files all the time, it's still fine to use this. But the initial sync may cause a lot of CSS and JS files to be processed and thereby cause a lot of load!

Available processors:

  1. YUICompressor Compresses .css and .js files with the YUI Compressor

Processor module: google_closure_compiler

Warning: this processor is CPU-intensive! Since you typically don't get new JS files all the time, it's still fine to use this. But the initial sync may cause a lot of JS files to be processed and thereby cause a lot of load!

Available processors:

  1. GoogleClosureCompiler Compresses .js files with the Google Closure Compiler

Processor module: link_updater

Warning: this processor is CPU-intensive! Since you typically don't get new CSS files all the time, it's still fine to use this. But the initial sync may cause a lot of CSS files to be processed and thereby cause a lot of load! Note that this processor will skip processing a CSS file if not all files that are referenced from it, have been synced to the CDN yet. Which means the CSS files may need to parsed over and over again until the images have been synced.

It seems this processor is suited for optimization. It uses the cssutils Python module, which validates every CSS property. This is an enormous slow- down: on a 2.66 GHz Core 2 Duo, it causes 100% CPU usage every time it runs. This module also seems to suffer from rather massive memory leaks. Memory usage can easily top 30 MB on Mac OS X where it would never go over 17 MB without this processor!

This processor will replace all URLs in CSS files with references to their counterparts on the CDN. There are a couple of important gotchas to use this processor module:

  • absolute URLs (http://, https://) are ignored, only relative URLs are processed
  • if a re
View on GitHub
GitHub Stars354
CategoryCustomer
Updated3mo ago
Forks91

Languages

Python

Security Score

92/100

Audited on Nov 29, 2025

No findings