Cloudless

Deep learning pipeline for orbital satellite data for detecting clouds

Generate Convert Improve

Install / Use

/learn @BradNeuberg/Cloudless

About this skill

Quality Score

0/100

README

Introduction

This project provides a classifier for detecting clouds in satellite remote sensing data using deep learning. Startups like Planet Labs have launched fleets of nanosats to image much of the earth daily; detecting clouds in these images to ignore or eliminate them is an important pre-processing step to doing interesting work with nanosat imagery. For example, if we are getting daily orbital photos of a location, we might want to detect changes over time, such as for automatically detecting deforestation, counting cars in parking lots, etc. Being able to first detect and eliminate clouds (which change often and could lead to false positives), is therefore important.

This project has three parts:

An annotation tool that takes data from the Planet Labs API and allows users to draw bounding boxes around clouds to bootstrap training data.
A training pipeline that takes annotated data, runs it on EC2 on GPU boxes to fine tune an AlexNet trained model, and then generates validation statistics to relate how well the trained model performs.
A bounding box system that takes the trained cloud classifier and attempts to draw bounding boxes on orbital satellite data.

Note that even though Cloudless is currently focused on cloud detection and localization, the entire pipeline can be used for any other satellite detection task with just a bit of tweaking, such as detecting cars, different biomes, etc. Use the annotation tools to bootstrap training data then run it through the pipeline for your particular task; everything in Cloudless is what you would need for other kinds of orbital computer vision detection tasks.

A full technical report with details on the project can be found in the announcement blog post.

Here's example output of before and after images with detected clouds. Detected clouds are automatically given yellow overlay boxes via the trained deep neural network:

normal image for comparison

cloud detection boxes

This project and its trained model are available under an Apache 2 license; see the license.txt file for details.

Parts of the Cloudless project started as part of Dropbox's Hack Week, with continued work post-Hack Week by Brad Neuberg. Contributors:

This is release 1.0 of Cloudless.

Special thanks to both Dropbox and Planet Labs!

Documentation sections:

Annotation Tool
Training Pipeline
Bounding Box/Inference System

Annotation Tool

The annotation tool makes it possible to label Planet Labs data to feed into the neural network. It's code lives in src/annotate.

Setting it up:

brew install gdal
Install virtualenv and virtualenvwrapper: https://jamie.curle.io/posts/installing-pip-virtualenv-and-virtualenvwrapper-on-os-x/
mkvirtualenv annotate-django
cd src/annotate
pip install -r requirements.txt
./manage.py migrate
echo 'PLANET_KEY="SECRET PLANET LABS KEY"' >> $VIRTUAL_ENV/bin/postactivate

Each time you work with the annotation tool you will ned to re-activate its virtualenv setup:

workon annotate-django

When finished working run:

deactivate

To import imagery into the annotation tool, go into src/annotate and:

Choose your lat/lng and buffer distance (meters) you want (this example is for San Fran) and the directory to download to, then run:

python train/scripts/download_planetlabs.py 37.796105 -122.461349 --buffer 200 --image_type rapideye --dir ../../data/planetlab/images/

Chop up these raw images into 512x512 pixels and add them to the database

./manage.py runscript populate_db --script-args ../../data/planetlab/images/ 512

To begin annotating imagery:

Start the server running:

./manage.py runserver

Go to http://127.0.0.1:8000/train/annotate
Draw bounding boxes on the image.
Hit the "Done" button to submit results to the server
Upon successful submission, the browser will load a new image to annotate

To export annotated imagery so it can be consumed by the training pipeline:

Writes out annotated.json and all the annotated images to a specified directory

./manage.py runscript export --script-args ../../data/planetlab/metadata/

If you need to clear out the database and all its images to restart for some reason:

./manage.py runscript clear

Training Pipeline

Once you've annotated images using the annotation tool, you can bring them into the training pipeline to actually train a neural network using Caffe and your data. Note that the training pipeline does not use virtualenv, so make sure to de-activate any virtualenv environment you've activated for the annotation tool earlier.

There are a series of Python programs to aid in preparing data for training, doing the actual training, and then seeing how well the trained model performs via graphs and statistics, all located in src/cloudless/train.

To setup these tools, you must have CAFFE_HOME defined and have Caffe installed with the Caffe Python bindings setup.

Second, ensure you have all Python requirements installed by going into the cloudless root directory and running:

pip install -r requirements.txt

Third, ensure you have ./src in your PYTHONPATH as well as the Python bindings for Caffe compiled and in your PYTHONPATH as well:

export PYTHONPATH=$PYTHONPATH:/usr/local/caffe/python:./src

Original raw TIFF imagery that will be fed into the annotation tool should go into data/planetlab/images while metadata added via annotation is in data/planetlab/metadata. Data that has been prepared by one of the data prep tools below are saved as LevelDB files into data/leveldb. Raw data is not provided due to copyright concerns; however, access to some processed annotation data is available. See the section "Trained Models and Archived Data" below for more details.

Note: for any of the data preparation, training, or graphing Python scripts below you can add --help to see what command-line options are available to override defaults.

To prepare data that has been labelled via the annotation tool, first run the following from the root directory:

./src/cloudless/train/prepare_data.py --input_metadata data/planetlab/metadata/annotated.json --input_images data/planetlab/metadata --output_images data/planetlab/metadata/bounded --output_leveldb data/leveldb --log_num 1

You can keep incrementing the --log_num option while doing data preparation and test runs in order to have log output get saved for each session for later analysis. If --do_augmentation is it present we augment the data with extra training data manual 90 degree rotations. Testing found, however, that these degrade performance rather than aid performance.

To train using the prepared data, run the following from the root directory:

./src/cloudless/train/train.py --log_num 1

By default this will place the trained, fine-tuned model into logs/latest_bvlc_alexnet_finetuned.caffemodel; this can be changed via the --output_weight_file option.

To generate graphs and verify how well the trained model is performing (note that you should set the log number to be the same as what you set it for train.py):

./src/cloudless/train/test.py --log_num 1 --note "This will get added to graph"

Use the --note property to add extra info to the quality graphs, such as various details on hyperparameter settings, so you can reference them in the future.

You can also predict how well the trained classifier is doing on a single image via the predict.py script:

./src/cloudless/train/predict.py --image examples/cloud.png
./src/cloudless/train/predict.py --image examples/no_cloud.png

The four scripts above all have further options to customize them; add --help as an option when running them.

Training info and graphs go into logs/.

Trained Models and Archived Data

We currently have pretrained weights from the BVLC AlexNet Caffe Model Zoo, in src/caffe_model/bvlc_alexnet. This is trained on ILSVRC 2012, almost exactly as described in ImageNet classification with deep convolutional neural networks by Krizhevsky et al. in NIPS 2012.

Note that the trained AlexNet file is much too large to check into Github (it's about 350MB). You will have to download the file from here and copy it to src/caffe_model/bvlc_alexnet/bvlc_alexnet.caffemodel.

A trained, fine tuned model is available in a shared Dropbox file folder here. Download this and place it into src/caffe_model/bvlc_alexnet/bvlc_alexnet_finetuned.caffemodel. It's current accuracy is 89.69%, while its F1 score is 0.91. See logs/output0005.statistics.txt for full accuracy details. Note that the trained model is available under the same Apache 2 license as the rest of the code.

The same shared Dropbox folder also has the training and validation leveldb databases. Trained models for earlier training runs can be found here, while labelled RapidEye annotation data can be found here. Note that the ann

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

isf-agent

a repo for an agent that helps researchers apply for isf funding