Snorkel
Snorkel - Bootstrap your Data Science
Install / Use
/learn @Sqooba/SnorkelREADME
Snorkel - Bootstrap your DataScience
Snorkel is a local ready-in-30-seconds DataScience workbench for small to medium sized data problems.
It is based on Apache Zeppelin, is easy to start and stop, allows to persist your workspace locally and update your python or javascript dependencies without interrupting your work. It is best suited for early stage data exploration and prototyping, fully loaded with common python and javascript data science libraries.
How to launch it
On Linux and macOS
-
./build-images.shRun once to build the docker image and install the python and javascript dependencies.
-
./zeppelin.sh --startStarts the Zeppelin container.
Default port for Zeppelin is 8080, i.e. http://localhost:8080. Default port for Spark UI is 4040, i.e. http://localhost:4040, once the first Spark job has been started.
-
./zeppelin.sh --stopStops Zeppelin container
On Windows
Windows scripts are available (.cmd extension). You can execute them from the command prompt or the powershell, or simply double-click on them from the explorer (or right-click > run).
-
build-images.cmdRun once to build the docker image and install the python and javascript dependencies.
-
start-zeppelin.cmdStarts the Zeppelin container. A command prompt window will appear, press any key to close it.
Default port for Zeppelin is 8080, i.e. http://localhost:8080. Default port for Spark UI is 4040, i.e. http://localhost:4040, once the first Spark job has been started.
-
stop-zeppelin.cmdStops Zeppelin container. Once again, press any key to close the window.
Custom configuration
Workspace persistence
On first start, the following volumes will be created on the host at the specified default locations and shared with the container:
Host | Container | Description
--- | --- | ---
snorkel/zeppelin/data | /zeppelin/data | Your data stored here are available in Zeppelin
snorkel/zeppelin/logs | /zeppelin/logs | Logs
snorkel/zeppelin/notebooks | /zeppelin/notebooks | Notebooks git repo, i.e. your work
snorkel/zeppelin/spark-warehouse | /zeppelin/spark-warehouse | Storage for temporary Spark tables
It is possible to override the location of these volumes by setting the environment variable ZEPPELIN_ROOT_DIR
to your preferred location before running the zeppelin.sh --start script
Zeppelin interpreter memory
By default half of the total available memory will be allocated to the Zeppelin interpreters on start.
You can override this value by setting the environment variable ZEPPELIN_MEMORY (the value should be the size in GB, eg: export ZEPPELIN_MEMORY=8 for 8 Gb of memory).
UI ports
By default the Zeppelin UI will run on port 8080 and the Spark UI on port 4040.
You can override these values by setting the environment variables, respectively ZEPPELIN_PORT and SPARK_UI_PORT
Add Python and JS dependencies on-the-fly
snorkel/zeppelin/bootstrap/python/requirements.txt lets you define Python pip dependencies.
zeppelin/bootstrap/js and zeppelin/bootstrap/css lets you deploy JS and CSS libraries inside Zeppelin.
On Linux and macOS, call ./zeppelin.sh --refresh to refresh your container without restarting it!
Examples
Python dependency
Say you're missing the python web micro-framework Flask. Just add the following line to
snorkel/zeppelin/bootstrap/python/requirements.txt:
Flask==0.12.2
And execute ./zeppelin.sh --refresh. Voilà! Flask is available in your Zeppelin notebook, no restart needed.
JS libraries
Let's imagine you want to add the mobx library to your dependencies.
There are two ways to add javascript dependencies to your Zeppelin notebook:
-
By using unpkg, a fast, global content delivery network for everything on npm:
Add the following script tag to your code in the notebook's snippet:
<script src="https://unpkg.com/mobx"></script>This will inject the static (non-minified) source code of the library in your browser. -
By using the
zeppelin.shscript:- Download the source code of the library from any CDN
- Add the js file to the
bootstrap/jsfolder - Execute
./zeppelin.sh --refresh. This will copy the library in the container at a location where Zeppelin can serve it to your browser.
Scala/Java dependency
You can use Zeppelin's built-in dependency interpreter to pull dependencies without leaving your notebook
For example, if you need the Scala plotting library Vegas, just add the following line in a snippet at the very beginning of your notebook:
%spark.dep
z.load("org.vegas-viz:vegas_2.11:0.3.11")
Do not forget to specify the spark.dep interpreter!
Execute the snippet before running any code (or restart your interpreter and execute the snippet). You can now use the library normally:
import vegas._
...
Dependencies table
The below table list all the dependencies included inside the container.
Library | Version | Licence --- | --- | --- matplotlib | 2.0.2 | PSF NumPy | 1.13.1 | BSD pandas | 0.20.3 | BSD python-igraph | 0.7.1.post6 | GPL 2 cairocffi | 0.8.0 | BSD-3-Clause scikit-learn | 0.19.0 | BSD-3-Clause SciPy | 0.19.1 | BSD Seaborn | 0.8.1 | BSD-3-Clause sklearn | 0.0 | BSD d3js | 4.10.2 | BSD-3-Clause leaflet | 1.2.0 | BSD 2-clause Leaflet.markercluster | 1.1.0 | MIT Zeppelin | 0.7.3 | Apache-2.0 Docker Compose | 3.3 | Apache-2.0
Related Skills
feishu-drive
347.6k|
things-mac
347.6kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
347.6kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
codebase-memory-mcp
1.2kHigh-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 66 languages, sub-ms queries, 99% fewer tokens. Single static binary, zero dependencies.
