Sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
Install / Use
/learn @jupyter-incubator/SparkmagicREADME
sparkmagic
Sparkmagic is a set of tools for interactively working with remote Spark clusters in Jupyter notebooks. Sparkmagic interacts with remote Spark clusters through a REST server. Currently there are three server implementations compatible with Sparkmagic:
- Livy - for running interactive sessions on Yarn
- Lighter - for running interactive sessions on Yarn or Kubernetes (only PySpark sessions are supported)
- Ilum - for running interactive sessions on Yarn or Kubernetes
The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment.




Features
- Run Spark code in multiple languages against any remote Spark cluster through Livy
- Automatic SparkContext (
sc) and HiveContext (sqlContext) creation - Easily execute SparkSQL queries with the
%%sqlmagic - Automatic visualization of SQL queries in the PySpark, Spark and SparkR kernels; use an easy visual interface to interactively construct visualizations, no code required
- Easy access to Spark application information and logs (
%%infomagic) - Ability to capture the output of SQL queries as Pandas dataframes to interact with other Python libraries (e.g. matplotlib)
- Send local files or dataframes to a remote cluster (e.g. sending pretrained local ML model straight to the Spark cluster)
- Authenticate to Livy via Basic Access authentication or via Kerberos
Examples
There are two ways to use sparkmagic. Head over to the examples section for a demonstration on how to use both models of execution.
1. Via the IPython kernel
The sparkmagic library provides a %%spark magic that you can use to easily run code against a remote Spark cluster from a normal IPython notebook. See the Spark Magics on IPython sample notebook
2. Via the PySpark and Spark kernels
The sparkmagic library also provides a set of Scala and Python kernels that allow you to automatically connect to a remote Spark cluster, run code and SQL queries, manage your Livy server and Spark job configuration, and generate automatic visualizations. See Pyspark and Spark sample notebooks.
3. Sending local data to Spark Kernel
See the Sending Local Data to Spark notebook.
Installation
Jupyter Notebook 7.x / JupyterLab 3.x
-
Install the library
pip install sparkmagic -
Make sure that ipywidgets is properly installed by running
pip install ipywidgets -
(Optional) Install the wrapper kernels. Run
pip show sparkmagicand it will show the path wheresparkmagicis installed at.cdto that location and run:jupyter-kernelspec install sparkmagic/kernels/sparkkernel jupyter-kernelspec install sparkmagic/kernels/pysparkkernel jupyter-kernelspec install sparkmagic/kernels/sparkrkernel -
(Optional) Modify the configuration file at ~/.sparkmagic/config.json. Look at the example_config.json
-
(Optional) Enable the server extension so that clusters can be programatically changed:
jupyter server extension enable --py sparkmagic
Jupyter Notebook 5.2 or earlier / JupyterLab 1 or 2
-
Install the library
pip install sparkmagic -
Make sure that ipywidgets is properly installed by running
jupyter nbextension enable --py --sys-prefix widgetsnbextension -
If you're using JupyterLab 1 or 2, you'll need to run another command:
jupyter labextension install "@jupyter-widgets/jupyterlab-manager" -
(Optional) Install the wrapper kernels. Run
pip show sparkmagicand it will show the path wheresparkmagicis installed at.cdto that location run:jupyter-kernelspec install sparkmagic/kernels/sparkkernel jupyter-kernelspec install sparkmagic/kernels/pysparkkernel jupyter-kernelspec install sparkmagic/kernels/sparkrkernel -
(Optional) Modify the configuration file at ~/.sparkmagic/config.json. Look at the example_config.json
-
(Optional) Enable the server extension so that clusters can be programatically changed:
jupyter serverextension enable --py sparkmagic
Authentication Methods
Sparkmagic supports:
- No auth
- Basic authentication
- Kerberos
The Authenticator is the mechanism for authenticating to Livy. The base Authenticator used by itself supports no auth, but it can be subclassed to enable authentication via other methods. Two such examples are the Basic and Kerberos Authenticators.
Kerberos Authenticator
Kerberos support is implemented via the requests-kerberos package. Sparkmagic expects a kerberos ticket to be available in the system. Requests-kerberos will pick up the kerberos ticket from a cache file. For the ticket to be available, the user needs to have run kinit to create the kerberos ticket.
Kerberos Configuration
By default the HTTPKerberosAuth constructor provided by the requests-kerberos package will use the following configuration
HTTPKerberosAuth(mutual_authentication=REQUIRED)
but this will not be right configuration for every context, so it is able to pass custom arguments for this constructor using the following configuration on the ~/.sparkmagic/config.json
{
"kerberos_auth_configuration": {
"mutual_authentication": 1,
"service": "HTTP",
"delegate": false,
"force_preemptive": false,
"principal": "principal",
"hostname_override": "hostname_override",
"sanitize_mutual_error_response": true,
"send_cbt": true
}
}
Custom Authenticators
You can write custom Authenticator subclasses to enable authentication via other mechanisms. All Authenticator subclasses
should override the Authenticator.__call__(request) method that attaches HTTP Authentication to the given Request object.
Authenticator subclasses that add additional class attributes to be used for the authentication, such as the [Basic] (sparkmagic/sparkmagic/auth/basic.py) authenticator which adds username and password attributes, should override the __hash__, __eq__, update_with_widget_values, and get_widgets methods to work with these new attributes. This is necessary in order for the Authenticator to use these attributes in the authentication process.
Using a Custom Authenticator with Sparkmagic
If your repository layout is:
.
├── LICENSE
├── README.md
├── customauthenticator
│ ├── __init__.py
│ ├── customauthenticator.py
└── setup.py
Then to pip install from this repository, run: pip install git+https://git_repo_url/#egg=customauthenticator
After installing, you need to register the custom authenticator with Sparkmagic so it can be dynamically imported. This can be done in two different ways:
-
Edit the configuration file at
~/.sparkmagic/config.jsonwith the following settings:{ "authenticators": { "Kerberos": "sparkmagic.auth.kerberos.Kerberos", "None": "sparkmagic.auth.customauth.Authenticator", "Basic_Access": "sparkmagic.auth.basic.Basic", "Custom_Auth": "customauthenticator.customauthenticator.CustomAuthenticator" } }This adds your
CustomAuthenticatorclass incustomauthenticator.pyto Sparkmagic.Custom_Authis the authentication type that will be displayed in the%manage_sparkwidget's Auth type dropdown as well as the Auth type passed as an argument to the -t flag in the%spark add sessionmagic. -
Modify the
authenticatorsmethod insparkmagic/utils/configuration.pyto return your custom authenticator:def authenticators(): return { u"Kerberos": u"sparkmagic.auth.kerberos.Kerberos", u"None": u"sparkmagic.auth.customauth.Authenticator", u"Basic_Access": u"sparkmagic.auth.basic.Basic", u"Custom_Auth": u"customauthenticator.customauthenticator.CustomAuthenticator" }
Spark config settings
There are two config options for spark settings session_configs_defaults and session_configs. session_configs_defaults sets default setting that have to be explicitly overidden in order for a user to change them. session_configs provides defaults that are all replaced whenever a user changes them using the configure magic.
HTTP Session Adapters
If you need to customize HTTP request behavior for specific domains by modifying headers, implementing c
