poola

Python package for pooled screen analysis

Install

Install from github for the latest development release:

pip install git+git://github.com/gpp-rnd/poola.git#egg=poola

Or install the most recent distribution from PyPi:

pip install poola

How to use

Additional packages required for this tutorial can be install using pip install -r requirements.txt

from poola import core as pool
import pandas as pd
import seaborn as sns
import gpplot
import matplotlib.pyplot as plt

To demonstrate the functionality of this module we'll use read counts from Sanson et al. 2018.

supp_reads = 'https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-018-07901-8/MediaObjects/41467_2018_7901_MOESM4_ESM.xlsx'
read_counts = pd.read_excel(supp_reads,
                            sheet_name = 'A375_orig_tracr raw reads', 
                            header = None,
                            skiprows = 3, 
                            names = ['sgRNA Sequence', 'pDNA', 'A375_RepA', 'A375_RepB'], 
                            engine='openpyxl')
guide_annotations = pd.read_excel(supp_reads,
                                  sheet_name='sgRNA annotations', 
                                  engine='openpyxl')

The input data has three columns with read counts and one column with sgRNA annotations

read_counts.head()

<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>sgRNA Sequence</th> <th>pDNA</th> <th>A375_RepA</th> <th>A375_RepB</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>AAAAAAAATCCGGACAATGG</td> <td>522</td> <td>729</td> <td>774</td> </tr> <tr> <th>1</th> <td>AAAAAAAGGATGGTGATCAA</td> <td>511</td> <td>1484</td> <td>1393</td> </tr> <tr> <th>2</th> <td>AAAAAAATGACATTACTGCA</td> <td>467</td> <td>375</td> <td>603</td> </tr> <tr> <th>3</th> <td>AAAAAAATGTCAGTCGAGTG</td> <td>200</td> <td>737</td> <td>506</td> </tr> <tr> <th>4</th> <td>AAAAAACACAAGCAAGACCG</td> <td>286</td> <td>672</td> <td>352</td> </tr> </tbody> </table> </div>

lognorms = pool.lognorm_columns(reads_df=read_counts, columns=['pDNA', 'A375_RepA', 'A375_RepB'])
filtered_lognorms = pool.filter_pdna(lognorm_df=lognorms, pdna_cols=['pDNA'], z_low=-3)
print('Filtered ' + str(lognorms.shape[0] - filtered_lognorms.shape[0]) + ' columns due to low pDNA abundance')

Filtered 576 columns due to low pDNA abundance

Note that the column names for the lognorms remain the same

lognorms.head()

<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>sgRNA Sequence</th> <th>pDNA</th> <th>A375_RepA</th> <th>A375_RepB</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>AAAAAAAATCCGGACAATGG</td> <td>4.192756</td> <td>3.373924</td> <td>3.521755</td> </tr> <tr> <th>1</th> <td>AAAAAAAGGATGGTGATCAA</td> <td>4.163726</td> <td>4.326828</td> <td>4.312620</td> </tr> <tr> <th>2</th> <td>AAAAAAATGACATTACTGCA</td> <td>4.041390</td> <td>2.540624</td> <td>3.196767</td> </tr> <tr> <th>3</th> <td>AAAAAAATGTCAGTCGAGTG</td> <td>2.930437</td> <td>3.388159</td> <td>2.973599</td> </tr> <tr> <th>4</th> <td>AAAAAACACAAGCAAGACCG</td> <td>3.388394</td> <td>3.268222</td> <td>2.528233</td> </tr> </tbody> </table> </div>

lfc_df = pool.calculate_lfcs(lognorm_df=filtered_lognorms, ref_col='pDNA', target_cols=['A375_RepA', 'A375_RepB'])

We drop the pDNA column after calculating log-fold changes

lfc_df.head()

<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>sgRNA Sequence</th> <th>A375_RepA</th> <th>A375_RepB</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>AAAAAAAATCCGGACAATGG</td> <td>-0.818831</td> <td>-0.671000</td> </tr> <tr> <th>1</th> <td>AAAAAAAGGATGGTGATCAA</td> <td>0.163102</td> <td>0.148894</td> </tr> <tr> <th>2</th> <td>AAAAAAATGACATTACTGCA</td> <td>-1.500766</td> <td>-0.844622</td> </tr> <tr> <th>3</th> <td>AAAAAAATGTCAGTCGAGTG</td> <td>0.457721</td> <td>0.043161</td> </tr> <tr> <th>4</th> <td>AAAAAACACAAGCAAGACCG</td> <td>-0.120172</td> <td>-0.860161</td> </tr> </tbody> </table> </div>

Since we only have two conditions it's easy to visualize replicates as a point densityplot using gpplot

plt.subplots(figsize=(4,4))
gpplot.point_densityplot(data=lfc_df, x='A375_RepA', y='A375_RepB')
gpplot.add_correlation(data=lfc_df, x='A375_RepA', y='A375_RepB')
sns.despine()

png

Since we see a strong correlation, we'll average the log-fold change of each sgRNA across replicates

avg_replicate_lfc_df = pool.average_replicate_lfcs(lfcs=lfc_df, guide_col='sgRNA Sequence', condition_indices=[0],
                                                   sep='_')

After averaging log-fold changes our dataframe is melted, so the condition column specifies the experimental condition (A375 here) and the n_obs specifies the number of replicates

avg_replicate_lfc_df.head()

<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>sgRNA Sequence</th> <th>condition</th> <th>avg_lfc</th> <th>n_obs</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>AAAAAAAATCCGGACAATGG</td> <td>A375</td> <td>-0.744916</td> <td>2</td> </tr> <tr> <th>1</th> <td>AAAAAAAGGATGGTGATCAA</td> <td>A375</td> <td>0.155998</td> <td>2</td> </tr> <tr> <th>2</th> <td>AAAAAAATGACATTACTGCA</td> <td>A375</td> <td>-1.172694</td> <td>2</td> </tr> <tr> <th>3</th> <td>AAAAAAATGTCAGTCGAGTG</td> <td>A375</td> <td>0.250441</td> <td>2</td> </tr> <tr> <th>4</th> <td>AAAAAACACAAGCAAGACCG</td> <td>A375</td> <td>-0.490166</td> <td>2</td> </tr> </tbody> </table> </div>

It's sometimes helpful to group controls into pseudo-genes so they're easier to compare with target genes. Our annotation file maps from sgRNA sequences to gene symbols

remapped_annotations = pool.group_pseudogenes(annotations=guide_annotations, pseudogene_size=4, 
                                              gene_col='Annotated Gene Symbol', 
                                              control_regex=['NO_CURRENT'])
remapped_annotations.head()

<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>sgRNA Sequence</th> <th>Annotated Gene Symbol</th> <th>Annotated Gene ID</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>AAAAAAAATCCGGACAATGG</td> <td>SLC25A24</td> <td>29957</td> </tr> <tr> <th>1</th> <td>AAAAAAAGGATGGTGATCAA</td> <td>FASTKD3</td> <td>79072</td> </tr> <tr> <th>2</th> <td>AAAAAAATGACATTACTGCA</td> <td>BCAS2</td> <td>10286</td> </tr> <tr> <th>3</th> <td>AAAAAAATGTCAGTCGAGTG</td> <td>GPR18</td> <td>2841</td> </tr> <tr> <th>4</th> <td>AAAAAACACAAGCAAGACCG</td> <td>ZNF470</td> <td>388566</td> </tr> </tbody> </table> </div>

We provide two methods for scaling log-fold change values to controls:

Z-score from a set of negative controls
Scale scores between a set of negative and positive controls

For both scoring methods, you can input either a regex or a list of genes to define control sets

For our set of negative controls, we'll use nonessential genes

nonessential_genes = (pd.read_table('https://raw.githubusercontent.com/gpp-rnd/genesets/master/human/non-essential-genes-Hart2014.txt',
                                    names=['gene'])
                      .gene)
annot_guide_lfcs = pool.annotate_guide_lfcs(avg_replicate_lfc_df, remapped_annotations, 'Annotated Gene Symbol',
                                            merge_on='sgRNA Sequence', z_score_neg_ctls=True,

Poola

Install / Use

README

poola

Install

How to use