Polly Notebooks: Reproducible analysis expert
Product & Engineering

Polly Notebooks: Reproducible analysis expert

Shubhra Agrawal
July 9, 2020

The interactive notebook has recently become a major fixture in the data science toolbox, especially for sharing analyses. We have integrated notebooks with our platform, Polly, to allow for an easy setup process along with data management and computational resource management capabilities. This article will walk you through the notebook features available in Polly and how to set up the scripting environment quickly!

What are Notebooks?

Polly Python3 Notebook
Polly Python3 Notebook

Interactive notebooks are environments that couple your code with the resulting output in a single document, including text and visualizations. In addition, they allow you to do the following:

  • Serve as a record of your analyses: Did you forget how you normalized some data last month? You can revisit your code anytime!
  • Re-run old code, bite-wise: Notebooks are organized into cells that can be run and re-run independently. You don’t have to run the entire document to see what the manipulated data looks like in cell 2
  • Keep code and output together: The output of each step is visible right in the document. Whether it’s a plot you generated or a data frame you manipulated. It eliminates the need to copy and paste plots into a separate document

Why choose Polly Notebooks over Local Notebooks?

Polly Notebook provides a Jupyter-like interface on the cloud. Here’s why we recommend Polly notebooks over other local hosting options:

  • Ready-to-code platform: Installing and maintaining environments for every notebook can be a frustrating overhead. We provide custom dockers that come pre-installed with modules commonly used in bioinformatics. 
  • Cloud storage: With Polly Notebooks, you can store your data files and notebooks in a single place that will be ready to run in less than 5 minutes from anywhere in the world. No need to fetch your code from Bitbucket anymore!
  • Share and collaborate on your projects: Polly allows sharing of projects so you can review and refer notebooks within your team. 
  • Resource management: RNA-seq analyses are commonly resource-intensive, whether in terms of RAM or processing power. In such cases, you either have to scramble for bigger resources or compromise on the speed by using less processing power. Polly makes it possible to scale up your resources at any time.

The above features help reduce setup costs and barriers to entry in biological data analyses significantly. Combine these features with data storage on the cloud and the ability to share projects and you can heavily boost collaboration within your group and ensure reproducibility of every analysis you perform!

Getting a Polly account

Although Polly remains behind a paywall to cover the storage and processing costs, the team is providing free Polly Notebook trials to selected bioinformaticians for a limited time period. You can register yourself for a free trial by filling out a form here.

Creating your first Polly Notebook

In this section, we will learn how to create and save a Polly Notebook, learn about the scripting interface, and some other exciting features!

Once you have logged in to Polly, navigate to the default project, and create a new notebook.

Polly Notebooks Dashboard
Polly Notebooks Dashboard

On the next screen, you will be asked to choose between different environments and resources. Let’s discuss some basic terminology before we make our selection:

  • Kernels are the engines that run your code. While there are hundreds of kernels written for Jupyter, Polly provides the three most commonly used kernels for data science- R, Python2 and Python3. 
  • Docker is an isolated environment with the chosen kernel and basic dependencies installed to run your code. Choose the docker environment that is most suitable for your purpose, or request a custom docker. One of the more notable dockers provided by Polly Notebook is the Pollyglot docker where you can run R, Python and Bash code within the same notebook!
  • Machine type specifies the number of cores and amount of memory required to run your notebook. For basic scripting purposes, Polly Small should be sufficient.

In the following example, we will use the Python3 kernel with a Polly Small machine.

Polly Notebooks interface to choose your docker & machine
Polly Notebooks interface to choose your docker & machine

Once the notebook is ready, you will see a cell with “Welcome to Polly <kernelname> Notebook. Now let’s see how to run your code.

Beginning coding in Polly Notebook
Beginning coding in Polly Notebook

Running your code in Polly Notebook

All the code in a notebook is organized into cells for easier comprehension. To see how these cells behave, let us add some code in the first cell. Type `print(“Hello Jupyter!”)` in the first cell and click on the >| button from the toolbar to execute this statement. You can also press `Shift+Enter` for the same effect.

Running your code in Polly Notebook
Running your code in Polly Notebook

The output of the first cell is shown right under it and the label to its left is updated from `In [ ] ` to `In [1]:`, indicating that this was the 1st cell to be executed in the notebook. This is a powerful feature that helps you keep track of your variables if the cells are executed in a non-linear fashion. You can add more cells using the Insert tab and continue coding like you would on a local editor!

How do I use my project files in a Polly Notebook?

Polly Notebook provides simple methods to fetch files from your project and save new files from your notebook. These methods are applicable to both R and Python notebooks.

  • list_project_file() : Get a list of files available in your Polly project
  • download_project_file(‘input_file.csv’) : Pull your selected file into the current working space
  • save_file_to_project(‘sample_file.csv’) : Save the modified/newly created file to your project on Polly

Here’s an example snippet to see how this would work. Suppose we have uploaded a CSV file named ‘my data file.csv’ in my Polly project. We then create a new notebook as shown above and write the following snippet.

Using project files in Polly Notebooks
Using project files in Polly Notebooks

As you can see, we first fetched the name of the file using the list method, then downloaded it into our current workspace. We used a pandas method to read this file and then display the contents right in the document.

Let’s manipulate this data and save it as a different file in Polly. We’ll be deleting the ‘id’ column from this CSV using the following snippet.

Output 1

Let’s use the list method again to see if the file appears in the project.

Output 2

You can also navigate to your project and check if this file has been added there.

Now that your work is done, you can save your notebook using the save button from the toolbar. This will ensure your current state has been stored in Polly for future use.

Installing dependencies using Polly terminal

While our available environments are equipped with the most commonly used bioinformatics modules for R and Python, you might frequently run into “Module not found” errors while importing your package or library. In case you require a package that’s not available in our environment, you can install it using the terminal, just like you would on your local system.

Let’s take a look at the steps involved for both Python and R notebooks

Installing packages for a Python notebook:

Go to the Polly Offerings tab and open the Terminal. The terminal opens in a new tab automatically and looks something like this:

Output 3

# For installing packages DON’T forget to use sudo. It will not ask for a password.
> sudo pip install <package-name>

# System binaries
> sudo apt install <package-name>

# If the above command outputs ‘package not found’, you can run this command to update the system package indices
> sudo apt-get update

To install a new package like biopython, simply type `sudo pip install biopython`.

Output 4

Once the package installation is successful, you can import the package in your notebook using `import Bio`.

Installing packages for an R notebook:

Go to the Polly Offerings tab and open the Terminal. The terminal opens in a new tab automatically.

# You can install R package by opening R terminal
> sudo R

# Install packages using the following command
> install.packages(c(‘<pkg-name>’), dependencies=TRUE, repos=<Enter your choice cran mirror link>)

# For CRAN mirror link: You can use either one of your choice or this one: “https://cran.cnr.berkeley.edu/”

# For importing the library using the terminal, use the following command (Note – You can also call the libraries from the notebook as usual)
> library(<pkg-name>)

Sharing your project

Once your notebook is ready, save it using the save button and go back to the Polly project page.

Sharing your project on Polly Notebooks
Sharing your project on Polly Notebooks

Click on the ‘Share’ button and enter your collaborator’s email address. Note that you can only share your projects with other Polly users. To share your project with collaborators who are not yet on Polly, you can download your notebook in various formats like HTML, .ipynb, Markdown, LaTeX, etc. You will find the download option inside your notebook under the “File” menu.

Final thoughts

We presented a brief look into the Polly Notebooks offering. There’s much more that can be done with notebooks, from making beautiful custom dashboards and mini-apps to automating computationally heavy workflows on the cloud. This is just a start. We will be pushing out more tutorials and documentation regularly to help you get the best of Polly!

As early bird users, we would love to get your feedback on the product. Love it? Hate it? Think it’s just another fish in the pond? We’re all ears at product@elucidata.io

Subscribe to our newsletter
Only data insights. No spam!
Thank you! Please click on the link to start the download.
Download Now
Oops! Something went wrong while submitting the form.

Blog Categories