Knowledge Repo for Data Science

One of the key issues that data scientists face is keeping track of the results, and sharing the progress of the project among their colleagues. In data science, decisions are made based on results so just sharing a chunk of code makes no sense unless the results (graphs, tables, etc.) are stated. Tools like R Markdowns and IPython notebook have done a great job of producing traceable and interactive results along with documentation. On the other hand, Github is well adopted for sharing and reviewing code and writing but not their results (images/graphs). Knowledge Repository combines these ideas into one system. It is focused on facilitating the sharing of knowledge between data scientists and other technical roles. It provides various data stores for “knowledge posts”, with a particular focus on notebooks to better promote reproducible results.

At a basic level Knowledge Repo is a Git repository, where knowledge posts written in Jupyter notebooks, Rmarkdown or in plain Markdown are committed. Knowledge posts must have a specific header format including title, author(s), tags, and a TLDR. Knowledge Repo validates the content by running the whole code and transforms the post into plain text with Markdown syntax.

Essentially knowledge repo provides the following functionality:

Reproducibility: The entire work is reproducible at any point of time.
Quality: GitHub’s functionality of pull requests and peer review improves quality of code.
Consumability: With proper documentation and results alongside code, the whole work is accessible to non-technical readers.
Discoverability: Structured meta-data allows for easier navigation through past research.
Learning: By having previous work easily accessible, it becomes easier to learn from each other.

Elucidata uses Knowledge Repository to keep track of progress of each data science project. A local knowledge repository is initiated which is then added to remote git repository. Depending upon the project, one or more knowledge post can be created with special header to be recognized by knowledge repo. Knowledge post is created in such a manner that it can be standalone reproducible. So to remove dependency on a particular machine, aws.s3 (for Rmarkdown) and boto/boto3 (for Jupyter notebooks) packages are used to pull files directly from AWS S3. The knowledge post is then added to knowledge repo which then is submitted to remote git repository (or Bitbucket). Each post is then reviewed by colleagues to improve the quality of code, and merged to the master. The knowledge repository can be deployed on a server to view merged knowledge, and shared with concerned authority or client. Thus knowledge repo brings the whole analysis for a project at a single place making it easier for sharing and retrieving at a later stage.

Even though Knowledge Repository is still a work in progress, it is a great tool for sharing and reviewing the progress of a project.