Skip to content
AronT-TLV edited this page Mar 8, 2018 · 2 revisions

Welcome to FourM's Data Science Knowledge Base Wiki!

The Problem

In the past, it made sense to set up various data analytic tools on analysts personal computers for development and testing work. Most of these tools are open source . Having them on a personal computer provides inexpensive and easy access to these tools for students, researchers and corporate data scientist.

This approach has several disadvantages:

  1. Given the open source and dynamic nature of these tools, installation is complicated and there are constant updates that need to be deployed. An awful lot of system administration work is necessary to get and keep tools up to date.
  2. Moreover, the installation methods and supplementary extensions for these tools are constantly changing. This just adds to the system administration nightmare.
  3. Even if the analyst has the latest and greatest version of a Mac/Windows/Linux personal computer, they will likely quickly run out of cpu and or memory available to work on interesting problems and data sets.

If you are the organization’s infrastructure person, you don’t want people doing dev and test work on massive and expensive clusters that are used for production workloads. Setting up less expensive dev and test infrastructure “labs” can save quite a bit of work over maintaining everyone’s laptop. But still, points 1 and 2 mean it’s still quite a bit of work, work you would rather spend tuning and improving production infrastructure.

If you are a student or researcher, dependent on the good will of a sysadmin (which is usually not abundant), the path of least resistance is installing stuff on your laptop and making do. Anything more sophisticated is likely beyond the capabilities of the typical student or researcher.

The Solution

Fortunately we now have a great alternative that addresses all three of the listed concerns: containers.

  1. To deploy the latest version of the tool you need, or a specific version your code might need, is trivial. You just provide the version or default to latest, when you deploy the container
  2. Most of the hard work for deployment has already been done by other people. Container deployments are almost always done using some sort of script—infrastructure as code. There are predefined scripts (Dockerfiles, Kubernetes deployments and even Helm Charts) available that make it simple to deploy container versions of nearly any data science tool you might need. Many of these (particularly Dockerfiles) are created by the projects themselves. There are usually alternative versions with different configurations of the tools and its extensions. These can be tweaked if you have very specific requirements.
  3. Container deployments give you control over the size of resources the container use, allowing you to scale up and down as your code and data requires.

For small work loads the Docker host on your personal compute might be sufficiently powerful. Installing Docker is easy enough these days. The edge version has Kubernetes support built in, so it will soon be easy to deploy more complex container configurations on your laptop. But inevitably there are going to be tools and workloads where you want to use containers that will be well beyond your personal computers capabilities. This is where the public cloud comes in.

If you are the organization’s infrastructure person, rather than set up dev/test labs for data scientists on the cloud, its much easier to provide appropriate cloud infrastructure and tools for your users to deploy dev and test containers as needed. You can even create tools for allowing the infrastructure itself to be made available on demand, to minimize cloud costs. After all, users usually don’t need Docker hosts or Kubernetes clusters available 24/7 for dev/test work loads.

If you are a student almost all the public cloud vendors have an education offering that provide some level of free resources. The public cloud vendors also offer free and/or very low cost services for researchers, if you can’t easily get the necessary resources from your own university’s sysadmin.

This Wiki

This wiki and the associated code repository aims to provide guides and tools to help you deploy data science containers and make this modern solution for data scientists even easier.

How-To Articles

Troubleshooting

Clone this wiki locally