Skip to content

Latest commit

 

History

History
299 lines (173 loc) · 1.76 MB

Watson Studio.md

File metadata and controls

299 lines (173 loc) · 1.76 MB

Watson Studio is a collaborative platform for data scientists, built on open source components and IBM added value, and is available in the cloud and on-premises. 

  • Collaborative platform: 

  • Gallery for sharing resources and data sets

  • Simplified communication between different users and job roles

  • Open-source components: Python, Scala, R, SQL, and notebooks (Jupyter and Zeppelin) 

  • IBM added value: Watson Machine Learning, Flow Editor, Decision Optimization, SPSS predictive analytics algorithms, analytics dashboard, and more

  • Watson Studio Cloud, Watson Studio Local, and Watson Studio Desktop

Watson Studio is a collaborative platform for data scientists that is built on open source components and IBM added value, and is available in the cloud or on-premises. The collaborative platform enables the users, whether they are data scientists, data engineers, or application developers to share resources and work together seamlessly within the platform.

Watson Studio is built upon open source components such as Python, Scala, R, SQL, and notebooks.

If the open source tools are not enough for your needs, IBM added value components such as Watson Machine Learning, Flow Editor, Decision Optimization, SPSS predictive analytics algorithms, analytics dashboard, and more.

Watson Studio is available as three different offerings:

  • Watson Studio Cloud, which is what you use in this course.
  • Watson Studio Local, which is the on-premises version.
  • Watson Studio Desktop is a light-weight version that you can install on your notebook.

Open-source tools give you a wide range of choices. The downside of having too many choices is knowing the correct one to choose. Essentially, you want to pick the correct tool for the correct job. Watson Studio is no exception.

Watson Studio is designed for a specific persona, but other personas can use it as it relates to their jobs.

Look at the diagram on the slide. Starting from the top and going clockwise, you have the input, analysis, and output phases. Within each phase are the objectives of that phase. Each objective can overlap between various user personas.

Look at the list of personas on the right side: the data engineer, the data scientist, the business analyst, and the app developer. Each persona has primary tools that help them do their job. For example, the data scientist's main tool is Watson Studio. Also, there might be a team of different personas. Whatever the case is, you must decide what tool is correct for the job, regardless of the personas. The definitions of personas can vary between different companies and evolve over time. 

Data science spans multiple industries, but you can see that data analysis that is applied to some of these use cases is not entirely new. In fact, organizations have been doing these types of activities for many years. The advantage that you have with Watson Studio is that you can easily collaborate with other data scientists by using well-known tools that are widely used in the industry.

Watson Studio is built as a collaborative platform. Watson Studio provides an easy way for you to learn how to get started with the platform. You can create state-of-the-art products that are based on the data that you derive by using open source and IBM added value tools. As you innovate, you can collaborate with your team and the community to share and gain insights.

Watson Studio is available in the following offerings:

  • Watson Studio Cloud is available through IBM Cloud as the public cloud option. To use it, you must have an IBM Cloud account. If you do not already have one, you are directed to register with IBM Cloud the first time you attempt to access the Watson Studio environment.
  • IBM Watson Studio Local is a ready-to-use enterprise solution for data scientists and data engineers. It offers a suite of data science tools, such as RStudio, Spark, Jupyter, and Zeppelin notebooks, that are integrated with proprietary IBM technologies. This offering can be installed on a private cloud or on-premises on the organization’s servers. It includes the same tools and features that are available in Watson Studio Cloud. 
  • Watson Studio Desktop is the local version of the tool that you install on your local machine. It enables you to experience the Watson Studio tooling before you decide on one of the other two options.

Watson Studio and other IBM Cloud services

  •  Watson Studio is part of Watson Data Platform: 

  • All Watson Data Platform services are seamlessly integrated and loosely coupled through IBM Cloud.

  • Examples: 

  • Watson Studio is aware of the Watson Machine Learning deployment service.

  • Watson Studio is aware of Db2 Warehouse on Cloud (formerly dashDB), Analytics Engine, and other data sources.

  • The services do not depend on each other and some can be used as stand-alone (loosely coupled).

Watson Studio is part of the Watson Data Platform and integrates seamlessly, but coupled loosely, with other tools through IBM Cloud. Here are some examples about how Watson Studio can work with other tools: 

  • Watson Studio is aware of the Watson Machine Learning deployment service.
  • Watson Studio is aware of Db2 Warehouse on Cloud (formerly dashDB), IBM Analytics Engine, and other data sources.

Each service is independent of the other ones, so you use only what you need.

Watson Studio high availability (public cloud)

  • Designed for 24x7 availability: Continuous availability and continuous delivery 

  • Backup and recovery: 

  • Watson Studio is disaster-resistant. 

  • Notebooks in Watson Studio are stored in a three-way Cloudant cluster in multiple geographic zones. 

  • Watson Studio provides integration with GitHub and an interface for downloading notebooks if the customer wants to use their own backup.

Watson Studio is designed for 24x7 availability, that is, continuous delivery and availability. Features and updates are rolled out without downtime. Notebooks in Watson Studio are stored in a three-way Cloudant cluster in multiple geographic zones. Watson Studio also provides integration with GitHub so that you can manually download the notebooks if you want to use your own backups.

Projects

  • The architecture of Watson Studio is centered around the project. 
  • Projects are a way to organize resources for a specific data science task or goal. 
  • Integrate collaborators, data and analytic assets, tools, Gallery resources, your own data and so on, to support finding insights for a well defined or narrow goal.

The architecture of Watson Studio is centered on projects where everything is seamlessly integrated. You create and organize projects to suit your business needs. Projects consist of data assets, collaborators, analytic assets, and Gallery resources that are combined with many open source and added value tools.

Data assets are the files in your object store or connections, such as a database, data services, streaming data, and other external files .

Collaborators can be assigned to your projects as admins, editors, or viewers.

Analytic assets are the notebooks and the models that you develop.

Watson Studio has a suite of tools that is available to help you with your job in the open source space. It also has a suite of added value tools such as Decision Optimization, Watson Machine Learning, and Streaming Analytics.

You can think of Watson Studio AI tools in these categories: 

  • Natural language classification 
  • Machine learning
  • Deep learning

Creating a project

When you create a project in Watson Studio, you can create an empty project or preload your project with data and analytical assets.

When you create a project in Watson Studio, you can create an empty project or preload your project with data and analytical assets:

Create an empty project: Add the data that you want to prepare, analyze, or model. Choose tools based on how you want to work: write code, create a flow on a graphical canvas, or automatically build models.

Create a project from a sample or file: Get started fast by loading existing assets. Choose a project file from your system, or choose a curated sample project

Watson Studio and Cloud Object Storage

Projects require an object storage to store non-database data sources.

  • An IBM Cloud Object Storage instance can be created when the project is created.

  • A project can be associated with an existing Cloud Object Storage instance.

  • Information stored in IBM Cloud Object Storage is encrypted and resilient. 

  • Each project has its own dedicated bucket. 

  • Buckets can be managed from the Watson Studio project interface. 

  • Cloud Object Storage supports two APIs:

  • Swift API: Available through Watson Studio 

  • Amazon Simple Storage Service (S3) API

Object storage provides the space where unstructured data for your project is stored.

An IBM Cloud Object Storage instance can be created at project creation time or a new project can be associated with an existing Cloud Object Storage instance.

Object Storage supports two APIs:

  • The Swift API, which is available through Watson Studio. 
  • The S3 API, where you provide external credentials.

Information that is stored with IBM Cloud Object Storage is encrypted and resilient. Cloud Object Storage uses buckets to organize the data. Each project has its own dedicated bucket.

The Cloud Object Storage can be managed from the Watson Studio project interface.

Creating a project and defining its storage

Projects include integration with IBM Cloud Object Storage for storing project assets. When a new project is created, a new Cloud Object Storage service is created to provide an unstructured cloud data store. It is also possible to associate a project with an existing Cloud Object Storage.

To create a new project:

  1.  Click + New.
  2. Enter a name for the project. 
  3. In the Define storage pane, click Add. 
  4. Click Refresh.
  5. The Cloud Object Storage service page with the New tab selected is displayed. Click Create.
  6. Click Refresh in the project page.
  7. The newly created Cloud Object Storage instance is associated with the project
  8. Click Create to create the project.

Watson Studio project tabs

The Watson Studio project page includes tabs with specific information about the project. The following tabs are included:

Overview. Provides basic information about the project such as: 

  • Description 
  • Storage usage 
  • Collaborators 
  • Recent activity

Assets. Lists all the different types of assets that you can add to you project such as: 

  • Data assets 
  • Models 
  • Notebooks
  • Dashboards 
  • Experiments 
  • Modeler flows

Environments. In this tab, you can define the hardware size and software configuration for the runtime that is associated with Watson Studio tools. An environment definition defines the hardware and software configuration that you can use to run tools like notebooks, model builder, or the flow editor in Watson Studio. With environments, you have dedicated resources and flexible options for the compute resources that are used by the tools.

Jobs. A job is a way of running assets, such as Data Refinery flows or notebooks in a project in Watson Studio

From the Jobs tab of your project, you can: 

  • See the list of the jobs in your project.
  • View the details of each job. You can change the schedule settings of a job and pick a different environment definition.
  • Monitor job runs. 
  • Create job. 
  • Delete jobs.

Deployments. Contains any models that you created and deployed, for example machine learning models.

Access Control. Here you can add and manage additional collaborators for your project.

Settings. Project settings is the area to define key aspects of your project like its name and description. Use Settings to add tools and key services, check storage, and define access tokens.

Assets

  • An asset is an artifact in a project that contains information, or metadata, about data or data analysis.

An asset is an artifact in a project that contains information, or metadata, about data or data analysis.

Data assets are the types of assets that point to data, for example, a file or a data set that is accessed through a connection to an external data source.

Analytical assets are the types of assets that run code to analyze data.

An asset might have associated files or require specific IBM Cloud services. For example, many types of analytical assets require a service.

The information that you can view about an asset and the actions that you can perform on an asset vary by asset type. A few actions apply to all asset types, for example, you can create, edit the properties of, and remove or delete all types of assets. You can edit the contents of most types of assets in projects. For data assets that contain relational data, you can shape and cleanse the data with the Data Refinery tool. For analytical assets, you can edit the contents in the appropriate editor tool. In catalogs, you can edit only the metadata for the asset.

A new project has no assets. By default, the Data assets type is added with all project starters. To add other asset types to your project, click Add to project.

Watson Studio Gallery

Contains samples that you can use in your project: 

  • Run sample notebooks to learn new techniques or to use as templates for your own notebooks.
  • Add sample data sets to your project to analyze in sample or your own analytical assets.

The Watson Studio Gallery is a great place to get started. In the Gallery you can find open data sets that are ready to use, including local, state, and government data. You just download the data set and load it into your project.

The Gallery includes sample data sets and notebooks that apply to various topics such as transportation, science and technology, health, law and government, economy and business, and many more. You can add the data set or notebook to your project.

When you sign in to Watson Studio, scroll down in the home page to see the content that is most recently added to the Gallery. You can click Explore to browse the Gallery content from here. Alternatively, you can use the menu to go directly to the Gallery area where you find data sets and notebooks.

Collaborating in a project

  • Project collaborators share knowledge and resources.
  • Admin role is required to manage collaborators (add, remove, and edit). 
  • Collaborators are added at the project level. 
  • Only collaborators in a project can access the project’s data and notebooks assets.

After you create a project, add collaborators to share knowledge and resources.

Project collaborators share knowledge and resources and help one another complete jobs. You must have the Admin role in the project to manage collaborators.

Only the collaborators in your project can access your data, notebooks and other assets. Collaborators can be removed from a project or have their permissions changes.

To add collaborators to your project:

  1. From your project, click the Access Control tab, and then click Add collaborators.
  2. Add the collaborators who you want to have the same access level:
  • Type email addresses into the Invite field.
  • Copy multiple email addresses, which are separated by commas, and paste them into the Invite field.
  1. Choose the access level for the collaborators and click Add: 
  • Viewer: View the project.
  • Editor: Control project assets.
  • Admin: Control project assets, collaborators, and settings.
  1. Add more collaborators with the same or different access levels.

  2. Click Invite.

If the invited users have existing IBM Cloud accounts with Watson Studio activated, they are added to your project immediately. If an invited user does not have an IBM Cloud account, the user receives an email invitation to create an IBM Cloud account and activate Watson Studio. When the user activates Watson Studio, the user can see your project and the user's status on your collaborators list changes from Invited to Active. If necessary, you can resend or cancel an invitation.

Watson Studio and Spark

  • You need a Spark environment if:

  • Your notebook includes Spark APIs. 

  • You create machine learning models or model flows with Spark runtimes.

  • Spark environments that are offered under Watson Studio: 

  • Watson Studio users can create Spark environments with varying hardware and software configurations.  

Apache Spark is an open source distributed cluster computing framework that is optimized for extremely fast and large-scale data processing. Spark provides a core data processing engine. Several libraries for SQL, machine learning, graph computation, and stream processing that is run on Spark and can be used together in an application. To learn more about Apache Spark, see documentation and examples at spark.apache.org.

If your notebook includes Spark APIs, or you create machine learning models or model flows with Spark runtimes, you need to associate the tool with a Spark environment. With Spark environments, you can configure the size of the Spark driver and the size and number of the executors. You can use Spark environments for notebooks, the model builder, and Spark modeler flows.

Spark options

In Watson Studio, you can use Spark environments that are offered under Watson Studio.

All Watson Studio users can create Spark environments with varying hardware and software configurations. Spark environments offer Spark kernels as a service (SparkR, PySpark, and Scala). Each kernel gets a dedicated Spark cluster and Spark executors. Spark environments are offered under Watson Studio and, like default environments, consume capacity unit hours (CUHs) that are tracked.

Spark default environments

Spark environments offered under Watson Studio. All Watson Studio users can create Spark environments with varying hardware and software configurations. Spark environments offer Spark kernels as a service (SparkR, PySpark, and Scala). Each kernel gets a dedicated Spark cluster and Spark executors. Spark environments are offered under Watson Studio and like default environments, consume capacity unit hours (CUHs) that are tracked.