Data analytics toolkit part of the KAVE, installed with AmbariKave, and also installable stand-alone http://beta.kave.io, a wiki for the entire KAVE is maintained on the cluster installer, AmbariKave wiki
Examples:
- Installing a python module into anaconda when the first install used root privilages
- Installing some derived library (some ROOT component) which needs ROOT and python integration
Why is this complicated?
- The root user by default does not have the KAVE environment setup, it has the system python and cannot see the kave components
- whatever new modules you try and install as root will then build against the system libraries, not KaveToolbox
What errors might I see?
- Complaints that you don't have the right privilages
- Software you think you've installed does not link or run correctly against KaveToolbox
- Software you think you've installed does not run for your user, but seems to run for the root user
Fix:
- sudo su #changes you to the root user
- source /opt/KaveToolbox/pro/scripts/KaveEnv.sh # the regular kave environment is not automatically fired for the root user
- #install as normal
Example:
sudo su
source /opt/KaveToolbox/pro/scripts/KaveEnv.sh
conda update conda
pip install pymongo
Our libraries:
- RootNotes: using ROOT plotting in ipython notebooks
- StatTools: tool for confidence level calculations
- LogMon : monitor a logfile (e.g. hive logfile)
- gdown : Auto download files from google docs/drive
- geomaps : Utilities to give a postal code-based map, or other geographical-based map in ipython Notebook
Installer for:
- Python through (ana)conda, includes SciPy, numpy, pip etc., (continuum.io)
- ROOT, CERN's data analysis package (root.cern.ch)
- R with integration into iPython notebook (http://nbviewer.ipython.org/github/dboyliao/cookbook-code/blob/master/notebooks/chapter07_stats/08_r.ipynb)
- Additional hadoopy-python modules, dumbo, mrjob, pyleus and pymongo_hadoop (if hadoop is available)
- Pentaho kettle (only if specifically configured, see ReleaseNotes.md for details), graphical process and data management tool (http://community.pentaho.com/projects/data-integration/)
- robomongo (only if specifically configured, see ReleaseNotes.md for details)
Examples of:
- IPython notebooks for RootNotes, StatTools, R and geomaps
CentOS6, CentOS7, Redhat7, Ubuntu 14 and Ubuntu 16 are used for testing, although no guarantees are given.
Only bash as a default shell is supported at the moment, users with a different default have reported many problems.
Please get in touch if you would like to make enquiries about this.
KaveToolbox is aimed at making the installation of our key analytics software and libraries seamless so that one-click deployment is possible and encouraged, taking the pain out of working out prerequisites, compilation, for most of our software. When you just want to get stuck straight into the data, you can bring along your same toolbox. It ensures a common environment to allow for simpler code distribution across all data nodes "fire and forget" instead of "push and pray".
KaveToolbox recognises two types of distribution:
- Node - no x-windows, needs libraries necessary for linking/running jobs, but no GUI management for that
- Workstation - complete data analysis workstation, with all graphical components, vnc, x-windows, etc.
-
Node: 5 GB of disk space for the software, additional 2 GB temp space needed during installation
-
Workstation: 7 GB of disk space for the software, additional 2 GB temp space needed during installation
-
Node: 1 core 2GB of RAM
-
Workstation: 2 core 4GB RAM
-
An internet connection (many packages will be downloaded form various sites)
-
Centos6/7 review your yum.conf file to make sure you are not ignoring certain packages from being installed
Nodes are likely to have even higher requirements for other service requirements such as Hadoop or storm.
- (2 core + 4 GB RAM)+(1 core + 2 GB RAM)*(number of simultaneous users)
- (100 GB)*(number of all-time users) home directory
- 20 GB "/" free on top of system size, or direct mount of 20 GB as /opt/
- 100 GB "/tmp" size
- GB Ethernet with high upload bandwidth for VNC connections
- We recommend that any servers/services requiring 100% uptime are not run on the analysis workstation (e.g. Hue/Ganglia/nagios/ldap) since analysis users will have erratic usage with a very high peak usage, we recommend running such services on dedicated servers in the network.
We also release the software packaged within docker containers. See http://hub.docker.com/r/kave/kavetoolbox. For example:
docker run -it kave/kavetoolbox:3.7-Beta.c7.node /bin/bash
When making a local installation you have two choices:
- Installing a released version from the repos server
- Installing the head, branch or specific tag from GIT
We recommend to install with the default configurations, but in case you want to modify the configurations you can create a file in /etc/kave/CustomInstall.py,For an example and more information run the installer with --help
- 1: Installing the released version, for example, 3.7-Beta
yum -y install wget curl tar zip unzip gzip python
wget http://repos:kaverepos@repos.kave.io/noarch/KaveToolbox/3.7-Beta/kavetoolbox-installer-3.7-Beta.sh
sudo bash kavetoolbox-installer-3.7-Beta.sh [--quiet]
(--quiet is for a quieter install, remove the brackets!) Remember the help at this stage [--help] ( NB: yum is the standard package manager for Centos/redhat. To install on Ubuntu the equivalent is apt-get )
( NB: the repository server uses a semi-private password only as a means of avoiding robots and reducing DOS attacks this password is intended to be widely known and is used here as an extension of the URL )
- 2: Installing the head from git, Example given using ssh.
#test ssh keys with
ssh -T git@github.com
#if this works,
git clone git@github.com:KaveIO/KaveToolbox.git
#then install with
sudo ./KaveToolbox/scripts/KaveInstall [--quiet]
(--quiet is for a quieter install, remove the brackets!) Remember the help at this stage [--help]
- Then to browse through examples
cd /opt/KaveToolbox/pro/examples
ipython notebook
And/or visit http://nbviewer.ipython.org/
-
Optional: Editing configuration files
- Default will install into directories in /opt
- Default will not overwrite existing packages
- Default configurations are well-tested, read all the configurations from config/kavedefaults.py
- To override configurations, create a simple python file in /etc/kave/CustomInstall.py
- To override pip requirements, create and edit the fine /etc/kave/requirements.txt
- this python should be used to logically overwrite any property of a service appearing in kavedefaults.py and will not be over-written on re-install/upgrade
- For an example and more information call ./kavetoolbox/scripts/KaveInstall --help
-
Optional: Set mirrors/nearside cache
- A list of mirrors of where to locate our software can be added to /etc/kave/mirror .
- The "mirror" file will be interpreted line-by-line should be used to add a list of nearside cache directories or nearside mirrors of the KPMG repository.
- All mirrors listed here must follow the same directory structure as the main repository, this looks like: mirror/os-version(s)/KaveToolbox/toolbox-version(s)/files.ext
- See more details below in setting up such a cache
-
Optional: Additional installation options
- The installer script has more options to help steer the installation
- take a look at the --help for the KaveInstall script for more details.
- Examples include automatically cleaning old versions from /opt. (--clean-after)
- Examples include completely cleaning directories before install from /opt (--clean-before)
- If you want to only select a certain list of components to install, this is possible with command-line arguments, e.g. KaveInstall KaveToolbox anaconda will only install the KaveToolbox scripts and anaconda python, but nothing else
Troubleshooting:
- "Warning: end of file not at end of line" during installation: this means you don't have enough virtual memory for the compilation of root. Modify configuration file for "low memory mode"
- Other errors in root or python installation: if installation fails, it may be due to conflicts with a previous install, try touch ~/.nokaveEnv and then obtain a clean shell, possibly via ssh
- ProtectNotebooks.sh script: if run as root, will add a system-wide ipython_notebook_config.py file if run as a user will add a user-level ipython_notebook_config.py file this file chooses a default port based on username and protects notebooks with the user's login password
- Re-running the installer over a pre-existing installation will only install new software and pick up new configuration changes.
- New software will be installed into versioned directories, to make it easier to track
- In case of an error during installation the install will stop, to complete an incomplete installation, re-run the installer, this will not delete any partially created directories, you will need to do that yourself
- To fix some component within a broken installation, delete any installed directories in /opt (or whatever you specified them to be) and re-run the installer, it will only install those parts you either deleted or didn't work the first time.
- To perform a complete re-install remove relevent directories from /opt, like /opt/root, /opt/kettle etc. or add the --clean-before flag to the script
- To re-install only the core KaveToolbox with any new features, see Updating
- To re-install specific components, add the component name as an arguement ' KaveInstall eclipse kettle --clean-before '
There are three possible update mechanisms
- Downloading/rerunning the latest install script (from git or from the repository -> 2 methods)
- Running the KaveUpdate script
sudo /opt/KaveToolbox/pro/scripts/KaveUpdate --list
sudo /opt/KaveToolbox/pro/scripts/KaveUpdate --help
sudo /opt/KaveToolbox/pro/scripts/KaveUpdate --quiet
The update script works well for updating between 2.X versions, and can also be used for 1.X, but only with either:
- the --clean-before flag.
- or by moving/removing directories in /opt, e.g. moving /opt/KaveToolbox/ to /opt/KaveToolbox/1.X and /opt/anaconda to /opt/anaconda/2.2 (version of old anaconda install)
The --clean-after flag is a common addition to the update to remove deprecated software after install
If you are trying to upgrade from 1.X to 2.X, either use --clean-before to remove the previous install, or move /opt/KaveToolbox/ to /opt/KaveToolbox/1.X and /opt/anaconda to /opt/anaconda/2.2 before installation
-
The correct paths to directly use our tools will be automatically added to your environment provided:
- you are not the root user
- you do not have .nokaveEnv in your home directory
- you use bash as your default shell
-
In other cases you will need to get/set environment manually
source [directory, e.g. /opt/KaveToolbox/pro/scripts]/KaveEnv.sh
-
the ASCII-art KAVE banner only shows up for interactive, non-dumb terminals, to turn off the KAVE banner even in that case do
touch ~/.nokaveBanner
-
To disable automatic setting of the environment for this user:
touch ~/.nokaveEnv
-
To force setting the environment for this user in case they would normally be skipped, first remove .nokaveEnv, then:
touch ~/.kaveEnv
- take a look at the examples!
cd $KAVETOOLBOX/examples
ipython notebook
--> Choose, for example, rootnotes.ipynb
--> Kernel --> Restart
--> Cells --> RunAll
we have Migrated to python3 as default
Ideally all of your nodes will have access to the internet during installation in order to download software.
If this is not the case, you can, possibly, implement a near-side cache/mirror of all required software. This is not very easy, but once it is done one time, you can keep it for later.
- Centos6: Howto
- EPEL: Mirror FAQ , Mirroring
- Ambari: Local Repositories Deploying HDP behind a firewall
To setup a local near-side cache for the KAVE tool stack is quite easy. First either copy the entire repository website to your own internal apache server, or copy the contents of the directories to your own shared directory visible from every node.
mkdir -p /my/shared/dir
cd /my/shared/dir
wget -R http://repos:kaverepos@repos.kave.io/
Then create a /etc/kave/mirror file on each node with the new top-level directory to try first before looking for our website:
echo "/my/shared/dir" >> /etc/kave/mirror
echo "http://my/local/apache/mirror" >> /etc/kave/mirror
So long as the directory structure of the nearside cache is identical to our website, you can drop, remove or replace, any local packages you will never install from this directory structure, and update it as our repo server updates.
You might consider creating a near-side cache, and/or configuring your proxy settings correctly, since we use wget for the downloads, your existing proxy settings (e.g. HTTP_PROXY environment variable) should be sufficient.
Don't forget that the root user/sudo also must comminicate over the proxy, and this may mean propagating the right environment variables. Try adding:
Defaults env_keep +="http_proxy"
Defaults env_keep +="https_proxy"
to your sudoers file with visudo
from here
We can't trouble shoot your networking issues for you, but if you are trying to install from behind a proxy, check the "How can I install behind a proxy" FAQ, also talk with your network administrator and decide if you need to setup a nearside cache.
- This is to be expected.
- If you have edited the configuration file to change installed packages or locations, it is quite likely that root will not install from the precompiled version correctly.
- To fix this, revert your copy of kaveconfiguration.py to the default settings and re-install root, or, configure/compile root yourself in this new location like:
cd /root/install/location
./configure [options e.g. linuxx8664gcc --enable-python --enable-mathmore --enable-minuit2 --enable-roofit --fail-on-missing]
make -j numcores
or follow the instructions on the root website to install root yourself
Many different packages are needed, did you maybe run out of space? Or did you ignore kernel packages in your yum.conf?
Check /etc/yum.conf and see if there is anything your are ignoring or forbidding from installing.
This is gnome trying to spawn an x window to have you enter your password. Work around by:
unset SSH_ASKPASS
So long as the pre-requisites are already installed (see the yum install commands in the kaveconfiguration.py) it is possible to install all the software we package into a local directory, however that is not implemented yet and will not permit seamless integration of all users and machines in a network, and it will not be possible to automatically source the environment for all users.
This is usually your local browser which is blocking things:
- Not allowed to run unsafe/unverified scripts (look for the tell-tale icon in the browser toolbar)
- Not allowed to display mixed content (if the trying to display https, look for the tell-tale icon in the browser toolbar)
In the first case, you can simply permit scripts running, by clicking on the correct icon and choosing the correct option.
In the second case, there are tow solutions. Best is to restart/modify mpld3 options with the correct options to switch to https for the javascript part aswell. The other option is to allow mixed content in your browser for ipython notebooks.
http://stackoverflow.com/questions/21089935/unable-plot-with-vincent-in-ipython https://mpld3.github.io/modules/API.html https://mpld3.github.io/modules/API.html#mpld3.enable_notebook
On Centos7, for some reason the vnc installation/start does not be default recognise the gnome installation
To fix this, edit your .vnc/xstartup file to contain:
#!/bin/sh
[ -r /etc/sysconfig/i18n ] && . /etc/sysconfig/i18n
export LANG
export SYSFONT
vncconfig -iconic &
unset SESSION_MANAGER
unset DBUS_SESSION_BUS_ADDRESS
OS=`uname -s`
if [ $OS = 'Linux' ]; then
case "$WINDOWMANAGER" in
*gnome*)
if [ -e /etc/SuSE-release ]; then
PATH=$PATH:/opt/gnome/bin
export PATH
fi
;;
esac
fi
if [ -x /etc/X11/xinit/xinitrc ]; then
exec /etc/X11/xinit/xinitrc
fi
if [ -f /etc/X11/xinit/xinitrc ]; then
exec sh /etc/X11/xinit/xinitrc
fi
[ -r $HOME/.Xresources ] && xrdb $HOME/.Xresources
xsetroot -solid grey
xterm -geometry 80x24+10+10 -ls -title "$VNCDESKTOP Desktop" &
twm &
This is caused by an environment variable being inherited form one user to the next. Simple fix, unset $XDG_RUNTIME_DIR .
In some cases we have seen that users have a file ~/.local/share/jupyter/kernels/python2/kernel.json where the wrong python executable is given.
Easy fix, change the name of the python executable to simply 'python' in this file.