Glossary in Distributed TensorFlow

In the figure, we take distributed deep learning as an example to explain the glossary of Client, Cluster, Job, Task, TensorFlow server, Master service, Worker Service in TensorFlow (TF).


In the figure, the model parallelism within every model replica and data parallelism among replicas are adopted, for distributed deep learning. A example of mapping physical nodes to TensorFlow glossary is illustrated.

  • The whole system is mapped to a TF cluster.
  • Parameter servers are mapped to a job
  • Each model replica is mapped to a job
  • Each physical computing node is mapped to a task within its job
  • Each task has a TF server, using “Master service” to communicate and coordinate works and using “Worker service” to compute designated operations in the TF graph by local devices.

Official Explanation of Glossary in TensorFlow:

A client is typically a program that builds a TensorFlow graph and constructs a `tensorflow::Session` to interact with a cluster. Clients are typically written in Python or C++. A single client process can directly interact with multiple TensorFlow servers (see “Replicated training” above), and a single server can serve multiple clients.
A TensorFlow cluster comprises a one or more “jobs”, each divided into lists of one or more “tasks”. A cluster is typically dedicated to a particular high-level objective, such as training a neural network, using many machines in parallel. A cluster is defined by a `tf.train.ClusterSpec` object.
A job comprises a list of “tasks”, which typically serve a common purpose. For example, a job named `ps` (for “parameter server”) typically hosts nodes that store and update variables; while a job named `worker` typically hosts stateless nodes that perform compute-intensive tasks. The tasks in a job typically run on different machines. The set of job roles is flexible: for example, a `worker` may maintain some state.
Master service
An RPC service that provides remote access to a set of distributed devices, and acts as a session target. The master service implements the tensorflow::Session interface, and is responsible for coordinating work across one or more “worker services”. All TensorFlow servers implement the master service.
A task corresponds to a specific TensorFlow server, and typically corresponds to a single process. A task belongs to a particular “job” and is identified by its index within that job’s list of tasks.
TensorFlow server
A process running a tf.train.Server instance, which is a member of a cluster, and exports a “master service” and “worker service”.
Worker service
An RPC service that executes parts of a TensorFlow graph using its local devices. A worker service implementsworker_service.proto. All TensorFlow servers implement the worker service.

GeForce GTX 1080 + CUDA 8.0 + Ubuntu 16.04 + Caffe/TensorFlow

Install cuda 8.0 referring here.

As the default gcc in Ubuntu 16.04 is very new, if you have compiling error similar to #error -- unsupported GNU version! gcc versions later than 5.3 are not supported!

Try to uncomment the #error line in file /usr/local/cuda/include/host_config.h

#if __GNUC__ > 5 || (__GNUC__ == 5 && __GNUC_MINOR__ > 3)

//#error -- unsupported GNU version! gcc versions later than 5.3 are not supported!

#endif /* __GNUC__ > 5 || (__GNUC__ == 5 && __GNUC_MINOR__ > 1) */

Update nvidia driver

You may need to remove the old driver version installed with cuda-8.0 and install a newer version of driver, if you have error similar to modprobe: ERROR: could not insert 'nvidia_361_uvm': Invalid argument

Please refer here to remove old driver, then install a new one:

sudo apt-get purge nvidia-*
dkms status
sudo dkms remove bbswitch/0.8 -k 4.4.0-31-generic #do this based on your `dkms status`
sudo ./ # download the compatible driver (GTX 1080 in my case) from nvidia website and install it
sudo reboot # reboot if the driver was not updated
./cuda8-0-samples/bin/x86_64/linux/release/deviceQuery # compile the sample code and check if it works

Install cudnn V5

cudnn v4 is NOT supported in 1080. You may compile and run, but will not converge.

Install prerequisities of Caffe

Because the gcc version is Ubuntu 16.04 is very new, If any prerequisity installed from apt-get does not work, uninstall it, compile and install it by the default gcc (5.4) from the source code. The prerequisities that may have problems include: protobuf and opencv. E.g. if you have protobuf error similar to

.build_release/lib/ undefined reference to `google::protobuf::io::CodedOutputStream::WriteVarint64ToArray(unsigned long long, unsigned char*)'

Try to uninstall the protobuf installed by apt-get, the one installed by apt-get might have been compiled by a older gcc version, so its shared libraries may not be compatible with your default gcc:

sudo apt-get purge libprotobuf-dev protobuf-compiler

then compile the protobuf-2.5.0 from src and install it. Please config the default gcc (5.4 in my case) when you compile protobuf:

./configure --prefix=/your/path/ CC=/usr/bin/gcc
make check
make install

 Compile and test Caffe here.

Please also refer here.

If you use anaconda, and get error like awk: symbol lookup error: $HOME/anaconda2/lib/ undefined symbol: PC

try to remove readline in anaconda lib so as to use the default system one.

conda remove --force readline

Compile and test TensorFlow here.

If there is error like ERROR: /mnt/tmp/tensorflow/tensorflow/core/BUILD:87:1: //tensorflow/core:protos_all_py: no such attribute 'imports' in 'py_library' , newer version (e.g. 0.2.3) of bazel can solve this problem.

git clone
cd bazel/
git tag -l
git checkout tags/0.2.3 # or 0.3.0 etc.

“This will create a bazel binary in bazel-bin/src/bazel. This binary is self-contained, so it can be copied to a directory on the PATH (e.g., /usr/local/bin) or used in-place. ”

Use which bazel to make sure your bazel is updated.

Then build and install:

bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer
# test
bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

# build pip package with gpu support
bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tmp/tensorflow_pkg

# The name of the .whl file will depend on your platform.
sudo pip install ~/tmp/tensorflow_pkg/tensorflow-0.9.0-py2-none-any.whl

mkdir _python_build
cd _python_build
ln -s ../bazel-bin/tensorflow/tools/pip_package/build_pip_package.runfiles/* .
ln -s ../tensorflow/tools/pip_package/* .
python develop

cd tensorflow/models/image/mnist

Known issues

  • TensorFlow compiling @ RHEL
ERROR: /home/wwen/github/tensorflow/tensorflow/core/kernels/BUILD:1529:1: undeclared inclusion(s) in rule '//tensorflow/core/kernels:depth_space_ops_gpu':
this rule is missing dependency declarations for the following files included by 'tensorflow/core/kernels/':

Solution: add cuda include path to third_party/gpus/crosstool/CROSSTOO by adding a line similar to cxx_builtin_include_directory: "/usr/local/cuda-7.5/include"

  • bazel compiling @ RHEL

If no JAVA_HOME is found when installing bazel from source code, install java sdk and make sure the default one is the one you want

sudo yum install java-1.8.0-openjdk-devel
/usr/sbin/alternatives --config java

Add JAVA_HOME variables in ~/.bashrc:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export PATH=$PATH:$JAVA_HOME/bin
  • TensorFlow WORK_DIR issue
bazel-bin/inception/download_and_preprocess_imagenet "/fdata/imagenet-data"

And get error: bazel-bin/inception/download_and_preprocess_imagenet: line 66: bazel-bin/inception/download_and_preprocess_imagenet.runfiles/inception/data/ No such file or directory, change WORK_DIR in ./inception/data/ to WORK_DIR="$0.runfiles/__main__/inception"




Image Segmentation by OpenCV

• Watershed • Graphcut • Gabor wavelet • Adaptive threshold and contour method are explored to do vessel segmentation, the best method is adaptive threshold and contour.


Original Retina Image

Python source code:

__author__ = 'pittnuts'
import cv2
from numpy import *
from matplotlib import pyplot as plt
import numpy as np

def build_filters():
    filters = []
    ksize = 31
    for theta in np.arange(0, np.pi, np.pi / 16):
        kern = cv2.getGaborKernel((ksize, ksize), 4.0, theta, 10.0, 0.5, 0, ktype=cv2.CV_32F)
        kern /= 1.5*kern.sum()
    return filters

def gabor_segment(img, filters):
    accum = np.zeros_like(img)
    for kern in filters:
        fimg = cv2.filter2D(img, cv2.CV_8UC1, kern)
        #cv2.imshow('filtered retina {}'.format(kern),fimg[::2,::2])
        np.maximum(accum, fimg, accum)
    return accum

if __name__ == "__main__":
    #imagefile = "../data/RetinaFD-L12-1024.jpg"
    imagefile = "../data/RetinaFD-R6-1024.jpg"
    img = cv2.imread(imagefile)
    orig_dimen = img.shape
    cv2.imshow('original retina',img[::2,::2])
    img = img[:,:,1]
    cv2.imshow('green retina',img[::2,::2])

    hist = cv2.calcHist([img],[0],None,[256],[0,256])

    img_segmented = gabor_segment(img,build_filters())
    cv2.imshow('gabor segmentation',img_segmented[::2,::2])

    cv2.imshow('threshold segmentation',img_segmented[::2,::2])
    im2, contours, hierarchy = cv2.findContours(img_segmented,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)
    # remove small contours
    for idx in range(len(contours)-1,-1,-1):
        if cv2.contourArea(contours[idx])


larger contour area

larger contour area

Build and Install OpenCV 3.0 to Anaconda Python in Ubuntu

Good references:

$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install build-essential cmake git pkg-config
$ sudo apt-get install libjpeg8-dev libtiff4-dev libjasper-dev libpng12-dev
$ sudo apt-get install libgtk2.0-dev
$ sudo apt-get install libavcodec-dev libavformat-dev libswscale-dev libv4l-dev
$ sudo apt-get install libatlas-base-dev gfortran
$ sudo apt-get install python2.7-dev

# clone source
$ cd ~
$ git clone
$ cd opencv
$ git checkout 3.0.0

# SIFT SURF are in opencv_contrib.git
$ cd ~
$ git clone
$ cd opencv_contrib
$ git checkout 3.0.0

$ cd ~/opencv
$ mkdir build
$ cd build
 -DOPENCV_EXTRA_MODULES_PATH=~/github/opencv_contrib/modules \
 -DCMAKE_INSTALL_PREFIX=$(python -c "import sys; print(sys.prefix)") \
 -DPYTHON_EXECUTABLE=$(which python) \
 -DPYTHON_INCLUDE_DIR=$(python -c "from distutils.sysconfig import get_python_inc; print(get_python_inc())") \
 -DPYTHON_PACKAGES_PATH=$(python3 -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())") \

# -DCMAKE_INSTALL_PREFIX=/usr/local \ #specify any location you want to install

$ make -j4
$ sudo make install
$ sudo ldconfig

# check if .so are installed in your specified location
# ex. should be installed into
# /home/wew57/anaconda/lib/python2.7/site-packages/

$ python
>>> import cv2
>>> cv2.__version__
'3.0.0' # is it the version you installed

# taste demo 





optical flow


set(0, 'DefaultAxesFontSize', 20); % set default figure font size
set(0, 'DefaultAxesFontName', 'Times New Roman'); % set default figure font

% draw arrows
drawArrow = @(x,y) quiver( x(1),x(2),y(1)-x(1),y(2)-x(2),0,'-r','LineWidth',2,'MaxHeadSize',0.5);
drawArrow(cur_point,next_point); hold on

High Performance Parallel & Distributed Computing Conferences (deadline)

Google Scholar

Supercomputing (SC, h5-index:49): April 3, 2015

IEEE International Parallel & Distributed Processing Symposium (IPDPS, h5-index: 43): October 9, 2015, October 16, 2015 ( In 2015, 2 related paper – kNN and SVM. Like parallel algorithms in specific domains such as machine learning and network science)

International Conference on Distributed Computing Systems (ICDCS, h5-index: 38): December 12, 2014 (Not match)

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP, h5-index: 34): Sep. 2015

International Conference on Supercomputing (ICS, h5-index: 26): January 9, 2015  (Tiny SESSION: Applications)

ACM International Symposium on High Performance Distributed Computing (HPDC, h5-index:32): January 12, 2015

ACM Symposium on Parallelism in Algorithms and Architectures (SPAA, h5-index:26): January 14

European Conference on Parallel Processing (Euro-Par, h5-index:27): 30 January 2015 (Application orientated)

Linux Shell

# get random lines of a file
shuf -n 4 filename

# sort by number
sort -V $list

# download with retry
wget --retry-connrefused --waitretry=1 --read-timeout=20 --timeout=15 -t 0  -c

# check file sizes
du -h --max-depth=1 | sort -h

# search string with 'Net' in folder include
# -r recursively search, -n display line number
grep -nr  'Net' ./include/

# sum numbers in each line
awk '{s+=$1} END {print s}' yourfileORpipeline

# arch 
lscpu # check cpu core count
cat /proc/cpuinfo # print info for each cpu
cat /proc/meminfo # check memory info

# add user
export username=
sudo useradd -m -d /home/$username -s /bin/bash -U $username
sudo passwd $username # initialize password
sudo chage -d 0 $username # force to change password when login

usermod -a -G $groupname $username #add user to group

apt-get remove packagename #remove the binaries, but not the configuration or data files of the package
apt-get purge packagename # = apt-get --purge remove packagename. remove everything regarding the package but not the dependencies installed with it.
apt-get autoremove # removes orphaned packages, i.e. installed packages that used to be installed as an dependency, but aren't any longer.

# printf+awk to align
awk 'NR%2==1 {printf "%-30s %s\n",$5, $7}' yourfile # print the 5-th and 7-th columns in the odd lines, and align left with width of 30

# check the topology of PCI
lspci -t -v

## a script example to launch executable across servers
## First follow here to add ssh keys in all servers:
##    -
################ Script begins #################
# script to launch a server/client test with ssh
# must be launched from client
# example: runme /home/perftest/rdma_lat -s 10
if [ $# -lt 1 ] ; then
 echo "Usage: runme <server> <test> <test options>"
 exit 3
shift # remove 'runme' from the arguments
echo $*
ssh $server $* & # log into the server and launch the executable 
#give server time to start
sleep 2
$* $server # launch executable in local with the remove server ip
exit $status
################# script ends ###################

# Switch your current process in the background
# 1. ctrl+z to stop (pause) the program and get back to the shell
bg # to run it in the background
disown -h # so that the process isn't killed when the terminal closes

svn and git notes

Create a branch:

export url=$(svn info | grep URL | awk '{print $2}') #get url
echo $url
export newurl=$url/../branches/path
svn mkdir --parents $newurl -m 'mkdir of branch'
svn copy $url  $newurl -m 'create branch' # add "-r466" option to branch from specific branch
cd your/local/branch/path
svn co $newurl

# remove branch
svn delete -m "Removing obsolete branch of calc project."

svn log # list all commit records
svn diff -r 455:462 # diff two versions




Git flow:

Git Branching – Basic Branching and Merging
Merging vs. Rebasing


# Merging an upstream repository into your fork
git checkout master # Check out the branch you wish to merge to
git pull BRANCH_NAME # This will pull remote files to local and may conflicts with your modification [ See how to solve conflicts:]
git merge other-branch # Instead of remote filtes, this will merge the other branch in the same repo to current branch 

# Use tool to merge conflict
sudo apt-get install kdiff3
git mergetool --tool=kdiff3

git status; git commit -m 'merge' # Check and commit the merge
git remote -v # check what is origin URL
git push origin master # Push the merge to your GitHub repository.

# diff working copy and a branch
git diff --name-only master . > diff_files.txt
files=$(cat diff_files.txt)
for file in $files; do
 git diff master:$file $file >> master_scnn.diff

git reset HEAD <file> # unstage a file

# track other's (e.g. wenwei202) fork
git remote add wenwei202
git fetch wenwei202 
git checkout --track wenwei202/sfm 
git branch -b sfm wenwei202/sfm
git checkout -b sfm wenwei202/sfm 
git push -u origin sfm 
git pull wenwei202 sfm # merge other's commits

Git diff: