dispy : Python Framework for Distributed and Parallel Computing

dispy is a Python framework for parallel execution of computations by distributing them across multiple processors in a single machine (SMP), among many machines in a cluster, grid or cloud. dispy is well suited for data parallel (SIMD) paradigm where a computation is evaluated with different (large) datasets independently with no communication among computation tasks (except for computation tasks sending intermediate results to the client). If communication/cooperation among tasks is needed, asyncoro framework could be used.

Salient features of dispy are:

dispy works with Python versions 2.7+ and 3.1+ and tested on Linux, OS X and Windows; it may work on other platforms too.

Download

dispy can be downloaded from Sourceforge Files; latest version is released under MIT license, a common license used for Python modules. Files in 'dispy-*' package (e.g., dispy-3.5.tar.gz, dispy-3.5.zip) are to be used with Python 2.7+ and files in 'dispy3-*' package (e.g., dispy3-3.5.tar.gz, dispy3-3.5.zip) are to be used with Python 3.1+.

dispy is availble through PyPI (Python Package Index) so it can be easily installed with pip as pip install dispy.

Quick Guide

Below is a quick guide on how to use dispy. More details are available in dispy document.

dispy consists of 4 components:

  1. dispy (client) provides two ways of creating "clusters": JobCluster when only one instance of dispy may run and SharedJobCluster when multiple instances may run (in separate processes). If JobCluster is used, the scheduler contained within dispy.py will distribute jobs on the server nodes; if SharedJobCluster is used, a separate scheduler (dispyscheduler) must be running.
  2. dispynode executes jobs on behalf of dispy. dispynode must be running on each of the (server) nodes that form the cluster.
  3. dispyscheduler is needed only when SharedJobCluster is used; this provides a scheduler that can be shared by multiple dispy clients simultaneously.
  4. dispynetrelay is needed when nodes are located across different networks. If all nodes are on local network or if all remote nodes can be listed in 'nodes' parameter when creating cluster, there is no need for dispynetrelay - the scheduler can discover such nodes automatically. However, if there are many nodes on remote network(s), dispynetrelay can be used to relay information about the nodes on that network to scheduler, without having to list all nodes in 'nodes' parameter.

As a tutorial, consider the following program, in which function 'compute' is distributed to nodes on a local network for parallel execution. First, run dispynode program ('dispynode.py') on each of the nodes on the network.

def compute(n):
    import time, socket
    time.sleep(n)
    host = socket.gethostname()
    return (host, n)

if __name__ == '__main__':
    import dispy, random
    cluster = dispy.JobCluster(compute)
    jobs = []
    for n in range(20):
        job = cluster.submit(random.randint(5,20))
        job.id = n
        jobs.append(job)
    # cluster.wait()
    for job in jobs:
        host, n = job()
        print '%s executed job %s at %s with %s' % (host, job.id, job.start_time, n)
        # other fields of 'job' that may be useful:
        # print job.stdout, job.stderr, job.exception, job.ip_addr, job.start_time, job.end_time
    cluster.stats()

Now run the above program, which creates a cluster with function 'compute'; this cluster is then used to create jobs to execute 'compute' with a random number 20 times. dispy schedules these jobs on the processors in the nodes running dispynode. The nodes execute each job with the job's arguments in isolation - computations shouldn't depend on global state, such as modules imported outside of computations, global variables etc. In this case, 'compute' needs modules 'time' and 'socket', so it must import them. The program then retrieves results of execution for each job with 'job()'. If necessary, a persistence mechanism, such as file or database, could be used to store/retrieve global state.

See dispy and examples for more details and examples.

dispy can also be used as a command line tool; in this case the computations should only be programs and dependencies should only be files.

dispy.py -f /some/file1 -f file2 -a "arg11 arg12" -a "arg21 arg22" -a "arg3" /some/program
will distribute '/some/program' with dependencies '/some/file1' and 'file2' and then execute '/some/program' in parallel with arg11 and arg12 (two arguments to the program), arg21 and arg22 (two arguments), and arg3 (one argument).