Rabit: Reliable Allreduce and Broadcast Interface


Build Status Documentation Status

Rabit is a light weight library that provides a fault tolerant interface of Allreduce and Broadcast. It is designed to support easy implementations of distributed machine learning programs, many of which fall naturally under theAllreduce abstraction. The goal of rabit is to support portable , scalable and reliable distributed machine learning programs.

The newest version can also be used within DataFlow frameworks such as Flink and Spark.

Features

All these features comes from the facts about small rabbit:)

  • Portable: rabit is light weight and runs everywhere
    • Rabit is a library instead of a framework, a program only needs to link the library to run
    • Rabit only replies on a mechanism to start program, which was provided by most framework
    • You can run rabit programs on many platforms, including Yarn(Hadoop), MPI using the same code
  • Scalable and Flexible: rabit runs fast
    • Rabit program use Allreduce to communicate, and do not suffer the cost between iterations of MapReduce abstraction.
    • Programs can call rabit functions in any order, as opposed to frameworks where callbacks are offered and called by the framework, i.e. inversion of control principle.
    • Programs persist over all the iterations, unless they fail and recover.
  • Reliable: rabit dig burrows to avoid disasters
    • Rabit programs can recover the model and results using synchronous function calls.

Resources