rorodata Data Science Platform User Guide
Our objective is to give data scientist control of their application from concept to production. Deploy from day one. Build machine applications faster, with a small team.
What is rorodata?
rorodata is a cloud based platform to help data scientists build, deploy, and manage end-to-end Machine Learning (ML) applications in production, using a simple, repeatable, flexible, and automated process. rorodata works with a simple set of commands. It creates a repeatable process to fully automate cloud-infrastructure setup, provisioning data science environments including custom packages, and has the flexibility to specify on-demand computing resources for each task or service.
In short, it eliminates all the ad-hoc, error-prone, repetitive tasks that go into productionizing machine learning applications
Currently, rorodata enables data scientists to do four things:
- Build models in the cloud using Jupyter notebooks
- Train ML models on ad-hoc basis as well as on a time-schedule (periodic retraining)
- Deploy ML models into production and expose prediction APIs as web-services
- Track multiple versions of ML models and associated metadata
How does rorodata work?
rorodata platform is organized by projects. A project is a machine learning application which comprises its code, all associated tasks and services, persistent data volumes, and models. The software environment for the project, the services to run, and period tasks to be scheduled are specified in a simple text file named roro.yml When a project is deployed, rorodata takes its cue from this roro.yml file and goes through the following process
- Packages and moves all code to the platform
- Automatically provisions the necessary software environments using a docker image
- Deploys each service into production, with the specified compute power for that service
- Set up URL endpoints for each service
- Schedules periodic tasks
Useful technical details
Project Directory Structure
The main working directory for a project is referenced as
/app. This contains all the files packaged from the local code repository when you execute deploy from the roro client. These get overwritten after every execution of deploy.
Volumes are persistent data storages and can be referenced as
/volumes. By default, you will see two volumes,
/volumes. Unlike (the files in)
/volumes does not get overwritten - these volumes are persistent. Shortly, you will be able to mount your S3 buckets as rorodata volumes and access them directly as
By default, notebooks created in the project i.e using
roro run:notebook get created in
Command Line Interface (CLI)
rorodata CLI client offers a number of simple commands to manage and monitor activity on the rorodata platform. The most commonly used ones are
roro create project,
roro logs , etc. You can get a full list of these commands and their description by typing
roro --help, and access details for a specific command (e.g. cmd1) by typing
roro cmd1 --help
Understanding the roro.yml file
roro.yml is a simple text file that tells rorodata platform what to do, during deployment. Here is a quick explanation of the contents of this example
project: loan-score-demo runtime: rorodata/sandbox services: - name: default function: train.predict
: loan-score-demoThis tells rorodata platform that the rorodata project that the local code is being deployed to, is the project loan-scoring-demo (in the above example). Note that project names are unique across rorodata platform (not just by project). Make sure that you replace your project name here, to make sure you are deploying to the correct project.
: rorodata/sandboxIn this example, the docker image used to create the runtime environment for the python code has been given the short name
rorodata/sandbox. During deployment, this docker image is used to create a virtual environment for running code in this project. Note, any python libraries not part of this runtime can be included in the docker image, by specifying those libraries in a requirements.txt file. Please refer to the rorodata documentation to check out the specifications for the different runtime docker images available.
:This lists all the services exposed by this project as endpoints. In this case, there is one service that exposes the function predict. In general, there can be multiple services in a project, each with its own name and a function attached to each name.
name:defaultAny name can be chosen by the user; the endpoint URL gets formed as your-project-name-- servicename.rorocloud.io. If name is left as default, then the endpoint URL will be formed as .rorocloud.io. In this example, the endpoint will be https://loan-score-demo.rorocloud.io (This is dummy link for illustration only)
function: train.predictThis is the full path to the function being exposed e.g. because the predict function is in the file train.py (in this example), the entry made in the roro.yml file below name is function:train.predict