This is an old revision of the document!

Deep Learning on the Supercomputer

Installing gcloud on a local machine

1. Install gcloud sdk on your local machine (I personally used window linux subsystem, therefore I chose the apt-get option) reference: https://cloud.google.com/sdk/downloads

2. Use the following code to set user account, set region of computation unit.

 gcloud init

Setup google cloud storage device

3. On the web api (link: https://console.cloud.google.com), click on the drop down manual on the top left hand corner → click on storage → click on browse, and create a new storage bucket if there hasn't been one. Let's call it byu_tf_ml in this example.

Submit a learning job to google cloud

4. On the local machine console, call the following command:

 gcloud ml-engine jobs submit training my_job --package-path trainer --module-name trainer.tf_task --staging-bucket gs://byu_tf_ml --scale-tier BASIC

reference: https://cloud.google.com/sdk/gcloud/reference/ml-engine/jobs/submit/training

This step is a bit tricky, the command “gcloud ml-engine jobs submit training” is a google cloud version to 1/package up our python machine learning job, 2/uploading that to the cloud platform and 3/ run it on some cloud machines. There are four fields required:

a. job: in our example, the value is my_job, it's the job id showing up in the web api after submitting the job.
b. package-path: the local machine directory which contains the python source code.
c. module-name: the main python script.
d. staging-bucket: the place on google cloud where the ml model is stored.

optional:

e. scale-tier: this is optional, but allow a fine control on how much computation power we want to use with the project.
f. package-path: the path where packages you imported into the project but not listed here: https://cloud.google.com/ml-engine/docs/concepts/runtime-version-list
g. job-dir: an argument passed into your program to tell it which google storage directory to use. It has to be in the form of gs://[bucket_name]/[job_dir].

hint:

h. In order to save and retrieve data on google cloud machines, specify the path of input and output as gs://[bucket_name]/[input files/output directory] in your code.

Check the results

5. On the web api, click on the drop down manual on the top left hand corner → click on ML Engine → click on job. You should be able to see the project submitted.

6. After the training finished, you will be able to see the results and logs under corresponding job.

More

7. If you want to reuse the trained weights of of the model, include the savedmodel function in the application. Reference: https://cloud.google.com/ml-engine/docs/concepts/prediction-overview

8. I haven't try out tensorbroad yet, but it seems like that it's not too bad to achieve. Reference: https://cloud.google.com/ml-engine/docs/how-tos/monitor-training#monitoring_with_tensorboard

9. You may want to check out more examples online. Reference: https://cloud.google.com/ml-engine/docs/tutorials/

BYU CS classes

Table of Contents

Deep Learning on the Supercomputer

Installing gcloud on a local machine

Setup google cloud storage device

Submit a learning job to google cloud

Check the results

More

BYU CS classes

User Tools

Site Tools

Table of Contents

Deep Learning on the Supercomputer

Installing gcloud on a local machine

Setup google cloud storage device

Submit a learning job to google cloud

Check the results

More

Page Tools