Run GVS on Verily Workbench

Running GVS on Verily Workbench

Prior reading: Workflows in Verily Workbench: Cromwell, dsub, and Nextflow


Introduction

Genomic Variant Store (GVS) is a WDL-based workflow developed by the Broad Institute. This tutorial shows you how to run GVS in your own workspace.

Step by step instructions

1 Create a GCS bucket to hold WDL files

Workbench currently requires that the WDL file(s) for your workflow is in a bucket. For this example, we create a new bucket.

  1. In resources tab, click “+ Cloud resource” -> “New Cloud Storage bucket”
  2. Give it a name and click “Create bucket”. In this example I name it workflows_bucket: Diagram showing dialog for adding a bucket.

2 Create a BigQuery dataset

The GVS workflow requires a BigQuery dataset to exist that it can read and write to. For this example, we create a new BQ dataset.

  1. In resources tab, click “+ Cloud resource” -> “New BigQuery dataset”
  2. Give it a name and click “Create dataset”. In this example I name it gvs_1: Diagram showing dialog for adding a dataset.

3 Get the WDLs into the bucket

The WDLs used are available on github here. To get them into the bucket from step 1, run:

git clone https://github.com/verily-src/workbench-examples.git .
cd workbench-examples/cromwell_setup/gvs_wdls/
gsutil cp *.wdl $BUCKET_NAME

4 Add the Workflow

Navigate to the Workflows section and click “+Add workflow”. Add the wdl named “GvsJointVariantCalling.wdl”

5 Create a new job

Click on the “+New job” button. Navigate to the next “Prepare inputs” page.

6 Enter the inputs

Enter in the following values:

Input Key Value Example
GvsJointVariantCalling.call_set_identifier Any string for this callset. “my_call_set_1”
GvsJointVariantCalling.dataset_name The dataset created in step 2. “gvs_1”
GvsJointVariantCalling.external_sample_names The list of sample names. [“2013050218”, “2013050219”]
GvsJointVariantCalling.input_vcf_indexes The list of GCS locations pointing to the vcf index files of each sample. [“gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr18.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi”, “gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr19.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi”]
GvsJointVariantCalling.input_vcfs The list of GCS locations pointing to the vcf files of each sample. [“gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr18.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz”, “gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr19.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz”]
GvsJointVariantCalling.project_id The Google Cloud Project ID of the workspace. “YOUR_PROJECT_ID”

7 Monitor the workflow

In the workflows tab, there is a section to monitor the jobs as they run and complete.

Diagram showing dialog for monitoring a workflow.

8 Get outputs

Once the workflow completes, browse the workspace bucket and navigate to the task containing the sharded vcf outputs.

Last Modified: 26 September 2024