Activity logs in Workbench

How to find and interpret user activity logs for data collections

Prior reading: Data collections

Purpose: This document explains what activity logs are and how you can use them to gain insight about your data collections.



What are activity logs and why are they useful?

Verily Workbench activity logs are time-stamped logs that can associate a user ID, org ID, operation taken (e.g., create, clone, delete, update access), and Workbench object (e.g., workspace, data collection, storage resource, VM, Dataproc cluster).

These logs are useful for data collection owners to measure the impact of their data collections. With these logs, owners can write SQL queries to pull results that answer queries such as the following:

  • How many users have cloned this data collection?
  • What orgs are these users from?
  • How many workspaces reference this data collection?
  • What resources in the data collection have been most frequently cloned?

These logs are also useful to organization admins who want to better understand how active the users who belong to their organization are in Workbench. With these logs, admins can write SQL queries to pull results that answer queries such as the following:

  • How many users are in this org?
  • How many workspaces have users in this org created?
  • How many cloud apps (i.e., notebooks) have users in this org created?

How can I access the activity logs?

Activity logs are accessible at the data collection and organization level.

For data collections, data collection owners (not writers or readers) will see a link to these logs, stored in BigQuery, from the individual data collection page:

Click the Activity logs link to open these logs in the BigQuery UI. The logs are stored in a single BigQuery table:

`workbench-bq-log-sink.workbench_monitoring_data_collection_logs_prod.data_collection_activity_clone_logs`

Using BigQuery, data collection owners can preview and directly query the table via SQL, save query results or open them in Google Sheets, and more. The table provides information about all the data collections for which you are an owner.

In development: For organizations, org admins will have access to similar activity logs. These will be accessible through the org admin-dedicated user interface once launched, but until then need to be directly accessed in BigQuery. Please reach out to workbench-support@verily.com for more details.

Schema of the activity logs table

The information below shows the schema of the data_collection_activity_clone_logs table.

Field name Type Definition
data_collection_id STRING Universally unique identifier (UUID) for data collection
data_collection_user_facing_id STRING Human readable, user-facing ID for data collection
data_collection_org_id STRING UUID for org data collection belongs to
data_collection_org_user_facing_id STRING User-facing ID for org data collection belongs to
data_collection_pod_id STRING UUID for pod data collection belongs to
data_collection_pod_user_facing_id STRING User-facing ID for pod data collection belongs to
workspace_id STRING UUID of destination workspace that data collection was cloned to
org_id STRING UUID for org of destination workspace
org_user_facing_id STRING User-facing ID for org of destination workspace
pod_id STRING UUID of pod of destination workspace
change_date TIMESTAMP Timestamp of clone event
user_id STRING UUID of user who cloned data collection
change_subject_id STRING UUID of specific resource in data collection cloned
change_subject_type STRING Type of resource cloned - referenced/controlled GCS bucket / BigQuery / AWS S3 folder
change_type STRING Type of action taken on subject. This is always CLONE for data collection logs

Querying the activity logs table

BigQuery UI

You can query the activity logs table directly from the BigQuery panel in the Google Cloud console, reached by clicking the Activity logs link from a data collection that you own.

Running queries from a Workbench notebook app

You can also query the table from within a Workbench notebook app, using the bigquery library, e.g., by running something like the following:

import pandas as pd
from google.cloud import bigquery

BQ_dataset = 'workbench-bq-log-sink.workbench_monitoring_data_collection_logs_prod'

job_query_config = bigquery.QueryJobConfig(default_dataset=BQ_dataset)
client = bigquery.Client(default_query_job_config=job_query_config)

Next, define a query on the table:

query = """
SELECT DISTINCT(workspace_id), org_user_facing_id, change_date
FROM `data_collection_activity_clone_logs`
WHERE data_collection_user_facing_id='1000-genomes-data-collection'
ORDER BY change_date DESC
"""

Then, run the query and view a dataframe of the results:

df = client.query(query).result().to_dataframe()
df

Querying on a specific data collection

If you're an owner of multiple data collections, the logs are aggregated in one table. So if you are interested in the usage of a particular data collection, you can first get the user-facing ID of that data collection, and then refine your query to retrieve data for just that data collection. You can find the user-facing ID in the path of the data collection URL like this: https://workbench.verily.com/data-collections/{your-data-collection-user-facing-id}

Example query

"Who is importing my data collection?"

SELECT DISTINCT(workspace_id), org_user_facing_id, change_date
FROM `workbench-bq-log-sink.workbench_monitoring_data_collection_logs_prod.data_collection_activity_clone_logs`
WHERE data_collection_user_facing_id='<your-data-collection-user-facing-id>'
ORDER BY change_date DESC

For example, if your user-facing data collection ID was 1000-genomes-data-collection, the query would look like this:

SELECT DISTINCT(workspace_id), org_user_facing_id, change_date
FROM `workbench-bq-log-sink.workbench_monitoring_data_collection_logs_prod.data_collection_activity_clone_logs`
WHERE data_collection_user_facing_id='1000-genomes-data-collection'
ORDER BY change_date DESC

FAQ

Q: The change_type field is CLONE. Why is this? Are my data collection resources being duplicated?

A: No, data collection cloud resources are not duplicated. They are created as referenced resources in the workspace. The workspace will inherit the policy of the data collection.

Last Modified: 10 December 2024