July 22, 2024


Lower than 70 years separate us from one of many biggest discoveries of all time: the double helix construction of DNA. We now know that DNA is a form of a twisted ladder composed of 4 varieties of compounds, known as bases. These 4 bases are often recognized by an uppercase letter: adenine (A), guanine (G), cytosine (C), and thymine (T). One of many causes for the double helix construction is that when these compounds are on the two sides of the ladder, A at all times bonds with T, and C at all times bonds with G.

If we unroll the ladder on a desk, we’d see two sequences of “letters”, and every of the 2 sides would carry the identical genetic data. For instance, listed below are two sequence (AGCT and TCGA) sure collectively:

A – T
G – C
C – G
T – A

These sequence of letters will be very lengthy. For instance, the human genome consists of over three billion letters of code and acts because the organic blueprint of each cell in an individual. The data in an individual’s genome can be utilized to create extremely customized remedies to enhance the well being of people and even all the inhabitants. Equally, genomic knowledge will be use to trace infectious ailments, enhance prognosis, and even observe epidemics, meals pathogens and toxins. That is the rising discipline of environmental genomics.

Accessing genomic knowledge requires genome sequencing, which with current advances in expertise, will be performed for giant teams of people, shortly and extra cost-effectively than ever earlier than. Within the subsequent 5 years, genomics datasets are estimated to develop and comprise greater than a billion sequenced genomes.

How Genomics Knowledge Evaluation Works
Genomics knowledge evaluation makes use of a wide range of instruments that have to be orchestrated as a selected sequence of steps, or a workflow. To facilitate growing, sharing, and operating workflows, the genomics and bioinformatics communities have developed specialised workflow definition languages like WDL, Nextflow, CWL, and Snakemake.

Nevertheless, this course of generates petabytes of uncooked genomic knowledge and specialists in genomics and life science wrestle to scale compute and storage assets to deal with knowledge at such huge scale.

To course of knowledge and supply solutions shortly, cloud assets like compute, storage, and networking have to be configured to work along with evaluation instruments. In consequence, scientists and researchers usually need to spend beneficial time deploying infrastructure and modifying open-source genomics evaluation instruments as an alternative of constructing contributions to genomics improvements.

Introducing Amazon Genomics CLI
A few months in the past, we shared the preview of Amazon Genomics CLI, a device that makes it simpler to course of genomics knowledge at petabyte scale on AWS. I’m excited to share that the Amazon Genomics CLI is now an open supply mission and is usually accessible immediately. You should utilize it with publicly accessible workflows as a place to begin and develop your evaluation on prime of those.

Amazon Genomics CLI simplifies and automates the deployment of cloud infrastructure, offering you with an easy-to-use command line interface to shortly setup and run genomics workflows on AWS. By eradicating the heavy lifting from organising and operating genomics workflows within the cloud, software program builders and researchers can robotically provision, configure and scale cloud assets to allow sooner and less expensive population-level genetics research, drug discovery cycles, and extra.

Amazon Genomics CLI permits you to run your workflows on an optimized cloud infrastructure. Extra particularly, the CLI:

  • Contains enhancements to genomics workflow engines to make them combine higher with AWS, eradicating the burden to manually modify open-source instruments and tune them to run effectively at scale. These instruments work seamlessly throughout Amazon Elastic Container Service (Amazon ECS), Amazon DynamoDB, Amazon Elastic File System (Amazon EFS), and Amazon Easy Storage Service (Amazon S3), serving to you to scale compute and storage and on the similar time optimize your prices utilizing options like EC2 Spot Cases.
  • Eliminates probably the most time-consuming duties like provisioning storage and compute capacities, deploying the genomics workflow engines, and tuning the clusters used to execute workflows.
  • Routinely will increase or decreases cloud assets primarily based in your workloads, which eliminates the danger of shopping for an excessive amount of or too little capability.
  • Tags assets so to use instruments like AWS Price & Utilization Report to grasp the prices associated to your genomics knowledge evaluation throughout a number of AWS companies.

Using Amazon Genomics CLI relies on these three essential ideas:

Workflow – These are bioinformatics workflows written in languages like WDL or Nextflow. They are often both single script information or packages of a number of information. These workflow script information are workflow definitions and mixed with extra metadata, just like the workflow language the definition is written in, kind a workflow specification that’s utilized by the CLI to execute workflows on acceptable compute assets.

Context – A context encapsulates and automates time-consuming duties to configure and deploy workflow engines, create knowledge entry insurance policies, and tune compute clusters (managed utilizing AWS Batch) for operation at scale.

Undertaking – A mission hyperlinks collectively workflows, datasets, and the contexts used to course of them. From a consumer perspective, it handles assets associated to the identical downside or utilized by the identical crew.

Let’s see how this works in follow.

Utilizing Amazon Genomics CLI
I observe the directions to put in Amazon Genomics CLI on my laptop computer. Now, I can use the agc command to handle genomic workloads. I see the accessible choices with:

The primary time I exploit it, I activate my AWS account:

This creates the core infrastructure that Amazon Genomics CLI must function, which incorporates an S3 bucket, a digital personal cloud (VPC), and a DynamoDB desk. The S3 bucket is used for sturdy metadata, and the VPC is used to isolate compute assets.

Optionally, I can convey my very own VPC. I can even use one among my named profiles for the AWS Command Line Interface (CLI). On this means, I can customise the AWS Area and the AWS account utilized by the Amazon Genomics CLI.

I configure my e mail tackle within the native settings. This wil be used to tag assets created by the CLI:

$ agc configure e mail me@instance.web

There are a couple of demo initiatives within the examples folder included by the Amazon Genomics CLI set up. These initiatives use totally different engines, comparable to Cromwell or Nextflow. Within the demo-wdl-project folder, the agc-project.yaml file describes the workflows, the information, and the contexts for the Demo mission:

identify: Demo
schemaVersion: 1
  good day:
      language: wdl
      model: 1.zero
    sourceURL: workflows/good day
      language: wdl
      model: 1.zero
    sourceURL: workflows/learn
      language: wdl
      model: 1.zero
    sourceURL: workflows/haplotype
      language: wdl
      model: 1.zero
    sourceURL: workflows/phrases
  - location: s3://gatk-test-data
    readOnly: true
  - location: s3://broad-references
    readOnly: true
      - kind: wdl
        engine: cromwell

    requestSpotInstances: true
      - kind: wdl
        engine: cromwell

For this mission, there are 4 workflows (good day, learn, words-with-vowels, and haplotype). The mission has read-only entry to 2 S3 buckets and may run workflows utilizing two contexts. Each contexts use the Cromwell engine. One context (spotCtx) makes use of Amazon EC2 Spot Cases to optimize prices.

Within the demo-wdl-project folder, I exploit the Amazon Genomics CLI to deploy the spotCtx context:

$ agc context deploy -c spotCtx

After a couple of minutes, the context is prepared, and I can execute the workflows. As soon as began, a context incurs about $zero.40 per hour of baseline prices. These prices don’t embrace the assets created to execute workflows. These assets rely in your particular use case. Contexts have the choice to make use of spot cases by including the requestSpotInstances flag to their configuration.

I exploit the CLI to see the standing of the contexts of the mission:

$ agc context standing


Now, let’s take a look at the workflows included on this mission:

$ agc workflow record

2021-09-24T11:15:29+01:00 𝒊  Itemizing workflows.
WORKFLOWNAME words-with-vowels

The only workflow is good day. The content material of the good day.wdl file is kind of comprehensible if you realize any programming language:

model 1.zero
workflow hello_agc 
    name good day 

process good day 

The good day workflow defines a single process (good day) that prints the output of a command. The duty is executed on a selected container picture (ubuntu:newest). The output is taken from commonplace output (stdout), the default file descriptor the place a course of can write output.

Working workflows is an asynchronous course of. After submitting a workflow from the CLI, it’s dealt with fully within the cloud. I can run a number of workflows at a time. The underlying compute assets will robotically scale and I can be charged just for what I exploit.

Utilizing the CLI, I begin the good day workflow:

$ agc workflow run good day -c spotCtx

2021-09-24T13:03:47+01:00 𝒊  Working workflow. Workflow identify: 'good day', Arguments: '', Context: 'spotCtx'

The workflow was efficiently submitted, and the final line is the workflow execution ID. I can use this ID to reference a selected workflow execution. Now, I verify the standing of the workflow:

$ agc workflow standing

2021-09-24T13:04:21+01:00 𝒊  Exhibiting workflow run(s). Max Runs: 20
WORKFLOWINSTANCE	spotCtx	fcf72b78-f725-493e-b633-7dbe67878e91	true	RUNNING	2021-09-24T12:03:53Z	good day

The good day workflow continues to be operating. After a couple of minutes, I verify once more:

$ agc workflow standing

2021-09-24T13:12:23+01:00 𝒊  Exhibiting workflow run(s). Max Runs: 20
WORKFLOWINSTANCE	spotCtx	fcf72b78-f725-493e-b633-7dbe67878e91	true	COMPLETE	2021-09-24T12:03:53Z	good day

The workflow has terminated and is now full. I take a look at the workflow logs:

$ agc logs workflow good day

2021-09-24T13:13:08+01:00 𝒊  Exhibiting the logs for 'good day'
2021-09-24T13:13:12+01:00 𝒊  Exhibiting logs for the most recent run of the workflow. Run id: 'fcf72b78-f725-493e-b633-7dbe67878e91'
Fri, 24 Sep 2021 13:07:22 +0100	obtain: s3://agc-123412341234-eu-west-1/scripts/1a82f9a96e387d78ae3786c967f97cc0 to tmp/tmp.498XAhEOy/batch-file-temp
Fri, 24 Sep 2021 13:07:22 +0100	*** LOCALIZING INPUTS ***
Fri, 24 Sep 2021 13:07:23 +0100	obtain: s3://agc-123412341234-eu-west-1/mission/Demo/userid/danilop20tbvT/context/spotCtx/cromwell-execution/hello_agc/fcf72b78-f725-493e-b633-7dbe67878e91/call-hello/script to agc-024700040865-eu-west-1/mission/Demo/userid/danilop20tbvT/context/spotCtx/cromwell-execution/hello_agc/fcf72b78-f725-493e-b633-7dbe67878e91/call-hello/script
Fri, 24 Sep 2021 13:07:23 +0100	*** COMPLETED LOCALIZATION ***
Fri, 24 Sep 2021 13:07:23 +0100	Good day Amazon Genomics CLI!
Fri, 24 Sep 2021 13:07:23 +0100	*** DELOCALIZING OUTPUTS ***
Fri, 24 Sep 2021 13:07:24 +0100	add: ./hello-rc.txt to s3://agc-123412341234-eu-west-1/mission/Demo/userid/danilop20tbvT/context/spotCtx/cromwell-execution/hello_agc/fcf72b78-f725-493e-b633-7dbe67878e91/call-hello/hello-rc.txt
Fri, 24 Sep 2021 13:07:25 +0100	add: ./hello-stderr.log to s3://agc-123412341234-eu-west-1/mission/Demo/userid/danilop20tbvT/context/spotCtx/cromwell-execution/hello_agc/fcf72b78-f725-493e-b633-7dbe67878e91/call-hello/hello-stderr.log
Fri, 24 Sep 2021 13:07:25 +0100	add: ./hello-stdout.log to s3://agc-123412341234-eu-west-1/mission/Demo/userid/danilop20tbvT/context/spotCtx/cromwell-execution/hello_agc/fcf72b78-f725-493e-b633-7dbe67878e91/call-hello/hello-stdout.log
Fri, 24 Sep 2021 13:07:25 +0100	*** COMPLETED DELOCALIZATION ***
Fri, 24 Sep 2021 13:07:25 +0100	*** EXITING WITH RETURN CODE ***
Fri, 24 Sep 2021 13:07:25 +0100	zero

Within the logs, I discover as anticipated the Good day Amazon Genomics CLI! message printed by workflow.

I can even take a look at the content material of hello-stdout.log on S3 utilizing the data within the log above:

aws s3 cp s3://agc-123412341234-eu-west-1/mission/Demo/userid/danilop20tbvT/context/spotCtx/cromwell-execution/hello_agc/fcf72b78-f725-493e-b633-7dbe67878e91/call-hello/hello-stdout.log -

Good day Amazon Genomics CLI!

It labored! Now, let’s search for at extra advanced workflows. Earlier than I alter mission, I destroy the context for the Demo mission:

$ agc context destroy -c spotCtx

Within the gatk-best-practices-project folder, I record the accessible workflows for the mission:

$ agc workflow record

2021-09-24T11:41:14+01:00 𝒊  Itemizing workflows.
WORKFLOWNAME	bam-to-unmapped-bams
WORKFLOWNAME	cram-to-bam
WORKFLOWNAME	gatk4-basic-joint-genotyping
WORKFLOWNAME	gatk4-data-processing
WORKFLOWNAME	gatk4-germline-snps-indels
WORKFLOWNAME	gatk4-rnaseq-germline-snps-indels
WORKFLOWNAME	interleaved-fastq-to-paired-fastq
WORKFLOWNAME	paired-fastq-to-unmapped-bam
WORKFLOWNAME	seq-format-validation

Within the agc-project.yaml file, the gatk4-data-processing workflow factors to a neighborhood listing with the identical identify. That is the content material of that listing:

$ ls gatk4-data-processing


This workflow processes high-throughput sequencing knowledge with GATK4, a genomic evaluation toolkit centered on variant discovery.

The listing comprises a MANIFEST.json file. The manifest file describes which file comprises the principle workflow to execute (there will be multiple WDL file within the listing) and the place to seek out enter parameters and choices. Right here’s the content material of the manifest file:

Within the gatk-best-practices-project folder, I create a context to run the workflows:

$ agc context deploy -c spotCtx

Then, I begin the gatk4-data-processing workflow:

$ agc workflow run gatk4-data-processing -c spotCtx

2021-09-24T12:08:22+01:00 𝒊  Working workflow. Workflow identify: 'gatk4-data-processing', Arguments: '', Context: 'spotCtx'

After a few hours, the workflow has terminated:

$ agc workflow standing

2021-09-24T14:06:40+01:00 𝒊  Exhibiting workflow run(s). Max Runs: 20
WORKFLOWINSTANCE	spotCtx	630e2d53-0c28-4f35-873e-65363529c3de	true	COMPLETE	2021-09-24T11:08:28Z	gatk4-data-processing

I take a look at the logs:

$ agc logs workflow gatk4-data-processing

Fri, 24 Sep 2021 14:02:32 +0100	*** DELOCALIZING OUTPUTS ***
Fri, 24 Sep 2021 14:03:45 +0100	add: ./NA12878.hg38.bam to s3://agc-123412341234-eu-west-1/mission/GATK/userid/danilop20tbvT/context/spotCtx/cromwell-execution/PreProcessingForVariantDiscovery_GATK4/630e2d53-0c28-4f35-873e-65363529c3de/call-GatherBamFiles/NA12878.hg38.bam
Fri, 24 Sep 2021 14:03:46 +0100	add: ./NA12878.hg38.bam.md5 to s3://agc-123412341234-eu-west-1/mission/GATK/userid/danilop20tbvT/context/spotCtx/cromwell-execution/PreProcessingForVariantDiscovery_GATK4/630e2d53-0c28-4f35-873e-65363529c3de/call-GatherBamFiles/NA12878.hg38.bam.md5
Fri, 24 Sep 2021 14:03:47 +0100	add: ./NA12878.hg38.bai to s3://agc-123412341234-eu-west-1/mission/GATK/userid/danilop20tbvT/context/spotCtx/cromwell-execution/PreProcessingForVariantDiscovery_GATK4/630e2d53-0c28-4f35-873e-65363529c3de/call-GatherBamFiles/NA12878.hg38.bai
Fri, 24 Sep 2021 14:03:48 +0100	add: ./GatherBamFiles-rc.txt to s3://agc-123412341234-eu-west-1/mission/GATK/userid/danilop20tbvT/context/spotCtx/cromwell-execution/PreProcessingForVariantDiscovery_GATK4/630e2d53-0c28-4f35-873e-65363529c3de/call-GatherBamFiles/GatherBamFiles-rc.txt
Fri, 24 Sep 2021 14:03:49 +0100	add: ./GatherBamFiles-stderr.log to s3://agc-123412341234-eu-west-1/mission/GATK/userid/danilop20tbvT/context/spotCtx/cromwell-execution/PreProcessingForVariantDiscovery_GATK4/630e2d53-0c28-4f35-873e-65363529c3de/call-GatherBamFiles/GatherBamFiles-stderr.log
Fri, 24 Sep 2021 14:03:50 +0100	add: ./GatherBamFiles-stdout.log to s3://agc-123412341234-eu-west-1/mission/GATK/userid/danilop20tbvT/context/spotCtx/cromwell-execution/PreProcessingForVariantDiscovery_GATK4/630e2d53-0c28-4f35-873e-65363529c3de/call-GatherBamFiles/GatherBamFiles-stdout.log
Fri, 24 Sep 2021 14:03:50 +0100	*** COMPLETED DELOCALIZATION ***
Fri, 24 Sep 2021 14:03:50 +0100	*** EXITING WITH RETURN CODE ***
Fri, 24 Sep 2021 14:03:50 +0100	zero

Outcomes have been written to the S3 bucket created through the account activation. The identify of the bucket is within the logs however I can even discover it saved as a parameter by AWS Techniques Supervisor. I can reserve it in an setting variable with the next command:

$ export AGC_BUCKET=$(aws ssm get-parameter 
  --name /agc/_common/bucket 
  --query 'Parameter.Worth' 
  --output textual content)

Utilizing the AWS Command Line Interface (CLI), I can now discover the outcomes on the S3 bucket and get the outputs of the workflow.

Earlier than trying on the outcomes, I take away the assets that I don’t want by stopping the context. This can destroy all compute assets, however retain knowledge in S3.

$ agc context destroy -c spotCtx

Extra examples on configuring totally different contexts and operating extra workflows are offered within the documentation on GitHub.

Availability and Pricing
Amazon Genomics CLI is an open supply device, and you should utilize it immediately in all AWS Areas except for AWS GovCloud (US) and Areas positioned in China. There is no such thing as a price for utilizing the AWS Genomics CLI. You pay for the AWS assets created by the CLI.

With the Amazon Genomics CLI, you possibly can deal with science as an alternative of architecting infrastructure. This will get you up and operating sooner, enabling analysis, improvement, and testing workloads. For manufacturing workloads that scale to a number of thousand parallel workflows, we will present really helpful methods to leverage extra Amazon companies, like AWS Step Capabilities, simply attain out to our account groups for extra data.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *