Difference between revisions of "CrocoBLAST:Job management"

From WebChem Wiki
Jump to: navigation, search
(Add database from your computer)
Line 4: Line 4:
  
  
=Create job=
+
=Create BLAST job=
As  
+
As already mentioned, BLAST takes an input file with unknown sequences and aligns each such sequence against a database of known sequences. To create a job, you must first specify the [http://www.ncbi.nlm.nih.gov/BLAST/blast_program.shtml BLAST program] you plan to use, which depends on the nature of the unknown sequences in your input file, and the nature of the sequences in the reference database. Then, you need to specify the database
  
 +
<code>
 +
CrocoBLAST -add_to_queue <span style="color:blue">blast_program database</span> <span style="color:green">input_file output_folder<span><br>
 +
CrocoBLAST -add_to_queue <span style="color:blue">blast_program database</span> <span style="color:green">input_file output_folder<span> --options <span style="color:blue">option1 value1 ...</span>
 +
</code>
  
  
 
=Manage databases=
 
=Manage databases=
As already mentioned, BLAST takes an input file with unknown sequences and aligns each such sequence against a database of known sequences. Therefore, to submit a BLAST job, you must specify which database you wish to align against. The first time you indicate a database for a BLAST job, CrocoBLAST will remember it and add it to its index, so that in the future it is easier for you to access this database. You can see which databases are already indexed in CrocoBLAST:
+
To submit a BLAST job, you must specify which database you wish to align against. The first time you indicate a database for a BLAST job, CrocoBLAST will remember it and add it to its index, so that in the future it is easier for you to access this database. You can see which databases are already indexed in CrocoBLAST:
  
<code bash>
+
<code>
 
CrocoBLAST -list_databases
 
CrocoBLAST -list_databases
 
</code>
 
</code>
Line 21: Line 25:
 
In the most typical scenario, you will use the [ftp://ftp.ncbi.nlm.nih.gov/refseq/release/ established reference sequence databases maintained by NCBI]. CrocoBLAST allows you to specify the name of such a database, and will download or update the database for you:
 
In the most typical scenario, you will use the [ftp://ftp.ncbi.nlm.nih.gov/refseq/release/ established reference sequence databases maintained by NCBI]. CrocoBLAST allows you to specify the name of such a database, and will download or update the database for you:
  
<code bash>
+
<code>
 
CrocoBLAST -add_database --ncbi_download <span style="color:blue">ncbi_database_name</span> <span style="color:green">output_folder</span><br>
 
CrocoBLAST -add_database --ncbi_download <span style="color:blue">ncbi_database_name</span> <span style="color:green">output_folder</span><br>
 
CrocoBLAST -update_ncbi_database <span style="color:blue">ncbi_database_name</span> <span style="color:green">output_folder</span>
 
CrocoBLAST -update_ncbi_database <span style="color:blue">ncbi_database_name</span> <span style="color:green">output_folder</span>
Line 31: Line 35:
 
If you have already downloaded the databases from NCBI, or if you do not have internet connection, you may add to the CrocoBLAST index database files stored on your computer. Remember to provide a unique and representative name for each database you add, so that it is easy to call the databases later. If the database files are appropriately formatted (e.g., psq or nsq):
 
If you have already downloaded the databases from NCBI, or if you do not have internet connection, you may add to the CrocoBLAST index database files stored on your computer. Remember to provide a unique and representative name for each database you add, so that it is easy to call the databases later. If the database files are appropriately formatted (e.g., psq or nsq):
  
<code bash>
+
<code>
 
CrocoBLAST -add_database --formated_db <span style="color:green">nsq_database_file</span><br>
 
CrocoBLAST -add_database --formated_db <span style="color:green">nsq_database_file</span><br>
 
CrocoBLAST -add_database --formated_db <span style="color:green">psq_database_file</span>
 
CrocoBLAST -add_database --formated_db <span style="color:green">psq_database_file</span>
Line 38: Line 42:
 
If your database is in FASTA or FASTQ format, you will need to tell CrocoBLAST the type of '''sequence''' it will find in the database:
 
If your database is in FASTA or FASTQ format, you will need to tell CrocoBLAST the type of '''sequence''' it will find in the database:
  
<code bash>
+
<code>
 
CrocoBLAST -add_database --sequence_file '''nucleotide''' <span style="color:green">fasta_file</span> <span style="color:orange">database_name</span> <span style="color:green">output_folder</span><br>
 
CrocoBLAST -add_database --sequence_file '''nucleotide''' <span style="color:green">fasta_file</span> <span style="color:orange">database_name</span> <span style="color:green">output_folder</span><br>
 
CrocoBLAST -add_database --sequence_file '''protein''' <span style="color:green">fasta_file</span> <span style="color:orange">database_name</span> <span style="color:green">output_folder</span><br>
 
CrocoBLAST -add_database --sequence_file '''protein''' <span style="color:green">fasta_file</span> <span style="color:orange">database_name</span> <span style="color:green">output_folder</span><br>
Line 45: Line 49:
 
</code>
 
</code>
  
=Manage queue=
+
=Manage CrocoBLAST queue=
 
The efficiency of CrocoBLAST lies in its ability to parallelize the execution of your BLAST jobs. This is related to breaking each big calculation into smaller pieces, and then organizing the execution of the pieces. Having smaller pieces means that you need less memory to run each job, and if you can analyze several pieces at once you can speed up the total calculation time. CrocoBLAST takes care of these things for you.
 
The efficiency of CrocoBLAST lies in its ability to parallelize the execution of your BLAST jobs. This is related to breaking each big calculation into smaller pieces, and then organizing the execution of the pieces. Having smaller pieces means that you need less memory to run each job, and if you can analyze several pieces at once you can speed up the total calculation time. CrocoBLAST takes care of these things for you.
  
Line 51: Line 55:
 
Say you have ''created one or more BLAST jobs'' and are ready to start munching some sequences. It's easy:
 
Say you have ''created one or more BLAST jobs'' and are ready to start munching some sequences. It's easy:
  
<code bash>
+
<code>
 
CrocoBLAST -run
 
CrocoBLAST -run
 
</code>
 
</code>
Line 59: Line 63:
 
When you run CrocoBLAST without any additional options, you will make the most efficient use of your computational resources, as CrocoBLAST will figure out how to best parallelize the calculation on your machine. Nonetheless, if you want to limit the number of threads running simultaneously, you may do so:
 
When you run CrocoBLAST without any additional options, you will make the most efficient use of your computational resources, as CrocoBLAST will figure out how to best parallelize the calculation on your machine. Nonetheless, if you want to limit the number of threads running simultaneously, you may do so:
  
<code bash>
+
<code>
 
CrocoBLAST -run --num_threads <span style="color:orange">number_of_threads</span>
 
CrocoBLAST -run --num_threads <span style="color:orange">number_of_threads</span>
 
</code>
 
</code>
Line 65: Line 69:
 
Similarly, you can easily stop or pause the execution at any time. The difference between ''pause'' and ''stop'' rests with how long you are willing to wait before your computational resources become available, and how much partial output you need. To immediately kill a CrocoBLAST job and free up the memory and cores:
 
Similarly, you can easily stop or pause the execution at any time. The difference between ''pause'' and ''stop'' rests with how long you are willing to wait before your computational resources become available, and how much partial output you need. To immediately kill a CrocoBLAST job and free up the memory and cores:
  
<code bash>
+
<code>
 
CrocoBLAST -stop
 
CrocoBLAST -stop
 
</code>
 
</code>
Line 71: Line 75:
 
On the other hand, if you are more interested in the output:
 
On the other hand, if you are more interested in the output:
  
<code bash>
+
<code>
 
CrocoBLAST -pause
 
CrocoBLAST -pause
 
</code>
 
</code>
Line 77: Line 81:
 
This lets CrocoBLAST know that no new threads should be initiated, and the output produced by each running thread will be incorporated in the partial results as soon as the thread finishes. Therefore, you will have to wait until all running threads have completed. Depending on the type of BLAST program you are running, the size of the database, and the similarity between your input sequences and the sequences in the database, you may have to wait a considerable amount of time. However, this will ensure that you can resume the calculation at a later time. To resume, simply tell CrocoBLAST to start munching.  
 
This lets CrocoBLAST know that no new threads should be initiated, and the output produced by each running thread will be incorporated in the partial results as soon as the thread finishes. Therefore, you will have to wait until all running threads have completed. Depending on the type of BLAST program you are running, the size of the database, and the similarity between your input sequences and the sequences in the database, you may have to wait a considerable amount of time. However, this will ensure that you can resume the calculation at a later time. To resume, simply tell CrocoBLAST to start munching.  
  
<code bash>
+
<code>
 
CrocoBLAST -run
 
CrocoBLAST -run
 
</code>
 
</code>
Line 83: Line 87:
 
It will automatically detect the current state of each job in the queue, and continue from where it left off, unless you have made changes to the queue in the meantime. While CrocoBLAST operates with the concept of queue, it is important to note that only one job is active at any given time. You can check the current state of the CrocoBLAST queue:
 
It will automatically detect the current state of each job in the queue, and continue from where it left off, unless you have made changes to the queue in the meantime. While CrocoBLAST operates with the concept of queue, it is important to note that only one job is active at any given time. You can check the current state of the CrocoBLAST queue:
  
<code bash>
+
<code>
 
CrocoBLAST -status
 
CrocoBLAST -status
 
</code>
 
</code>
  
 
This will provide you with information regarding which jobs are queued, with full details regarding the BLAST setup, as well as a description about the progress of the alignment. The progress of each job is described in three main directions: fragmentation of the input file, alignment, and assembly of results. If you want to change anything about the queue (say, pause one job and start another, or change the order of the jobs in a queue), you need to first pause or stop the current run.
 
This will provide you with information regarding which jobs are queued, with full details regarding the BLAST setup, as well as a description about the progress of the alignment. The progress of each job is described in three main directions: fragmentation of the input file, alignment, and assembly of results. If you want to change anything about the queue (say, pause one job and start another, or change the order of the jobs in a queue), you need to first pause or stop the current run.

Revision as of 06:43, 25 July 2016

CrocoBLAST is built to help you plan your BLAST jobs and run them efficiently. CrocoBLAST operates with the concept of queue, which is basically a list of BLAST jobs scheduled to run. Thus, you can plan several BLAST job and let CrocoBLAST manage their execution for you.

All CrocoBLAST functionality is available via the command line utility and the graphical user interface. In fact, the graphical user interface does precisely what its name suggests: it provides an interface for the command line utility. In a nutshell, while you can interact with CrocoBLAST via simple commands, you may also use the interface to generate the commands or read the output of such commands.


Create BLAST job

As already mentioned, BLAST takes an input file with unknown sequences and aligns each such sequence against a database of known sequences. To create a job, you must first specify the BLAST program you plan to use, which depends on the nature of the unknown sequences in your input file, and the nature of the sequences in the reference database. Then, you need to specify the database

CrocoBLAST -add_to_queue blast_program database input_file output_folder
CrocoBLAST -add_to_queue blast_program database input_file output_folder --options option1 value1 ... </code>


Manage databases

To submit a BLAST job, you must specify which database you wish to align against. The first time you indicate a database for a BLAST job, CrocoBLAST will remember it and add it to its index, so that in the future it is easier for you to access this database. You can see which databases are already indexed in CrocoBLAST:

<code> CrocoBLAST -list_databases </code>

You can provide a simple name for each database, that you may later refer to whenever you need to run a BLAST job. There are two ways to add a new database to the CrocoBLAST index.

Retrieve database from NCBI servers

In the most typical scenario, you will use the established reference sequence databases maintained by NCBI. CrocoBLAST allows you to specify the name of such a database, and will download or update the database for you:

<code> CrocoBLAST -add_database --ncbi_download ncbi_database_name output_folder
CrocoBLAST -update_ncbi_database ncbi_database_name output_folder </code>

When adding or updating a database in this manner, you need not worry about the format of the database, as NCBI provides pre-formatted database files.

Add database from your computer

If you have already downloaded the databases from NCBI, or if you do not have internet connection, you may add to the CrocoBLAST index database files stored on your computer. Remember to provide a unique and representative name for each database you add, so that it is easy to call the databases later. If the database files are appropriately formatted (e.g., psq or nsq):

<code> CrocoBLAST -add_database --formated_db nsq_database_file
CrocoBLAST -add_database --formated_db psq_database_file </code>

If your database is in FASTA or FASTQ format, you will need to tell CrocoBLAST the type of sequence it will find in the database:

<code> CrocoBLAST -add_database --sequence_file nucleotide fasta_file database_name output_folder
CrocoBLAST -add_database --sequence_file protein fasta_file database_name output_folder
CrocoBLAST -add_database --sequence_file nucleotide fastq_file database_name output_folder
CrocoBLAST -add_database --sequence_file protein fastq_file database_name output_folder </code>

Manage CrocoBLAST queue

The efficiency of CrocoBLAST lies in its ability to parallelize the execution of your BLAST jobs. This is related to breaking each big calculation into smaller pieces, and then organizing the execution of the pieces. Having smaller pieces means that you need less memory to run each job, and if you can analyze several pieces at once you can speed up the total calculation time. CrocoBLAST takes care of these things for you.

Execution

Say you have created one or more BLAST jobs and are ready to start munching some sequences. It's easy:

<code> CrocoBLAST -run </code>

This tells CrocoBLAST to take the input file, break it into little fragments, and submit each fragment for sequence alignment as soon as a core becomes free. This means that, if your computer has only one core, the alignment will start only after fragmentation of the input file is complete. However, if your computer has two cores (or one core that supports multi-threading), the alignment will start as soon as at least one fragment has been generated, which means immediately. The alignment of each fragment runs as an independent thread. The more threads you can run simultaneously, the faster your job will finish. This depends on the number and type of cores your computer has.

When you run CrocoBLAST without any additional options, you will make the most efficient use of your computational resources, as CrocoBLAST will figure out how to best parallelize the calculation on your machine. Nonetheless, if you want to limit the number of threads running simultaneously, you may do so:

<code> CrocoBLAST -run --num_threads number_of_threads </code>

Similarly, you can easily stop or pause the execution at any time. The difference between pause and stop rests with how long you are willing to wait before your computational resources become available, and how much partial output you need. To immediately kill a CrocoBLAST job and free up the memory and cores:

<code> CrocoBLAST -stop </code>

On the other hand, if you are more interested in the output:

<code> CrocoBLAST -pause </code>

This lets CrocoBLAST know that no new threads should be initiated, and the output produced by each running thread will be incorporated in the partial results as soon as the thread finishes. Therefore, you will have to wait until all running threads have completed. Depending on the type of BLAST program you are running, the size of the database, and the similarity between your input sequences and the sequences in the database, you may have to wait a considerable amount of time. However, this will ensure that you can resume the calculation at a later time. To resume, simply tell CrocoBLAST to start munching.

<code> CrocoBLAST -run </code>

It will automatically detect the current state of each job in the queue, and continue from where it left off, unless you have made changes to the queue in the meantime. While CrocoBLAST operates with the concept of queue, it is important to note that only one job is active at any given time. You can check the current state of the CrocoBLAST queue:

<code> CrocoBLAST -status </code>

This will provide you with information regarding which jobs are queued, with full details regarding the BLAST setup, as well as a description about the progress of the alignment. The progress of each job is described in three main directions: fragmentation of the input file, alignment, and assembly of results. If you want to change anything about the queue (say, pause one job and start another, or change the order of the jobs in a queue), you need to first pause or stop the current run.