STAR Howto

From TaejoonLab
Jump to: navigation, search

Indexing The Reference

$ $HOME/src/STAR/STAR-20160723/bin/Linux_x86_64/STAR --runMode genomeGenerate \\
    --runThreadN 8 --genomeDir /work/project/db.STAR/HUMAN_ens85_dna_sm \\
    --genomeFastaFiles /work/project/pub/ens/85.g/HUMAN_ens85_dna_sm.fa \\
    --sjdbGTFfile /work/project/pub/ens/85/Homo_sapiens.GRCh38.85.gtf \\
    --sjdbOverhang 99
the number of threads to be used for genome generation, it has to be set to the number of available cores on the server node.
--runMode genomeGenerate
directs STAR to run genome indices generation job.
a path to the directory (henceforth called "genome directory" where the genome indices are stored. This directory has to be created (with mkdir) before STAR run and needs to writing permissions. The file system needs to have at least 100GB of disk space available for a typical mammalian genome. It is recommended to remove all files from the genome directory before running the genome generation step. This directory path will have to be supplied at the mapping step to identify the reference genome.
one or more FASTA files with the genome reference sequences. Multiple reference sequences (henceforth called chromosomes) are allowed for each fasta file. You can rename the chromosomes names in the chrName.txt keeping the order of the chromosomes in the file: the names from this file will be used in all output alignment files (such as .sam). The tabs are not allowed in chromosomes names, and spaces are not recommended.
the path to the file with annotated transcripts in the standard GTF format. STAR will extract splice junctions from this file and use them to greatly improve accuracy of the mapping. While this is optional, and STAR can be run without annotations, using annotations is highly recommended whenever they are available. Starting from 2.4.1a, the annotations can also be included on the fly at the mapping step.
the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumin 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as well as the ideal value.


    STAR --genomeDir $DB --runThreadN $NUM_THREADS --readFilesIn $FQ1 $FQ2 \
      --readFilesCommand zcat --outSAMtype SAM \
      --outFilterMultimapNmax 1 --quantMode GeneCounts \
      --sjdbGTFfile $GTF \
      --outFileNamePrefix $BASENAME"." --limitBAMsortRAM 32000000000