FastQC is a quality control tool for high throughput sequence data, written by Simon Andrews at the Babraham Institute in Cambridge.
The FastQC module parses results generated by FastQC, a quality control tool for high throughput sequence data written by Simon Andrews at the Babraham Institute.
FastQC generates a HTML report which is what most people use when
they run the program. However, it also helpfully generates a file
fastqc_data.txt which is relatively easy to parse.
A typical run will produce the following files:
mysample_fastqc.html mysample_fastqc/ Icons/ Images/ fastqc.fo fastqc_data.txt fastqc_report.html summary.txt
Sometimes the directory is zipped, with just
The FastQC MultiQC module looks for files called
or ending in
_fastqc.zip. If the zip files are found, they are
read in memory and
The directory and zip file are often both present. To speed up MultiQC execution, zip files will be skipped if the file name suggests that they will share a sample name with data that has already been parsed.
You can customise the patterns used for finding these files in your MultiQC config (see Module search patterns). The below code shows the default file patterns:
sp: fastqc/data: fn: "fastqc_data.txt" fastqc/zip: fn: "*_fastqc.zip"
Sample names are discovered by parsing the line beginning
fastqc_data.txt, not based on the FastQC report names.
Theoretical GC Content
It is possible to plot a dashed line showing the theoretical GC content for a reference genome. MultiQC comes with genome and transcriptome guides for Human and Mouse. You can use these in your reports by adding the following MultiQC config keys (see Configuring MultiQC):
fastqc_config: fastqc_theoretical_gc: "hg38_genome"
Only one theoretical distribution can be plotted. The following guides are available: (txome = transcriptome)
Alternatively, a custom theoretical guide can be used in reports. To do this,
create a file with
fastqc_theoretical_gc in the filename and place it with your
analysis files. It should be tab delimited with the following format (column 1 = %GC,
column 2 = % of genome):
# FastQC theoretical GC content curve: YOUR REFERENCE NAME 0 0.005311768 1 0.004108502 2 0.004060371 3 0.005066476 [...]
You can generate these files using an R package called fastqcTheoreticalGC written by Mike Love. Please see the package readme for more details.
Result files from this package are searched for with the following search pattern (can be customised as described above):
sp: fastqc/theoretical_gc: fn: "*fastqc_theoretical_gc*"
If you want to always use a specific custom file for MultiQC reports without having to add it to the analysis directory, add the full file path to the same MultiQC config variable described above:
fastqc_config: fastqc_theoretical_gc: "/path/to/your/custom_fastqc_theoretical_gc.txt"
Changing the order of sections
Remember that it is possible to customise the order in which the different module sections appear in the report if you wish. See the docs for more information.
For example, to show the Status Checks section at the top, use the following config:
report_section_order: fastqc_status_checks: order: -1000
File search patterns
dragen_fastqc: fn: "*.fastqc_metrics.csv" fastqc/data: fn: fastqc_data.txt fastqc/zip: fn: "*_fastqc.zip" fastqc/theoretical_gc: fn: "*fastqc_theoretical_gc*"