Welcome to the MultiQC docs.

# Introduction

MultiQC is a reporting tool that parses summary statistics from results and log files generated by other bioinformatics tools. MultiQC doesn't run other tools for you - it's designed to be placed at the end of analysis pipelines or to be run manually when you've finished running your tools.

When you launch MultiQC, it recursively searches through any provided file paths and finds files that it recognises. It parses relevant information from these and generates a single stand-alone HTML report file. It also saves a directory of data files with all parsed data for further downstream use.

# Installing MultiQC

## System Python

Before we start - a quick note that using the system-wide installation of Python is not recommended. This often causes problems and it's a little risky to mess with it. If you find yourself prepending sudo to any MultiQC commands, take a step back and think about Python virtual environments / conda instead (see below).

## Installing Python

To see if you have python installed, run python --version on the command line. MultiQC needs Python version 2.7+, 3.4+ or 3.5+.

We recommend using virtual environments to manage your Python installation. Our favourite is conda, a cross-platform tool to manage Python environments. You can installation instructions for Miniconda here.

Once conda is installed, you can create an environment with the following commands:

conda create --name py3.6 python=3.6
source activate py3.6
# Windows: activate py3.6

You'll want to add the source activate py3.6 line to your .bashrc file so that the environment is loaded every time you load the terminal.

## Installing with conda

If you're using conda as described above, you can install MultiQC from the bioconda channel as follows:

conda install -c bioconda multiqc

## Installation with pip

This is the easiest way to install MultiQC. pip is the package manager for the Python Package Manager. It comes bundled with recent versions of Python, otherwise you can find installation instructions here.

You can now install MultiQC from PyPI as follows:

pip install multiqc

If you would like the development version, the command is:

pip install git+https://github.com/ewels/MultiQC.git

Note that if you have problems with read-only directories, you can install to your home directory with the --user parameter (though it's probably better to use virtual environments, as described above).

pip install --user multiqc

## Manual installation

If you'd rather not use either of these tools, you can clone the code and install the code yourself:

git clone https://github.com/ewels/MultiQC.git
cd MultiQC
python setup.py install

git not installed? No problem - just download the flat files:

curl -LOk https://github.com/ewels/MultiQC/archive/master.zip
unzip master.zip
cd MultiQC-master
python setup.py install

## Updating MultiQC

You can update MultiQC from PyPI at any time by running the following command:

pip --update multiqc

To update the development version, use:

pip install --force git+https://github.com/ewels/MultiQC.git

If you cloned the git repo, just pull the latest changes and install:

cd MultiQC
git pull
python setup.py install

## Installing on Windows

MultiQC is has primarily been designed for us on Unix systems (Linux, Mac OSX). However, it should work on Windows too. Indeed, automated continuous integration tests run using AppVeyor to check compatability at https://ci.appveyor.com/project/ewels/multiqc (see test config here).

Some users have found that running the multiqc command directly in Windows doesn't work but that using the full path to the program does work. For example:

python \path\to\python\scripts\multiqc my_data

Note that you may be able to avoid this by adding this directory to your PATH.

## Installing as an environment module

Many people using MultiQC will be working on a HPC environment. Every server / cluster is different, and you're probably best off asking your friendly sysadmin to install MultiQC for you. However, with that in mind, here are a few general tips for installing MultiQC into an environment module system:

MultiQC comes in two parts - the multiqc python package and the multiqc executable script. The former must be available in $PYTHONPATH and the script must be available on the $PATH.

A typical installation procedure with an environment module Python install might look like this: (Note that $PYTHONPATH must be defined before pip installation.) VERSION=0.7 INST=/path/to/software/multiqc/$VERSION
mkdir $INST export PYTHONPATH=$INST/lib/python2.7/site-packages
pip install --install-option="--prefix=$INST" multiqc Once installed, you'll need to create an environment module file. Again, these vary between systems a lot, but here's an example: #%Module1.0##################################################################### ## ## MultiQC ## set components [ file split [ module-info name ] ] set version [ lindex$components 1 ]
set modroot /path/to/software/multiqc/$version proc ModulesHelp { } { global version modroot puts stderr "\tMultiQC - use MultiQC$version"
puts stderr "\n\tVersion $version\n" } module-whatis "Loads MultiQC environment." # load required modules module load python/2.7.6 # only one version at a time conflict multiqc # Make the directories available prepend-path PATH$modroot/bin
prepend-path    PYTHONPATH  $modroot/lib/python2.7/site-packages ## Using the Docker container A Docker container based on python:2.7-slim is provided. Specify the volume to bind mount as desired with -v, same for the working directory inside the container with -w. Or just use -v "$PWD":"$PWD" -w "$PWD" to run in current directory. For more information, look into the Docker documentation

The usual multiqc command line should work fine:

docker run -v "$PWD":"$PWD" -w "$PWD" ewels/multiqc multiqc . ## Using MultiQC through Galaxy ### On the main Galaxy instance The easiest and fast manner to use MutliQC is to use the usegalaxy.org main Galaxy instance where you will find MultiQC Galaxy tool under the NGS: QC and manipualtion tool panel section. ### On your instance You can install MultiQC on your own Galaxy instance through your Galaxy admin space, searching on the main Toolshed for the MultiQC repository available under the visualization, statistics and Fastq Manipulation sections. # Running MultiQC Once installed, just go to your analysis directory and run multiqc, followed by a list of directories to search. At it's simplest, this can just be . (the current working directory): multiqc . That's it! MultiQC will scan the specified directories and produce a report based on details found in any log files that it recognises. See Using MultiQC Reports for more information about how to use the generated report. For a description of all command line parameters, run multiqc --help. ## Choosing where to scan You can supply MultiQC with as many directories or files as you like. Above, we supply . - just the current directory, but all of these would work too: multiqc data/ multiqc data/ ../proj_one/analysis/ /tmp/results multiqc data/*_fastqc.zip multiqc data/sample_1* You can also ignore files using the -x/--ignore flag (can be specified multiple times). This takes a string which it matches using glob expansion to filenames, directory names and entire paths: multiqc . --ignore *_R2* multiqc . --ignore run_two/ multiqc . --ignore */run_three/*/fastqc/*_R2.zip Some modules get sample names from the contents of the file and not the filename (for example, stdout logs can contain multiple samples). In this case, you can skip samples by name instead: multiqc . --ignore-samples sample_3* These strings are matched using glob logic (* and ? are wildcards). All of these settings can be saved in a MultiQC config file so that you don't have to type them on the command line for every run. Finally, you can supply a file containing a list of file paths, one per row. MultiQC only search the listed files. multiqc --file-list my_file_list.txt ## Renaming reports The report is called multiqc_report.html by default. Tab-delimited data files are created in multiqc_data/, containing additional information. You can use a custom name for the report with the -n/--filename parameter, or instruct MultiQC to create them in a subdirectory using the -o/-outdir parameter. Note that different MultiQC templates may have different defaults. ## Overwriting existing reports It's quite common to repeatedly create new reports as new analysis results are generated. Instead of manually deleting old reports, you can just specify the -f parameter and MultiQC will overwrite any conflicting report filenames. ## Sample names prefixed with directories Sometimes, the same samples may be processed in different ways. If MultiQC finds log files with the same sample name, the previous data will be overwritten (this can be inspected by running MultiQC with -v/--verbose). To avoid this, run MultiQC with the -d/--dirs parameter. This will prefix every sample name with the directory path for that log file. As such, sample names should now be unique, and not overwrite one-another. By default, --dirs will prepend the entire path to each sample name. You can choose which directories are added with the -dd/--dirs-depth parameter. Set to a positive integer to use that many directories at the end of the path. A negative integer takes directories from the start of the path. For example: $ multiqc -d .
# analysis_1 | results | type | sample_1 | file.log
# analysis_2 | results | type | sample_2 | file.log
# analysis_3 | results | type | sample_3 | file.log

$multiqc -d -dd 1 . # sample_1 | file.log # sample_2 | file.log # sample_3 | file.log$ multiqc -d -dd -1 .
# analysis_1 | file.log
# analysis_2 | file.log
# analysis_3 | file.log

## Using different templates

MultiQC is built around a templating system. You can produce reports with different styling by using the -t/--template option. The available templates are listed with multiqc --help.

If you're interested in creating your own custom template, see the writing new templates section.

## PDF Reports

Whilst HTML is definitely the format of choice for MultiQC reports due to the interactive features that it can offer, PDF files are an integral part of some people's workflows. To try to accommodate this, MultiQC has a --pdf command line flag which will try to create a PDF report for you.

To do this, MultiQC uses the simple template. This uses flat plots, has no navigation or toolbar and strips out all JavaScript. The resulting HTML report is pretty basic, but this simplicity is helpful when generating PDFs.

Once the report is generated MultiQC attempts to call Pandoc, a command line tool able to convert documents between different file formats. You must have Pandoc already installed for this to work. If you don't have Pandoc installed, you will get an error message that looks like this:

Error creating PDF - pandoc not found. Is it installed? http://pandoc.org/

Please note that Pandoc is a complex tool and uses LaTeX / XeLaTeX for PDF generation. Please make sure that you have the latest version of Pandoc and that it can successfully convert basic HTML files to PDF before reporting and errors. Also note that not all plots have flat image equivalents, so some will be missing (at time of writing: FastQC sequence content plot, beeswarm dot plots, heatmaps).

## Printing to stdout

If you would like to generate MultiQC reports on the fly, you can print the output to standard out by specifying -n stdout. Note that the data directory will not be generated and the template used must create stand-alone HTML reports.

## Parsed data directory

By default, MultiQC creates a directory alongside the report containing tab-delimited files with the parsed data. This is useful for downstream processing, especially if you're running MultiQC with very large numbers of samples.

Typically, these files are tab-delimited tables. However, you can get JSON or YAML output for easier downstream parsing by specifying -k/--data-format on the command line or data_format in your configuration file.

You can also choose whether to produce the data by specifying either the --data-dir or --no-data-dir command line flags or the make_data_dir variable in your configuration file. Note that the data directory is never produced when printing the MultiQC report to stdout.

To zip the data directory, use the -z/--zip-data-dir flag.

## Exporting Plots

In addition to the HTML report, it's also possible to get MultiQC to save plots as stand alone files. You can do this with the -p/--export command line flag. By default, plots will be saved in a directory called multiqc_plots as .png, .svg and .pdf files. Raw data for the plots are also saved to files.

You can instruct MultiQC to always do this by setting the export_plots config option to true, though note that this will add a few seconds on to execution time. The plots_dir_name changes the default directory name for plots and the export_plot_formats specifies what file formats should be created (must be supported by MatPlotLib).

Note that not all plot types are yet supported, so you may find some plots are missing.

Note: You can always save static image versions of plots from within MultiQC reports, using the Export toolbox in the side bar.

## Choosing which modules to run

Sometimes, it's desirable to choose which MultiQC modules run. This could be because you're only interested in one type of output and want to keep the reports small. Or perhaps the output from one module is misleading in your situation.

You can do this by using -m/--modules to explicitly define which modules you want to run. Alternatively, use -e/--exclude to run all modules except those listed.

You can get a group of modules by using --tag followed by a tag e.g. RNA or DNA.

# Using MultiQC Reports

Once MultiQC has finished, you should have a HTML report file called multiqc_report.html (or something similar, depending on how you ran MultiQC). You can launch this report with open multiqc_report.html on the command line, or double clicking the file in a file browser.

## Browser compatibility

MultiQC reports should work in any modern browser. They have been tested using OSX Chrome, Firefox and Safari. If you find any report bugs, please report them as a GitHub issue.

## Report layout

MultiQC reports have three main page sections:

• Links to the different module sections in the report
• Click the logo to go to the top of the page
• The toolbox (right side)
• Contains various tools to modify the report data (see below)
• The report (middle)
• This is what you came here for, the data!

Note that if you're viewing the report on a mobile device / small window, the content will be reformatted to fit the screen.

## General Statistics table

At the top of every MultiQC report is the 'General Statistics' table. This shows an overview of key values, taken from all modules. The aim of the table is to bring together stats for each sample from across the analysis so that you can see it in one place.

Hovering over column headers will show a longer description, including which module produced the data. Clicking a header will sort the table by that value. Clicking it again will change the sort direction. You can shift-click multiple headers to sort by multiple columns.

Above the table there is a button called 'Configure Columns'. Clicking this will launch a modal window with more detailed information about each column, plus options to show/hide and change the order of columns.

## Plots

MultiQC modules can take plot more extensive data in the sections below the general statistics table.

### Interactive plots

Plots in MultiQC reports are usually interactive, using the HighCharts JavaScript library.

You can hover the mouse over data to see a tooltip with more information about that dataset. Clicking and dragging on line graphs will zoom into that area.

To reset the zoom, use the button in the top right:

Plots have a grey bar along their base; clicking and dragging this will resize the plot's height:

You can force reports to use interactive plots instead of flat by specifying the --interactive command line option (see below).

### Flat plots

Reports with large numbers of samples may contain flat plots. These are rendered when the MultiQC report is generated using MatPlotLib and are non-interactive (flat) images within the report. The reason for generating these is that large sample numbers can make MultiQC reports very data-intensive and unresponsive (crashing people's browsers in extreme cases). Plotting data in flat images is scalable to any number of samples, however.

Flat plots in MultiQC have been designed to look as similar to their interactive versions as possible. They are also copied to multiqc_data/multiqc_plots

You can force reports to use flat plots with the --flat command line option.

See the Large sample numbers section of the Configuring MultiQC docs for more on how to customise the flat / interactive plot behaviour.

### Exporting plots

If you want to use the plot elsewhere (eg. in a presentation or paper), you can export it in a range of formats. Just click the menu button in the top right of the plot:

This opens the MultiQC Toolbox Export Plots panel with the current plot selected. You have a range of export options here. When deciding on output format bear in mind that SVG is a vector format, so can be edited in tools such as Adobe Illustrator or the free tool Inkscape. This makes it ideal for use in publications and manual customisation / annotation. The Plot scaling option changes how large the labels are relative to the plot.

### Dynamic plots

Some plots have buttons above them which allow you to change the data that they show or their axis. For example, many bar plots have the option to show the data as percentages instead of counts:

## Toolbox

MultiQC reports come with a 'toolbox', accessible by clicking the buttons on the right hand side of the report:

Active toolbox panels have their button highlighted with a blue outline. You can hide the toolbox by clicking the open panel button a second time, or pressing Escape on your keyboard.

### Highlight Samples

If you run MultiQC plots with a lot of samples, plots can become very data-heavy. This makes it difficult to find specific samples, or subsets of samples.

To help with this, you can use the Highlight Samples tool to colour datasets of interest. Simply enter some text which will match the samples you want to highlight and press enter (or click the add button). If you like, you can also customise the highlight colour.

To make it easier to match groups of samples, you can use a regular expressions by turning on 'Regex mode'. You can test regexes using a nice tool at regex101.com. See a nice introduction to regexes here. Note that pattern delimiters are not needed (use pattern, not /pattern/).

Here, we highlight any sample names that end in _1:

Note that a new button appears above the General Statistics table when samples are highlighted, allowing you to sort the table according to highlights.

Search patterns can be changed after creation, just click to edit. To remove, click the grey cross on the right hand side.

Searching for an empty string will match all samples.

### Renaming Samples

Sample names are typically generated based on processed file names. These file names are not always informative. To help with this, you can do a search and replace within sample names. Here, we remove the SRR1067 and _1 parts of the sample names, which are the same for all samples:

## Large sample numbers

MultiQC has been written with the intention of being used for any number of samples. This means that it should work well with 6 samples or 6000. Very large sample numbers are becoming increasingly common, for example with single cell data.

Producing reports with data from many hundreds or thousands of samples provides some challenges, both technically and also in terms of data visualisation and report usability.

One problem with large reports is that the browser can hang when the report is first loaded. This is because it loading and processing the data for all plots at once. To mitigate this, large reports may show plots as grey boxes with a "Show Plot" button. Clicking this will render the plot as normal and prevents the browser from trying to do everything at once.

By default this behaviour kicks in when a plot has 50 samples or more. This can be customised by changing the num_datasets_plot_limit config option.

### Flat / interactive plots

Reports with many samples start to need a lot of data for plots. This results in inconvenient report file sizes (can be 100s of megabytes) and worse, web browser crashes. To allow MultiQC to scale to these sample numbers, most plot types have two plotting functions in the code base - interactive (using HighCharts) and flat (rendered with MatPlotLib). Flat plots take up the same disk space irrespective of sample number and do not consume excessive resources to display.

By default, MultiQC generates flat plots when there are 100 or more samples. This cutoff can be changed by changing the plots_flat_numseries config option. This behaviour can also be changed by running MultiQC with the --flat / --interactive command line options or by setting the plots_force_flat / plots_force_interactive config options to True.

### Tables / Beeswarm plots

Report tables with thousands of samples (table rows) can quickly become impossible to use. To avoid this, tables with large numbers of rows are instead plotted as a Beeswarm plot (aka. a strip chart / jitter plot). These plots have fixed dimensions with any number of samples. Hovering on a dot will highlight the same sample in other rows.

By default, MultiQC starts using beeswarm plots when a table has 500 rows or more. This can be changed by setting the max_table_rows config option.

## Command-line config

Sometimes it's useful to specify a single small config option just once, where creating a config file for the occasion may be overkill. In these cases you can use the --cl_config option to supply additional config values on the command line.

Config variables should be given as a YAML string. You will usually need to enclose this in quotes. If MultiQC is unable to understand your config you will get an error message saying Could not parse command line config.

As an example, the following command configures the coverage levels to use for the Qualimap module: (as described in the docs)

multiqc ./datadir --cl_config "qualimap_config: { general_stats_coverage: [20,40,200] }"

# Customising Reports

MultiQC offers a few ways to customise reports to easily add your own branding and some additional report-level information. These features are primarily designed for core genomics facilities.

Note that much more extensive customisation of reports is possible using custom templates.

## Titles and introductory text

You can specify a custom title for the report using the -i/--title command line option. The -b/--comment option can be used to add a longer comment to the top of the report at run time.

You can also specify the title and comment, as well as a subtitle and the introductory text in your config file:

title: "My Title"
subtitle: "A subtitle to go underneath in grey"
intro_text: "MultiQC reports summarise analysis results."
report_comment: "This is a comment about this report."

Note that if intro_text is None the template will display the default introduction sentence. Set this to False to hide this, or set it to a string to use your own text.

custom_logo: '/abs/path/to/logo.png'
custom_logo_url: 'https://www.example.com'
custom_logo_title: 'Our Institute Name'

Only custom_logo is needed. The URL will make the logo open up a new web browser tab with your address and the title sets the mouse hover title text.

## Project level information

You can add custom information at the top of reports by adding key:value pairs to the config option report_header_info. Note that if you have a file called multiqc_config.yaml in the working directory, this will automatically be parsed and added to the config. For example, if you have the following saved:

report_header_info:
- Contact E-mail: 'phil.ewels@scilifelab.se'
- Application Type: 'RNA-seq'
- Project Type: 'Application'
- Sequencing Platform: 'HiSeq 2500 High Output V4'
- Sequencing Setup: '2x125'

Then this will be displayed at the top of reports:

Note that you can also specify a path to a config file using -c.

## Bulk sample renaming

Although it is possible to rename samples manually and in bulk using the report toolbox, it's often desirable to embed such renaming patterns into the report so that they can be shared with others. For example, a typical case could be for a sequencing centre that has internal sample IDs and also user-supplied sample names. Or public sample identifiers such as SRA numbers as well as more meaningful names.

It's possible to supply a file with one or more sets of sample names using the --sample-names command line option. This file should be a tab-delimited file with a header row (used for the report button labels) and then any number of renamed sample identifiers. For example:

MultiQC Names   Proper Names    AWESOME NAMES
SRR1067503_1    Sample_1    MYBESTSAMP_1
SRR1067505_1    Sample_2    MYBESTSAMP_2
SRR1067510_1    Sample_3    MYBESTSAMP_3

If supplied, buttons will be generated at the top of the report with your labels. Clicking these will populate and apply the Toolbox renaming panel.

NB: Sample renaming works with partial substrings - these will be replaced!

It's also possible to supply such renaming patterns within a config file (useful if you're already generating a config file for a run). In this case, you need to set the variables sample_names_rename_buttons and sample_names_rename. For example:

sample_names_rename_buttons:
- "MultiQC Names"
- "Proper Names"
- "AWESOME NAMES"
sample_names_rename:
- ["SRR1067503_1", "Sample_1", "MYBESTSAMP_1"]
- ["SRR1067505_1", "Sample_2", "MYBESTSAMP_2"]
- ["SRR1067510_1", "Sample_3", "MYBESTSAMP_3"]

Sometimes you may want to add a custom comment above specific sections in the report. You can do this with the config option section_comments as follows:

section_comments:
featurecounts: 'This comment is for a module header, but should still work'
star_alignments: 'This new way of commenting above sections is **awesome**!'

Comments can be written in Markdown. The section_comments keys should correspond to the HTML IDs of the report section. You can find these by clicking on a navigation link in the report and seeing the #section_id at the end of the browser URL.

## Removing modules or sections

If you don't want an entire module to be used in a MultiQC report, use the -e/--exclude command line flags to skip running that tool.

If you would like to remove just one section of a module report, you can do so with the remove_sections config option as follows:

remove_sections:
- section-id-one
- second-section-id

The section ID is the string appended to the URL when clicking a report section in the navigation. For example, the GATK module has a section with the title "Compare Overlap". When clicking that in the report's left hand side navigation, the web browser URL has #gatk-compare-overlap appended. Here, you would add gatk-compare-overlap to the remove_sections config.

#### Removing General Statistics

The General Statistics is a bit of a special case in MultiQC, but there is added code to make it behave well with the above mechanism. On the command line, you can specify -e general_stats. Alternatively, you can set the following config flag in your MultiQC config:

skip_generalstats: true

## Order of modules

By default, modules are included in the report as in the order specified in config.module_order. Any modules found which aren't in this list are appended at the top of the report.

#### Top modules

To specify certain modules that should always come at the top of the report, you can configure config.top_modules in your MultiQC configuration file. For example, to always have the FastQC module at the top of reports, add the following to your ~/.multiqc_config.yaml file:

top_modules:
- 'fastqc'

#### Running modules multiple times

A module can be specified multiple times in either config.module_order or config.top_modules, causing it to be run multiple times. By itself you'll just get two identical report sections. However, you can also supply configuration options to the modules as follows:

top_modules:
- moduleName:
name: 'Module (filtered)'
info: 'This section shows the module with different files'
path_filters:
- '*_special.txt'
- '*_others.txt'
- moduleName:
name: 'Module (not-special)'
path_filters_exclude:
- '*_special.txt'

These overwrite the defaults that are hardcoded in the module code. path_filters and path_filters_exclude being the exception. These filter the file searches for a given list of glob filename patterns:

Pattern Meaning
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any character not in seq

Note that exclusion superseeds inclusion for the path filters.

The other available configuration options are:

• name: Section name
• anchor: Section report ID
• target: Intro link text
• href: Intro link URL
• info: Intro text
• extra: Additional HTML after intro.
• custom_config: Custom module-level settings. Translated into config.moduleName, but specifically for this section.

For example, to run the FastQC module twice, before and after adapter trimming, you could use the following config:

module_order:
- fastqc:
name: 'FastQC (trimmed)'
info: 'This section of the report shows FastQC results after adapter trimming.'
target: ''
path_filters:
- '*_1_trimmed_fastqc.zip'
- fastqc:
name: 'FastQC (raw)'
path_filters:
- '*_1_fastqc.zip'

Note that if you change the name then you will get multiples of columns in the General Statistics table. If unchanged, the topmost module may overwrite output from the first iteration.

NB: Currently, you can not list a module name in both top_modules and module_order. Let me know if this is a problem..

## Order of sections

Sometimes it's desirable to customise the order of specific sections in a report, independent of module execution. For example, the custom_content module can generate multiple sections from different input files.

To do this, follow a link in a report navigation to skip to the section you want to move (must be a major section header, not a subheading). Find the ID of that section by looking at the URL. For example, clicking on FastQC changes the URL to multiqc_report.html#fastqc - the ID is the text after (not including) the # symbol.

Next, specify the report_section_order option in your MultiQC config file. Section in the report are given a number ranging from 10 (section at bottom of report), incrementing by +10 for each section. You can change this number (eg. a very low number to always get at the bottom of the report or very high to always be at the top), or you can move a section to before or after another existing section (has no effect if the other named ID is not in the report).

report_section_order:
section1:
order: -1000
section2:
before: 'othersection'
section3:
after: 'diffsection'

## Customising plots

Almost every plot in all MultiQC reports are created using standard plotting functions and a plot config. You can override any plot config variable you like for any plot to customise how these are generated.

To do this, first find the plot that you would like to customise and copy it's unique ID. You can find this by clicking export - the name next to the checkbox is the ID.

Next, you need to find the plot config key(s) that you would like to change. You can find these by reading the MultiQC documentation below.

For example, to set a new limit for the Picard InsertSizeMetrics x-axis, you can use the following:

custom_plot_config:
picard_insert_size:
xmax: 300

You can customise multiple variables for multiple plots:

custom_plot_config:
# Show the percentages tab by default for the FastQC sequence counts plot
fastqc_sequence_counts_plot:
cpswitch_c_active: False

# Only show up to 20bp on the x axis for cutadapt, change the title
xmax: 20
title: "How many base pairs have been removed from the data"

# Add a coloured band in the background to show what is a good result
# Yes I know this doesn't make sense for this plot, it's just an example ;)
bismark_mbias:
yPlotBands:
- from: 0
to: 40
color: '#e6c3c3'
- from: 40
to: 80
color: '#e6dcc3'
- from: 80
to: 100
color: '#c3e6c3'

## Customising tables

### Hiding columns

Report tables such as the General Statistics table can get quite wide. To help with this, columns in the report can be hidden. Some MultiQC modules include columns which are hidden by default, others may be uninteresting to some users.

To allow customisation of this behaviour, the defaults can be changed by adding to your MultiQC config file. This is done with the table_columns_visible value. Open a MultiQC report and click Configure Columns above a table. Make a note of the Group and ID for the column that you'd like to alter. For example, to make the % Duplicate Reads column from FastQC hidden by default, the Group is FastQC and the ID is percent_duplicates. These are then added to the config as follows:

table_columns_visible:
FastQC:
percent_duplicates: False

Note that you can set these to True to show columns that would otherwise be hidden by default.

### Column order

In the same way, you can force a column to appear at the start or end of the table, or indeed impose a custom ordering on all the columns, by setting the table_columns_placement. High values push columns to the right hand side of the table and low to the left. The default value is 1000. For example:

table_columns_placement:
Samtools:
properly_paired: 1010
secondary: 1020

In this case, since the default placement weighting is 1000, the reads_mapped will end up as the leftmost column and the other two will and up as the final columns on the right of the table.

The columns are organised by either namespace or table ID, then column ID. In the above example, Samtools is the namespace in the General Statistics table - the text that is at the start of the tooltip. For custom tables, the ID may be easier to use.

### Conditional formatting

It's possible to highlight values in tables based on their value. This is done using the table_cond_formatting_rules config setting. Rules can be applied to every table column, or to specific columns only, using that column's unique ID.

The default rules are as follows:

table_cond_formatting_rules:
all_columns:
pass:
- s_eq: 'pass'
- s_eq: 'true'
warn:
- s_eq: 'warn'
- s_eq: 'unknown'
fail:
- s_eq: 'fail'
- s_eq: 'false'

These make any table cells that match the string pass or true have text with a green background, orange for warn, red for fail and so on. There can be multiple tests for each style of formatting - if there is a match for any, it will be applied. The following comparison operators are available:

• s_eq - String exactly equals (case insensitive)
• s_contains - String contains (case insensitive)
• s_ne - String does not equal (case insensitive)
• eq - Value equals
• ne - Value does not equal
• gt - Value is greater than
• lt - Value is less than

To have matches for a specific column, use that column's ID instead of all_columns. For example:

table_cond_formatting_rules:
mqc-generalstats-uniquely_mapped_percent:
pass:
- gt: 80
warn:
- lt: 80
fail:
- lt: 70

Note that the formatting is done in a specific order - pass/warn/fail by default, so that anything matching both warn and fail will be formatted as fail for example. This can be customised with table_cond_formatting_colours (see below).

To find the unique ID for your column, right click a table cell in a report and inspect it's HTML (Inpsect in Chrome). It should look something like <td class="data-coloured mqc-generalstats-Assigned">, where the mqc-generalstats-Assigned bit is the unique ID.

I know this isn't the same method of IDs as above and isn't super easy to do. Sorry!

It's possible to highlight matches in any number of colours. MultiQC comes with the following defaults:

table_cond_formatting_colours:
- blue: '#337ab7'
- lbue: '#5bc0de'
- pass: '#5cb85c'
- fail: '#d9534f'

These can be overridden or added to with any string / CSS hex colour combinations you like. You can generate hex colour codes with lots of tools, for example http://htmlcolorcodes.com/

Note that the different sets of rules are formatted in order. So if a value matches both pass and fail then it will be formatted as a fail

## Number base (multiplier)

To make numbers in the General Statistics table easier to read and compare quickly, MultiQC sometimes divides them by one million (typically read counts). If your samples have very low read counts then this can result in the table showing counts of 0.0, which isn't very helpful.

To change this behaviour, you can customise three config variables in your MultiQC config. The defaults are as follows:

read_count_multiplier: 0.000001
read_count_desc: 'millions'

So, to show thousands of reads instead of millions, change these to:

read_count_multiplier: 0.001
read_count_desc: 'thousands'

The same options are also available for numbers of base pairs:

base_count_multiplier: 0.000001
base_count_prefix: 'Mb'
base_count_desc: 'millions'

## Number formatting

By default, the interactive HighCharts plots in MultiQC reports use spaces for thousand separators and points for decimal places (e.g. 1 234 567.89). Different countries have different preferences for this, so you can customise the two using a couple of configuration parameters - decimalPoint_format and thousandsSep_format.

For example, the following config would result in the following alternative number formatting: 1234567,89.

decimalPoint_format: ','
thousandsSep_format: ''

This formatting currently only applies to the interactive charts. It may be extended to apply elsewhere in the future (submit a new issue if you spot somewhere where you'd like it).

## Troubleshooting

One tricky bit that caught me out whilst writing this is the different type casting between Python, YAML and Jinja2 templates. This is especially true when using an empty variable:

# Python
my_var = None
# YAML
my_var: null
# Jinja2
if myvar is none # Note - Lower case!

# Troubleshooting

Hopefully MultiQC will be easy to use and run without any hitches. If you have any problems, please do get in touch with the developer (Phil Ewels) by e-mail or by submitting an issue on github. Before that, here are a few things previously encountered that may help...

## Not enough samples found

In this scenario, MultiQC finds some logs for the bioinformatics tool in question, but not all of your samples appear in the report. This is the most common question I get regarding MultiQC operation.

Usually, this happens because sample names collide. This happens innocently a lot - MultiQC overwrites previous results of the same name and you get the last one seen in the report. You can see warnings about this by running MultiQC in verbose mode with the -v flag, or looking at the generated log file in multiqc_data/multiqc.log. If you are unsure about what log file ended up in the report, look at multiqc_data/multiqc_sources.txt which lists each source file used.

To solve this, try running MultiQC with the -d and -s flags. The Clashing sample names section of the docs explains this in more detail.

### Big log files

Another reason that log files can be skipped is if the log filesize is very large. For example, this could happen with very long concatenated standard out files. By default, MultiQC skips any file that is larger than 10MB to keep execution fast. The verbose log output (-v or multiqc_data/multiqc.log) will show you if files are being skipped with messages such as these:

[DEBUG  ]  Ignoring file as too large: filename.txt

You can configure the threshold and parse your files by changing the log_filesize_limit config option. For example, to parse files up to 2GB in size, add the following to your MultiQC config file:

log_filesize_limit: 2000000000

## No logs found for a tool

In this case, you have run a bioinformatics tool and have some log files in a directory. When you run MultiQC with that directory, it finds nothing for the tool in question.

There are a couple of things you can check here:

1. Is the tool definitely supported by MultiQC? If not, why not open an issue to request it!
2. Did your bioinformatics tool definitely run properly? I've spent quite a bit of time debugging MultiQC modules only to realise that the output files from the tool were empty or incomplete. If your data is missing, take a look and the raw files and make sure that there's something to see!

If everything looks fine, then MultiQC probably needs extending to support your data. Tools have different versions, different parameters and different output formats that can confuse the parsing code. Please open an issue with your log files and we can get it fixed.

## Error messages about mkl trial mode / licences

In this case you run MultiQC and get something like this:

$multiqc . Vendor: Continuum Analytics, Inc. Package: mkl Message: trial mode EXPIRED 2 days ago You cannot run mkl without a license any longer. A license can be purchased it at: http://continuum.io We are sorry for any inconveniences. SHUTTING DOWN PYTHON INTERPRETER The mkl library provides optimisations for numpy, a requirement of MatPlotLib. Recent versions of Conda have a bundled version which should come with a licence and remove the warning. See this page for more info. If you already have Conda installed you can get the updated version by running: conda remove mkl-rt conda install -f mkl Another way around it is to uninstall mkl. It seems that numpy works without it fine: $ conda remove --features mkl

Problem solved! See more here and here.

If you're not using Conda, try installing MultiQC with that instead. You can find instructions here.

## Locale Error Messages

Two MultiQC dependencies have been known to throw errors due to problems with the Python locale settings, or rather the lack of those settings.

MatPlotLib can complain that some strings (such as en_SE) aren't allowed. Running MultiQC gives the following error:

multiqc --version # ..long traceback.. # File "/sw/comp/python/2.7.6_milou/lib/python2.7/locale.py", line 443, in _parse_localename raise ValueError, 'unknown locale: %s' % localename ValueError: unknown locale: UTF-8 Click can have a similar problem if the locale isn't set when using Python 3. That generates an error that looks like this: # ..truncated traceback.. # File "click/_unicodefun.py", line 118, in _verify_python3_env 'for mitigation steps.' + extra) RuntimeError: Click will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. Consult http://click.pocoo.org/python3/for mitigation steps. You can fix both of these problems by changing your system locale to something that will be recognised. One way to do this is by adding these lines to your .bashrc in your home directory (or .bash_profile): export LC_ALL=en_US.UTF-8 export LANG=en_US.UTF-8 Other locale strings are also fine, as long as the variables are set and valid. # MultiQC Modules # Pre-alignment ## Adapter Removal This program searches for and removes remnant adapter sequences from High-Throughput Sequencing (HTS) data and (optionally) trims low quality bases from the 3' end of reads following adapter removal. AdapterRemoval can analyze both single end and paired end data, and can be used to merge overlapping paired-ended reads into (longer) consensus sequences. Additionally, the AdapterRemoval may be used to recover a consensus adapter sequence for paired-ended data, for which this information is not available. The adapterRemoval module parses *.settings logs generated by Adapter Removal, a tool for rapid adapter trimming, identification, and read merging. supported setting file results: • single end • paired end noncollapsed • paired end collapsed ## AfterQC The AfterQC module parses results generated by AfterQC. AfterQC can simply go through all fastq files in a folder and then output three folders: good, bad and QC folders, which contains good reads, bad reads and the QC results of each fastq file/pair. ## bcl2fastq There are two versions of this software: bcl2fastq for MiSeq and HiSeq sequencing systems running RTA versions earlier than 1.8, and bcl2fastq2 for Illumina sequencing systems running RTA version 1.18.54 and above. This module currently only covers output from the latter. ## BioBloom Tools BioBloom Tools (BBT) provides the means to create filters for a given reference and then to categorize sequences. This methodology is faster than alignment but does not provide mapping locations. BBT was initially intended to be used for pre-processing and QC applications like contamination detection, but is flexible to accommodate other purposes. This tool is intended to be a pipeline component to replace costly alignment steps. ## Cluster Flow Cluster Flow is a simple and flexible bioinformatics pipeline tool. It's designed to be quick and easy to install, with flexible configuration and simple customization. Cluster Flow easy enough to set up and use for non-bioinformaticians (given a basic knowledge of the command line), and it's simplicity makes it great for low to medium throughput analyses. The MultiQC module for Cluster Flow parses *_clusterflow.txt logs and finds consensus commands executed by modules in each pipeline run. The Cluster Flow *.run files are also parsed and pipeline information shown (some basic statistics plus the pipeline steps / params used). ## Cutadapt The Cutadapt module parses results generated by Cutadapt, a tool to find and remove adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads. This module should be able to parse logs from a wide range of versions of Cutadapt. It's been tested with log files from v1.2.1, 1.6 and 1.8. Note that you will need to change the search pattern for very old log files (such as v.1.2) with the following MultiQC config: sp: cutadapt: contents: 'cutadapt version' See the module search patterns section of the MultiQC documentation for more information. ## ClipAndMerge An application to clip adapter sequences and merge reads in ancient DNA analysis. Note, that versions < 1.7.8 use the basename of the file path to distinguish samples, whereas newer versions produce logfiles with a sample identifer that gets parsed by MultiQC. ## FastQ Screen The FastQ Screen module parses results generated by FastQ Screen, a tool that allows you to screen a library of sequences in FastQ format against a set of sequence databases so you can see if the composition of the library matches with what you expect. By default, the module creates a plot that emulates the FastQ Screen output with blue and red stacked bars showing unique and multimapping read counts. This plot only works for a handful of samples however, so if # samples * # organisms >= 160, a simpler stacked barplot is shown. This is also shown when generating flat-image plots. To always show this style of plot, add the following line to a MultiQC config file: fastqscreen_simpleplot: true ## FastQC The FastQC module parses results generated by FastQC, a quality control tool for high throughput sequence data written by Simon Andrews at the Babraham Institute. FastQC generates a HTML report which is what most people use when they run the program. However, it also helpfully generates a file called fastqc_data.txt which is relatively easy to parse. A typical run will produce the following files: mysample_fastqc.html mysample_fastqc/ Icons/ Images/ fastqc.fo fastqc_data.txt fastqc_report.html summary.txt Sometimes the directory is zipped, with just mysample_fastqc.zip. The FastQC MultiQC module looks for files called fastqc_data.txt or ending in _fastqc.zip. If the zip files are found, they are read in memory and fastqc_data.txt parsed. Note: The directory and zip file are often both present. To speed up MultiQC execution, zip files will be skipped if the file name suggests that they will share a sample name with data that has already been parsed. You can customise the patterns used for finding these files in your MultiQC config (see Module search patterns). The below code shows the default file patterns: sp: fastqc/data: fn: 'fastqc_data.txt' fastqc/zip: fn: '*_fastqc.zip' Note: Sample names are discovered by parsing the line beginning Filename in fastqc_data.txt, not based on the FastQC report names. ### Theoretical GC Content It is possible to plot a dashed line showing the theoretical GC content for a reference genome. MultiQC comes with genome and transcriptome guides for Human and Mouse. You can use these in your reports by adding the following MultiQC config keys (see Configuring MultiQC): fastqc_config: fastqc_theoretical_gc: 'hg38_genome' Only one theoretical distribution can be plotted. The following guides are available: hg38_genome, hg38_txome, mm10_genome, mm10_txome (txome = transcriptome). Alternatively, a custom theoretical guide can be used in reports. To do this, create a file with fastqc_theoretical_gc in the filename and place it with your analysis files. It should be tab delimited with the following format (column 1 = %GC, column 2 = % of genome): # FastQC theoretical GC content curve: YOUR REFERENCE NAME 0 0.005311768 1 0.004108502 2 0.004060371 3 0.005066476 [...] You can generate these files using an R package called fastqcTheoreticalGC written by Mike Love. Please see the package readme for more details. Result files from this package are searched for with the following search pattern (can be customised as described above): sp: fastqc/theoretical_gc: fn: '*fastqc_theoretical_gc*' If you want to always use a specific custom file for MultiQC reports without having to add it to the analysis directory, add the full file path to the same MultiQC config variable described above: fastqc_config: fastqc_theoretical_gc: '/path/to/your/custom_fastqc_theoretical_gc.txt' ## Fastp The Fastp module parses results generated by Fastp. Fastp can simply go through all fastq files in a folder and perform a series of quality control and filtering. Quality control and reporting are displayed both before and after filtering, allowing for a clear depiction of the consequences of the filtering process. Notably, the latter can be conducted on a variety of paramaters including quality scores, length, as well as the presence of adapters, polyG, or polyX tailing. ## FLASh The FLASh module parses the log messages generated by the FLASh read merger. To create a log file, you can use tee. From the FLASh help: flash reads_1.fq reads_2.fq 2>&1 | tee logfilename.log The sample name is set by the first input filename listed in the log. However, this can be changed to using the first output filename (i.e. if you used FLASh's --output-prefix=PREFIX option) by using the following config: flash: use_output_name: true The module can also parse the .hist numeric histograms output by FLASh. Note that the histogram's file format and extension are too generic by themselves which could result in the accidental parsing a file output by another tool. To get around this, the MultiQC module only parses files with the filename pattern *flash*.hist. To customise this (for example, enabling for any file ending in *.hist), use the following config change: sp: flash/hist: fn: '*.hist' ## Flexbar Flexbar preprocesses high-throughput sequencing data efficiently. It demultiplexes barcoded runs and removes adapter sequences. Moreover, trimming and filtering features are provided. Flexbar increases read mapping rates and improves genome as well as transcriptome assemblies. ## InterOp This module parses the output from the InterOp Summary executable and creates a table view. The aim is to replicate the Run & Lane Metrics table from the Illumina Basespace interface. The executable used can easily be installed from the BioConda channel using conda install -c bioconda illumina-interop. ## Jellyfish JELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA. A k-mer is a substring of length k, and counting the occurrences of all such substrings is a central step in many analyses of DNA sequence. JELLYFISH can count k-mers using an order of magnitude less memory and an order of magnitude faster than other k-mer counting packages by using an efficient encoding of a hash table and by exploiting the "compare-and-swap" CPU instruction to increase parallelism. The MultiQC module for Jellyfish parses only *_jf.hist files. The general usage of jellyfish to be parsed by MultiQC module needs to be: • gunzip -c file.fastq.gz | jellyfish count -o file.jf -m ... • jellyfish histo -o file_jf.hist -f file.jf In case a user wants to customise the matching pattern for jellyfish, then multiqc can be run with the option --cl_config "sp: { jellyfish: { fn: 'PATTERN' } }" where PATTERN is the pattern to be matched. For example: multiqc . --cl_config "sp: { jellyfish: { fn: '*.hist' } }" ## KAT The KAT multiqc module interprets output from KAT distribution analysis json files, which typically contain information such as estimated genome size and heterozygosity rates from your k-mer spectra. ## leeHom leeHom is a Bayesian maximum a posteriori algorithm for stripping sequencing adapters and merging overlapping portions of reads. The algorithm is mostly aimed at ancient DNA and Illumina data but can be used for any dataset. ## minionqc The MinIONQC module parses results generated by MinIONQC. It uses the sequencing_summary.txt files produced by ONT (Oxford Nanopore Technologies) long-read base-callers to perform QC on the reads. It allows quick-and-easy comparison of data from multiple flowcells The MultiQC module parses data in the summary.yaml MinIONQC output files. ## Skewer The Skewer module parses results generated by Skewer, an adapter trimming tool specially designed for processing next-generation sequencing (NGS) paired-end sequences. ## SortMeRNA SortMeRNA is a program tool for filtering, mapping and OTU-picking NGS reads in metatranscriptomic and metagenomic data. The core algorithm is based on approximate seeds and allows for fast and sensitive analyses of nucleotide sequences. The main application of SortMeRNA is filtering ribosomal RNA from metatranscriptomic data. The MultiQC module parses the log files, which are created when SortMeRNA is run with the --log option. The default header in the 'General Statistics' table is '% rRNA'. Users can override this using the configuration option: sortmerna: tab_header: 'My database hits' ## Trimmomatic The Trimmomatic module parses standard error generated by Trimmomatic, a flexible read trimming tool for Illumina NGS data. StdErr can be captured by directing it to a file e.g. trimmomatic command 2> trim_out.log By default, the module generates the sample names based on the command line used by Trimmomatic. If you prefer, you can tell the module to use the filenames as sample names instead. To do so, use the following config option: trimmomatic: s_name_filenames: true # Aligners ## BISCUIT The BISCUIT module parses logs generated by BISCUIT and QC.sh included in BISCUIT software. ## Bismark The Bismark module parses logs generated by Bismark, a tool to map bisulfite converted sequence reads and determine cytosine methylation states. ## Bowtie 1 The Bowtie 1 module parses results generated by Bowtie, an ultrafast, memory-efficient short read aligner. ## Bowtie 2 The Bowtie 2 module parses results generated by Bowtie 2, an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. Please note that the Bowtie 2 logs are difficult to parse as they don't contain much extra information (such as what the input data was). A typical log looks like this: 314537 reads; of these: 314537 (100.00%) were paired; of these: 111016 (35.30%) aligned concordantly 0 times 193300 (61.46%) aligned concordantly exactly 1 time 10221 (3.25%) aligned concordantly >1 times ---- 111016 pairs aligned concordantly 0 times; of these: 11377 (10.25%) aligned discordantly 1 time ---- 99639 pairs aligned 0 times concordantly or discordantly; of these: 199278 mates make up the pairs; of these: 112779 (56.59%) aligned 0 times 85802 (43.06%) aligned exactly 1 time 697 (0.35%) aligned >1 times 82.07% overall alignment rate Bowtie 2 logs are from STDERR - some pipelines (such as Cluster Flow) print the Bowtie 2 command before this, so MultiQC looks to see if this can be recognised in the same file. If not, it takes the filename as the sample name. Bowtie 2 is used by other tools too, so if your log file contains the word bisulfite, MultiQC will assume that this is actually Bismark and ignore the Bowtie 2 logs. ## BBMap The BBMap module produces summary statistics from the BBMap suite of tools. The module can summarise data from the following BBMap output files (descriptions from command line help output): • stats • BBDuk filtering statistics. • covstats (not yet implemented) • Per-scaffold coverage info. • rpkm (not yet implemented) • Per-scaffold RPKM/FPKM counts. • covhist • Histogram of # occurrences of each depth level. • basecov (not yet implemented) • Coverage per base location. • bincov (not yet implemented) • Print binned coverage per location (one line per X bases). • scafstats • Statistics on how many reads mapped to which scaffold. • refstats • Statistics on how many reads mapped to which reference file; only for BBSplit. • bhist • Base composition histogram by position. • qhist • Quality histogram by position. • qchist • Count of bases with each quality value. • aqhist • Histogram of average read quality. • bqhist • Quality histogram designed for box plots. • lhist • Read length histogram. • gchist • Read GC content histogram. • indelhist • Indel length histogram. • mhist • Histogram of match, sub, del, and ins rates by read location. • statsfile (not yet implemented) • Mapping statistics are printed here. Additional information on the BBMap tools is available on SeqAnswers. ## HiCUP The HiCUP module parses results generated by HiCUP, (Hi-C User Pipeline), a tool for mapping and performing quality control on Hi-C data. ## HiC-Pro The HiC-Pro module parses results generated by HiC-Pro, a tool for efficient processing and quality control of Hi-C data. ## HISAT2 HISAT2 is a fast and sensitive alignment program for mapping NGS reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). The HISAT2 MultiQC module parses summary statistics generated by versions >= v2.1.0 where the command line option --new-summary has been specified. Note that running HISAT2 without this option (and older versions) gives log output identical to Bowtie2. These logs are indistinguishable and summary statistics will appear in MultiQC reports labelled as Bowtie2. See GitHub issues on the HISAT2 repository and the MultiQC repository for more information. HISAT2 does not report the input file names in the log, so MultiQC takes the filename as the sample. Note that if you specify --summary-file when running HISAT2 the same summary output appears both there and in the stdout. So if you save both with different names you may end up with duplicate samples in your MultiQC report. ## Kallisto The Kallisto module parses logs generated by Kallisto, a program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. Note - MultiQC parses the standard out from Kallisto, not any of its output files (abundance.h5, abundance.tsv, and run_info.json). As such, you must capture the Kallisto stdout to a file when running to use the MultiQC module. ## Longranger Currently supported Longranger pipelines: • wgs • targeted Usage: longranger wgs --fastqs=/path/to/fastq --id=NA12878 multiqc /path/to/NA12878 This module will look for the files _invocation and summary.csv in the the NA12878 folder, i.e. the output folder of Longranger in this example. The file summary.csv is required. If the file _invocation is not found the sample will receive a generic name in the MultiQC report (longranger#1), instead of NA12878 or whatever was given by the --id parameter. ## Salmon The Salmon module parses results generated by Salmon, a tool for quantifying the expression of transcripts using RNA-seq data. ## STAR STAR is an ultrafast universal RNA-seq aligner. This MultiQC module parses summary statistics from the Log.final.out log files. Sample names are taken either from the filename prefix (sampleNameLog.final.out) when set with --outFileNamePrefix in STAR. If there is no filename prefix, the sample name is set as the name of the directory containing the file. In addition to this summary log file, the module parses ReadsPerGene.out.tab files generated with --quantMode GeneCounts, if found. ## TopHat The TopHat module parses results generated by TopHat, a fast splice junction mapper for RNA-Seq reads that aligns RNA-Seq reads to mammalian-sized genomes. # Post-alignment ## Bamtools The Bamtools module parses bamtools stats logs generated by Bamtools, a programmer's API and an end-user's toolkit for handling BAM files. Supported commands: stats ## Bcftools The Bcftools module parses results generated by Bcftools, a suite of programs for interacting with variant call data. Supported commands: stats #### Collapse complementary substitutions In non-strand-specific data, reporting the total numbers of occurences for both changes in a comlementary pair - like A>C and T>G - might not bring any additional information. To collapse such statistics in the substitutions plot, you can add the following section into your configuration: bcftools: collapse_complementary_changes: true MultiQC will sum up all complementary changes and show only A>* and C>* substitutions in the resulting plot. ## biobambam2 Currently, the biobambam2 module only processes output from the bamsormadup command. Not only that, but it cheats by using the module code from Picard/MarkDuplicates. The output is so similar that the code simply sets up a module with unique name and filename search pattern and then uses the parsing code from the Picard module. Apart from behind the scenes coding, this module should work in exactly the same way as all other MultiQC modules. ## BUSCO BUSCO v2 provides quantitative measures for the assessment of genome assembly, gene set, and transcriptome completeness, based on evolutionarily-informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB v9. The MultiQC module parses the short_summary_[samplename].txt files and plots the proportion of BUSCO types found. MultiQC has been tested with output from BUSCO v1.22 - v2. ## Conpair Conpair is a fast and robust method dedicated for human tumour-normal studies to perform concordance verification (= samples coming from the same individual), as well as cross-individual contamination level estimation in whole-genome and whole-exome sequencing experiments. ## DamageProfiler A tool for DNA damage pattern retrieval for ancient DNA analysis and verification. ## DeDup Improved Duplicate Removal for merged/collapsed reads in ancient DNA analysis ## deepTools deepTools addresses the challenge of handling the large amounts of data that are now routinely generated from DNA sequencing centers. deepTools contains useful modules to process the mapped reads data for multiple quality checks, creating normalized coverage files in standard bedGraph and bigWig file formats, that allow comparison between different files (for example, treatment and control). Finally, using such normalized and standardized files, deepTools can create many publication-ready visualizations to identify enrichments and for functional annotations of the genome. The MultiQC module for deepTools parses a number of the text files that deepTools can produce. In particular, the following are supported: • bamPEFragmentSize --table • bamPEFragmentSize --outRawFragmentLengths • estimateReadFiltering • plotCoverage ---outRawCounts (as well as the content written normally to the console) • plotEnrichment --outRawCounts • plotFingerprint --outQualityMetrics --outRawCounts • plotPCA --outFileNameData • plotCorrelation --outFileCorMatrix • plotProfile --outFileNameData Please be aware that some tools (namely, plotFingerprint --outRawCounts and plotCoverage --outRawCounts) are only supported as of deepTools version 2.6. For earlier output from plotCoverage --outRawCounts, you can use #'chr' 'start' 'end' in utils/search_patterns.yaml (see here for more details). Also for these types of files, you may need to increase the maximum file size supported by MultiQC (log_filesize_limit in the MultiQC configuration file). You can find details regarding the configuration file location here. Note that sample names are parsed from the text files themselves, they are not derived from file names. ## Disambiguate Disambiguation algorithm for reads aligned to two species (e.g. human and mouse genomes) from Tophat, Hisat2, STAR or BWA mem. Both a Python and C++ implementation are offered. The MultiQC module for Disambiguate parses the summary files generated by Disambiguate. ## featureCounts The featureCounts module parses results generated by featureCounts, a highly efficient general-purpose read summarization program that counts mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations. ## GATK Developed by the Data Science and Data Engineering group at the Broad Institute, the GATK toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Supported tools: • BaseRecalibrator • VariantEval #### BaseRecalibrator BaseRecalibrator is a tool for detecting systematic errors in read base quality scores of aligned high-throughput sequencing reads. It outputs a base quality score recalibration table that can be used in conjunction with the PrintReads tool to recalibrate base quality scores. ### VariantEval VariantEval is a general-purpose tool for variant evaluation. It gives information about percentage of variants in dbSNP, genotype concordance, Ti/Tv ratios and a lot more. ## goleft indexcov The goleft indexcov module parses results generated by goleft indexcov. It uses the PED and ROC data files to create diagnostic plots of coverage per sample, helping to identify sample gender and coverage issues. By default, we attempt to only plot chromosomes using standard human-like naming (chr1, chr2... chrX or 1, 2 ... X) but you can specify chromosomes for detailed ROC plots for alternative naming schemes in your configuration with: goleft_indexcov_config: chromosomes: - I - II - III ## Hap.py ## HiCExplorer The HiCExplorer module parses results generated by HiCExplorere's hicBuildMatrix, a tool to create an interaction matrix out of mapped Hi-C reads. ## HOMER HOMER (Hypergeometric Optimization of Motif EnRichment) is a suite of tools for Motif Discovery and next-gen sequencing analysis. HOMER contains many useful tools for analyzing ChIP-Seq, GRO-Seq, RNA-Seq, DNase-Seq, Hi-C and numerous other types of functional genomics sequencing data sets. The HOMER MultiQC module currently only parses output from the findPeaks tool. If you would like support to be added for other HOMER tools, please open a new issue on the MultiQC GitHub page. #### FindPeaks The HOMER findPeaks MultiQC module parses the summary statistics found at the top of HOMER peak files. Three key statistics are shown in the General Statistics table, all others are saved to multiqc_data/multiqc_homer_findpeaks.txt. #### TagDirectory The HOMER tag directory submodule parses output from files tag directory output files, generating a number of diagnostic plots. ## HTSeq HTSeq is a general purpose Python package that provides infrastructure to process data from high-throughput sequencing assays. htseq-count is a tool that is part of the main HTSeq package - it takes a file with aligned sequencing reads, plus a list of genomic features and counts how many reads map to each feature. ## MACS2 MACS2 (Model-based Analysis of ChIP-Seq) is a tool for identifying transcript factor binding sites. MACS captures the influence of genome complexity to evaluate the significance of enriched ChIP regions. The MACS2 MultiQC module reads the header of the *_peaks.xls results files and prints the redundancy rates in the General Statistics table. Numerous additional values are parsed and saved to multiqc_data/multiqc_macs2.txt. ## methylQA The methylQA module parses results generated by methylQA, a methylation sequencing data quality assessment tool. ## miRTrace The miRTrace module parses results generated by miRTrace, a quality control software for small RNA sequencing data. miRTrace performs adapter trimming and discards the reads that fail to pass the QC filters. miRTrace specifically addresses sequencing quality, read length, sequencing depth and miRNA complexity and also identifies the presence of both miRNAs and undesirable sequences derived from tRNAs, rRNAs, or Illumina artifact sequences. miRTrace also profiles clade-specific miRNAs based on a comprehensive catalog of clade-specific miRNA families identified previously. With this information, miRTrace can detect exogenous miRNAs, which could be contamination derived, e.g. index mis-assignment on sample demultiplexing, or biologically derived, e.g. parasitic RNAs. ## phantompeakqualtools Used to generate three quality metrics: NSC, RSC, and PBC. The NSC (Normalized strand cross-correlation) and RSC (relative strand cross-correlation) metrics use cross-correlation of stranded read density profiles to measure enrichment independently of peak calling. The PBC (PCR bottleneck coefficient) is an approximate measure of library complexity. PBC is the ratio of (non-redundant, uniquely mappable reads)/(uniquely mappable reads). ## Peddy Peddy compares familial-relationships and sexes as reported in a PED file with those inferred from a VCF. It samples the VCF at about 25000 sites (plus chrX) to accurately estimate relatedness, IBS0, heterozygosity, sex and ancestry. It uses 2504 thousand genome samples as backgrounds to calibrate the relatedness calculation and to make ancestry predictions. It does this very quickly by sampling, by using C for computationally intensive parts, and by parallelization. ## Picard The Picard module parses results generated by Picard, a set of Java command line tools for manipulating high-throughput sequencing data. Supported commands: • MarkDuplicates • InsertSizeMetrics • GcBiasMetrics • HsMetrics • OxoGMetrics • BaseDistributionByCycl • RnaSeqMetrics • AlignmentSummaryMetrics • RrbsSummaryMetrics • ValidateSamFile • VariantCallingMetrics #### InsertSizeMetrics By default, the insert size plot is smoothed to contain a maximum of 500 data points per sample. This is to prevent the MultiQC report from being very large with big datasets. If you would like to customise this value to get a better resolution you can set the following MultiQC config values, with the new maximum number of points: picard_config: insertsize_smooth_points: 10000 #### Coverage Levels It's possible to customise the HsMetrics "Target Bases 30X" coverage and WgsMetrics "Fraction of Bases over 30X" that are shown in the general statistics table. This must correspond to field names in the picard report, such as PCT_TARGET_BASES_2X / PCT_10X. Any numbers not found in the reports will be ignored. The coverage levels available for HsMetrics are typically 1, 2, 10, 20, 30, 40, 50 and 100X. The coverage levels available for WgsMetrics are typically 1, 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90 and 100X. To customise this, add the following to your MultiQC config: picard_config: general_stats_target_coverage: - 10 - 50 #### ValidateSamFile Search Pattern Generally, Picard adds identifiable content to the output of function calls. This is not the case for ValidateSamFile. In order to identify logs the MultiQC Picard submodule ValidateSamFile will search for filenames that contain 'validatesamfile' or 'ValidateSamFile'. One can customise the used search pattern by overwriting the picard/sam_file_validation pattern in your MultiQC config. For example: sp: picard/sam_file_validation: fn: '*[Vv]alidate[Ss]am[Ff]ile*' ## Preseq The Preseq module parses results generated by Preseq, a tool that estimates the complexity of a library, showing how many additional unique reads are sequenced for increasing total read count. When preseq lc_extrap is run with the default parameters, the extrapolation points reach 10 billion molecules making the plot difficult to interpret in most scenarios. It also includes a lot of data in the reports, which can unnecessarily inflate report file sizes. To avoid this, MultiQC trims back the x axis until each dataset shows 80% of its maximum y-value (unique molecules). To disable this feature and show all of the data, add the following to your MultiQC configuration: preseq: notrim: true #### Using coverage instead of read counts Preseq reports its numbers as "Molecule counts". This isn't always very intuitive, and it's often easier to talk about sequencing depth in terms of coverage. You can plot the estimated coverage instead by specifying the reference genome or target size, and the read length in your MultiQC configuration: preseq: genome_size: 3049315783 read_length: 300 These parameters make the script take every molecule count and divide it by (genome_size / read_length). MultiQC comes with effective genome size presets for Human and Mouse, so you can provide the genome build name instead, like this: genome_size: hg38_genome. The following values are supported: hg19_genome, hg38_genome, mm10_genome. When the genome and read sizes are provided, MultiQC will plot the molecule counts on the X axis ("total" data) and coverages on the Y axis ("unique" data). However, you can customize what to plot on each axis (counts or coverage), e.g.: preseq: x_axis: counts y_axis: coverage #### Plotting externally calculated read counts To mark on the plot the read counts calculated externally from BAM or fastq files, create a file with preseq_real_counts in the filename and place it with your analysis files. It should be space or tab delimited with 2 or 3 columns (column 1 = preseq file name, column 2 = real read count, optional column 3 = real unique read count). For example: Sample_1.preseq.txt 3638261 3638011 Sample_2.preseq.txt 1592394 1592133 [...] You can generate a line for such a file using samtools: echo "Sample_1.preseq.txt "(samtools view -c -F 4 Sample_1.bam)" "(samtools view -c -F 1028 Sample_1.bam) ## Prokka The Prokka module analyses summary results from the Prokka annotation pipeline for prokaryotic genomes. The Prokka module accepts two configuration options: • prokka_table: default False. Show a table in the report. • prokka_barplot: default True. Show a barplot in the report. • prokka_fn_snames: default False. Use filenames for sample names (see below). Sample names are generated using the first line in the prokka reports: organism: Helicobacter pylori Sample1 The module assumes that the first two words are the organism name and the third is the sample name. So the above will give a sample name of Sample1. If you prefer, you can set config.prokka_fn_snames to True and MultiQC will instead use the log filename as the sample name. ## QoRTs The QoRTs software package is a fast, efficient, and portable multifunction toolkit designed to assist in the analysis, quality control, and data management of RNA-Seq datasets. Its primary function is to aid in the detection and identification of errors, biases, and artifacts produced by paired-end high-throughput RNA-Seq technology. In addition, it can produce count data designed for use with differential expression and differential exon usage tools, as well as individual-sample and/or group-summary genome track files suitable for use with the UCSC genome browser. ## Qualimap The Qualimap module parses results generated by Qualimap, a platform-independent application to facilitate the quality control of alignment sequencing data and its derivatives like feature counts. The MultiQC module supports the Qualimap commands BamQC and RNASeq. Note that Qualimap must be run with the -outdir option as well as -outformat HTML (which is on by default). MultiQC uses files found within the raw_data_qualimapReport folder (as well as genome_results.txt). Qualimap adds lots of columns to the General Statistics table. To avoid making the table too wide and bloated, some of these are hidden by default (Error Rate, M Aligned, M Total reads). You can override these defaults in your MultiQC config file - for example, to show Error Rate by default and hide Ins. size by default, add the following: table_columns_visible: QualiMap: general_error_rate: True median_insert_size: False See the relevant section of the documentation for more detail. In addition to this, it's possible to customise which coverage thresholds calculated by the Qualimap BamQC module (default: 1, 5, 10, 30, 50) and which of these are hidden in the General Statistics tablewhen the report loads (default: all hidden except 30X). To do this, add something like the following to your MultiQC config file: qualimap_config: general_stats_coverage: - 10 - 20 - 40 - 200 - 30000 general_stats_coverage_hidden: - 10 - 20 - 200 ## QUAST QUAST evaluates genome assemblies by computing various metrics, including • N50, length for which the collection of all contigs of that length or longer covers at least 50% of assembly length • NG50, where length of the reference genome is being covered • NA50 and NGA50, where aligned blocks instead of contigs are taken • Misassemblies, misassembled and unaligned contigs or contigs bases • Genes and operons covered The QUAST MultiQC module parses the report.tsv files generated by QUAST and adds key metrics to the report General Statistics table. All statistics for all samples are saved to multiqc_data/multiqc_quast.txt. #### Configuration By default, the QUAST module is configured to work with large de-novo genomes, showing thousands of contigs, mega-base pairs and other sensible defaults. If these aren't appropriate for your genomes, you can configure them as follows: quast_config: contig_length_multiplier: 0.001 contig_length_suffix: 'Kbp' total_length_multiplier: 0.000001 total_length_suffix: 'Mbp' total_number_contigs_multiplier: 0.001 total_number_contigs_suffix: 'K' The default module values are shown above. See the main MultiQC documentation for more information about how to configure MultiQC. #### MetaQUAST The QUAST module will also parse output from MetaQUAST runs (metaquast.py). The combined_reference/report.tsv file is parsed, and folders runs_per_reference and not_aligned are ignored. If you want to run MultiQC against auxiliary MetaQUAST runs, you must explicitly pass these files to MultiQC: multiqc runs_per_reference/reference_1/report.tsv Note that you can pass as many file paths to MultiQC as you like and use glob expansion (eg. runs_per_reference/*/report.tsv). ## RNA-SeQC The RSeQC module parses results generated by RNA-SeQC, (not to be confused with RSeQC, which MultiQC also supports). RNA-SeQC is a java program which computes a series of quality control metrics for RNA-seq data. This module shows the Spearman correlation heatmap if both Spearman and Pearson's are found. To plot Pearson's by default instead, add the following to your MultiQC config file: rna_seqc: default_correlation: pearson ## RSEM The rsem module parses results generated by RSEM, a software package for estimating gene and isoform expression levels from RNA-Seq data Supported scripts: • rsem-calculate-expression This module search for the file .cnt created by RSEM into directory named PREFIX.stat ## RSeQC The RSeQC module parses results generated by RSeQC, a package that provides a number of useful modules that can comprehensively evaluate high throughput RNA-seq data. Supported scripts: • bam_stat • gene_body_coverage • infer_experiment • inner_distance • junction_annotation • junction_saturation • read_distribution • read_duplication • read_gc You can choose to hide sections of RSeQC output and customise their order. To do this, add and customise the following to your MultiQC config file: rseqc_sections: - read_distribution - gene_body_coverage - inner_distance - read_gc - read_duplication - junction_annotation - junction_saturation - infer_experiment - bam_stat Change the order to rearrage sections or remove to hide them from the report. ## Samblaster The Samblaster module parses results generated by Samblaster, a tool to mark duplicates and extract discordant and split reads from sam files. ## Samtools The Samtools module parses results generated by Samtools, a suite of programs for interacting with high-throughput sequencing data. Supported commands: • stats • flagstats • idxstats • rmdup ### idxstats The samtools idxstats prints its results to standard out (no consistent file name) and has no header lines (no way to recognise from content of file). As such, idxstats result files must have the string idxstat somewhere in the filename. There are a few MultiQC config options that you can add to customise how the idxstats module works. A typical configuration could look as follows: # Always include these chromosomes in the plot samtools_idxstats_always: - X - Y # Never include these chromosomes in the plot samtools_idxstats_ignore: - MT # Threshold where chromosomes are ignored in the plot. # Should be a fraction, default is 0.001 (0.1% of total) samtools_idxstats_fraction_cutoff: 0.001 # Name of the X and Y chromosomes. # If not specified, MultiQC will search for any chromosome # names that look like x, y, chrx or chry (case insensitive search) samtools_idxstats_xchr: myXchr samtools_idxstats_ychr: myYchr ## Sargasso The sargasso module parses results generated by Sargasso, a tool for separating mixed-species RNA-seq reads according to their species of origin. ## Slamdunk Slamdunk is a tool to analyze data from the SLAM-Seq sequencing protocol. This module should be able to parse logs from v0.2.2-dev onwards. ## SnpEff The SnpEff module parses results generated by SnpEff, a genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes). MultiQC parses the summary .csv file that is generated by SnpEff. Note that you must run SnpEff with -csvStats <filename> for this to be generated. See the SnpEff documentation for more information. ## Supernova #### Important notes Due to the size of the histogram_kmer_count.json files, MultiQC is likely to skip these files. To be able to display these you will need to change the MultiQC configuration to allow for larger logfiles, see the MultiQC documentation. For instance, if you run MultiQC as part of an analysis pipeline, you can create a multiqc_config.yaml file in the working directory, containing the following line: log_filesize_limit: 100000000 #### General Notes The Supernova module parses the reports from an assembly run. As a bare minimum it requires the file report.txt, found in the folder sampleID/outs/, to function. Note! If you are anything like the author (@remiolsen), you might only have files (often renamed to, e.g. sampleID-report.txt) lying around due to disk space limitations and for ease of sharing with your colleagues. This module will search for *report*.txt. If available the stats in the report file will be superseded by the higher precision numbers found in the file sampleID/outs/assembly/stats/summary.json. In the same folder, this module will search for the following plots and render them: • histogram_molecules.json -- Inferred molecule lengths • histogram_kmer_count.json -- Kmer multiplicity This module has been tested using Supernova versions 1.1.4 and 1.2.0 ## Stacks #### Very important note This module will only work with Stacks version 2.1 or greater. Furthermore, this module is designed to only parse some of the output from the denovo_map pipeline. If you are missing some functionality, please submit an issue on the MultiQC github page ## THeTA2 THeTA2 (Tumor Heterogeneity Analysis) is an algorithm that estimates the tumour purity and clonal / subclonal copy number aberrations directly from high-throughput DNA sequencing data. The THeTA2 MultiQC module plots the % germline and % tumour subclone for each sample. Note that each sample can have multiple maximum likelihood solutions - the MultiQC module plots proportions for the first one in the results file (*.BEST.results). Also note that if there are more than 5 tumour subclones, their percentages are summed. ## VCFTools ### Important General Note • Depending on the size and density of the variant data (vcf), some of the stat files generated by vcftools can be very large. If you find that some of your input files are missing, increase the config.log_filesize_limit so that the large file(s) will not be skipped by MultiQC. Note, however, that this might make MultiQC very slow! This module parses the outputs from VCFTools' various commands: ### Implemented • relatedness2 • Plots a heatmap of pairwise sample relatedness. • Not to be confused with the similarly-named command relatedness • TsTv-by-count • Plots the transition to transversion ratio as a function of alternative allele count (using only bi-allelic SNPs). • TsTv-by-qual • Plots the transition to transversion ratio as a function of SNP quality threshold (using only bi-allelic SNPs). • TsTv-summary • Plots a bargraph of the summary counts of each type of transition and transversion SNPs. ### To do VCFTools has a number of outputs not yet supported in MultiQC which would be good to add. Please check GitHub if you'd like these added or (better still), would like to contribute! ## VerifyBAMID A key step in any genetic analysis is to verify whether data being generated matches expectations. verifyBamID checks whether reads in a BAM file match previous genotypes for a specific sample. In addition, it detects possible sample mixture from population allele frequency only, which can be particularly useful when the genotype data is not available. Using a mathematical model that relates observed sequence reads to an hypothetical true genotype, verifyBamID tries to decide whether sequence reads match a particular individual or are more likely to be contaminated (including a small proportion of foreign DNA), derived from a closely related individual, or derived from a completely different individual. This module currently only imports data from the .selfSM output. The chipmix and freemix columns are imported into the general statistics table. A verifyBAMID section is then added, with a table containing the entire selfSM file. If no chip data was parsed, these columns will not be added to the MultiQC report. Should you wish to remove one of these columns from the general statistics table add the below lines to the table_columns_visible section of your config file table_columns_visible: verifyBAMID: CHIPMIX: False FREEMIX: False This was designed to work with verifyBamID 1.1.3 January 2018 # Custom Content WARNING - This feature is new and is very much in a beta status. It is expected to be further developed in future releases, which may break backwards compatibility. There are also probably quite a few bugs. Use at your own risk! Please report bugs or missing functionality as a new GitHub issue. # Introduction Bioinformatics projects often include non-standardised analyses, with results from custom scripts or in-house packages. It can be frustrating to have a MultiQC report describing results from 90% of your pipeline but missing the final key plot. To help with this, MultiQC has a special "custom content" module. Custom content parsing is a little more restricted than standard modules. Specifically: • Only one plot per section is possible • Plot customisation is more limited All plot types can be generated using custom content - see the test files for examples of how data should be structured. ## Data from a released tool If your data comes from a released bioinformatics tool, you shouldn't be using this feature of MultiQC! Sure, you can probably get it to work, but it's better if a fully-fledged core MultiQC module is written instead. That way, other users of MultiQC can also benefit from results parsing. Note that proper MultiQC modules are more robust and powerful than this custom-content feature. You can also write modules in MultiQC plugins if they're not suitable for general release. ## Images As of MultiQC v1.7, you can import custom images into your MultiQC reports. Simply add _mqc to the end of the filename for .png, .jpg or .jpeg files, for example: my_image_file_mqc.png or summmary_diagram.jpeg. Images will be embedded within the HTML file, so will be self contained. Note that this means that it's very possible to make the HTML file very very large if abused! The report section name and description will be automatically based on the filename. ## MultiQC-specific data file If you can choose exactly how your data output looks, then the easiest way to parse it is to use a MultiQC-specific format. If the filename ends in *_mqc.(yaml|json|txt|csv|out) then it will be found by any standard MultiQC installation with no additional customisation required (v0.9 onwards). These files contain configuration information specifying how the data should be parsed, alongside the data. If you want to use YAML, this is an example of how it should look: id: 'my_pca_section' section_name: 'PCA Analysis' description: 'This plot shows the first two components from a principal component analysis.' plot_type: 'scatter' pconfig: id: 'pca_scatter_plot' title: 'PCA Plot' xlab: 'PC1' ylab: 'PC2' data: sample_1: {x: 12, y: 14} sample_2: {x: 8, y: 6 } sample_3: {x: 5, y: 11} sample_4: {x: 9, y: 12} The file format can also be JSON: { "id": "custom_data_lineplot", "section_name": "Custom JSON File", "description": "This plot is a self-contained JSON file.", "plot_type": "linegraph", "pconfig": { "id": "custom_data_linegraph", "title": "Output from my JSON file", "ylab": "Number of things", "xDecimals": false }, "data": { "sample_1": { "1": 12, "2": 14, "3": 10, "4": 7, "5": 16 }, "sample_2": { "1": 9, "2": 11, "3": 15, "4": 18, "5": 21 } } } For maximum compatibility with other tools, you can also use comma-separated or tab-separated files. Include commented header lines with plot configuration in YAML format: # title: 'Output from my script' # description: 'This output is described in the file header. Any MultiQC installation will understand it without prior configuration.' # section: 'Custom Data File' # format: 'tsv' # plot_type: 'bargraph' # pconfig: # id: 'custom_bargraph_w_header' # ylab: 'Number of things' Category_1 374 Category_2 229 Category_3 39 Category_4 253 If no configuration is given, MultiQC will do its best to guess how to visualise your data appropriately. To see examples of typical file structures which are understood, see the test data used to develop this code. Something will be probably be shown, but it may produce unexpected results. ## Data as part of MultiQC config If you are already using a MultiQC config file to add data to your report (for example, titles / introductory text), you can give data within this file too. This can be in any MultiQC config file (for example, passed on the command line with -c my_yaml_file.yaml). This is useful as you can keep everything contained within a single file (including stuff unrelated to this specific custom content feature of MultiQC). To be understood by MultiQC, the custom_data key must be found. This must contain a section with a unique id, specific to your new report section. Finally, the contents of this second dictionary will look the same as the above stand-alone YAML files. For example: custom_data: my_data_type: id: 'mqc_config_file_section' section_name: 'My Custom Section' description: 'This data comes from a single multiqc_config.yaml file' plot_type: 'bargraph' pconfig: id: 'barplot_config_only' title: 'MultiQC Config Data Plot' ylab: 'Number of things' data: sample_a: first_thing: 12 second_thing: 14 sample_b: first_thing: 8 second_thing: 6 sample_c: first_thing: 11 second_thing: 5 sample_d: first_thing: 12 second_thing: 9 Or to add data to the General Statistics table: custom_data: my_genstats: plot_type: 'generalstats' pconfig: - col_1: max: 100 min: 0 scale: 'RdYlGn' suffix: '%' - col_2: min: 0 data: sample_a: col_1: 14.32 col_2: 1.2 sample_b: col_1: 84.84 col_2: 1.9 Note: Use a list of headers in pconfig (keys prepended with -) to specify the order of columns in the General Statistics table. See the general statistics docs for more information about configuring data for the General Statistics table. ## Separate configuration and data files It's not always possible or desirable to include MultiQC configuration within a data file. If this is the case, you can add to the MultiQC configuration to specify how input files should be parsed. As described in the above Data as part of MultiQC config section, this configuration should be held within a section called custom_data with a section-specific id. The only difference is that no data subsection is given and a search pattern for the given id must be supplied. Search patterns are added as with any other module. Ensure that the search pattern key is the same as your custom_data section ID. For example, a MultiQC config file could look as follows: # Other MultiQC config stuff here custom_data: example_files: file_format: 'tsv' section_name: 'Coverage Decay' description: 'This plot comes from files acommpanied by a mutliqc_config.yaml file for configuration' plot_type: 'linegraph' pconfig: id: 'example_coverage_lineplot' title: 'Coverage Decay' ylab: 'X Coverage' ymax: 100 ymin: 0 sp: example_files: fn: 'example_files_*' And work with the following data file: example_files_Sample_1.txt: 0 98.22076066 1 97.96764159 2 97.78227175 3 97.61262195 # [...] As mentioned above - if no configuration is given, MultiQC will do its best to guess how to visualise your data appropriately. To see examples of typical file structures which are understood, see the test data used to develop this code. # Configuration ## Order of sections If you have multiple different Custom Content sections, their order will be random and may vary between runs. To avoid this, you can specify an order in your MultiQC config as follows: custom_content: order: - first_cc_section - second_cc_section Each section name should be the ID assigned to that section. You can explicitly set this (see below), or the Custom Content module will automatically assign an ID. To find out what your custom content section ID is, generate a report and click the side navigation to your section. The browser URL should update and show something that looks like this: multiqc_report.html#my_cc_section The section ID is the part after the # (my_cc_section in the above section). Note that any Custom Content sections found that are not specified in the config will be placed at the top of the report. ## Section configuration See below for how these config options can be specified (either within the data file or in a MultiQC config file). All of these configuration parameters are optional, and MultiQC will do its best to guess sensible defaults if they are not specified. All possible configuration keys and their default values are shown below: id: null # Unique ID for report section. section_anchor: <id> # Used in report section #soft-links section_name: <id> # Nice name used for the report section header section_href: null # External URL for the data, to find more information description: null # Introductory text to be printed under the section header file_format: null # File format of the data (eg. csv / tsv) plot_type: null # The plot type to visualise the data with. # generalstats | table | bargraph | linegraph | scatter | heatmap | beeswarm pconfig: {} # Configuration for the plot. Note that any custom content data found with the same section id will be merged into the same report section / plot. The other section configuration keys are merged for each file, with identical keys overwriting what was previously parsed. This approach means that it's possible to have a single file containing data for multiple samples, but it's also possible to have one file per sample and still have all of them summarised. If you're using plot_type: 'generalstats' then a report section will not be created and most of the configuration keys above are ignored. Data types generalstats and beeswarm are only possible by setting the above configuration keys (these can't be guessed by data format). ## Plot configuration Configuration of specific plots follows the same syntax as used when writing modules. To find out more, please see the later docs. Specifically, the plot config docs for bar graphs, line graphs, scatter plots, tables, beeswarm plots and heatmaps. Wherever you see pconfig, any key can be used within the above syntax. ## Tricky extras Because of the way this module works, there are a few specifics that can trip you up. Most of these should probably be fixed one day. Feel free to complain on gitter or submit a pull request! I'll try to keep a list here to help the wary... ### Differences between Tables and General Stats Although they're both tables, note that general stats configures columns with a list in the pconfig scope (see above example). Files that are just tables use headers instead. ### First columns in tables are special The first column in every table is reserved for the sample name. As such, it shouldn't contain data. All header configuration will be ignored for the first column. The only exception is name: this can be tweaked using the somewhat tricky col1_header field in the pconfig scope (see table docs). ## Linting MultiQC has been developed to be as forgiving as possible and will handle lots of invalid or ignored configurations. This is useful for most users but can make life difficult when getting MultiQC to work with a new custom content format. To help with this, you can run with the --lint flag, which will give explicit warnings about anything that is not optimally configured. For example: multiqc --lint test_data # Examples Probably the best way to get to grips with Custom Content is to see some examples. The MultiQC automated testing runs with a bunch of different files, and I try to add to these all the time. You can see these examples here: https://github.com/ewels/MultiQC_TestData/tree/master/data/custom_content For example, to see a file which generates a table in a report by itself, you can have a look at embedded_config/table_headers_mqc.txt (link). # Coding with MultiQC # Writing New Modules ## Introduction Writing a new module can at first seem a daunting task. However, MultiQC has been written (and refactored) to provide a lot of functionality as common functions. Provided that you are familiar with writing Python and you have a read through the guide below, you should be on your way in no time! If you have any problems, feel free to contact the author - details here: @ewels ## Core modules / plugins New modules can either be written as part of MultiQC or in a stand-alone plugin. If your module is for a publicly available tool, please add it to the main program and contribute your code back when complete via a pull request. If your module is for something very niche, which no-one else can use, you can write it as part of a custom plugin. The process is almost identical, though it keeps the code bases separate. For more information about this, see the docs about MultiQC Plugins below. ## Linting MultiQC has been developed to be as forgiving as possible and will handle lots of invalid or ignored code. This is useful most of the time but can be difficult when writing new MultiQC modules (especially during pull-request reviews). To help with this, you can run with the --lint flag, which will give explicit warnings about anything that is not optimally configured. For example: multiqc --lint test_data Note that the automated MultiQC continuous integration testing runs in this mode, so you will need to pass all lint tests for those checks to pass. This is required for any pull-requests. ## Initial setup ### Submodule MultiQC modules are Python submodules - as such, they need their own directory in /multiqc/ with an __init__.py file. The directory should share its name with the module. To follow common practice, the module code usually then goes in a separate python file (also with the same name) which is then imported by __init__.py: from __future__ import absolute_import from .modname import MultiqcModule ### Entry points Once your submodule files are in place, you need to tell MultiQC that they are available as an analysis module. This is done within setup.py using entry points. In setup.py you will see some code that looks like this: entry_points = { 'multiqc.modules.v1': [ 'bismark = multiqc.modules.bismark:MultiqcModule', [...] ] } Copy one of the existing module lines and change it to use your module name. The order is irrelevant, so stick to alphabetical if in doubt. Once this is done, you will need to update your installation of MultiQC: python setup.py develop ### MultiQC config So that MultiQC knows what order modules should be run in, you need to add your module to the core config file. In multiqc/utils/config_defaults.yaml you should see a list variable called module_order. This contains the name of modules in order of precedence. Add your module here in an appropriate position. ### Documentation Next up, you need to create a documentation file for your module. The reason for this is twofold: firstly, docs are important to help people to use, debug and extend MultiQC (you're reading this, aren't you?). Secondly, having the file there with the appropriate YAML front matter will make the module show up on the MultiQC homepage so that everyone knows it exists. This process is automated once the file is added to the core repository. This docs file should be placed in docs/modules/<your_module_name>.md and should have the following structure: --- Name: Tool Name URL: http://www.amazing-bfx-tool.com Description: > This amazing tool does some really cool stuff. You can describe it here and split onto multiple lines if you want. Not too long though! --- Your documentation goes here. Feel free to use markdown and write whatever you think would be helpful. Please avoid using heading levels 1 to 3. Make a reference to this in the YAML frontmatter at the top of docs/README.md - this allows the website to find the file to build the documentation. ### Readme and Changelog Last but not least, remember to add your new module to the main README.md file and CHANGELOG.md, so that people know that it's there. Feel free to add your name to the list of credits at the bottom of the readme. ### MultiqcModule Class If you've copied one of the other entry point statements, it will have ended in :MultiqcModule - this tells MultiQC to try to execute a class or function called MultiqcModule. To use the helper functions bundled with MultiQC, you should extend this class from multiqc.modules.base_module.BaseMultiqcModule. This will give you access to a number of functions on the self namespace. For example: from multiqc.modules.base_module import BaseMultiqcModule class MultiqcModule(BaseMultiqcModule): def __init__(self): # Initialise the parent object super(MultiqcModule, self).__init__(name='My Module', anchor='mymod', href="http://www.awesome_bioinfo.com/my_module", info="is an example analysis module used for writing documentation.") Ok, that should be it! The __init__() function will now be executed every time MultiQC runs. Try adding a print("Hello World!") statement and see if it appears in the MultiQC logs at the appropriate time... Note that the __init__ variables are used to create the header, URL link, analysis module credits and description in the report. ### Logging Last thing - MultiQC modules have a standardised way of producing output, so you shouldn't really use print() statements for your Hello World ;) Instead, use the logger module as follows: import logging log = logging.getLogger(__name__) # Initialise your class and so on log.info('Hello World!') Log messages can come in a range of formats: • log.debug • Thes only show if MultiQC is run in -v/--verbose mode • log.info • For more important status updates • log.warning • Alert user about problems that don't halt execution • log.error and log.critical • Not often used, these are for show-stopping problems ## Step 1 - Find log files The first thing that your module will need to do is to find analysis log files. You can do this by searching for a filename fragment, or a string within the file. It's possible to search for both (a match on either will return the file) and also to have multiple strings possible. First, add your default patterns to: MULTIQC_ROOT/multiqc/utils/search_patterns.yaml Each search has a yaml key, with one or more search criteria. The yaml key must begin with the name of your module. If you have multiple search patterns for a single module, follow the module name with a forward slash and then any string. For example, see the fastqc module search patterns: fastqc/data: fn: 'fastqc_data.txt' fastqc/zip: fn: '_fastqc.zip' The following search criteria sub-keys can then be used: • fn • fn_re • A regex filename pattern • contents • A string to match within the file contents (checked line by line) • contents_re • A regex to match within the file contents (checked line by line) • NB: Regex must match entire line (add .* to start and end of pattern to avoid this) • exclude_fn • A glob filename pattern which will exclude a file if matched • exclude_fn_re • A regex filename pattern which will exclude a file if matched • exclude_contents • A string which will exclude the file if matched within the file contents (checked line by line) • exclude_contents_re • A regex which will exclude the file if matched within the file contents (checked line by line) • num_lines • The number of lines to search through for the contents string. Default: all lines. • shared • By default, once a file has been assigned to a module it is not searched again. Specify shared: true when your file can be shared between multiple tools (for example, part of a stdout stream). • max_filesize • Files larger than the log_filesize_limit config key (default: 10MB) are skipped. If you know your files will be smaller than this and need to search by contents, you can specify this value (in bytes) to skip any files smaller than this limit. Please try to use num_lines and max_filesize where possible as they will speed up MultiQC execution time. Note that exclude_ keys are tested after a file is detected with one or more of the other patterns. For example, two typical modules could specify search patterns as follows: mymod: fn: '_myprogram.txt' myothermod: contents: 'This is myprogram v1.3' You can also supply a list of different patterns for a single log file type if needed. If any of the patterns are matched, the file will be returned: mymod: - fn: 'mylog.txt' - fn: 'different_fn.out' You can use AND logic by specifying keys within a single list item. For example: mymod: fn: 'mylog.txt' contents: 'mystring' myothermod: - fn: 'different_fn.out' contents: 'This is myprogram v1.3' - fn: 'another.txt' contents: 'What are these files anyway?' Here, a file must have the filename mylog.txt and contain the string mystring. You can match subsets of files by using exclude_ keys as follows: mymod: fn: '*.myprog.txt' exclude_fn: 'not_these_*' myothermod: fn: 'mylog.txt' exclude_contents: - 'trimmed' - 'sorted' Note that the exclude_ patterns can have either a single value or a list of values. They are always considered using OR logic - any matches will reject the file. Remember that users can overwrite these defaults in their own config files. This is helpful as people have weird and wonderful processing pipelines with their own conventions. Once your strings are added, you can find files in your module with the base function self.find_log_files(), using the key you set in the YAML: self.find_log_files('mymod') This function yields a dictionary with various information about each matching file. The f key contains the contents of the matching file: # Find all files for mymod for myfile in self.find_log_files('mymod'): print( myfile['f'] ) # File contents print( myfile['s_name'] ) # Sample name (from cleaned filename) print( myfile['fn'] ) # Filename print( myfile['root'] ) # Directory file was in If filehandles=True is specified, the f key contains a file handle instead: for f in self.find_log_files('mymod', filehandles=True): # f['f'] is now a filehandle instead of contents for l in f['f']: print( l ) This is good if the file is large, as Python doesn't read the entire file into memory in one go. ## Step 2 - Parse data from the input files What most MultiQC modules do once they have found matching analysis files is to pass the matched file contents to another function, responsible for parsing the data from the file. How this parsing is done will depend on the format of the log file and the type of data being read. See below for a basic example, based loosely on the preseq module: class MultiqcModule(BaseMultiqcModule): def __init__(self): # [...] self.mod_data = dict() for f in self.find_log_files('mymod'): self.mod_data[f['s_name']] = self.parse_logs(f['f']) def parse_logs(self, f): data = {} for l in f.splitlines(): s = l.split() data[s[0]] = s[1] return data ### Filtering by parsed sample names MultiQC users can use the --ignore-samples flag to skip sample names that match specific patterns. As sample names are generated in a different way by every module, this filter has to be applied after log parsing. There is a core function to do this task - assuming that your data is in a dictionary with the first key as sample name, pass it through the self.ignore_samples function as follows: self.yourdata = self.ignore_samples(self.yourdata) This will remove any dictionary keys where the sample name matches a user pattern. ### No files found If your module cannot find any matching files, it needs to raise an exception of type UserWarning. This tells the core MultiQC program that no modules were found. For example: if len(self.mod_data) == 0: raise UserWarning Note that this has to be raised as early as possible, so that it halts the module progress. For example, if no logs are found then the module should not create any files or try to do any computation. ### Custom sample names Typically, sample names are taken from cleaned log filenames (the default f['s_name'] value returned). However, if possible, it's better to use the name of the input file (allowing for concatenated log files). To do this, you should use the self.clean_s_name() function, as this will prepend the directory name if requested on the command line: input_fname = s[3] # Or parsed however s_name = self.clean_s_name(input_fname, f['root']) This function has already been applied to the contents of f['s_name']. self.clean_s_name() must be used on sample names parsed from the file contents. Without it, features such as prepending directories (--dirs) will not work. ### Identical sample names If modules find samples with identical names, then the previous sample is overwritten. It's good to print a log statement when this happens, for debugging. However, most of the time it makes sense - programs often create log files and print to stdout for example. if f['s_name'] in self.bowtie_data: log.debug("Duplicate sample name found! Overwriting: {}".format(f['s_name'])) ### Printing to the sources file Finally, once you've found your file we want to add this information to the multiqc_sources.txt file in the MultiQC report data directory. This lists every sample name and the file from which this data came from. This is especially useful if sample names are being overwritten as it lists the source used. This code is typically written immediately after the above warning. If you've used the self.find_log_files function, writing to the sources file is as simple as passing the log file variable to the self.add_data_source function: for f in self.find_log_files('mymod'): self.add_data_source(f) If you have different files for different sections of the module, or are customising the sample name, you can tweak the fields. The default arguments are as shown: self.add_data_source(f=None, s_name=None, source=None, module=None, section=None) ## Step 3 - Adding to the general statistics table Now that you have your parsed data, you can start inserting it into the MultiQC report. At the top of ever report is the 'General Statistics' table. This contains metrics from all modules, allowing cross-module comparison. There is a helper function to add your data to this table. It can take a lot of configuration options, but most have sensible defaults. At it's simplest, it works as follows: data = { 'sample_1': { 'first_col': 91.4, 'second_col': '78.2%' }, 'sample_2': { 'first_col': 138.3, 'second_col': '66.3%' } } self.general_stats_addcols(data) To give more informative table headers and configure things like data scales and colour schemes, you can supply an extra dict: headers = OrderedDict() headers['first_col'] = { 'title': 'First', 'description': 'My First Column', 'scale': 'RdYlGn-rev' } headers['second_col'] = { 'title': 'Second', 'description': 'My Second Column', 'max': 100, 'min': 0, 'scale': 'Blues', 'suffix': '%' } self.general_stats_addcols(data, headers) Here are all options for headers, with defaults: headers['name'] = { 'namespace': '', # Module name. Auto-generated for core modules in General Statistics. 'title': '[ dict key ]', # Short title, table column title 'description': '[ dict key ]', # Longer description, goes in mouse hover text 'max': None, # Minimum value in range, for bar / colour coding 'min': None, # Maximum value in range, for bar / colour coding 'scale': 'GnBu', # Colour scale for colour coding. Set to False to disable. 'suffix': None, # Suffix for value (eg. '%') 'format': '{:,.1f}', # Output format() string 'shared_key': None # See below for description 'modify': None, # Lambda function to modify values 'hidden': False, # Set to True to hide the column on page load 'placement' : 1000.0, # Alter the default ordering of columns in the table } • namespace • This prepends the column title in the mouse hover: Namespace: Title. • The 'Configure Columns' modal displays this under the 'Group' column. • It's automatically generated for core modules in the General Statistics table, though this can be overwritten (useful for example with custom-content). • scale • Colour scales are the names of ColorBrewer palettes. See below for available scales. • Add -rev to the name of a colour scale to reverse it • Set to False to disable colouring and background bars • shared_key • Any string can be specified here, if other columns are found that share the same key, a consistent colour scheme and data scale will be used in the table. Typically this is set to things like read_count, so that the read count in a sample can be seen varying across analysis modules. • modify • A python lambda function to change the data in some way when it is inserted into the table. • hidden • Setting this to True will hide the column when the report loads. It can then be shown through the Configure Columns modal in the report. This can be useful when data could be sometimes useful. For example, some modules show "percentage aligned" on page load but hide "number of reads aligned". • placement • If you feel that the results from your module should appear at the left side of the table set this value less than 1000. Or to move the column right, set it greater than 1000. This value can be any float. The typical use for the modify string is to divide large numbers such as read counts, to make them easier to interpret. If handling read counts, there are three config variables that should be used to allow users to change the multiplier for read counts: read_count_multiplier, read_count_prefix and read_count_desc. For example: 'title': '{} Reads'.format(config.read_count_prefix), 'description': 'Number of reads ({})'.format(config.read_count_desc), 'modify': lambda x: x * config.read_count_multiplier, Similar config options apply for base pairs: base_count_multiplier, base_count_prefix and base_count_desc. A third parameter can be passed to this function, namespace. This is usually not needed - MultiQC automatically takes the name of the module that is calling the function and uses this. However, sometimes it can be useful to overwrite this. ### Table colour scales Colour scales are taken from ColorBrewer2. Colour scales can be reversed by adding the suffix -rev to the name. For example, RdYlGn-rev. The following scales are available: ## Step 4 - Writing data to a file In addition to printing data to the General Stats, MultiQC modules typically also write to text-files to allow people to easily use the data in downstream applications. This also gives the opportunity to output additional data that may not be appropriate for the General Statistics table. Again, there is a base class function to help you with this - just supply it with a dictionary and a filename: data = { 'sample_1': { 'first_col': 91.4, 'second_col': '78.2%' }, 'sample_2': { 'first_col': 138.3, 'second_col': '66.3%' } } self.write_data_file(data, 'multiqc_mymod') If your output has a lot of columns, you can supply the additional argument sort_cols = True to have the columns alphabetically sorted. This function will also pay attention to the default / command line supplied data format and behave accordingly. So the written file could be a tab-separated file (default), JSON or YAML. Note that any keys with more than 2 levels of nesting will be ignored when being written to tab-separated files. ## Step 5 - Create report sections Great! It's time to start creating sections of the report with more information. To do this, use the self.add_section() helper function: self.add_section ( name = 'First Module Section', anchor = 'mymod-first', description = 'My amazing module output, from the first section', helptext = "If you're not sure _how_ to interpret the data, we can help!", plot = bargraph.plot(data) ) self.add_section ( name = 'Second Module Section', anchor = 'mymod-second', plot = linegraph.plot(data2) ) self.add_section ( content = '<p>Some custom HTML.</p>' ) These will automatically be labelled and linked in the navigation (unless the module has only one section or name is not specified). Note that description and helptext are processed as Markdown by default. This can be disabled by passing autoformat=False to the function. ## Step 6 - Plot some data Ok, you have some data, now the fun bit - visualising it! Each of the plot types is described in the Plotting Functions section of the docs. ## Appendices ### User configuration Instead of hardcoding defaults, it's a great idea to allow users to configure the behaviour of MultiQC module code. It's pretty easy to use the built in MultiQC configuration settings to do this, so that users can set up their config as described above in the docs. To do this, just assume that your configuration variables are available in the MultiQC config module and have sensible defaults. For example: from multiqc import config mymod_config = getattr(config, mymod_config, {}) my_custom_config_var = mymod_config.get('my_custom_config_var', 5) You now have a variable my_custom_config_var with a default value of 5, but that can be configured by a user as follows: mymod_config: my_custom_config_var: 200 Please be sure to use a unique top-level config name to avoid clashes - prefixing with your module name is a good idea as in the example above. Keep all module config options under the same top-level name for clarity. Finally, don't forget to document the usage of your module-specific configuration in docs/modules/mymodule.md so that people know how to use it. ### Profiling Performance It's important that MultiQC runs quickly and efficiently, especially on big projects with large numbers of samples. The recommended method to check this is by using cProfile to profile the code execution. To do this, run MultiQC as follows: python -m cProfile -o multiqc_profile.prof /path/to/MultiQC/scripts/multiqc -f . You can create a .bashrc alias to make this easier to run: alias profile_multiqc='python -m cProfile -o multiqc_profile.prof /path/to/MultiQC/scripts/multiqc ' profile_multiqc -f . MultiQC should run as normal, but produce the additional binary file multiqc_profile.prof. This can then be visualised with software such as SnakeViz. To install SnakeViz and visualise the results, do the following: pip install snakeviz snakeviz multiqc_profile.prof A web page should open where you can explore the execution times of different nested functions. It's a good idea to run MultiQC with a comparable number of results from other tools (eg. FastQC) to have a reference to compare against for how long the code should take to run. ### Adding Custom CSS / Javascript If you would like module-specific CSS and / or JavaScript added to the template, just add to the self.css and self.js dictionaries that come with the BaseMultiqcModule class. The key should be the filename that you want your file to have in the generated report folder (this is ignored in the default template, which includes the content file directly in the HTML). The dictionary value should be the path to the desired file. For example, see how it's done in the FastQC module: self.css = { 'assets/css/multiqc_fastqc.css' : os.path.join(os.path.dirname(__file__), 'assets', 'css', 'multiqc_fastqc.css') } self.js = { 'assets/js/multiqc_fastqc.js' : os.path.join(os.path.dirname(__file__), 'assets', 'js', 'multiqc_fastqc.js') } # Plotting Functions MultiQC plotting functions are held within multiqc.plots submodules. To use them, simply import the modules you want, eg.: from multiqc.plots import bargraph, linegraph Once you've done that, you will have access to the corresponding plotting functions: bargraph.plot() linegraph.plot() scatter.plot() table.plot() beeswarm.plot() heatmap.plot() These have been designed to work in a similar manner to each other - you pass a data structure to them, along with optional extras such as categories and configuration options, and they return a string of HTML to add to the report. You can add this to the module introduction or sections as described above. For example: self.add_section ( name = 'Module Section', anchor = 'mymod_section', description = 'This plot shows some really nice data.', helptext = 'This longer string (can be **markdown**) helps explain how to interpret the plot', plot = bargraph.plot(self.parsed_data, categories, pconfig) ) ## Common options All plots should as a minimum have a config with an id and a title. MultiQC is written to work with sensible defaults, so won't complain if you don't supply these, but it's good practice for usability (the ID is used as a filename when exporting plots, and all plots should have a title when exported). Plot titles should use the format Module name: Plot name (this is partly for ease of use within MegaQC and other downstream tools). ## Bar graphs Simple data can be plotted in bar graphs. Many MultiQC modules make use of stacked bar graphs. Here, the bargraph.plot() function comes to the rescue. A basic example is as follows: from multiqc.plots import bargraph data = { 'sample 1': { 'aligned': 23542, 'not_aligned': 343, }, 'sample 2': { 'not_aligned': 7328, 'aligned': 1275, } } html_content = bargraph.plot(data) To specify the order of categories in the plot, you can supply a list of dictionary keys. This can also be used to exclude a key from the plot. cats = ['aligned', 'not_aligned'] html_content = bargraph.plot(data, cats) If cats is given as a dict instead of a list, you can specify a nice name and a colour too. Make it an OrderedDict to specify the order: from collections import OrderedDict cats = OrderedDict() cats['aligned'] = { 'name': 'Aligned Reads', 'color': '#8bbc21' } cats['not_aligned'] = { 'name': 'Unaligned Reads', 'color': '#f7a35c' } Finally, a third variable should be supplied with configuration variables for the plot. The defaults are as follows: config = { # Building the plot 'id': '<random string>', # HTML ID used for plot 'cpswitch': True, # Show the 'Counts / Percentages' switch? 'cpswitch_c_active': True, # Initial display with 'Counts' specified? False for percentages. 'cpswitch_counts_label': 'Counts', # Label for 'Counts' button 'cpswitch_percent_label': 'Percentages' # Label for 'Percentages' button 'logswitch': False, # Show the 'Log10' switch? 'logswitch_active': False, # Initial display with 'Log10' active? 'logswitch_label': 'Log10', # Label for 'Log10' button 'hide_zero_cats': True, # Hide categories where data for all samples is 0 # Customising the plot 'title': None, # Plot title - should be in format "Module Name: Plot Title" 'xlab': None, # X axis label 'ylab': None, # Y axis label 'ymax': None, # Max y limit 'ymin': None, # Min y limit 'yCeiling': None, # Maximum value for automatic axis limit (good for percentages) 'yFloor': None, # Minimum value for automatic axis limit 'yMinRange': None, # Minimum range for axis 'yDecimals': True, # Set to false to only show integer labels 'ylab_format': None, # Format string for x axis labels. Defaults to {value} 'stacking': 'normal', # Set to None to have category bars side by side 'use_legend': True, # Show / hide the legend 'click_func': None, # Javascript function to be called when a point is clicked 'cursor': None, # CSS mouse cursor type. 'tt_decimals': 0, # Number of decimal places to use in the tooltip number 'tt_suffix': '', # Suffix to add after tooltip number 'tt_percentages': True, # Show the percentages of each count in the tooltip } The keys id and title should always be passed as a minimum. The id is used for the plot name when exporting. If left unset, the Plot Export panel will call the filename mqc_hcplot_gtucwirdzx.png (with some other random string). Plots should always have titles, especially as they can stand by themselves when exported. The title should have the format Modulename: Plot Name ### Switching datasets It's possible to have single plot with buttons to switch between different datasets. To do this, give a list of data objects (same formats as described above). Also add the following config options to supply names to the buttons: config = { 'data_labels': ['Reads', 'Bases'] } You can also customise the y-axis label and min/max values for each dataset: config = { 'data_labels': [ {'name': 'Reads', 'ylab': 'Number of Reads'}, {'name': 'Bases', 'ylab': 'Number of Base Pairs', 'ymax':100} ] } If supplying multiple datasets, you can also supply a list of category objects. Make sure that they are in the same order as the data. Categories should contain data keys, so if you're supplying a list of two datasets, you should supply a list of two sets of keys for the categories. MultiQC will try to guess categories from the data keys if categories are missing. For example, with two datasets supplied as above: cats = [ ['aligned_reads','unaligned_reads'], ['aligned_base_pairs','unaligned_base_pairs'], ] Or with additional customisation such as name and colour: from collections import OrderedDict cats = [OrderedDict(), OrderedDict()] cats[0]['aligned_reads'] = {'name': 'Aligned Reads', 'color': '#8bbc21'} cats[0]['unaligned_reads'] = {'name': 'Unaligned Reads', 'color': '#f7a35c'} cats[1]['aligned_base_pairs'] = {'name': 'Aligned Base Pairs', 'color': '#8bbc21'} cats[1]['unaligned_base_pairs'] = {'name': 'Unaligned Base Pairs', 'color': '#f7a35c'} ### Interactive / Flat image plots Note that the bargraph.plot() function can generate both interactive JavaScript (HighCharts) powered report plots and flat image plots made using MatPlotLib. This choice is made within the function based on config variables such as number of dataseries and command line flags. Note that both plot types should come out looking pretty much identical. If you spot something that's missing in the flat image plots, let me know. ## Line graphs This base function works much like the above, but for two-dimensional data, to produce line graphs. It expects a dictionary in the following format: from multiqc.plots import linegraph data = { 'sample 1': { '<x val 1>': '<y val 1>', '<x val 2>': '<y val 2>', }, 'sample 2': { '<x val 1>': '<y val 1>', '<x val 2>': '<y val 2>', } } html_content = linegraph.plot(data) Additionally, a config dict can be supplied. The defaults are as follows: from multiqc.plots import linegraph config = { # Building the plot 'smooth_points': None, # Supply a number to limit number of points / smooth data 'smooth_points_sumcounts': True, # Sum counts in bins, or average? Can supply list for multiple datasets 'id': '<random string>', # HTML ID used for plot 'categories': False, # Set to True to use x values as categories instead of numbers. 'colors': dict() # Provide dict with keys = sample names and values colours 'extra_series': None, # See section below # Plot configuration 'title': None, # Plot title - should be in format "Module Name: Plot Title" 'xlab': None, # X axis label 'ylab': None, # Y axis label 'xCeiling': None, # Maximum value for automatic axis limit (good for percentages) 'xFloor': None, # Minimum value for automatic axis limit 'xMinRange': None, # Minimum range for axis 'xmax': None, # Max x limit 'xmin': None, # Min x limit 'xLog': False, # Use log10 x axis? 'xDecimals': True, # Set to false to only show integer labels 'yCeiling': None, # Maximum value for automatic axis limit (good for percentages) 'yFloor': None, # Minimum value for automatic axis limit 'yMinRange': None, # Minimum range for axis 'ymax': None, # Max y limit 'ymin': None, # Min y limit 'yLog': False, # Use log10 y axis? 'yDecimals': True, # Set to false to only show integer labels 'yPlotBands': None, # Highlighted background bands. See http://api.highcharts.com/highcharts#yAxis.plotBands 'xPlotBands': None, # Highlighted background bands. See http://api.highcharts.com/highcharts#xAxis.plotBands 'yPlotLines': None, # Highlighted background lines. See http://api.highcharts.com/highcharts#yAxis.plotLines 'xPlotLines': None, # Highlighted background lines. See http://api.highcharts.com/highcharts#xAxis.plotLines 'xLabelFormat': '{value}', # Format string for the axis labels 'yLabelFormat': '{value}', # Format string for the axis labels 'tt_label': '{point.x}: {point.y:.2f}', # Use to customise tooltip label, eg. '{point.x} base pairs' 'pointFormat': None, # Replace the default HTML for the entire tooltip label 'click_func': function(){}, # Javascript function to be called when a point is clicked 'cursor': None # CSS mouse cursor type. Defaults to pointer when 'click_func' specified 'reversedStacks': False # Reverse the order of the category stacks. Defaults True for plots with Log10 option } html_content = linegraph.plot(data, config) The keys id and title should always be passed as a minimum. The id is used for the plot name when exporting. If left unset, the Plot Export panel will call the filename mqc_hcplot_gtucwirdzx.png (with some other random string). Plots should always have titles, especially as they can stand by themselves when exported. The title should have the format Modulename: Plot Name ### Switching datasets You can also have a single plot with buttons to switch between different datasets. To do this, just supply a list of data dicts instead (same formats as described above). Also add the following config options to supply names to the buttons and graph labels: config = { 'data_labels': [ {'name': 'DS 1', 'ylab': 'Dataset 1', 'xlab': 'x Axis 1'}, {'name': 'DS 2', 'ylab': 'Dataset 2', 'xlab': 'x Axis 2'} ] } All of these config values are optional, the function will default to sensible values if things are missing. See the cutadapt module plots for an example of this in action. ### Additional data series Sometimes, it's good to be able to specify specific data series manually. To do this, use config['extra_series']. For a single extra line this can be a dict (as below). For multiple lines, use a list of dicts. For multiple dataset plots, use a list of list of dicts. For example, to add a dotted x = y reference line: from multiqc.plots import linegraph config = { 'extra_series': { 'name': 'x = y', 'data': [[0, 0], [max_x_val, max_y_val]], 'dashStyle': 'Dash', 'lineWidth': 1, 'color': '#000000', 'marker': { 'enabled': False }, 'enableMouseTracking': False, 'showInLegend': False, } } html_content = linegraph.plot(data, config) ## Scatter Plots Scatter plots work in almost exactly the same way as line plots. Most (if not all) config options are shared between the two. The data structure is similar but not identical: from multiqc.plots import scatter data = { 'sample 1': { x: '<x val>', y: '<y val>' }, 'sample 2': { x: '<x val>', y: '<y val>' } } html_content = scatter.plot(data) If you want more than one data point per sample, you can supply a list of dictionaries instead. You can also optionally specify point colours and sample name suffixes (these are appended to the sample name): data = { 'sample 1': [ { x: '<x val>', y: '<y val>', color: '#a6cee3', name: 'Type 1' }, { x: '<x val>', y: '<y val>', color: '#1f78b4', name: 'Type 2' } ], 'sample 2': [ { x: '<x val>', y: '<y val>', color: '#b2df8a', name: 'Type 1' }, { x: '<x val>', y: '<y val>', color: '#33a02c', name: 'Type 2' } ] } Remember that MultiQC reports can contain large numbers of samples, so this plot type is not suitable for large quantities of data - 20,000 genes might look good for one sample, but when someone runs MultiQC with 500 samples, it will crash the browser and be impossible to interpret. See the above docs about line plots for most config options. The scatter plot has a handful of unique ones in addition: pconfig = { 'marker_colour': 'rgba(124, 181, 236, .5)', # string, base colour of points (recommend rgba / semi-transparent) 'marker_size': 5, # int, size of points 'marker_line_colour': '#999', # string, colour of point border 'marker_line_width': 1, # int, width of point border 'square': False # Force the plot to stay square? (Maintain aspect ratio) } ## Creating a table Tables should work just like the functions above (most like the bar graph function). As a minimum, the function takes a dictionary containing data - the first keys will be sample names (row headers) and each key contained within will be a table column header. You can also supply a list of key names to restrict the data in the table to certain keys / columns. This also specifies the order that columns should be displayed in. For more customisation, the headers can be supplied as a dictionary. Each key should match the keys used in the data dictionary, but values can customise the output. If you want to specify the order of the columns, you must use an OrderedDict. Finally, the function accepts a config dictionary as a third parameter. This can set global options for the table (eg. a title) and can also hold default values to customise the output of all table columns. The default header keys are: single_header = { 'namespace': '', # Name for grouping. Prepends desc and is in Config Columns modal 'title': '[ dict key ]', # Short title, table column title 'description': '[ dict key ]', # Longer description, goes in mouse hover text 'max': None, # Minimum value in range, for bar / colour coding 'min': None, # Maximum value in range, for bar / colour coding 'ceiling': None, # Maximum value for automatic bar limit 'floor': None, # Minimum value for automatic bar limit 'minRange': None, # Minimum range for automatic bar 'scale': 'GnBu', # Colour scale for colour coding. False to disable. 'colour': '<auto>', # Colour for column grouping 'suffix': None, # Suffix for value (eg. '%') 'format': '{:,.1f}', # Value format string - default 1 decimal place 'shared_key': None # See below for description 'modify': None, # Lambda function to modify values 'hidden': False # Set to True to hide the column on page load } A third parameter can be specified with settings for the whole table: table_config = { 'namespace': '', # Name for grouping. Prepends desc and is in Config Columns modal 'id': '<random string>', # ID used for the table 'table_title': '<table id>', # Title of the table. Used in the column config modal 'save_file': False, # Whether to save the table data to a file 'raw_data_fn':'multiqc_<table_id>_table' # File basename to use for raw data file 'sortRows': True # Whether to sort rows alphabetically 'col1_header': 'Sample Name' # The header used for the first column 'no_beeswarm': False # Force a table to always be plotted (beeswarm by default if many rows) } Header keys such as max, min and scale can also be specified in the table config. These will then be applied to all columns. A very basic example is shown below: data = { 'sample 1': { 'aligned': 23542, 'not_aligned': 343, }, 'sample 2': { 'aligned': 1275, 'not_aligned': 7328, } } table_html = table.plot(data) A more complicated version with ordered columns, defaults and column-specific settings (eg. no decimal places): data = { 'sample 1': { 'aligned': 23542, 'not_aligned': 343, 'aligned_percent': 98.563952271 }, 'sample 2': { 'aligned': 1275, 'not_aligned': 7328, 'aligned_percent': 14.820411484 } } headers = OrderedDict() headers['aligned_percent'] = { 'title': '% Aligned', 'description': 'Percentage of reads that aligned', 'suffix': '%', 'max': 100, 'format': '{:,.0f}' # No decimal places please } headers['aligned'] = { 'title': '{} Aligned'.format(config.read_count_prefix), 'description': 'Aligned Reads ({})'.format(config.read_count_desc), 'shared_key': 'read_count', 'modify': lambda x: x * config.read_count_multiplier } config = { 'namespace': 'My Module', 'min': 0, 'scale': 'GnBu' } table_html = table.plot(data, headers, config) ### Table decimal places You can customise how many decimal places a number has by using the format config key for that column. The default format string is '{:,.1f}', which specifies a float number with a single decimal place. To remove decimals use '{:,.0f}'. To have two decimal places, use '{:,.2f}'. ### Table colour scales Colour scales are taken from ColorBrewer2. Colour scales can be reversed by adding the suffix -rev to the name. For example, RdYlGn-rev. The following scales are available: ## Beeswarm plots (dot plots) Beeswarm plots work from the exact same data structure as tables, so the usage is just the same. Except instead of calling table, call beeswarm: data = { 'sample 1': { 'aligned': 23542, 'not_aligned': 343, }, 'sample 2': { 'not_aligned': 7328, 'aligned': 1275, } } beeswarm_html = beeswarm.plot(data) The function also accepts the same headers and config parameters. ## Heatmaps Heatmaps expect data in the structure of a list of lists. Then, a list of sample names for the x-axis, and optionally for the y-axis (defaults to the same as the x-axis). heatmap.plot(data, xcats, ycats, pconfig) A simple example: hmdata = [ [0.9, 0.87, 0.73, 0.6, 0.2, 0.3], [0.87, 1, 0.7, 0.6, 0.9, 0.3], [0.73, 0.8, 1, 0.6, 0.9, 0.3], [0.6, 0.8, 0.7, 1, 0.9, 0.3], [0.2, 0.8, 0.7, 0.6, 1, 0.3], [0.3, 0.8, 0.7, 0.6, 0.9, 1], ] names = [ 'one', 'two', 'three', 'four', 'five', 'six' ] hm_html = heatmap.plot(hmdata, names) Much like the other plots, you can change the way that the heatmap looks using a config dictionary: pconfig = { 'title': None, # Plot title - should be in format "Module Name: Plot Title" 'xTitle': None, # X-axis title 'yTitle': None, # Y-axis title 'min': None, # Minimum value (default: auto) 'max': None, # Maximum value (default: auto) 'square': True, # Force the plot to stay square? (Maintain aspect ratio) 'colstops': [] # Scale colour stops. See below. 'reverseColors': False, # Reverse the order of the colour axis 'decimalPlaces': 2, # Number of decimal places for tooltip 'legend': True, # Colour axis key enabled or not 'borderWidth': 0, # Border width between cells 'datalabels': True, # Show values in each cell. Defaults True when less than 20 samples. 'datalabel_colour': '<auto>', # Colour of text for values. Defaults to auto contrast. } The colour stops are a bit special and can be used to define a custom colour scheme. These should be defined as a list of lists, with a number between 0 and 1 and a HTML colour. The default is RdYlBu from ColorBrewer: pconfig = { 'colstops' = [ [0, '#313695'], [0.1, '#4575b4'], [0.2, '#74add1'], [0.3, '#abd9e9'], [0.4, '#e0f3f8'], [0.5, '#ffffbf'], [0.6, '#fee090'], [0.7, '#fdae61'], [0.8, '#f46d43'], [0.9, '#d73027'], [1, '#a50026'], ] } ## Javascript Functions The javascript bundled in the default MultiQC template has a number of helper functions to make your life easier. NB: The MultiQC Python functions make use of these, so it's very unlikely that you'll need to use any of this. But it's here for reference. ### Plotting line graphs plot_xy_line_graph (target, ds) Plots a line graph with multiple series of (x,y) data pairs. Used by the linegraph.plot() python function. Data and configuration must be added to the document level mqc_plots variable on page load, using the target as the key. The variables used are as follows: mqc_plots[target]['plot_type'] = 'xy_line'; mqc_plots[target]['config']; mqc_plots[target]['datasets']; Multiple datasets can be added in the ['datasets'] array. The supplied variable ds specifies which is plotted (defaults to 0). Available config options with default vars: config = { title: undefined, // Plot title xlab: undefined, // X axis label ylab: undefined, // Y axis label xCeiling: undefined, // Maximum value for automatic axis limit (good for percentages) xFloor: undefined, // Minimum value for automatic axis limit xMinRange: undefined, // Minimum range for axis xmax: undefined, // Max x limit xmin: undefined, // Min x limit xDecimals: true, // Set to false to only show integer labels yCeiling: undefined, // Maximum value for automatic axis limit (good for percentages) yFloor: undefined, // Minimum value for automatic axis limit yMinRange: undefined, // Minimum range for axis ymax: undefined, // Max y limit ymin: undefined, // Min y limit yDecimals: true, // Set to false to only show integer labels yPlotBands: undefined, // Highlighted background bands. See http://api.highcharts.com/highcharts#yAxis.plotBands xPlotBands: undefined, // Highlighted background bands. See http://api.highcharts.com/highcharts#xAxis.plotBands tt_label: '{point.x}: {point.y:.2f}', // Use to customise tooltip label, eg. '{point.x} base pairs' pointFormat: undefined, // Replace the default HTML for the entire tooltip label click_func: function(){}, // Javascript function to be called when a point is clicked cursor: undefined // CSS mouse cursor type. Defaults to pointer when 'click_func' specified } An example of the markup expected, with the function being called: <div id="my_awesome_line_graph" class="hc-plot"></div> <script type="text/javascript"> mqc_plots['#my_awesome_bar_plot']['plot_type'] = 'xy_line'; mqc_plots['#my_awesome_line_graph']['datasets'] = [ { name: 'Sample 1', data: [[1, 1.5], [1.5, 3.1], [2, 6.4]] }, { name: 'Sample 2', data: [[1, 1.7], [1.5, 4.3], [2, 8.4]] }, ]; mqc_plots['#my_awesome_line_graph']['config'] = { "title": "Best Plot Ever", "ylab": "Pings", "xlab": "Pongs" };(function () {
plot_xy_line_graph('#my_awesome_line_graph');
});
</script>

### Plotting bar graphs

plot_stacked_bar_graph (target, ds)

Plots a bar graph with multiple series containing multiple categories. Used by the bargraph.plot() python function.

Data and configuration must be added to the document level mqc_plots variable on page load, using the target as the key. The variables used are as follows:

mqc_plots[target]['plot_type'] = 'bar_graph';
mqc_plots[target]['config'];
mqc_plots[target]['datasets'];
mqc_plots[target]['samples'];

All available config options with default vars:

config = {
title: undefined,           // Plot title
xlab: undefined,            // X axis label
ylab: undefined,            // Y axis label
ymax: undefined,            // Max y limit
ymin: undefined,            // Min y limit
yDecimals: true,            // Set to false to only show integer labels
ylab_format: undefined,     // Format string for x axis labels. Defaults to {value}
stacking: 'normal',         // Set to null to have category bars side by side (None in python)
xtype: 'linear',            // Axis type. 'linear' or 'logarithmic'
use_legend: true,           // Show / hide the legend
click_func: undefined,      // Javascript function to be called when a point is clicked
cursor: undefined,          // CSS mouse cursor type. Defaults to pointer when 'click_func' specified
tt_percentages: true,       // Show the percentages of each count in the tooltip
reversedStacks: false,      // Reverse the order of the categories in the stack.
}

An example of the markup expected, with the function being called:

<div id="my_awesome_bar_plot" class="hc-plot"></div>
<script type="text/javascript">
mqc_plots['#my_awesome_bar_plot']['plot_type'] = 'bar_graph';
mqc_plots['#my_awesome_bar_plot']['samples'] = ['Sample 1', 'Sample 2']
mqc_plots['#my_awesome_bar_plot']['datasets'] = [{"data": [4, 7], "name": "Passed Test"}, {"data": [2, 3], "name": "Failed Test"}]
mqc_plots['#my_awesome_bar_plot']['config'] = {
"title": "My Awesome Plot",
"ylab": "# Observations",
"ymin": 0,
"stacking": "normal"
};
$(function () { plot_stacked_bar_graph("#my_awesome_bar_plot"); }); </script> ### Switching counts and percentages If you're using the plotting functions above, it's easy to add a button which switches between percentages and counts. Just add the following HTML above your plot: <div class="btn-group switch_group"> <button class="btn btn-default btn-sm active" data-action="set_numbers" data-target="#my_plot">Counts</button> <button class="btn btn-default btn-sm" data-action="set_percent" data-target="#my_plot">Percentages</button> </div> NB: This markup is generated automatically with the Python self.plot_bargraph() function. ### Switching plot datasets Much like the counts / percentages buttons above, you can add a button which switches the data displayed in a single plot. Make sure that both datasets are stored in named javascript variables, then add the following markup: <div class="btn-group switch_group"> <button class="btn btn-default btn-sm active" data-action="set_data" data-ylab="First Data" data-newdata="data_var_1" data-target="#my_plot">Data 1</button> <button class="btn btn-default btn-sm" data-action="set_data" data-ylab="Second Data" data-newdata="data_var_2" data-target="#my_plot">Data 2</button> </div> Note the CSS class active which specifies which button is 'pressed' on page load. data-ylab and data-xlab can be used to specify the new axes labels. data-newdata should be the name of the javascript object with the new data to be plotted and data-target should be the CSS selector of the plot to change. ### Custom event triggers Some of the events that take place in the general javascript code trigger jQuery events which you can hook into from within your module's code. This allows you to take advantage of events generated by the global theme whilst keeping your code modular. $(document).on('mqc_highlights', function(e, f_texts, f_cols, regex_mode){
// This trigger is called when the highlight strings are
// updated. Three variables are given - an array of search
// strings (f_texts), an array of colours with corresponding
// indexes (f_cols) and a boolean var saying whether the
// search should be treated as a string or a regex (regex_mode)
});

$(document).on('mqc_renamesamples', function(e, f_texts, t_texts, regex_mode){ // This trigger is called when samples are renamed // Three variables are given - an array of search // strings (f_texts), an array of replacements with corresponding // indexes (t_texts) and a boolean var saying whether the // search should be treated as a string or a regex (regex_mode) });$(document).on('mqc_hidesamples', function(e, f_texts, regex_mode){
// This trigger is called when the Hide Samples filters change.
// Two variables are given - an array of search strings
// (f_texts) and a boolean saying whether the search should
// be treated as a string or a regex (regex_mode)
});

$('#YOUR_PLOT_ID').on('mqc_plotresize', function(){ // This trigger is called when a plot handle is pulled, // resizing the height });$('#YOUR_PLOT_ID').on('mqc_original_series_click', function(e, name){
// A plot able to show original images has had a point clicked.
// 'name' contains the name of the series that was clicked
});

$('#YOUR_PLOT_ID').on('mqc_original_chg_source', function(e, name){ // A plot with original images has had a request to change the // original image source (eg. pressing Prev / Next) });$('#YOUR_PLOT_ID').on('mqc_plotexport_image', function(e, cfg){
// A trigger to export an image of the plot. cfg contains
// config variables for the requested image.
});

\$('#YOUR_PLOT_ID').on('mqc_plotexport_data', function(e, cfg){
// A trigger to export a data file of the plot. cfg contains
// config variables for the requested data.
});

# MultiQC Plugins

MultiQC is written around a system designed for extensibility and plugins. These features allow custom code to be written without polluting the central code base.

Please note that we want MultiQC to grow as a community tool! So if you're writing a module or theme that can be used by others, please keep it within the main MultiQC framework and submit a pull request.

## Entry Points

The plugin system works using setuptools entry points. In setup.py you will see a section of code that looks like this (truncated):

entry_points = {
'multiqc.modules.v1': [
'qualimap = multiqc.modules.qualimap:MultiqcModule',
],
'multiqc.templates.v1': [
'default = multiqc.templates.default',
],
# 'multiqc.cli_options.v1': [
# 'my-new-option = myplugin.cli:new_option'
# ],
# 'multiqc.hooks.v1': [
# 'before_config = myplugin.hooks:before_config',
# 'execution_start = myplugin.hooks:execution_start',
# 'before_modules = myplugin.hooks:before_modules',
# 'after_modules = myplugin.hooks:after_modules',
# 'execution_finish = myplugin.hooks:execution_finish',
# ]
},

These sets of entry points can each be extended to add functionality to MultiQC:

• multiqc.modules.v1
• Defines the module classes. Used to add new modules.
• multiqc.templates.v1
• Defines the templates. Can be used for new templates.
• multiqc.cli_options.v1
• Allows plugins to add new custom command line options
• multiqc.hooks.v1
• Code hooks for plugins to add new functionality

Any python program can create entry points with the same name, once installed MultiQC will find these and run them accordingly. For an example of this in action, see the MultiQC_NGI setup file:

entry_points = {
'multiqc.templates.v1': [
'ngi = multiqc_ngi.templates.ngi',
'genstat = multiqc_ngi.templates.genstat',
],
'multiqc.cli_options.v1': [
'project = multiqc_ngi.cli:pid_option'
],
'multiqc.hooks.v1': [
]
},

Here, two new templates are added, a new command line option and a new code hook.

## Modules

List items added to multiqc.modules.v1 specify new modules. They should be described as follows:

modname = python_mod.dirname.submodname:classname'

Once this is done, everything else should be the same as described in the writing modules documentation.

## Templates

As above, though no need to specify a class name at the end. See the writing templates documentation for further instructions.

## Command line options

MultiQC handles command line interaction using the click framework. You can use the multiqc.cli_options.v1 entry point to add new click decorators for command line options. For example, the MultiQC_NGI plugin uses the entry point above with the following code in cli.py:

import click
pid_option = click.option('--project', type=str)

The values given from additional command line arguments are parsed by MultiQC and put into config.kwargs. The above plugin later reads the value given by the user with the --project flag in a hook:

if config.kwargs['project'] is not None:
# do some stuff

See the click documentation or the main MultiQC script for more information and examples of adding command line options.

## Hooks

Hooks are a little more complicated - these define points in the core MultiQC code where you can run custom functions. This can be useful as your code is able to access data generated by other parts of the program. For example, you could tie into the after_modules hook to insert data processed by MultiQC modules into a database automatically.

Here, the entry point names are the hook titles, described as commented out lines in the core MultiQC setup.py: execution_start, config_loaded, before_modules, after_modules and execution_finish.

These should point to a function in your code which will be executed when that hook fires. Your custom code can import the core MultiQC modules to access configuration and loggers. For example:

#!/usr/bin/env python
""" MultiQC hook functions - we tie into the MultiQC
core here to add in extra functionality. """

import logging
from multiqc.utils import report, config

log = logging.getLogger('multiqc')

def after_modules():
""" Plugin code to run when MultiQC modules have completed  """
num_modules = len(report.modules_output)
status_string = "MultiQC hook - {} modules reported!".format(num_modules)
log.critical(status_string)

# Writing New Templates

MultiQC is built around a templating system that uses the Jinja python package. This makes it very easy to create new report templates that fit your needs.

## Core or plugin?

If your template could be of use to others, it would be great if you could add it to the main MultiQC package. You can do this by creating a fork of the MultiQC GitHub repository, adding your template and then creating a pull request to merge your changes back to the main repository.

## Creating a template skeleton

For a new template to be recognised by MultiQC, it must be a python submodule directory with a __init__.py file. This must be referenced in the setup.py installation script as an entry point.

You can see the bundled templates defined in this way:

entry_points = {
'multiqc.templates.v1': [
'default = multiqc.templates.default',
'default_dev = multiqc.templates.default_dev',
'simple = multiqc.templates.simple',
'geo = multiqc.templates.geo',
]
}

Note that these entry points can point to any Python modules, so if you're writing a plugin module you can specify your module name instead. Just make sure that multiqc.templates.v1 is the same.

Once you've added the entry point, remember to install the package again:

python setup.py develop

Using develop tells setuptools to softlink the plugin files instead of copying, so changes made whilst editing files will be reflected when you run MultiQC.

The __init__.py files must define two variables - the path to the template directory and the main jinja template file:

template_dir = os.path.dirname(__file__)
base_fn = 'base.html'

## Child templates

The default MultiQC template contains a lot of code. Importantly, it includes 1448 lines of custom JavaScript (at time of writing) which powers the plotting and dynamic functions in the report. You probably don't want to rewrite all of this for your template, so to make your life easier you can create a child template.

To do this, add an extra variable to your template's __init__.py:

template_parent = 'default'

This tells MultiQC to use the template files from the default template unless a file with the same name is found in your child template. For instance, if you just want to add your own logo in the header of the reports, you can create your own header.html which will overwrite the default header.

Files within the default template have comments at the top explaining what part of the report they generate.

## Extra init variables

There are a few extra variables that can be added to the __init__.py file to change how the report is generated.

Setting output_dir instructs MultiQC to put the report and it's contents into a subdirectory. Set the string to your desired name. Note that this will be prefixed if -p/--prefix is set at run time.

Secondly, you can copy additional files with your report when it is generated. This is usually used to copy required images or scripts with the report. These should be a list of file or directory paths, relative to the __init__.py file. Directory contents will be copied recursively.

You can also override config options in the template. For example, setting the value of config.plots_force_flat can force the report to only have static image plots.

from multiqc.utils import config

output_subdir = 'multiqc_report'
copy_files = ['assets']
config.plots_force_flat = True

## Jinja template variables

There are a number of variables that you can use within your Jinja template. Two namespaces are available - report and config. You can print these using the Jinja curly brace syntax, eg. {{ config.version }}. See the Jinja2 documentation for more information.

The default MultiQC template includes dependencies in the HTML so that the report is standalone. If you would like to do the same, use the include_file function. For example:

<script>{{ include_file('js/jquery.min.js') }}</script>
<img src="data:image/png;base64,{{ include_file('img/logo.png', b64=True) }}">

## Appendices

### Custom plotting functions

If you don't like the default plotting functions built into MultiQC, you can write your own! If you create a callable variable in a template called either bargraph or linegraph, MultiQC will use that instead. For example:

def custom_linegraph(plotdata, pconfig):
return '<h1>Awesome line graph here</h1>'
linegraph = custom_linegraph

def custom_bargraph(plotdata, plotseries, pconfig):
return '<h1>Awesome bar graph here</h1>'
bargraph = custom_bargraph

These particular examples don't do very much, but hopefully you get the idea. Note that you have to set the variable linegraph or bargraph to your function.

# Updating for compatibility

When releasing new versions of MultiQC we aim to maintain compatibility so that your existing modules and plugins will keep working. However, in some cases we have to make changes that require code to be modified. This section summarises the changes by MultiQC release.

MultiQC v1.0 brings a few changes in the way that MultiQC modules and plugins are written. Most are backwards-compatible, but there are a couple that could break external plugins.

#### Module imports

New MultiQC module imports have been refactored to make them less inter-dependent and fragile. This has a bunch of advantages, notably allowing better, more modular, unit testing (and hopefully more reliable and maintainable code).

All MultiQC modules and plugins will need to change some of their import statements.

There are two things that you probably need to change in your plugin modules to make them work with the updated version of MultiQC, both to do with imports. Instead of this style of importing modules:

from multiqc import config, BaseMultiqcModule, plots

You now need this:

from multiqc import config
from multiqc.plots import bargraph   # Load specific plot types here
from multiqc.modules.base_module import BaseMultiqcModule

Modules that directly reference multiqc.BaseMultiqcModule instead need to reference multiqc.modules.base_module.BaseMultiqcModule.

Secondly, modules that use import plots now need to import the specific plots needed. You will also need to update any plotting functions, removing the plot. prefix.

For example, change this:

import plots
return plots.bargraph.plot(data, keys, pconfig)

to this:

from plots import bargraph
return bargraph.plot(data, keys, pconfig)

These changes have been made to simplify the module imports within MultiQC, allowing specific parts of the codebase to be imported into a Python script on their own. This enables small, atomic, clean unit testing.

If you have any questions, please open an issue.

Many thanks to @tbooth at @EdinburghGenomics for his patient work with this.

#### Searching for files

The core find_log_files function has been rewritten and now works a little differently. Instead of searching all analysis files each time it's called (by every module), all files are searched once at the start of the MultiQC execution. This makes MultiQC run much faster.

To use the new syntax, add your search pattern to config.sp using the new before_config plugin hook:

setup.py:

# [..]
'multiqc.hooks.v1': [
]

mymodule.py:

from multiqc.utils import config
my_search_patterns = {
'my_plugin/my_mod': {'fn': '*_somefile.txt'},
'my_plugin/my_other_mod': {'fn': '*other_file.txt'},
}
config.update_dict(config.sp, my_search_patterns)

This will add in your search patterns to the default MultiQC config, before user config files are loaded (allowing people to overwrite your defaults as with other modules).

Now, you can find your files much as before, using the string specified above:

for f in self.find_log_files('my_plugin/my_mod'):
# do something

The old syntax (supplying a dict instead of a string to the function without any previous config setup) will still work, but you will get a depreciation notice. This functionality may be removed in the future.

Until now, report sections were added by creating a list called self.sections and adding to it. If you only had a single section, the routine was to instead append to the self.intro string.

These methods have been depreciated in favour of a new function called self.add_section(). For example, instead of the previous:

self.sections = list()
self.sections.append({
'name': 'My Section',
'anchor': 'my-html-id',
'content': '<p>Description of what this plot shows.</p>' +
linegraph.plot(data, pconfig)
})

the syntax is now:

self.add_section(
name = 'My Section',
anchor = 'my-html-id',
description = 'Description of what this plot shows.',
helptext = 'More extensive help text can about how to interpret this.'
plot = linegraph.plot(data, pconfig)
)

Note that content should now be split up into three new keys: description, helptext and plot. This will allow consistent formatting and future developments with improved module help text. Text is wrapped in <p> tags by the function, so these are no longer needed. Raw content can still be provided in a content string as before if required.

All fields are optional. If name is omitted then the end result will be the same as previously done with self.intro += content.

#### Updated number formatting

A couple of minor updates to how numbers are handled in tables may affect your configs. Firstly, format strings looking like {:.1f} should now be {:,.1f} (note the extra comma). This enables customisable number formatting with separated thousand groups.

Secondly, any table columns reporting a read count should use new config options to allow user-configurable multipliers. For example, instead of this:

headers['read_counts'] = {
'modify': lambda x: x / 1000000,
'format': '{:.,2f} M',
}
headers['read_counts'] = {
}