Welcome to the MultiQC docs.

These docs are bundled with the MultiQC download for your convenience, so you can also read in your installation or on Github.

Table of Contents

Discuss on Gitter

Back to top

Using MultiQC

Introduction

MultiQC is a reporting tool that parses summary statistics from results and log files generated by other bioinformatics tools. MultiQC doesn't run other tools for you - it's designed to be placed at the end of analysis pipelines or to be run manually when you've finished running your tools.

When you launch MultiQC, it recursively searches through any provided file paths and finds files that it recognises. It parses relevant information from these and generates a single stand-alone HTML report file. It also saves a directory of data files with all parsed data for further downstream use.

Installing MultiQC

Installing Python

To see if you have python installed, run python --version on the command line. If you see version 2.7+, 3.4+ or 3.5+ then you can skip this step.

We recommend using virtual environments to manage your Python installation. Our favourite is Anaconda, a cross-platform tool to manage Python environments. You can installation instructions for Anaconda here.

Once Anaconda is installed, you can create an environment with the following commands:

conda create --name py2.7 python=2.7
source activate py2.7
# Windows: activate py2.7

You'll want to add the source activate py2.7 line to your .bashrc file so that the environment is loaded every time you load the terminal.

Installing with conda

If you're using conda as described above, you can install MultiQC from the bioconda channel as follows:

conda install -c bioconda multiqc

Installation with pip

This is the easiest way to install MultiQC. pip is the package manager for the Python Package Manager. It comes bundled with recent versions of Python, otherwise you can find installation instructions here.

You can now install MultiQC from PyPI as follows:

pip install multiqc

If you would like the development version, the command is:

pip install git+https://github.com/ewels/MultiQC.git

Note that if you have problems with read-only directories, you can install to your home directory with the --user parameter (though it's probably better to use virtual environments, as described above).

pip install --user multiqc

Manual installation

If you'd rather not use either of these tools, you can clone the code and install the code yourself:

git clone https://github.com/ewels/MultiQC.git
cd MultiQC
python setup.py install

git not installed? No problem - just download the flat files:

curl -LOk https://github.com/ewels/MultiQC/archive/master.zip
unzip master.zip
cd MultiQC-master
python setup.py install

Updating MultiQC

You can update MultiQC from PyPI at any time by running the following command:

pip --update multiqc

To update the development version, use:

pip install --force git+https://github.com/ewels/MultiQC.git

If you cloned the git repo, just pull the latest changes and install:

cd MultiQC
git pull
python setup.py install

If you downloaded the flat files, just repeat the installation procedure.

Installing as an environment module

Many people using MultiQC will be working on a HPC environment. Every server / cluster is different, and you're probably best off asking your friendly sysadmin to install MultiQC for you. However, with that in mind, here are a few general tips for installing MultiQC into an environment module system:

MultiQC comes in two parts - the multiqc python package and the multiqc executable script. The former must be available in $PYTHONPATH and the script must be available on the $PATH.

A typical installation procedure with an environment module Python install might look like this: (Note that $PYTHONPATH must be defined before pip installation.)

VERSION=0.7
INST=/path/to/software/multiqc/$VERSION
module load python/2.7.6
mkdir $INST
export PYTHONPATH=$INST/lib/python2.7/site-packages
pip install --install-option="--prefix=$INST" multiqc

Once installed, you'll need to create an environment module file. Again, these vary between systems a lot, but here's an example:

#%Module1.0#####################################################################
##
## MultiQC
##

set components [ file split [ module-info name ] ]
set version [ lindex $components 1 ]
set modroot /path/to/software/multiqc/$version

proc ModulesHelp { } {
    global version modroot
    puts stderr "\tMultiQC - use MultiQC $version"
    puts stderr "\n\tVersion $version\n"
}
module-whatis   "Loads MultiQC environment."

# load required modules
module load python/2.7.6

# only one version at a time
conflict multiqc

# Make the directories available
prepend-path    PATH        $modroot/bin
prepend-path    PYTHONPATH  $modroot/lib/python2.7/site-packages

Using the Docker container

A Docker container based on python:2.7-slim is provided. Specify the volume to bind mount as desired with -v, same for the working directory inside the container with -w. Or just use -v "$PWD":"$PWD" -w "$PWD" to run in current directory. For more information, look into the Docker documentation

The usual multiqc command line should work fine:

docker run -v "$PWD":"$PWD" -w "$PWD" ewels/multiqc multiqc .

Running MultiQC

Once installed, just go to your analysis directory and run multiqc, followed by a list of directories to search. At it's simplest, this can just be . (the current working directory):

multiqc .

That's it! MultiQC will scan the specified directories and produce a report based on details found in any log files that it recognises.

See Using MultiQC Reports for more information about how to use the generated report.

For a description of all command line parameters, run multiqc --help.

Choosing where to scan

You can supply MultiQC with as many directories or files as you like. Above, we supply . - just the current directory, but all of these would work too:

multiqc data/
multiqc data/ ../proj_one/analysis/ /tmp/results
multiqc data/*_fastqc.zip
multiqc data/sample_1*

You can also ignore files using the -x/--ignore flag (can be specified multiple times). This takes a string which it matches using glob expansion to filenames, directory names and entire paths:

multiqc . --ignore *_R2*
multiqc . --ignore run_two/
multiqc . --ignore */run_three/*/fastqc/*_R2.zip

Some modules get sample names from the contents of the file and not the filename (for example, stdout logs can contain multiple samples). In this case, you can skip samples by name instead:

multiqc . --ignore-samples sample_3*

These strings are matched using glob logic (* and ? are wildcards).

All of these settings can be saved in a MultiQC config file so that you don't have to type them on the command line for every run.

Finally, you can supply a file containing a list of file paths, one per row. MultiQC only search the listed files.

multiqc --file-list my_file_list.txt

Renaming reports

The report is called multiqc_report.html by default. Tab-delimited data files are created in multiqc_data/, containing additional information. You can use a custom name for the report with the -n/--filename parameter, or instruct MultiQC to create them in a subdirectory using the -o/-outdir parameter.

Note that different MultiQC templates may have different defaults.

Overwriting existing reports

It's quite common to repeatedly create new reports as new analysis results are generated. Instead of manually deleting old reports, you can just specify the -f parameter and MultiQC will overwrite any conflicting report filenames.

Sample names prefixed with directories

Sometimes, the same samples may be processed in different ways. If MultiQC finds log files with the same sample name, the previous data will be overwritten (this can be inspected by running MultiQC with -v/--verbose).

To avoid this, run MultiQC with the -d/--dirs parameter. This will prefix every sample name with the directory path for that log file. As such, sample names should now be unique, and not overwrite one-another.

By default, --dirs will prepend the entire path to each sample name. You can choose which directories are added with the -dd/--dirs-depth parameter. Set to a positive integer to use that many directories at the end of the path. A negative integer takes directories from the start of the path.

For example:

$ multiqc -d .
# analysis_1 | results | type | sample_1 | file.log
# analysis_2 | results | type | sample_2 | file.log
# analysis_3 | results | type | sample_3 | file.log

$ multiqc -d -dd 1 .
# sample_1 | file.log
# sample_2 | file.log
# sample_3 | file.log

$ multiqc -d -dd -1 .
# analysis_1 | file.log
# analysis_2 | file.log
# analysis_3 | file.log

Using different templates

MultiQC is built around a templating system. You can produce reports with different styling by using the -t/--template option. The available templates are listed with multiqc --help.

If you're interested in creating your own custom template, see the writing new templates section.

PDF Reports

Whilst HTML is definitely the format of choice for MultiQC reports due to the interactive features that it can offer, PDF files are an integral part of some people's workflows. To try to accommodate this, MultiQC has a --pdf command line flag which will try to create a PDF report for you.

To do this, MultiQC uses the simple template. This uses flat plots, has no navigation or toolbar and strips out all JavaScript. The resulting HTML report is pretty basic, but this simplicity is helpful when generating PDFs.

Once the report is generated MultiQC attempts to call Pandoc, a command line tool able to convert documents between different file formats. You must have Pandoc already installed for this to work. If you don't have Pandoc installed, you will get an error message that looks like this:

Error creating PDF - pandoc not found. Is it installed? http://pandoc.org/

Please note that Pandoc is a complex tool and uses LaTeX / XeLaTeX for PDF generation. Please make sure that you have the latest version of Pandoc and that it can successfully convert basic HTML files to PDF before reporting and errors. Also note that not all plots have flat image equivalents, so some will be missing (at time of writing: FastQC sequence content plot, beeswarm dot plots, heatmaps).

Printing to stdout

If you would like to generate MultiQC reports on the fly, you can print the output to standard out by specifying -n stdout. Note that the data directory will not be generated and the template used must create stand-alone HTML reports.

Parsed data directory

By default, MultiQC creates a directory alongside the report containing tab-delimited files with the parsed data. This is useful for downstream processing, especially if you're running MultiQC with very large numbers of samples.

Typically, these files are tab-delimited tables. However, you can get JSON or YAML output for easier downstream parsing by specifying -k/--data-format on the command line or data_format in your configuration file.

You can also choose whether to produce the data by specifying either the --data-dir or --no-data-dir command line flags or the make_data_dir variable in your configuration file. Note that the data directory is never produced when printing the MultiQC report to stdout.

To zip the data directory, use the -z/--zip-data-dir flag.

Exporting Plots

In addition to the HTML report, it's also possible to get MultiQC to save plots as stand alone files. You can do this with the -p/--export command line flag. By default, plots will be saved in a directory called multiqc_plots as .png, .svg and .pdf files. Raw data for the plots are also saved to files.

You can instruct MultiQC to always do this by setting the export_plots config option to true, though note that this will add a few seconds on to execution time. The plots_dir_name changes the default directory name for plots and the export_plot_formats specifies what file formats should be created (must be supported by MatPlotLib).

Note that not all plot types are yet supported, so you may find some plots are missing.

Note: You can always save static image versions of plots from within MultiQC reports, using the Export toolbox in the side bar.

Choosing which modules to run

Sometimes, it's desirable to choose which MultiQC modules run. This could be because you're only interested in one type of output and want to keep the reports small. Or perhaps the output from one module is misleading in your situation.

You can do this by using -m/--modules to explicitly define which modules you want to run. Alternatively, use -e/--exclude to run all modules except those listed.

You can get a group of modules by using --tag followed by a tag e.g. RNA or DNA.

Using MultiQC Reports

Once MultiQC has finished, you should have a HTML report file called multiqc_report.html (or something similar, depending on how you ran MultiQC). You can launch this report with open multiqc_report.html on the command line, or double clicking the file in a file browser.

Browser compatibility

MultiQC reports should work in any modern browser. They have been tested using OSX Chrome, Firefox and Safari. If you find any report bugs, please report them as a GitHub issue.

Report layout

MultiQC reports have three main page sections:

  • The navigation menu (left side)
    • Links to the different module sections in the report
    • Click the logo to go to the top of the page
  • The toolbox (right side)
    • Contains various tools to modify the report data (see below)
  • The report (middle)
    • This is what you came here for, the data!

Note that if you're viewing the report on a mobile device / small window, the content will be reformatted to fit the screen.

General Statistics table

At the top of every MultiQC report is the 'General Statistics' table. This shows an overview of key values, taken from all modules. The aim of the table is to bring together stats for each sample from across the analysis so that you can see it in one place.

Hovering over column headers will show a longer description, including which module produced the data. Clicking a header will sort the table by that value. Clicking it again will change the sort direction. You can shift-click multiple headers to sort by multiple columns.

sort column

Above the table there is a button called 'Configure Columns'. Clicking this will launch a modal window with more detailed information about each column, plus options to show/hide and change the order of columns.

configure columns

Plots

MultiQC modules can take plot more extensive data in the sections below the general statistics table.

Interactive plots

Plots in MultiQC reports are usually interactive, using the HighCharts JavaScript library.

You can hover the mouse over data to see a tooltip with more information about that dataset. Clicking and dragging on line graphs will zoom into that area.

plot zoom

To reset the zoom, use the button in the top right:

reset zoom

Plots have a grey bar along their base; clicking and dragging this will resize the plot's height:

plot zoom

You can force reports to use interactive plots instead of flat by specifying the --interactive command line option (see below).

Flat plots

Reports with large numbers of samples may contain flat plots. These are rendered when the MultiQC report is generated using MatPlotLib and are non-interactive (flat) images within the report. The reason for generating these is that large sample numbers can make MultiQC reports very data-intensive and unresponsive (crashing people's browsers in extreme cases). Plotting data in flat images is scalable to any number of samples, however.

Flat plots in MultiQC have been designed to look as similar to their interactive versions as possible. They are also copied to multiqc_data/multiqc_plots

You can force reports to use flat plots with the --flat command line option.

See the Large sample numbers section of the Configuring MultiQC docs for more on how to customise the flat / interactive plot behaviour.

Exporting plots

If you want to use the plot elsewhere (eg. in a presentation or paper), you can export it in a range of formats. Just click the menu button in the top right of the plot:

plot zoom

This opens the MultiQC Toolbox Export Plots panel with the current plot selected. You have a range of export options here. When deciding on output format bear in mind that SVG is a vector format, so can be edited in tools such as Adobe Illustrator or the free tool Inkscape. This makes it ideal for use in publications and manual customisation / annotation. The Plot scaling option changes how large the labels are relative to the plot.

Dynamic plots

Some plots have buttons above them which allow you to change the data that they show or their axis. For example, many bar plots have the option to show the data as percentages instead of counts:

percentage button

Toolbox

MultiQC reports come with a 'toolbox', accessible by clicking the buttons on the right hand side of the report:

toolbox buttons

Active toolbox panels have their button highlighted with a blue outline. You can hide the toolbox by clicking the open panel button a second time, or pressing Escape on your keyboard.

Highlight Samples

If you run MultiQC plots with a lot of samples, plots can become very data-heavy. This makes it difficult to find specific samples, or subsets of samples.

To help with this, you can use the Highlight Samples tool to colour datasets of interest. Simply enter some text which will match the samples you want to highlight and press enter (or click the add button). If you like, you can also customise the highlight colour.

toolbox highlight

To make it easier to match groups of samples, you can use a regular expressions by turning on 'Regex mode'. You can test regexes using a nice tool at regex101.com. See a nice introduction to regexes here. Note that pattern delimiters are not needed (use pattern, not /pattern/).

Here, we highlight any sample names that end in _1:

highligh regex

Note that a new button appears above the General Statistics table when samples are highlighted, allowing you to sort the table according to highlights.

Search patterns can be changed after creation, just click to edit. To remove, click the grey cross on the right hand side.

Searching for an empty string will match all samples.

Renaming Samples

Sample names are typically generated based on processed file names. These file names are not always informative. To help with this, you can do a search and replace within sample names. Here, we remove the SRR1067 and _1 parts of the sample names, which are the same for all samples:

rename samples

Again, regular expressions can be used. See above for details. Note that regex groups can be used - define a group match with parentheses and use the matching value with $1, $2 etc. For example - a search string SRR283(\d{3}) and replace string $1_SRR283 would move the final three digits of matching sample names to the start of the name.

Often, you may have a spreadsheet with filenames and informative sample names. To avoid having to manually enter each name, you can paste from a spreadsheet using the 'bulk import' tool:

bulk rename

Hiding Samples

Sometimes, you want to focus on a subset of samples. To temporarily hide samples from the report, enter a search string as described above into the 'Hide Samples' toolbox panel.

Here, we hide all samples with _trimmed in their sample name: (Note that plots will tell you how many samples have been hidden)

hide samples

Export

This panel allows you to download MultiQC plots as images or as raw data. You can configure the size and characteristics of exported plot images: Width and Height set the output size of the images, scale sets how "zoomed-in" they should look (typically you want the plot to be more zoomed for printing). The tick boxes below these settings allow you to download multiple plots in one go.

Plots with multiple tabs will all be exported as files when using the Data tab. For plots with multiple tags, the currently visible plot will be exported.

Note: You can also save static plot images when you run MultiQC. See Exporting Plots for more information.

Save Settings

To avoid having to re-enter the same toolbox setup repeatedly, you can save your settings using the 'Save Settings' panel. Just pick a name and click save. To load, choose your set of settings and press load (or delete). Loaded settings are applied on top of current settings. All configs are saved in browser local storage - they do not travel with the report and may not work in older browsers.

Configuring MultiQC

Whilst most MultiQC settings can be specified on the command line, MultiQC is also able to parse system-wide and personal config files. At run time, it collects the configuration settings from the following places in this order (overwriting at each step if a conflicting config variable is found):

  1. Hardcoded defaults in MultiQC code
  2. System-wide config in <installation_dir>/multiqc_config.yaml
    • Manual installations only, not pip or conda
  3. User config in ~/.multiqc_config.yaml
  4. File path set in environment variable MULTIQC_CONFIG_PATH
    • For example, define this in your ~/.bashrc file and keep the file anywhere you like
  5. Config file in the current working directory: multiqc_config.yaml
  6. Config file paths specified in the command with --config / -c
    • You can specify multiple files like this, they can have any filename.
  7. Command line config (--cl_config)
  8. Specific command line options (e.g. --force)

You can find an example configuration file with the MultiQC source code, called multiqc_config.example.yaml. If you installed MultiQC with pip or conda you won't have this file locally, but you can find it on GitHub: github.com/ewels/MultiQC.

Sample name cleaning

MultiQC typically generates sample names by taking the input or log file name, and 'cleaning' it. To do this, it uses the fn_clean_exts settings and looks for any matches. If it finds any matches, everything to the right is removed. For example, consider the following config:

fn_clean_exts:
    - '.gz'
    - '.fastq'

This would make the following sample names:

mysample.fastq.gz  ->  mysample
secondsample.fastq.gz_trimming_log.txt  ->  secondsample
thirdsample.fastq_aligned.sam.gz  ->  thirdsample

There is also a config list called fn_clean_trim which just removes strings if they are present at the start or end of the sample name.

Usually you don't want to overwrite the defaults (though you can). Instead, add to the special variable names extra_fn_clean_exts and extra_fn_clean_trim:

extra_fn_clean_exts:
    - '.myformat'
    - '_processedFile'
extra_fn_clean_trim:
    - '#'
    - '.myext'

Other search types

File name cleaning can also take strings to remove (instead of removing with truncation). Also regex strings can be supplied to match patterns and remove or keep matching substrings.

truncate (default)

If you just supply a string, the default behavior is similar to "trim". The filename will be truncated beginning with the matching string.

extra_fn_clean_exts:
    - '.fastq'

This rule would produce the following sample names:

mysample.fastq.gz  ->  mysample
thirdsample.fastq_aligned.sam.gz  ->  thirdsample

remove (formerly replace)

The remove type allows you to remove the exact match from the filename.

extra_fn_clean_exts:
    - type: remove
      pattern: .sorted

This rule would produce the following sample names:

secondsample.sorted.deduplicated  ->  secondsample.deduplicated

regex

You can also remove a substring with a regular expression. Here's a good resource to interactively try it out.

extra_fn_clean_exts:
    - type: regex
      pattern: '^processed.'

This rule would produce the following sample names:

processed.thirdsample.processed  ->  thirdsample.processed

regex_keep

If you'd rather like to keep the match of a regular expression you can use the regex_keep type. This simplifies things if you can e.g. directly target samples names.

extra_fn_clean_exts:
    - type: regex_keep
      pattern: '[A-Z]{3}[1-9]{2}'

This rule would produce the following sample names:

merged.recalibrated.XZY97.alignment.bam  ->  XZY97

Clashing sample names

This process of cleaning sample names can sometimes result in exact duplicates. A duplicate sample name will overwrite previous results. Warnings showing these events can be seen with verbose logging using the --verbose/-v flag, or in multiqc_data/multiqc.log.

Problems caused by this will typically be discovered be fewer results than expected. If you're ever unsure about where the data from results within MultiQC reports come from, have a look at multiqc_data/multiqc_sources.txt, which lists the path to the file used for every section of the report.

Directory names

One scenario where clashing names can occur is when the same file is processed in different directories. For example, if sample_1.fastq is processed with four sets of parameters in four different directories, they will all have the same name - sample_1. Only the last will be shown. If the directories are different, this can be avoided with the --dirs/-d flag.

For example, given the following files:

├── analysis_1
│   └── sample_1.fastq.gz.aligned.log
├── analysis_2
│   └── sample_1.fastq.gz.aligned.log
└── analysis_3
    └── sample_1.fastq.gz.aligned.log

Running multiqc -d . will give the following sample names:

analysis_1 | sample_1
analysis_2 | sample_1
analysis_3 | sample_1

Filename truncation

If the problem is with filename truncation, you can also use the --fullnames/-s flag, which disables all sample name cleaning. For example:

├── sample_1.fastq.gz.aligned.log
└── sample_1.fastq.gz.subsampled.fastq.gz.aligned.log

Running multiqc -s . will give the following sample names:

sample_1.fastq.gz.aligned.log
sample_1.fastq.gz.subsampled.fastq.gz.aligned.log

You can turn off sample name cleaning permanently by setting fn_clean_sample_names to false in your config file.

Module search patterns

Many bioinformatics tools have standard output formats, filenames and other signatures. MultiQC uses these to find output; for example, the FastQC module looks for files that end in _fastqc.zip.

This works well most of the time, until someone has an automated processing pipeline that renames things. For this reason, as of version v0.3.2 of MultiQC, the file search patterns are loaded as part of the main config. This means that they can be overwritten in <installation_dir>/multiqc_config.yaml or ~/.multiqc_config.yaml. So if you always rename your _fastqc.zip files to _qccheck.zip, MultiQC can still work.

To see the default search patterns, see the search_patterns.yaml file. Copy the section for the program that you want to modify and paste this into your config file. Make sure you make it part of a dictionary called sp as follows:

sp:
    mqc_module:
        fn: _mysearch.txt

Search patterns can specify a filename match (fn) or a file contents match (contents).

Ignoring Files

MultiQC begins by indexing all of the files that you specified and building a list of the ones it will use. You can specify files and directories to skip on the command line using -x/--ignore, or for more permanent memory, with the following config file options: fn_ignore_files, fn_ignore_dirs and fn_ignore_paths (the command line option simply adds to all of these).

For example, given the following files:

├── analysis_1
│   └── sample_1.fastq.gz.aligned.log
├── analysis_2
│   └── sample_1.fastq.gz.aligned.log
└── analysis_3
    └── sample_1.fastq.gz.aligned.log

You could specify the following relevant config options:

fn_ignore_files:
    - '*.log'
fn_ignore_dirs:
    - 'analysis_1'
    - 'analysis_2'
fn_ignore_paths:
    - '*/analysis_*/sample_1*'

Note that the searched file paths will usually be relative to the working directory and can be highly variable, so you'll typically want to start patterns with a * to match any preceding directory structure.

Ignoring samples

Some modules get sample names from the contents of the file and not the filename (for example, stdout logs can contain multiple samples). You can skip samples by their resolved sample names (after cleaning) with two config options: sample_names_ignore and sample_names_ignore_re. The first takes a list of strings to be used for glob pattern matching (same behaviour as the command line option --ignore-samples), the latter takes a list of regex patterns. For example:

sample_names_ignore:
    - 'SRR*'
sample_names_ignore_re:
    - '^SR{2}\d{7}_1$'

Large sample numbers

MultiQC has been written with the intention of being used for any number of samples. This means that it should work well with 6 samples or 6000. Very large sample numbers are becoming increasingly common, for example with single cell data.

Producing reports with data from many hundreds or thousands of samples provides some challenges, both technically and also in terms of data visualisation and report usability.

Disabling on-load plotting

One problem with large reports is that the browser can hang when the report is first loaded. This is because it loading and processing the data for all plots at once. To mitigate this, large reports may show plots as grey boxes with a "Show Plot" button. Clicking this will render the plot as normal and prevents the browser from trying to do everything at once.

By default this behaviour kicks in when a plot has 50 samples or more. This can be customised by changing the num_datasets_plot_limit config option.

Flat / interactive plots

Reports with many samples start to need a lot of data for plots. This results in inconvenient report file sizes (can be 100s of megabytes) and worse, web browser crashes. To allow MultiQC to scale to these sample numbers, most plot types have two plotting functions in the code base - interactive (using HighCharts) and flat (rendered with MatPlotLib). Flat plots take up the same disk space irrespective of sample number and do not consume excessive resources to display.

By default, MultiQC generates flat plots when there are 100 or more samples. This cutoff can be changed by changing the plots_flat_numseries config option. This behaviour can also be changed by running MultiQC with the --flat / --interactive command line options or by setting the plots_force_flat / plots_force_interactive config options to True.

Tables / Beeswarm plots

Report tables with thousands of samples (table rows) can quickly become impossible to use. To avoid this, tables with large numbers of rows are instead plotted as a Beeswarm plot (aka. a strip chart / jitter plot). These plots have fixed dimensions with any number of samples. Hovering on a dot will highlight the same sample in other rows.

By default, MultiQC starts using beeswarm plots when a table has 500 rows or more. This can be changed by setting the max_table_rows config option.

Command-line config

Sometimes it's useful to specify a single small config option just once, where creating a config file for the occasion may be overkill. In these cases you can use the --cl_config option to supply additional config values on the command line.

Config variables should be given as a YAML string. You will usually need to enclose this in quotes. If MultiQC is unable to understand your config you will get an error message saying Could not parse command line config.

As an example, the following command configures the coverage levels to use for the Qualimap module: (as described in the docs)

multiqc ./datadir --cl_config "qualimap_config: { general_stats_coverage: [20,40,200] }"

Customising Reports

MultiQC offers a few ways to customise reports to easily add your own branding and some additional report-level information. These features are primarily designed for core genomics facilities.

Note that much more extensive customisation of reports is possible using custom templates.

Titles and introductory text

You can specify a custom title for the report using the -i/--title command line option. The -b/--comment option can be used to add a longer comment to the top of the report at run time.

You can also specify the title and comment, as well as a subtitle and the introductory text in your config file:

title: "My Title"
subtitle: "A subtitle to go underneath in grey"
intro_text: "MultiQC reports summarise analysis results."
report_comment: "This is a comment about this report."

Note that if intro_text is None the template will display the default introduction sentence. Set this to False to hide this, or set it to a string to use your own text.

To add your own custom logo to reports, you can add the following three lines to your MultiQC configuration file:

custom_logo: '/abs/path/to/logo.png'
custom_logo_url: 'https://www.example.com'
custom_logo_title: 'Our Institute Name'

Only custom_logo is needed. The URL will make the logo open up a new web browser tab with your address and the title sets the mouse hover title text.

Project level information

You can add custom information at the top of reports by adding key:value pairs to the config option report_header_info. Note that if you have a file called multiqc_config.yaml in the working directory, this will automatically be parsed and added to the config. For example, if you have the following saved:

report_header_info:
    - Contact E-mail: 'phil.ewels@scilifelab.se'
    - Application Type: 'RNA-seq'
    - Project Type: 'Application'
    - Sequencing Platform: 'HiSeq 2500 High Output V4'
    - Sequencing Setup: '2x125'

Then this will be displayed at the top of reports:

report project info

Note that you can also specify a path to a config file using -c.

Bulk sample renaming

Although it is possible to rename samples manually and in bulk using the report toolbox, it's often desirable to embed such renaming patterns into the report so that they can be shared with others. For example, a typical case could be for a sequencing centre that has internal sample IDs and also user-supplied sample names. Or public sample identifiers such as SRA numbers as well as more meaningful names.

It's possible to supply a file with one or more sets of sample names using the --sample-names command line option. This file should be a tab-delimited file with a header row (used for the report button labels) and then any number of renamed sample identifiers. For example:

MultiQC Names   Proper Names    AWESOME NAMES
SRR1067503_1    Sample_1    MYBESTSAMP_1
SRR1067505_1    Sample_2    MYBESTSAMP_2
SRR1067510_1    Sample_3    MYBESTSAMP_3

If supplied, buttons will be generated at the top of the report with your labels. Clicking these will populate and apply the Toolbox renaming panel.

NB: Sample renaming works with partial substrings - these will be replaced!

It's also possible to supply such renaming patterns within a config file (useful if you're already generating a config file for a run). In this case, you need to set the variables sample_names_rename_buttons and sample_names_rename. For example:

sample_names_rename_buttons:
    - "MultiQC Names"
    - "Proper Names"
    - "AWESOME NAMES"
sample_names_rename:
    - ["SRR1067503_1", "Sample_1", "MYBESTSAMP_1"]
    - ["SRR1067505_1", "Sample_2", "MYBESTSAMP_2"]
    - ["SRR1067510_1", "Sample_3", "MYBESTSAMP_3"]

Module and section comments

Sometimes you may want to add a custom comment above specific sections in the report. You can do this with the config option section_comments as follows:

section_comments:
    featurecounts: 'This comment is for a module header, but should still work'
    star_alignments: 'This new way of commenting above sections is **awesome**!'

Comments can be written in Markdown. The section_comments keys should correspond to the HTML IDs of the report section. You can find these by clicking on a navigation link in the report and seeing the #section_id at the end of the browser URL.

Order of modules

By default, modules are included in the report as in the order specified in config.module_order. Any modules found which aren't in this list are appended at the top of the report.

Top modules

To specify certain modules that should always come at the top of the report, you can configure config.top_modules in your MultiQC configuration file. For example, to always have the FastQC module at the top of reports, add the following to your ~/.multiqc_config.yaml file:

top_modules:
    - 'fastqc'

Running modules multiple times

A module can be specified multiple times in either config.module_order or config.top_modules, causing it to be run multiple times. By itself you'll just get two identical report sections. However, you can also supply configuration options to the modules as follows:

top_modules:
    - moduleName:
        name: 'Module (filtered)'
        info: 'This section shows the module with different files'
        path_filters:
            - '*_special.txt'
            - '*_others.txt'
    - moduleName:
        name: 'Module (all)'

These overwrite the defaults that are hardcoded in the module code. path_filters is the exception, which filters the file searches for a given list of glob filename patterns. The available options are:

  • name: Section name
  • anchor: Section report ID
  • target: Intro link text
  • href: Intro link URL
  • info: Intro text
  • extra: Additional HTML after intro.

For example, to run the FastQC module twice, before and after adapter trimming, you could use the following config:

module_order:
    - fastqc:
        name: 'FastQC (trimmed)'
        info: 'This section of the report shows FastQC results after adapter trimming.'
        target: ''
        path_filters:
            - '*_1_trimmed_fastqc.zip'
    - cutadapt
    - fastqc:
        name: 'FastQC (raw)'
        path_filters:
            - '*_1_fastqc.zip'

Note that if you change the name then you will get multiples of columns in the General Statistics table. If unchanged, the topmost module may overwrite output from the first iteration.

NB: Currently, you can not list a module name in both top_modules and module_order. Let me know if this is a problem..

Order of sections

Sometimes it's desirable to customise the order of specific sections in a report, independent of module execution. For example, the custom_content module can generate multiple sections from different input files.

To do this, follow a link in a report navigation to skip to the section you want to move (must be a major section header, not a subheading). Find the ID of that section by looking at the URL. For example, clicking on FastQC changes the URL to multiqc_report.html#fastqc - the ID is the text after (not including) the # symbol.

Next, specify the report_section_order option in your MultiQC config file. Section in the report are given a number ranging from 10 (section at bottom of report), incrementing by +10 for each section. You can change this number (eg. a very low number to always get at the bottom of the report or very high to always be at the top), or you can move a section to before or after another existing section (has no effect if the other named ID is not in the report).

For example, add the following to your MultiQC config file:

report_section_order:
    section1:
        order: -1000
    section2:
        before: 'othersection'
    section3:
        after: 'diffsection'

Customising tables

Report tables such as the General Statistics table can get quite wide. To help with this, columns in the report can be hidden. Some MultiQC modules include columns which are hidden by default, others may be uninteresting to some users.

To allow customisation of this behaviour, the defaults can be changed by adding to your MultiQC config file. This is done with the table_columns_visible value. Open a MultiQC report and click Configure Columns above a table. Make a note of the Group and ID for the column that you'd like to alter. For example, to make the % Duplicate Reads column from FastQC hidden by default, the Group is FastQC and the ID is percent_duplicates. These are then added to the config as follows:

table_columns_visible:
    FastQC:
        percent_duplicates: False

Note that you can set these to True to show columns that would otherwise be hidden by default.

In the same way, you can force a column to appear at the start or end of the table, or indeed impose a custom ordering on all the columns, by setting the table_columns_placement. High values push columns to the right hand side of the table and low to the left. The default value is 1000. For example:

table_columns_placement:
    Samtools:
        reads_mapped: 900
        properly_paired: 1010
        secondary: 1020

In this case, since the default placement weighting is 1000, the reads_mapped will end up as the leftmost column and the other two will and up as the final columns on the right of the table.

Number base (multiplier)

To make numbers in the General Statistics table easier to read and compare quickly, MultiQC sometimes divides them by one million (typically read counts). If your samples have very low read counts then this can result in the table showing counts of 0.0, which isn't very helpful.

To change this behaviour, you can customise three config variables in your MultiQC config. The defaults are as follows:

read_count_multiplier: 0.000001
read_count_prefix: 'M'
read_count_desc: 'millions'

So, to show thousands of reads instead of millions, change these to:

read_count_multiplier: 0.001
read_count_prefix: 'K'
read_count_desc: 'thousands'

The same options are also available for numbers of base pairs:

base_count_multiplier: 0.000001
base_count_prefix: 'Mb'
base_count_desc: 'millions'

Number formatting

By default, the interactive HighCharts plots in MultiQC reports use spaces for thousand separators and points for decimal places (e.g. 1 234 567.89). Different countries have different preferences for this, so you can customise the two using a couple of configuration parameters - decimalPoint_format and thousandsSep_format.

For example, the following config would result in the following alternative number formatting: 1234567,89.

decimalPoint_format: ','
thousandsSep_format: ''

This formatting currently only applies to the interactive charts. It may be extended to apply elsewhere in the future (submit a new issue if you spot somewhere where you'd like it).

Troubleshooting

One tricky bit that caught me out whilst writing this is the different type casting between Python, YAML and Jinja2 templates. This is especially true when using an empty variable:

# Python
my_var = None
# YAML
my_var: null
# Jinja2
if myvar is none # Note - Lower case!

Troubleshooting

Hopefully MultiQC will be easy to use and run without any hitches. If you have any problems, please do get in touch with the developer (Phil Ewels) by e-mail or by submitting an issue on github. Before that, here are a few things previously encountered that may help...

Not enough samples found

In this scenario, MultiQC finds some logs for the bioinformatics tool in question, but not all of your samples appear in the report. This is the most common question I get regarding MultiQC operation.

Usually, this happens because sample names collide. This happens innocently a lot - MultiQC overwrites previous results of the same name and you get the last one seen in the report. You can see warnings about this by running MultiQC in verbose mode with the -v flag, or looking at the generated log file in multiqc_data/multiqc.log. If you are unsure about what log file ended up in the report, look at multiqc_data/multiqc_sources.txt which lists each source file used.

To solve this, try running MultiQC with the -d and -s flags. The Clashing sample names section of the docs explains this in more detail.

Big log files

Another reason that log files can be skipped is if the log filesize is very large. For example, this could happen with very long concatenated standard out files. By default, MultiQC skips any file that is larger than 10MB to keep execution fast. The verbose log output (-v or multiqc_data/multiqc.log) will show you if files are being skipped with messages such as these:

[DEBUG  ]  Ignoring file as too large: filename.txt

You can configure the threshold and parse your files by changing the log_filesize_limit config option. For example, to parse files up to 2GB in size, add the following to your MultiQC config file:

log_filesize_limit: 2000000000

No logs found for a tool

In this case, you have run a bioinformatics tool and have some log files in a directory. When you run MultiQC with that directory, it finds nothing for the tool in question.

There are a couple of things you can check here:

  1. Is the tool definitely supported by MultiQC? If not, why not open an issue to request it!
  2. Did your bioinformatics tool definitely run properly? I've spent quite a bit of time debugging MultiQC modules only to realise that the output files from the tool were empty or incomplete. If your data is missing, take a look and the raw files and make sure that there's something to see!

If everything looks fine, then MultiQC probably needs extending to support your data. Tools have different versions, different parameters and different output formats that can confuse the parsing code. Please open an issue with your log files and we can get it fixed.

Error messages about mkl trial mode / licences

In this case you run MultiQC and get something like this:

$ multiqc .

Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode EXPIRED 2 days ago

    You cannot run mkl without a license any longer.
    A license can be purchased it at: http://continuum.io
    We are sorry for any inconveniences.

    SHUTTING DOWN PYTHON INTERPRETER

The mkl library provides optimisations for numpy, a requirement of MatPlotLib. Recent versions of Conda have a bundled version which should come with a licence and remove the warning. See this page for more info. If you already have Conda installed you can get the updated version by running:

conda remove mkl-rt
conda install -f mkl

Another way around it is to uninstall mkl. It seems that numpy works without it fine:

$ conda remove --features mkl

Problem solved! See more here and here.

If you're not using Conda, try installing MultiQC with that instead. You can find instructions here.

Locale Error Messages

Two MultiQC dependencies have been known to throw errors due to problems with the Python locale settings, or rather the lack of those settings.

MatPlotLib can complain that some strings (such as en_SE) aren't allowed. Running MultiQC gives the following error:

$ multiqc --version
# ..long traceback.. #
 File "/sw/comp/python/2.7.6_milou/lib/python2.7/locale.py", line 443, in _parse_localename
   raise ValueError, 'unknown locale: %s' % localename
ValueError: unknown locale: UTF-8

Click can have a similar problem if the locale isn't set when using Python 3. That generates an error that looks like this:

# ..truncated traceback.. #
File "click/_unicodefun.py", line 118, in _verify_python3_env 'for mitigation steps.' + extra)

RuntimeError: Click will abort further execution because Python 3 was configured to use ASCII
as encoding for the environment.  Consult http://click.pocoo.org/python3/for mitigation steps.

You can fix both of these problems by changing your system locale to something that will be recognised. One way to do this is by adding these lines to your .bashrc in your home directory (or .bash_profile):

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

Other locale strings are also fine, as long as the variables are set and valid.

MultiQC Modules

Pre-alignment

Adapter Removal

This program searches for and removes remnant adapter sequences from High-Throughput Sequencing (HTS) data and (optionally) trims low quality bases from the 3' end of reads following adapter removal. AdapterRemoval can analyze both single end and paired end data, and can be used to merge overlapping paired-ended reads into (longer) consensus sequences. Additionally, the AdapterRemoval may be used to recover a consensus adapter sequence for paired-ended data, for which this information is not available.

The adapterRemoval module parses *.settings logs generated by Adapter Removal, a tool for rapid adapter trimming, identification, and read merging.

supported setting file results:

  • single end
  • paired end noncollapsed
  • paired end collapsed

AfterQC

The AfterQC module parses results generated by AfterQC. AfterQC can simply go through all fastq files in a folder and then output three folders: good, bad and QC folders, which contains good reads, bad reads and the QC results of each fastq file/pair.

bcl2fastq

There are two versions of this software: bcl2fastq for MiSeq and HiSeq sequencing systems running RTA versions earlier than 1.8, and bcl2fastq2 for Illumina sequencing systems running RTA version 1.18.54 and above. This module currently only covers output from the latter.

BioBloom Tools

BioBloom Tools (BBT) provides the means to create filters for a given reference and then to categorize sequences. This methodology is faster than alignment but does not provide mapping locations. BBT was initially intended to be used for pre-processing and QC applications like contamination detection, but is flexible to accommodate other purposes. This tool is intended to be a pipeline component to replace costly alignment steps.

Cluster Flow

Cluster Flow is a simple and flexible bioinformatics pipeline tool. It's designed to be quick and easy to install, with flexible configuration and simple customization.

Cluster Flow easy enough to set up and use for non-bioinformaticians (given a basic knowledge of the command line), and it's simplicity makes it great for low to medium throughput analyses.

The MultiQC module for Cluster Flow parses *_clusterflow.txt logs and finds consensus commands executed by modules in each pipeline run.

The Cluster Flow *.run files are also parsed and pipeline information shown (some basic statistics plus the pipeline steps / params used).

Cutadapt

The Cutadapt module parses results generated by Cutadapt, a tool to find and remove adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.

This module should be able to parse logs from a wide range of versions of Cutadapt. It's been tested with log files from v1.2.1, 1.6 and 1.8. Note that you will need to change the search pattern for very old log files (such as v.1.2) with the following MultiQC config:

sp:
    cutadapt:
        contents: 'cutadapt version'

See the module search patterns section of the MultiQC documentation for more information.

FastQ Screen

The FastQ Screen module parses results generated by FastQ Screen, a tool that allows you to screen a library of sequences in FastQ format against a set of sequence databases so you can see if the composition of the library matches with what you expect.

By default, the module creates a plot that emulates the FastQ Screen output with blue and red stacked bars showing unique and multimapping read counts. This plot only works for a handful of samples however, so if # samples * # organisms >= 160, a simpler stacked barplot is shown. This is also shown when generating flat-image plots.

To always show this style of plot, add the following line to a MultiQC config file:

fastqscreen_simpleplot: true

FastQC

The FastQC module parses results generated by FastQC, a quality control tool for high throughput sequence data written by Simon Andrews at the Babraham Institute.

FastQC generates a HTML report which is what most people use when they run the program. However, it also helpfully generates a file called fastqc_data.txt which is relatively easy to parse.

A typical run will produce the following files:

mysample_fastqc.html
mysample_fastqc/
  Icons/
  Images/
  fastqc.fo
  fastqc_data.txt
  fastqc_report.html
  summary.txt

Sometimes the directory is zipped, with just mysample_fastqc.zip.

The FastQC MultiQC module looks for files called fastqc_data.txt or ending in _fastqc.zip. If the zip files are found, they are read in memory and fastqc_data.txt parsed.

Note: The directory and zip file are often both present. To speed up MultiQC execution, zip files will be skipped if the file name suggests that they will share a sample name with data that has already been parsed.

You can customise the patterns used for finding these files in your MultiQC config (see Module search patterns). The below code shows the default file patterns:

sp:
    fastqc/data:
        fn: 'fastqc_data.txt'
    fastqc/zip:
        fn: '*_fastqc.zip'

Note: Sample names are discovered by parsing the line beginning Filename in fastqc_data.txt, not based on the FastQC report names.

Theoretical GC Content

It is possible to plot a dashed line showing the theoretical GC content for a reference genome. MultiQC comes with genome and transcriptome guides for Human and Mouse. You can use these in your reports by adding the following MultiQC config keys (see Configuring MultiQC):

fastqc_config:
    fastqc_theoretical_gc: 'hg38_genome'

Only one theoretical distribution can be plotted. The following guides are available: hg38_genome, hg38_txome, mm10_genome, mm10_txome (txome = transcriptome).

Alternatively, a custom theoretical guide can be used in reports. To do this, create a file with fastqc_theoretical_gc in the filename and place it with your analysis files. It should be tab delimited with the following format (column 1 = %GC, column 2 = % of genome):

# FastQC theoretical GC content curve: YOUR REFERENCE NAME
0   0.005311768
1   0.004108502
2   0.004060371
3   0.005066476
[...]

You can generate these files using an R package called fastqcTheoreticalGC written by Mike Love. Please see the package readme for more details.

Result files from this package are searched for with the following search pattern (can be customised as described above):

sp:
    fastqc/theoretical_gc:
        fn: '*fastqc_theoretical_gc*'

If you want to always use a specific custom file for MultiQC reports without having to add it to the analysis directory, add the full file path to the same MultiQC config variable described above:

fastqc_config:
    fastqc_theoretical_gc: '/path/to/your/custom_fastqc_theoretical_gc.txt'

Flexbar

Flexbar preprocesses high-throughput sequencing data efficiently. It demultiplexes barcoded runs and removes adapter sequences. Moreover, trimming and filtering features are provided. Flexbar increases read mapping rates and improves genome as well as transcriptome assemblies.

Jellyfish

JELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA. A k-mer is a substring of length k, and counting the occurrences of all such substrings is a central step in many analyses of DNA sequence. JELLYFISH can count k-mers using an order of magnitude less memory and an order of magnitude faster than other k-mer counting packages by using an efficient encoding of a hash table and by exploiting the "compare-and-swap" CPU instruction to increase parallelism.

The MultiQC module for Jellyfish parses only *_jf.hist files. The general usage of jellyfish to be parsed by MultiQC module needs to be:

  • gunzip -c file.fastq.gz | jellyfish count -o file.jf -m ...
  • jellyfish histo -o file_jf.hist -f file.jf

In case a user wants to customise the matching pattern for jellyfish, then multiqc can be run with the option --cl_config "sp: { jellyfish: { fn: 'PATTERN' } }" where PATTERN is the pattern to be matched. For example:

multiqc . --cl_config "sp: { jellyfish: { fn: '*.hist' } }"

leeHom

leeHom is a Bayesian maximum a posteriori algorithm for stripping sequencing adapters and merging overlapping portions of reads. The algorithm is mostly aimed at ancient DNA and Illumina data but can be used for any dataset.

Skewer

The Skewer module parses results generated by Skewer, an adapter trimming tool specially designed for processing next-generation sequencing (NGS) paired-end sequences.

SortMeRNA

SortMeRNA is a program tool for filtering, mapping and OTU-picking NGS reads in metatranscriptomic and metagenomic data. The core algorithm is based on approximate seeds and allows for fast and sensitive analyses of nucleotide sequences. The main application of SortMeRNA is filtering ribosomal RNA from metatranscriptomic data.

The MultiQC module parses the log files, which are created when SortMeRNA is run with the --log option.

The default header in the 'General Statistics' table is '% rRNA'. Users can override this using the configuration option:

sortmerna:
    tab_header: 'My database hits'

Trimmomatic

The Trimmomatic module parses results generated by Trimmomatic, a flexible read trimming tool for Illumina NGS data.

Aligners

Bismark

The Bismark module parses logs generated by Bismark, a tool to map bisulfite converted sequence reads and determine cytosine methylation states.

Bowtie 1

The Bowtie 1 module parses results generated by Bowtie, an ultrafast, memory-efficient short read aligner.

Bowtie 2

The Bowtie 2 module parses results generated by Bowtie 2, an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences.

Please note that the Bowtie 2 logs are difficult to parse as they don't contain much extra information (such as what the input data was). A typical log looks like this:

314537 reads; of these:
  314537 (100.00%) were paired; of these:
    111016 (35.30%) aligned concordantly 0 times
    193300 (61.46%) aligned concordantly exactly 1 time
    10221 (3.25%) aligned concordantly >1 times
    ----
    111016 pairs aligned concordantly 0 times; of these:
      11377 (10.25%) aligned discordantly 1 time
    ----
    99639 pairs aligned 0 times concordantly or discordantly; of these:
      199278 mates make up the pairs; of these:
        112779 (56.59%) aligned 0 times
        85802 (43.06%) aligned exactly 1 time
        697 (0.35%) aligned >1 times
82.07% overall alignment rate

Bowtie 2 logs are from STDERR - some pipelines (such as Cluster Flow) print the Bowtie 2 command before this, so MultiQC looks to see if this can be recognised in the same file. If not, it takes the filename as the sample name.

Bowtie 2 is used by other tools too, so if your log file contains the word bisulfite, MultiQC will assume that this is actually Bismark and ignore the Bowtie 2 logs.

BBMap

The BBMap module produces summary statistics from the BBMap suite of tools. The module can summarise data from the following BBMap output files (descriptions from bbmap.sh help output):

  • covstats (not yet implemented)
    • Per-scaffold coverage info.
  • rpkm (not yet implemented)
    • Per-scaffold RPKM/FPKM counts.
  • covhist
    • Histogram of # occurrences of each depth level.
  • basecov
    • Coverage per base location.
  • bincov (not yet implemented)
    • Print binned coverage per location (one line per X bases).
  • scafstats
    • Statistics on how many reads mapped to which scaffold.
  • refstats
    • Statistics on how many reads mapped to which reference file; only for BBSplit.
  • bhist
    • Base composition histogram by position.
  • qhist
    • Quality histogram by position.
  • qchist
    • Count of bases with each quality value.
  • aqhist
    • Histogram of average read quality.
  • bqhist
    • Quality histogram designed for box plots.
  • lhist
    • Read length histogram.
  • gchist
    • Read GC content histogram.
  • indelhist
    • Indel length histogram.
  • mhist
    • Histogram of match, sub, del, and ins rates by read location.
  • statsfile (not yet implemented)
    • Mapping statistics are printed here.

Additional information on the BBMap tools is available on SeqAnswers.

HiCUP

The HiCUP module parses results generated by HiCUP, (Hi-C User Pipeline), a tool for mapping and performing quality control on Hi-C data.

HISAT2

HISAT2 is a fast and sensitive alignment program for mapping NGS reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome).

The HISAT2 MultiQC module parses summary statistics generated by versions >= v2.1.0 where the command line option --new-summary has been specified.

Note that running HISAT2 without this option (and older versions) gives log output identical to Bowtie2. These logs are indistinguishable and summary statistics will appear in MultiQC reports labelled as Bowtie2. See GitHub issues on the HISAT2 repository and the MultiQC repository for more information.

HISAT2 does not report the input file names in the log, so MultiQC takes the filename as the sample. Note that if you specify --summary-file when running HISAT2 the same summary output appears both there and in the stdout. So if you save both with different names you may end up with duplicate samples in your MultiQC report.

Kallisto

The Kallisto module parses logs generated by Kallisto, a program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads.

Note - MultiQC parses the standard out from Kallisto, not any of its output files (abundance.h5, abundance.tsv, and run_info.json). As such, you must capture the Kallisto stdout to a file when running to use the MultiQC module.

Salmon

The Salmon module parses results generated by Salmon, a tool for quantifying the expression of transcripts using RNA-seq data.

STAR

STAR is an ultrafast universal RNA-seq aligner.

This MultiQC module parses summary statistics from the Log.final.out log files. Sample names are taken either from the filename prefix (sampleNameLog.final.out) when set with --outFileNamePrefix in STAR. If there is no filename prefix, the sample name is set as the name of the directory containing the file.

In addition to this summary log file, the module parses ReadsPerGene.out.tab files generated with --quantMode GeneCounts, if found.

TopHat

The TopHat module parses results generated by TopHat, a fast splice junction mapper for RNA-Seq reads that aligns RNA-Seq reads to mammalian-sized genomes.

Post-alignment

Bamtools

The Bamtools module parses bamtools stats logs generated by Bamtools, a programmer's API and an end-user's toolkit for handling BAM files.

Supported commands: stats

Bcftools

The Bcftools module parses results generated by Bcftools, a suite of programs for interacting with variant call data.

Supported commands: stats

Collapse complementary substitutions

In non-strand-specific data, reporting the total numbers of occurences for both changes in a comlementary pair - like A>C and T>G - might not bring any additional information. To collapse such statistics in the substitutions plot, you can add the following section into your configuration:

bcftools:
    collapse_complementary_changes: true

MultiQC will sum up all complementary changes and show only A>* and C>* substitutions in the resulting plot.

BUSCO

BUSCO v2 provides quantitative measures for the assessment of genome assembly, gene set, and transcriptome completeness, based on evolutionarily-informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB v9.

The MultiQC module parses the short_summary_[samplename].txt files and plots the proportion of BUSCO types found. MultiQC has been tested with output from BUSCO v1.22 - v2.

Conpair

Conpair is a fast and robust method dedicated for human tumour-normal studies to perform concordance verification (= samples coming from the same individual), as well as cross-individual contamination level estimation in whole-genome and whole-exome sequencing experiments.

Disambiguate

Disambiguation algorithm for reads aligned to two species (e.g. human and mouse genomes) from Tophat, Hisat2, STAR or BWA mem. Both a Python and C++ implementation are offered.

The MultiQC module for Disambiguate parses the summary files generated by Disambiguate.

deepTools

deepTools addresses the challenge of handling the large amounts of data that are now routinely generated from DNA sequencing centers. deepTools contains useful modules to process the mapped reads data for multiple quality checks, creating normalized coverage files in standard bedGraph and bigWig file formats, that allow comparison between different files (for example, treatment and control). Finally, using such normalized and standardized files, deepTools can create many publication-ready visualizations to identify enrichments and for functional annotations of the genome.

The MultiQC module for deepTools parses a number of the text files that deepTools can produce. In particular, the following are supported:

  • bamPEFragmentSize --table
  • estimateReadFiltering
  • plotCoverage ---outRawCounts (as well as the content written normally to the console)
  • plotEnrichment --outRawCounts
  • plotFingerprint --outQualityMetrics --outRawCounts

Please be aware that some tools (namely, plotFingerprint --outRawCounts and plotCoverage --outRawCounts) are only supported as of deepTools version 2.6. For earlier output from plotCoverage --outRawCounts, you can use #'chr' 'start' 'end' in utils/search_patterns.yaml (see here for more details). Also for these types of files, you may need to increase the maximum file size supported by MultiQC (log_filesize_limit in the MultiQC configuration file). You can find details regarding the configuration file location here.

Note that sample names are parsed from the text files themselves, they are not derived from file names.

featureCounts

The featureCounts module parses results generated by featureCounts, a highly efficient general-purpose read summarization program that counts mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations.

GATK

Developed by the Data Science and Data Engineering group at the Broad Institute, the GATK toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size.

Supported tools:

  • BaseRecalibrator
  • VariantEval

BaseRecalibrator

BaseRecalibrator is a tool for detecting systematic errors in read base quality scores of aligned high-throughput sequencing reads. It outputs a base quality score recalibration table that can be used in conjunction with the PrintReads tool to recalibrate base quality scores.

VariantEval

VariantEval is a general-purpose tool for variant evaluation. It gives information about percentage of variants in dbSNP, genotype concordance, Ti/Tv ratios and a lot more.

goleft indexcov

The goleft indexcov module parses results generated by goleft indexcov. It uses the PED and ROC data files to create diagnostic plots of coverage per sample, helping to identify sample gender and coverage issues.

By default, we attempt to only plot chromosomes using standard human-like naming (chr1, chr2... chrX or 1, 2 ... X) but you can specify chromosomes for detailed ROC plots for alternative naming schemes in your configuration with:

goleft_indexcov_config:
  chromosomes:
    - I
    - II
    - III

HOMER

HOMER (Hypergeometric Optimization of Motif EnRichment) is a suite of tools for Motif Discovery and next-gen sequencing analysis. HOMER contains many useful tools for analyzing ChIP-Seq, GRO-Seq, RNA-Seq, DNase-Seq, Hi-C and numerous other types of functional genomics sequencing data sets.

The HOMER MultiQC module currently only parses output from the findPeaks tool. If you would like support to be added for other HOMER tools, please open a new issue on the MultiQC GitHub page.

FindPeaks

The HOMER findPeaks MultiQC module parses the summary statistics found at the top of HOMER peak files. Three key statistics are shown in the General Statistics table, all others are saved to multiqc_data/multiqc_homer_findpeaks.txt.

TagDirectory

The HOMER tag directory submodule parses output from files tag directory output files, generating a number of diagnostic plots.

HTSeq

HTSeq is a general purpose Python package that provides infrastructure to process data from high-throughput sequencing assays. htseq-count is a tool that is part of the main HTSeq package - it takes a file with aligned sequencing reads, plus a list of genomic features and counts how many reads map to each feature.

MACS2

MACS2 (Model-based Analysis of ChIP-Seq) is a tool for identifying transcript factor binding sites. MACS captures the influence of genome complexity to evaluate the significance of enriched ChIP regions.

The MACS2 MultiQC module reads the header of the *_peaks.xls results files and prints the redundancy rates in the General Statistics table. Numerous additional values are parsed and saved to multiqc_data/multiqc_macs2.txt.

methylQA

The methylQA module parses results generated by methylQA, a methylation sequencing data quality assessment tool.

Peddy

Peddy compares familial-relationships and sexes as reported in a PED file with those inferred from a VCF.

It samples the VCF at about 25000 sites (plus chrX) to accurately estimate relatedness, IBS0, heterozygosity, sex and ancestry. It uses 2504 thousand genome samples as backgrounds to calibrate the relatedness calculation and to make ancestry predictions.

It does this very quickly by sampling, by using C for computationally intensive parts, and by parallelization.

Picard

The Picard module parses results generated by Picard, a set of Java command line tools for manipulating high-throughput sequencing data.

Supported commands:

  • MarkDuplicates
  • InsertSizeMetrics
  • GcBiasMetrics
  • HsMetrics
  • OxoGMetrics
  • BaseDistributionByCycl
  • RnaSeqMetrics
  • AlignmentSummaryMetrics
  • RrbsSummaryMetrics

Coverage Levels

It's possible to customise the HsMetrics "Target Bases 30X" coverage and WgsMetrics "Fraction of Bases over 30X" that are shown in the general statistics table. This must correspond to field names in the picard report, such as PCT_TARGET_BASES_2X / PCT_10X. Any numbers not found in the reports will be ignored.

The coverage levels available for HsMetrics are typically 1, 2, 10, 20, 30, 40, 50 and 100X.

The coverage levels available for WgsMetrics are typically 1, 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90 and 100X.

To customise this, add the following to your MultiQC config:

picard_config:
    general_stats_target_coverage:
        - 10
        - 50

Preseq

The Preseq module parses results generated by Preseq, a tool that estimates the complexity of a library, showing how many additional unique reads are sequenced for increasing total read count.

When preseq lc_extrap is run with the default parameters, the extrapolation points reach 10 billion molecules making the plot difficult to interpret in most scenarios. It also includes a lot of data in the reports, which can unnecessarily inflate report file sizes. To avoid this, MultiQC trims back the x axis until each dataset shows 80% of its maximum y-value (unique molecules).

To disable this feature and show all of the data, add the following to your MultiQC configuration:

preseq:
    notrim: true

Using coverage instead of read counts

Preseq reports its numbers as "Molecule counts". This isn't always very intuitive, and it's often easier to talk about sequencing depth in terms of coverage.

You can plot with approximate coverage on the axes instead by specifying the reference genome or target size, and the read length in your MultiQC configuration:

preseq:
    genome_size: 3049315783
    read_length: 300

These parameters make the script take every molecule count and divide it by (genome_size / read_length).

MultiQC comes with effective genome size presets for Human and Mouse, so you can provide the genome build name instead, like this: genome_size: hg38_genome. The following values are supported: hg19_genome, hg38_genome, mm10_genome.

Plotting externally calculated read counts

To mark on the plot the read counts calculated externally from BAM or fastq files, create a file with preseq_real_counts in the filename and place it with your analysis files. It should be space or tab delimited with 2 or 3 columns (column 1 = preseq file name, column 2 = real read count, optional column 3 = real unique read count). For example:

Sample_1.preseq.txt 3638261 3638011
Sample_2.preseq.txt 1592394 1592133
[...]

You can generate a line for such a file using samtools:

echo "Sample_1.preseq.txt "$(samtools view -c -F 4 Sample_1.bam)" "$(samtools view -c -F 1028 Sample_1.bam)

Prokka

The Prokka module analyses summary results from the Prokka annotation pipeline for prokaryotic genomes. The Prokka module accepts two configuration options:

  • prokka_table: default False. Show a table in the report.
  • prokka_barplot: default True. Show a barplot in the report.
  • prokka_fn_snames: default False. Use filenames for sample names (see below).

Sample names are generated using the first line in the prokka reports:

organism: Helicobacter pylori Sample1

The module assumes that the first two words are the organism name and the third is the sample name. So the above will give a sample name of Sample1.

If you prefer, you can set config.prokka_fn_snames to True and MultiQC will instead use the log filename as the sample name.

QoRTs

The QoRTs software package is a fast, efficient, and portable multifunction toolkit designed to assist in the analysis, quality control, and data management of RNA-Seq datasets. Its primary function is to aid in the detection and identification of errors, biases, and artifacts produced by paired-end high-throughput RNA-Seq technology. In addition, it can produce count data designed for use with differential expression and differential exon usage tools, as well as individual-sample and/or group-summary genome track files suitable for use with the UCSC genome browser.

Qualimap

The Qualimap module parses results generated by Qualimap, a platform-independent application to facilitate the quality control of alignment sequencing data and its derivatives like feature counts.

The MultiQC module supports the Qualimap commands BamQC and RNASeq. Note that Qualimap must be run with the -outdir option as well as -outformat HTML (which is on by default). MultiQC uses files found within the raw_data_qualimapReport folder (as well as genome_results.txt).

It is possible to customise which coverage thresholds are shown from BamQC in the General Statistics table (default: 1, 5, 10, 30, 50) and which of these are hidden when the report loads (default: all except 30X).

To do this, add something like the following to your MultiQC config file:

qualimap_config:
    general_stats_coverage:
        - 10
        - 20
        - 40
        - 200
        - 30000
    general_stats_coverage_hidden:
        - 10
        - 20
        - 200

QUAST

QUAST evaluates genome assemblies by computing various metrics, including

  • N50, length for which the collection of all contigs of that length or longer covers at least 50% of assembly length
  • NG50, where length of the reference genome is being covered
  • NA50 and NGA50, where aligned blocks instead of contigs are taken
  • Misassemblies, misassembled and unaligned contigs or contigs bases
  • Genes and operons covered

The QUAST MultiQC module parses the report.tsv files generated by QUAST and adds key metrics to the report General Statistics table. All statistics for all samples are saved to multiqc_data/multiqc_quast.txt.

MetaQUAST

The QUAST module will also parse output from MetaQUAST runs (metaquast.py).

The combined_reference/report.tsv file is parsed, and folders runs_per_reference and not_aligned are ignored.

If you want to run MultiQC against auxiliary MetaQUAST runs, you must explicitly pass these files to MultiQC:

multiqc runs_per_reference/reference_1/report.tsv

Note that you can pass as many file paths to MultiQC as you like and use glob expansion (eg. runs_per_reference/*/report.tsv).

RNA-SeQC

The RSeQC module parses results generated by RNA-SeQC, (not to be confused with RSeQC, which MultiQC also supports). RNA-SeQC is a java program which computes a series of quality control metrics for RNA-seq data.

This module shows the Spearman correlation heatmap if both Spearman and Pearson's are found. To plot Pearson's by default instead, add the following to your MultiQC config file:

rna_seqc:
    default_correlation: pearson

RSEM

The rsem module parses results generated by RSEM, a software package for estimating gene and isoform expression levels from RNA-Seq data

Supported scripts:

  • rsem-calculate-expression

This module search for the file .cnt created by RSEM into directory named PREFIX.stat

RSeQC

The RSeQC module parses results generated by RSeQC, a package that provides a number of useful modules that can comprehensively evaluate high throughput RNA-seq data.

Supported scripts:

  • bam_stat
  • gene_body_coverage
  • infer_experiment
  • inner_distance
  • junction_annotation
  • junction_saturation
  • read_distribution
  • read_duplication
  • read_gc

You can choose to hide sections of RSeQC output and customise their order. To do this, add and customise the following to your MultiQC config file:

rseqc_sections:
    - read_distribution
    - gene_body_coverage
    - inner_distance
    - read_gc
    - read_duplication
    - junction_annotation
    - junction_saturation
    - infer_experiment
    - bam_stat

Change the order to rearrage sections or remove to hide them from the report.

Samblaster

The Samblaster module parses results generated by Samblaster, a tool to mark duplicates and extract discordant and split reads from sam files.

Samtools

The Samtools module parses results generated by Samtools, a suite of programs for interacting with high-throughput sequencing data.

Supported commands:

  • stats
  • flagstats
  • idxstats
  • rmdup

idxstats

The samtools idxstats prints its results to standard out (no consistent file name) and has no header lines (no way to recognise from content of file). As such, idxstats result files must have the string idxstat somewhere in the filename.

There are a few MultiQC config options that you can add to customise how the idxstats module works. A typical configuration could look as follows:

# Always include these chromosomes in the plot
samtools_idxstats_always:
    - X
    - Y

# Never include these chromosomes in the plot
samtools_idxstats_ignore:
    - MT

# Threshold where chromosomes are ignored in the plot.
# Should be a fraction, default is 0.001 (0.1% of total)
samtools_idxstats_fraction_cutoff: 0.001

# Name of the X and Y chromosomes.
# If not specified, MultiQC will search for any chromosome
# names that look like x, y, chrx or chry (case insensitive search)
samtools_idxstats_xchr: myXchr
samtools_idxstats_ychr: myYchr

Slamdunk

Slamdunk is a tool to analyze data from the SLAM-Seq sequencing protocol.

This module should be able to parse logs from v0.2.2-dev onwards.

SnpEff

The SnpEff module parses results generated by SnpEff, a genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes).

MultiQC parses the summary .csv file that is generated by SnpEff. Note that you must run SnpEff with -csvStats <filename> for this to be generated. See the SnpEff documentation for more information.

Supernova

Important notes

Due to the size of the histogram_kmer_count.json files, MultiQC is likely to skip these files. To be able to display these you will need to change the MultiQC configuration to allow for larger logfiles, see the MultiQC documentation. For instance, if you run MultiQC as part of an analysis pipeline, you can create a multiqc_config.yaml file in the working directory, containing the following line:

log_filesize_limit: 100000000

General Notes

The Supernova module parses the reports from an assembly run. As a bare minimum it requires the file report.txt, found in the folder sampleID/outs/, to function. Note! If you are anything like the author (@remiolsen), you might only have files (often renamed to, e.g. sampleID-report.txt) lying around due to disk space limitations and for ease of sharing with your colleagues. This module will search for *report*.txt. If available the stats in the report file will be superseded by the higher precision numbers found in the file sampleID/outs/assembly/stats/summary.json. In the same folder, this module will search for the following plots and render them:

  • histogram_molecules.json -- Inferred molecule lengths
  • histogram_kmer_count.json -- Kmer multiplicity

This module has been tested using Supernova versions 1.1.4 and 1.2.0

THeTA2

THeTA2 (Tumor Heterogeneity Analysis) is an algorithm that estimates the tumour purity and clonal / subclonal copy number aberrations directly from high-throughput DNA sequencing data.

The THeTA2 MultiQC module plots the % germline and % tumour subclone for each sample. Note that each sample can have multiple maximum likelihood solutions - the MultiQC module plots proportions for the first one in the results file (*.BEST.results). Also note that if there are more than 5 tumour subclones, their percentages are summed.

VCFTools

Important General Note

  • Depending on the size and density of the variant data (vcf), some of the stat files generated by vcftools can be very large. If you find that some of your input files are missing, increase the config.log_filesize_limit so that the large file(s) will not be skipped by MultiQC. Note, however, that this might make MultiQC very slow!

This module parses the outputs from VCFTools' various commands:

Implemented

  • relatedness2
    • Plots a heatmap of pairwise sample relatedness.
    • Not to be confused with the similarly-named command relatedness
  • TsTv-by-count
    • Plots the transition to transversion ratio as a function of alternative allele count (using only bi-allelic SNPs).
  • TsTv-by-qual
    • Plots the transition to transversion ratio as a function of SNP quality threshold (using only bi-allelic SNPs).
  • TsTv-summary
    • Plots a bargraph of the summary counts of each type of transition and transversion SNPs.

To do

VCFTools has a number of outputs not yet supported in MultiQC which would be good to add. Please check GitHub if you'd like these added or (better still), would like to contribute!

Custom Content

WARNING - This feature is new and is very much in a beta status. It is expected to be further developed in future releases, which may break backwards compatibility. There are also probably quite a few bugs. Use at your own risk! Please report bugs or missing functionality as a new GitHub issue.

Introduction

Bioinformatics projects often include non-standardised analyses, with results from custom scripts or in-house packages. It can be frustrating to have a MultiQC report describing results from 90% of your pipeline but missing the final key plot. To help with this, MultiQC has a special "custom content" module.

Custom content parsing is a little more restricted than standard modules. Specifically:

  • Only one plot per section is possible
  • Plot customisation is more limited

All plot types can be generated using custom content - see the test files for examples of how data should be structured.

Configuration

Order of sections

If you have multiple different Custom Content sections, their order will be random and may vary between runs. To avoid this, you can specify an order in your MultiQC config as follows:

custom_content:
  order:
    - first_cc_section
    - second_cc_section

Each section name should be the ID assigned to that section. You can explicitly set this (see below), or the Custom Content module will automatically assign an ID. To find out what your custom content section ID is, generate a report and click the side navigation to your section. The browser URL should update and show something that looks like this:

multiqc_report.html#my_cc_section

The section ID is the part after the # (my_cc_section in the above section).

Note that any Custom Content sections found that are not specified in the config will be placed at the top of the report.

Section configuration

See below for how these config options can be specified (either within the data file or in a MultiQC config file). All of these configuration parameters are optional, and MultiQC will do its best to guess sensible defaults if they are not specified.

All possible configuration keys and their default values are shown below:

id: null                # Unique ID for report section.
section_anchor: <id>    # Used in report section #soft-links
section_name: <id>      # Nice name used for the report section header
section_href: null      # External URL for the data, to find more information
description: null       # Introductory text to be printed under the section header
file_format: null       # File format of the data (typically csv / tsv - see below for more information)
plot_type: null         # The plot type to visualise the data with.
                        # - Possible options: generalstats | table | bargraph | linegraph | scatter | heatmap | beeswarm
pconfig: {}             # Configuration for the plot. See http://multiqc.info/docs/#plotting-functions

Note that any custom content data found with the same section id will be merged into the same report section / plot. The other section configuration keys are merged for each file, with identical keys overwriting what was previously parsed.

This approach means that it's possible to have a single file containing data for multiple samples, but it's also possible to have one file per sample and still have all of them summarised.

If you're using plot_type: 'generalstats' then a report section will not be created and most of the configuration keys above are ignored.

Data types generalstats and beeswarm are only possible by setting the above configuration keys (these can't be guessed by data format).

Data formats

MultiQC can parse custom data from a few different sources, in a number of different formats. Which one you use depends on how the data is being produced.

A quick summary of which approach to use looks something like this:

  • Additional data when already using custom MultiQC config files
    • Data as part of MultiQC config
  • Data specifically for MultiQC from a custom script
    • MultiQC-specific data file
  • Data from a custom script which is also used by other processes
    • Separate configuration and data files
    • Add _mqc.txt to filename and hope that MultiQC guesses correctly
  • Anything more complicated, or data from a released tool
    • Write a proper MultiQC module instead.

For more complete examples of the data formats understood by MultiQC, please see the data/custom_content directory in the MultiQC_TestData GitHub repository.

Data from a released tool

If your data comes from a released bioinformatics tool, you shouldn't be using this feature of MultiQC! Sure, you can probably get it to work, but it's better if a fully-fledged core MultiQC module is written instead. That way, other users of MultiQC can also benefit from results parsing.

Note that proper MultiQC modules are more robust and powerful than this custom-content feature. You can also write modules in MultiQC plugins if they're not suitable for general release.

Data as part of MultiQC config

If you are already using a MultiQC config file to add data to your report (for example, titles / introductory text), you can give data within this file too. This can be in any MultiQC config file (for example, passed on the command line with -c my_yaml_file.yaml). This is useful as you can keep everything contained within a single file (including stuff unrelated to this specific custom content feature of MultiQC).

If you're not using this file for other MultiQC configuration, you're probably better off using a stand-alone YAML file (see section below).

To be understood by MultiQC, the custom_data key must be found. This must contain a section with a unique id, specific to your new report section. This in turn must contain a section called data. Other configuration keys can be held alongside this. For example:

# Other MultiQC config stuff here
custom_data:
    my_data_type:
        id: 'mqc_config_file_section'
        section_name: 'My Custom Section'
        description: 'This data comes from a single multiqc_config.yaml file'
        plot_type: 'bargraph'
        pconfig:
            id: 'barplot_config_only'
            title: 'MultiQC Config Data Plot'
            ylab: 'Number of things'
        data:
            sample_a:
                first_thing: 12
                second_thing: 14
            sample_b:
                first_thing: 8
                second_thing: 6
            sample_c:
                first_thing: 11
                second_thing: 5
            sample_d:
                first_thing: 12
                second_thing: 9

Or to add data to the General Statistics table:

custom_data:
    my_genstats:
        plot_type: 'generalstats'
        pconfig:
            - col_1:
                max: 100
                min: 0
                scale: 'RdYlGn'
                suffix: '%'
            - col_2:
                min: 0
        data:
            sample_a:
                col_1: 14.32
                col_2: 1.2
            sample_b:
                col_1: 84.84
                col_2: 1.9

Note: Use a list of headers in pconfig (keys prepended with -) to specify the order of columns in the table.

See the general statistics docs for more information about configuring data for the General Statistics table.

MultiQC-specific data file

If you can choose exactly how your data output looks, then the easiest way to parse it is to use a MultiQC-specific format. If the filename ends in *_mqc.(yaml|json|txt|csv|out) then it will be found by any standard MultiQC installation with no additional customisation required (v0.9 onwards).

These files contain configuration information specifying how the data should be parsed, along side the data. If using YAML, this looks just the same as if in a MultiQC config file (see above), but without having to be within a custom_data section:

id: 'my_pca_section'
section_name: 'PCA Analysis'
description: 'This plot shows the first two components from a principal component analysis.'
plot_type: 'scatter'
pconfig:
    id: 'pca_scatter_plot'
    title: 'PCA Plot'
    xlab: 'PC1'
    ylab: 'PC2'
data:
    sample_1: {x: 12, y: 14}
    sample_2: {x: 8, y: 6 }
    sample_3: {x: 5, y: 11}
    sample_4: {x: 9, y: 12}

The file format can also be JSON:

{
    "id": "custom_data_lineplot",
    "section_name": "Custom JSON File",
    "description": "This plot is a self-contained JSON file.",
    "plot_type": "linegraph",
    "pconfig": {
        "id": "custom_data_linegraph",
        "title": "Output from my JSON file",
        "ylab": "Number of things",
        "xDecimals": false
    },
    "data": {
        "sample_1": { "1": 12, "2": 14, "3": 10, "4": 7, "5": 16 },
        "sample_2": { "1": 9, "2": 11, "3": 15, "4": 18, "5": 21 }
    }
}

If you want the data to be easy to use with other tools, you can also use comma-separated or tab-separated file. To customise plot output, include commented header lines with plot configuration in YAML format:

# title: 'Output from my script'
# description: 'This output is described in the file header. Any MultiQC installation will understand it without prior configuration.'
# section: 'Custom Data File'
# format: 'tsv'
# plot_type: 'bargraph'
# pconfig:
#    id: 'custom_bargraph_w_header'
#    ylab: 'Number of things'
Category_1    374
Category_2    229
Category_3    39
Category_4    253

If no configuration is given, MultiQC will do its best to guess how to visualise your data appropriately. To see examples of typical file structures which are understood, see the test data used to develop this code. Something will be probably be shown, but it may produce unexpected results.

Separate configuration and data files

It's not always possible or desirable to include MultiQC configuration within a data file. If this is the case, you can add to the MultiQC configuration to specify how input files should be parsed.

As described in the above Data as part of MultiQC config section, this configuration should be held within a section called custom_data with a section-specific id. The only difference is that no data subsection is given and a search pattern for the given id must be supplied.

Search patterns are added as with any other module. Ensure that the search pattern key is the same as your custom_data section ID.

For example:

# Other MultiQC config stuff here
custom_data:
    example_files:
        file_format: 'tsv'
        section_name: 'Coverage Decay'
        description: 'This plot comes from files acommpanied by a mutliqc_config.yaml file for configuration'
        plot_type: 'linegraph'
        pconfig:
            id: 'example_coverage_lineplot'
            title: 'Coverage Decay'
            ylab: 'X Coverage'
            ymax: 100
            ymin: 0
sp:
    example_files:
        fn: 'example_files_*'

A data file within the MultiQC search directories could then simply look like this:

example_files_Sample_1.txt:

0   98.22076066
1   97.96764159
2   97.78227175
3   97.61262195
[...]

As mentioned above - if no configuration is given, MultiQC will do its best to guess how to visualise your data appropriately. To see examples of typical file structures which are understood, see the test data used to develop this code.

Coding with MultiQC

Writing New Modules

Introduction

Writing a new module can at first seem a daunting task. However, MultiQC has been written (and refactored) to provide a lot of functionality as common functions.

Provided that you are familiar with writing Python and you have a read through the guide below, you should be on your way in no time!

If you have any problems, feel free to contact the author - details here: @ewels

Core modules / plugins

New modules can either be written as part of MultiQC or in a stand-alone plugin. If your module is for a publicly available tool, please add it to the main program and contribute your code back when complete via a pull request.

If your module is for something very niche, which no-one else can use, you can write it as part of a custom plugin. The process is almost identical, though it keeps the code bases separate. For more information about this, see the docs about MultiQC Plugins below.

Initial setup

Submodule

MultiQC modules are Python submodules - as such, they need their own directory in /multiqc/ with an __init__.py file. The directory should share its name with the module. To follow common practice, the module code usually then goes in a separate python file (also with the same name) which is then imported by __init__.py:

from __future__ import absolute_import
from .modname import MultiqcModule

Entry points

Once your submodule files are in place, you need to tell MultiQC that they are available as an analysis module. This is done within setup.py using entry points. In setup.py you will see some code that looks like this:

entry_points = {
    'multiqc.modules.v1': [
        'bismark = multiqc.modules.bismark:MultiqcModule',
        [...]
    ]
}

Copy one of the existing module lines and change it to use your module name. The order is irrelevant, so stick to alphabetical if in doubt. Once this is done, you will need to update your installation of MultiQC:

python setup.py develop

MultiQC config

So that MultiQC knows what order modules should be run in, you need to add your module to the core config file.

In multiqc/utils/config_defaults.yaml you should see a list variable called module_order. This contains the name of modules in order of precedence. Add your module here in an appropriate position.

Documentation

Next up, you need to create a documentation file for your module. The reason for this is twofold: firstly, docs are important to help people to use, debug and extend MultiQC (you're reading this, aren't you?). Secondly, having the file there with the appropriate YAML front matter will make the module show up on the MultiQC homepage so that everyone knows it exists. This process is automated once the file is added to the core repository.

This docs file should be placed in docs/modules/<your_module_name>.md and should have the following structure:

---
Name: Tool Name
URL: http://www.amazing-bfx-tool.com
Description: >
    This amazing tool does some really cool stuff. You can describe it
    here and split onto multiple lines if you want. Not too long though!
---

Your documentation goes here. Feel free to use markdown and write whatever
you think would be helpful. Please avoid using heading levels 1 to 3.

Make a reference to this in the YAML frontmatter at the top of docs/README.md - this allows the website to find the file to build the documentation.

Readme and Changelog

Last but not least, remember to add your new module to the main README.md file and CHANGELOG.md, so that people know that it's there. Feel free to add your name to the list of credits at the bottom of the readme.

MultiqcModule Class

If you've copied one of the other entry point statements, it will have ended in :MultiqcModule - this tells MultiQC to try to execute a class or function called MultiqcModule.

To use the helper functions bundled with MultiQC, you should extend this class from multiqc.modules.base_module.BaseMultiqcModule. This will give you access to a number of functions on the self namespace. For example:

from multiqc.modules.base_module import BaseMultiqcModule

class MultiqcModule(BaseMultiqcModule):
    def __init__(self):
        # Initialise the parent object
        super(MultiqcModule, self).__init__(name='My Module', anchor='mymod',
        href="http://www.awesome_bioinfo.com/my_module",
        info="is an example analysis module used for writing documentation.")

Ok, that should be it! The __init__() function will now be executed every time MultiQC runs. Try adding a print("Hello World!") statement and see if it appears in the MultiQC logs at the appropriate time...

Note that the __init__ variables are used to create the header, URL link, analysis module credits and description in the report.

Logging

Last thing - MultiQC modules have a standardised way of producing output, so you shouldn't really use print() statements for your Hello World ;)

Instead, use the logger module as follows:

import logging
log = logging.getLogger(__name__)
# Initialise your class and so on
log.info('Hello World!')

Log messages can come in a range of formats:

  • log.debug
    • Thes only show if MultiQC is run in -v/--verbose mode
  • log.info
    • For more important status updates
  • log.warning
    • Alert user about problems that don't halt execution
  • log.error and log.critical
    • Not often used, these are for show-stopping problems

Step 1 - Find log files

The first thing that your module will need to do is to find analysis log files. You can do this by searching for a filename fragment, or a string within the file. It's possible to search for both (a match on either will return the file) and also to have multiple strings possible.

First, add your default patterns to:

MULTIQC_ROOT/multiqc/utils/search_patterns.yaml

Each search has a yaml key, with one or more search criteria.

The yaml key must begin with the name of your module. If you have multiple search patterns for a single module, follow the module name with a forward slash and then any string. For example, see the fastqc module search patterns:

fastqc/data:
    fn: 'fastqc_data.txt'
fastqc/zip:
    fn: '_fastqc.zip'

The following search criteria sub-keys can then be used:

  • fn
    • A glob filename pattern, used with the Python fnmatch function
  • fn_re
    • A regex filename pattern
  • contents
    • A string to match within the file contents (checked line by line)
  • contents_re
    • A regex to match within the file contents (checked line by line)
    • NB: Regex must match entire line (add .* to start and end of pattern to avoid this)
  • num_lines
    • The number of lines to search through for the contents string. Default: all lines.
  • shared
    • By default, once a file has been assigned to a module it is not searched again. Specify shared: true when your file can be shared between multiple tools (for example, part of a stdout stream).
  • max_filesize
    • Files larger than the log_filesize_limit config key (default: 10MB) are skipped. If you know your files will be smaller than this and need to search by contents, you can specify this value (in bytes) to skip any files smaller than this limit.

Please try to use num_lines and max_filesize where possible as they will speed up MultiQC execution time.

For example, two typical modules could specify search patterns as follows:

mymod:
    fn: '_myprogram.txt'
myothermod:
    contents: 'This is myprogram v1.3'

You can also supply a list of different patterns for a single log file type if needed. If any of the patterns are matched, the file will be returned:

mymod:
    - fn: 'mylog.txt'
    - fn: 'different_fn.out'

You can use AND logic by specifying keys within a single list item. For example:

mymod:
    fn: 'mylog.txt'
    contents: 'mystring'
myothermod:
    - fn: 'different_fn.out'
      contents: 'This is myprogram v1.3'
    - fn: 'another.txt'
      contents: 'What are these files anyway?'

Here, a file must have the filename mylog.txt and contain the string mystring.

Remember that users can overwrite these defaults in their own config files. This is helpful as people have weird and wonderful processing pipelines with their own conventions.

Once your strings are added, you can find files in your module with the base function self.find_log_files(), using the key you set in the YAML:

self.find_log_files('mymod')

This function yields a dictionary with various information about each matching file. The f key contains the contents of the matching file:

# Find all files for mymod
for myfile in self.find_log_files('mymod'):
    print( myfile['f'] )       # File contents
    print( myfile['s_name'] )  # Sample name (from cleaned filename)
    print( myfile['fn'] )      # Filename
    print( myfile['root'] )    # Directory file was in

If filehandles=True is specified, the f key contains a file handle instead:

for f in self.find_log_files('mymod', filehandles=True):
    # f['f'] is now a filehandle instead of contents
    for l in f['f']:
        print( l )

This is good if the file is large, as Python doesn't read the entire file into memory in one go.

Step 2 - Parse data from the input files

What most MultiQC modules do once they have found matching analysis files is to pass the matched file contents to another function, responsible for parsing the data from the file. How this parsing is done will depend on the format of the log file and the type of data being read. See below for a basic example, based loosely on the preseq module:

class MultiqcModule(BaseMultiqcModule):
    def __init__(self):
        # [...]
        self.mod_data = dict()
        for f in self.find_log_files('mymod'):
            self.mod_data[f['s_name']] = self.parse_logs(f['f'])

    def parse_logs(self, f):
        data = {}
        for l in f.splitlines():
            s = l.split()
            data[s[0]] = s[1]
        return data

Filtering by parsed sample names

MultiQC users can use the --ignore-samples flag to skip sample names that match specific patterns. As sample names are generated in a different way by every module, this filter has to be applied after log parsing.

There is a core function to do this task - assuming that your data is in a dictionary with the first key as sample name, pass it through the self.ignore_samples function as follows:

self.yourdata = self.ignore_samples(self.yourdata)

This will remove any dictionary keys where the sample name matches a user pattern.

No files found

If your module cannot find any matching files, it needs to raise an exception of type UserWarning. This tells the core MultiQC program that no modules were found. For example:

if len(self.mod_data) == 0:
    raise UserWarning

Note that this has to be raised as early as possible, so that it halts the module progress. For example, if no logs are found then the module should not create any files or try to do any computation.

Custom sample names

Typically, sample names are taken from cleaned log filenames (the default f['s_name'] value returned). However, if possible, it's better to use the name of the input file (allowing for concatenated log files). To do this, you should use the self.clean_s_name() function, as this will prepend the directory name if requested on the command line:

input_fname = s[3] # Or parsed however
s_name = self.clean_s_name(input_fname, f['root'])

This function has already been applied to the contents of f['s_name'].

self.clean_s_name() must be used on sample names parsed from the file contents. Without it, features such as prepending directories (--dirs) will not work.

Identical sample names

If modules find samples with identical names, then the previous sample is overwritten. It's good to print a log statement when this happens, for debugging. However, most of the time it makes sense - programs often create log files and print to stdout for example.

if f['s_name'] in self.bowtie_data:
    log.debug("Duplicate sample name found! Overwriting: {}".format(f['s_name']))

Printing to the sources file

Finally, once you've found your file we want to add this information to the multiqc_sources.txt file in the MultiQC report data directory. This lists every sample name and the file from which this data came from. This is especially useful if sample names are being overwritten as it lists the source used. This code is typically written immediately after the above warning.

If you've used the self.find_log_files function, writing to the sources file is as simple as passing the log file variable to the self.add_data_source function:

for f in self.find_log_files('mymod'):
    self.add_data_source(f)

If you have different files for different sections of the module, or are customising the sample name, you can tweak the fields. The default arguments are as shown:

self.add_data_source(f=None, s_name=None, source=None, module=None, section=None)

Step 3 - Adding to the general statistics table

Now that you have your parsed data, you can start inserting it into the MultiQC report. At the top of ever report is the 'General Statistics' table. This contains metrics from all modules, allowing cross-module comparison.

There is a helper function to add your data to this table. It can take a lot of configuration options, but most have sensible defaults. At it's simplest, it works as follows:

data = {
    'sample_1': {
        'first_col': 91.4,
        'second_col': '78.2%'
    },
    'sample_2': {
        'first_col': 138.3,
        'second_col': '66.3%'
    }
}
self.general_stats_addcols(data)

To give more informative table headers and configure things like data scales and colour schemes, you can supply an extra dict:

headers = OrderedDict()
headers['first_col'] = {
    'title': 'First',
    'description': 'My First Column',
    'scale': 'RdYlGn-rev'
}
headers['second_col'] = {
    'title': 'Second',
    'description': 'My Second Column',
    'max': 100,
    'min': 0,
    'scale': 'Blues',
    'suffix': '%'
}
self.general_stats_addcols(data, headers)

Here are all options for headers, with defaults:

headers['name'] = {
    'namespace': '',                # Module name. Auto-generated for General Statistics.
    'title': '[ dict key ]',        # Short title, table column title
    'description': '[ dict key ]',  # Longer description, goes in mouse hover text
    'max': None,                    # Minimum value in range, for bar / colour coding
    'min': None,                    # Maximum value in range, for bar / colour coding
    'scale': 'GnBu',                # Colour scale for colour coding. Set to False to disable.
    'suffix': None,                 # Suffix for value (eg. '%')
    'format': '{:,.1f}',            # Output format() string
    'shared_key': None              # See below for description
    'modify': None,                 # Lambda function to modify values
    'hidden': False,                # Set to True to hide the column on page load
    'placement' : 1000.0,           # Alter the default ordering of columns in the table
}
  • namespace
    • This prepends the column title in the mouse hover: Namespace: Title. It's automatically generated for the General Statistics table.
  • scale
    • Colour scales are the names of ColorBrewer palettes. See below for available scales.
    • Add -rev to the name of a colour scale to reverse it
    • Set to False to disable colouring and background bars
  • shared_key
    • Any string can be specified here, if other columns are found that share the same key, a consistent colour scheme and data scale will be used in the table. Typically this is set to things like read_count, so that the read count in a sample can be seen varying across analysis modules.
  • modify
    • A python lambda function to change the data in some way when it is inserted into the table.
  • hidden
    • Setting this to True will hide the column when the report loads. It can then be shown through the Configure Columns modal in the report. This can be useful when data could be sometimes useful. For example, some modules show "percentage aligned" on page load but hide "number of reads aligned".
  • placement
    • If you feel that the results from your module should appear at the left side of the table set this value less than 1000. Or to move the column right, set it greater than 1000. This value can be any float.

The typical use for the modify string is to divide large numbers such as read counts, to make them easier to interpret. If handling read counts, there are three config variables that should be used to allow users to change the multiplier for read counts: read_count_multiplier, read_count_prefix and read_count_desc. For example:

'title': '{} Reads'.format(config.read_count_prefix),
'description': 'Number of reads ({})'.format(config.read_count_desc),
'modify': lambda x: x * config.read_count_multiplier,

Similar config options apply for base pairs: base_count_multiplier, base_count_prefix and base_count_desc.

The colour scales are from ColorBrewer2 and are named as follows: color brewer

A third parameter can be passed to this function, namespace. This is usually not needed - MultiQC automatically takes the name of the module that is calling the function and uses this. However, sometimes it can be useful to overwrite this.

Step 4 - Writing data to a file

In addition to printing data to the General Stats, MultiQC modules typically also write to text-files to allow people to easily use the data in downstream applications. This also gives the opportunity to output additional data that may not be appropriate for the General Statistics table.

Again, there is a base class function to help you with this - just supply it with a dictionary and a filename:

data = {
    'sample_1': {
        'first_col': 91.4,
        'second_col': '78.2%'
    },
    'sample_2': {
        'first_col': 138.3,
        'second_col': '66.3%'
    }
}
self.write_data_file(data, 'multiqc_mymod')

If your output has a lot of columns, you can supply the additional argument sort_cols = True to have the columns alphabetically sorted.

This function will also pay attention to the default / command line supplied data format and behave accordingly. So the written file could be a tab-separated file (default), JSON or YAML.

Note that any keys with more than 2 levels of nesting will be ignored when being written to tab-separated files.

Step 5 - Create report sections

Great! It's time to start creating sections of the report with more information. To do this, use the self.add_section() helper function:

self.add_section (
    name = 'First Module Section',
    anchor = 'mymod-first',
    description = 'My amazing module output, from the first section',
    helptext = "If you're not sure _how_ to interpret the data, we can help!",
    plot = bargraph.plot(data)
)
self.add_section (
    name = 'Second Module Section',
    anchor = 'mymod-second',
    plot = linegraph.plot(data2)
)
self.add_section (
    content = '<p>Some custom HTML.</p>'
)

These will automatically be labelled and linked in the navigation (unless the module has only one section or name is not specified).

Note that description and helptext are processed as Markdown by default. This can be disabled by passing autoformat=False to the function.

Step 6 - Plot some data

Ok, you have some data, now the fun bit - visualising it! Each of the plot types is described in the Plotting Functions section of the docs.

Appendices

Profiling Performance

It's important that MultiQC runs quickly and efficiently, especially on big projects with large numbers of samples. The recommended method to check this is by using cProfile to profile the code execution. To do this, run MultiQC as follows:

python -m cProfile -o multiqc_profile.prof /path/to/MultiQC/scripts/multiqc -f .

You can create a .bashrc alias to make this easier to run:

alias profile_multiqc='python -m cProfile -o multiqc_profile.prof /path/to/MultiQC/scripts/multiqc '
profile_multiqc -f .

MultiQC should run as normal, but produce the additional binary file multiqc_profile.prof. This can then be visualised with software such as SnakeViz.

To install SnakeViz and visualise the results, do the following:

pip install snakeviz
snakeviz multiqc_profile.prof

A web page should open where you can explore the execution times of different nested functions. It's a good idea to run MultiQC with a comparable number of results from other tools (eg. FastQC) to have a reference to compare against for how long the code should take to run.

Adding Custom CSS / Javascript

If you would like module-specific CSS and / or JavaScript added to the template, just add to the self.css and self.js dictionaries that come with the BaseMultiqcModule class. The key should be the filename that you want your file to have in the generated report folder (this is ignored in the default template, which includes the content file directly in the HTML). The dictionary value should be the path to the desired file. For example, see how it's done in the FastQC module:

self.css = {
    'assets/css/multiqc_fastqc.css' :
        os.path.join(os.path.dirname(__file__), 'assets', 'css', 'multiqc_fastqc.css')
}
self.js = {
    'assets/js/multiqc_fastqc.js' :
        os.path.join(os.path.dirname(__file__), 'assets', 'js', 'multiqc_fastqc.js')
}

Plotting Functions

MultiQC plotting functions are held within multiqc.plots submodules. To use them, simply import the modules you want, eg.:

from multiqc.plots import bargraph, linegraph

Once you've done that, you will have access to the corresponding plotting functions:

bargraph.plot()
linegraph.plot()
scatter.plot()
table.plot()
beeswarm.plot()
heatmap.plot()

These have been designed to work in a similar manner to each other - you pass a data structure to them, along with optional extras such as categories and configuration options, and they return a string of HTML to add to the report. You can add this to the module introduction or sections as described above. For example:

self.add_section (
    name = 'Module Section',
    anchor = 'mymod_section',
    description = 'This plot shows some really nice data.',
    helptext = 'This longer string (can be **markdown**) helps explain how to interpret the plot',
    plot = bargraph.plot(self.parsed_data, categories, pconfig)
)

Bar graphs

Simple data can be plotted in bar graphs. Many MultiQC modules make use of stacked bar graphs. Here, the bargraph.plot() function comes to the rescue. A basic example is as follows:

from multiqc.plots import bargraph
data = {
    'sample 1': {
        'aligned': 23542,
        'not_aligned': 343,
    },
    'sample 2': {
        'not_aligned': 7328,
        'aligned': 1275,
    }
}
html_content = bargraph.plot(data)

To specify the order of categories in the plot, you can supply a list of dictionary keys. This can also be used to exclude a key from the plot.

cats = ['aligned', 'not_aligned']
html_content = bargraph.plot(data, cats)

If cats is given as a dict instead of a list, you can specify a nice name and a colour too. Make it an OrderedDict to specify the order:

from collections import OrderedDict
cats = OrderedDict()
cats['aligned'] = {
    'name': 'Aligned Reads',
    'color': '#8bbc21'
}
cats['not_aligned'] = {
    'name': 'Unaligned Reads',
    'color': '#f7a35c'
}

Finally, a third variable should be supplied with configuration variables for the plot. The defaults are as follows:

config = {
    # Building the plot
    'id': '<random string>',                # HTML ID used for plot
    'cpswitch': True,                       # Show the 'Counts / Percentages' switch?
    'cpswitch_c_active': True,              # Initial display with 'Counts' specified? False for percentages.
    'cpswitch_counts_label': 'Counts',      # Label for 'Counts' button
    'cpswitch_percent_label': 'Percentages' # Label for 'Percentages' button
    'logswitch': False,                     # Show the 'Log10' switch?
    'logswitch_active': False,              # Initial display with 'Log10' active?
    'logswitch_label': 'Log10',             # Label for 'Log10' button
    'hide_zero_cats': True,                 # Hide categories where data for all samples is 0
    # Customising the plot
    'title': None,                          # Plot title
    'xlab': None,                           # X axis label
    'ylab': None,                           # Y axis label
    'ymax': None,                           # Max y limit
    'ymin': None,                           # Min y limit
    'yCeiling': None,                       # Maximum value for automatic axis limit (good for percentages)
    'yFloor': None,                         # Minimum value for automatic axis limit
    'yMinRange': None,                      # Minimum range for axis
    'yDecimals': True,                      # Set to false to only show integer labels
    'ylab_format': None,                    # Format string for x axis labels. Defaults to {value}
    'stacking': 'normal',                   # Set to None to have category bars side by side
    'use_legend': True,                     # Show / hide the legend
    'click_func': None,                     # Javascript function to be called when a point is clicked
    'cursor': None,                         # CSS mouse cursor type.
    'tt_decimals': 0,                       # Number of decimal places to use in the tooltip number
    'tt_suffix': '',                        # Suffix to add after tooltip number
    'tt_percentages': True,                 # Show the percentages of each count in the tooltip
}

The keys id and title should always be passed as a minimum. The id is used for the plot name when exporting. If left unset, the Plot Export panel will call the filename mqc_hcplot_gtucwirdzx.png (with some other random string). Plots should always have titles, especially as they can stand by themselves when exported. The title should have the format Modulename: Plot Name

Switching datasets

It's possible to have single plot with buttons to switch between different datasets. To do this, give a list of data objects (same formats as described above). Also add the following config options to supply names to the buttons:

config = {
    'data_labels': ['Reads', 'Bases']
}

You can also customise the y-axis label and min/max values for each dataset:

config = {
    'data_labels': [
        {'name': 'Reads', 'ylab': 'Number of Reads'},
        {'name': 'Bases', 'ylab': 'Number of Base Pairs', 'ymax':100}
    ]
}

If supplying multiple datasets, you can also supply a list of category objects. Make sure that they are in the same order as the data.

Categories should contain data keys, so if you're supplying a list of two datasets, you should supply a list of two sets of keys for the categories. MultiQC will try to guess categories from the data keys if categories are missing.

For example, with two datasets supplied as above:

cats = [
    ['aligned_reads','unaligned_reads'],
    ['aligned_base_pairs','unaligned_base_pairs'],
]

Or with additional customisation such as name and colour:

from collections import OrderedDict
cats = [OrderedDict(), OrderedDict()]
cats[0]['aligned_reads'] =        {'name': 'Aligned Reads',        'color': '#8bbc21'}
cats[0]['unaligned_reads'] =      {'name': 'Unaligned Reads',      'color': '#f7a35c'}
cats[1]['aligned_base_pairs'] =   {'name': 'Aligned Base Pairs',   'color': '#8bbc21'}
cats[1]['unaligned_base_pairs'] = {'name': 'Unaligned Base Pairs', 'color': '#f7a35c'}

Interactive / Flat image plots

Note that the bargraph.plot() function can generate both interactive JavaScript (HighCharts) powered report plots and flat image plots made using MatPlotLib. This choice is made within the function based on config variables such as number of dataseries and command line flags.

Note that both plot types should come out looking pretty much identical. If you spot something that's missing in the flat image plots, let me know.

Line graphs

This base function works much like the above, but for two-dimensional data, to produce line graphs. It expects a dictionary in the following format:

from multiqc.plots import linegraph
data = {
    'sample 1': {
        '<x val 1>': '<y val 1>',
        '<x val 2>': '<y val 2>',
    },
    'sample 2': {
        '<x val 1>': '<y val 1>',
        '<x val 2>': '<y val 2>',
    }
}
html_content = linegraph.plot(data)

Additionally, a config dict can be supplied. The defaults are as follows:

from multiqc.plots import linegraph
config = {
    # Building the plot
    'smooth_points': None,       # Supply a number to limit number of points / smooth data
    'smooth_points_sumcounts': True, # Sum counts in bins, or average? Can supply list for multiple datasets
    'id': '<random string>',     # HTML ID used for plot
    'categories': False,         # Set to True to use x values as categories instead of numbers.
    'colors': dict()             # Provide dict with keys = sample names and values colours
    'extra_series': None,        # See section below
    # Plot configuration
    'title': None,               # Plot title
    'xlab': None,                # X axis label
    'ylab': None,                # Y axis label
    'xCeiling': None,            # Maximum value for automatic axis limit (good for percentages)
    'xFloor': None,              # Minimum value for automatic axis limit
    'xMinRange': None,           # Minimum range for axis
    'xmax': None,                # Max x limit
    'xmin': None,                # Min x limit
    'xLog': False,               # Use log10 x axis?
    'xDecimals': True,           # Set to false to only show integer labels
    'yCeiling': None,            # Maximum value for automatic axis limit (good for percentages)
    'yFloor': None,              # Minimum value for automatic axis limit
    'yMinRange': None,           # Minimum range for axis
    'ymax': None,                # Max y limit
    'ymin': None,                # Min y limit
    'yLog': False,               # Use log10 y axis?
    'yDecimals': True,           # Set to false to only show integer labels
    'yPlotBands': None,          # Highlighted background bands. See http://api.highcharts.com/highcharts#yAxis.plotBands
    'xPlotBands': None,          # Highlighted background bands. See http://api.highcharts.com/highcharts#xAxis.plotBands
    'yPlotLines': None,          # Highlighted background lines. See http://api.highcharts.com/highcharts#yAxis.plotLines
    'xPlotLines': None,          # Highlighted background lines. See http://api.highcharts.com/highcharts#xAxis.plotLines
    'xLabelFormat': '{value}',   # Format string for the axis labels
    'yLabelFormat': '{value}',   # Format string for the axis labels
    'tt_label': '{point.x}: {point.y:.2f}', # Use to customise tooltip label, eg. '{point.x} base pairs'
    'pointFormat': None,         # Replace the default HTML for the entire tooltip label
    'click_func': function(){},  # Javascript function to be called when a point is clicked
    'cursor': None               # CSS mouse cursor type. Defaults to pointer when 'click_func' specified
    'reversedStacks': False      # Reverse the order of the category stacks. Defaults True for plots with Log10 option
}
html_content = linegraph.plot(data, config)

The keys id and title should always be passed as a minimum. The id is used for the plot name when exporting. If left unset, the Plot Export panel will call the filename mqc_hcplot_gtucwirdzx.png (with some other random string). Plots should always have titles, especially as they can stand by themselves when exported. The title should have the format Modulename: Plot Name

Switching datasets

You can also have a single plot with buttons to switch between different datasets. To do this, just supply a list of data dicts instead (same formats as described above). Also add the following config options to supply names to the buttons and graph labels:

config = {
    'data_labels': [
        {'name': 'DS 1', 'ylab': 'Dataset 1', 'xlab': 'x Axis 1'},
        {'name': 'DS 2', 'ylab': 'Dataset 2', 'xlab': 'x Axis 2'}
    ]
}

All of these config values are optional, the function will default to sensible values if things are missing. See the cutadapt module plots for an example of this in action.

Additional data series

Sometimes, it's good to be able to specify specific data series manually. To do this, use config['extra_series']. For a single extra line this can be a dict (as below). For multiple lines, use a list of dicts. For multiple dataset plots, use a list of list of dicts.

For example, to add a dotted x = y reference line:

from multiqc.plots import linegraph
config = {
    'extra_series': {
        'name': 'x = y',
        'data': [[0, 0], [max_x_val, max_y_val]],
        'dashStyle': 'Dash',
        'lineWidth': 1,
        'color': '#000000',
        'marker': { 'enabled': False },
        'enableMouseTracking': False,
        'showInLegend': False,
    }
}
html_content = linegraph.plot(data, config)

Scatter Plots

Scatter plots work in almost exactly the same way as line plots. Most (if not all) config options are shared between the two. The data structure is similar but not identical:

from multiqc.plots import scatter
data = {
    'sample 1': {
        x: '<x val>',
        y: '<y val>'
    },
    'sample 2': {
        x: '<x val>',
        y: '<y val>'
    }
}
html_content = scatter.plot(data)

If you want more than one data point per sample, you can supply a list of dictionaries instead. You can also optionally specify point colours and sample name suffixes (these are appended to the sample name):

data = {
    'sample 1': [
        { x: '<x val>', y: '<y val>', color: '#a6cee3', name: 'Type 1' },
        { x: '<x val>', y: '<y val>', color: '#1f78b4', name: 'Type 2' }
    ],
    'sample 2': [
        { x: '<x val>', y: '<y val>', color: '#b2df8a', name: 'Type 1' },
        { x: '<x val>', y: '<y val>', color: '#33a02c', name: 'Type 2' }
    ]
}

Remember that MultiQC reports can contain large numbers of samples, so this plot type is not suitable for large quantities of data - 20,000 genes might look good for one sample, but when someone runs MultiQC with 500 samples, it will crash the browser and be impossible to interpret.

See the above docs about line plots for most config options. The scatter plot has a handful of unique ones in addition:

pconfig = {
    'marker_colour': 'rgba(124, 181, 236, .5)', # string, base colour of points (recommend rgba / semi-transparent)
    'marker_size': 5,               # int, size of points
    'marker_line_colour': '#999',   # string, colour of point border
    'marker_line_width': 1,         # int, width of point border
    'square': False                 # Force the plot to stay square? (Maintain aspect ratio)
}

Creating a table

Tables should work just like the functions above (most like the bar graph function). As a minimum, the function takes a dictionary containing data - the first keys will be sample names (row headers) and each key contained within will be a table column header.

You can also supply a list of key names to restrict the data in the table to certain keys / columns. This also specifies the order that columns should be displayed in.

For more customisation, the headers can be supplied as a dictionary. Each key should match the keys used in the data dictionary, but values can customise the output. If you want to specify the order of the columns, you must use an OrderedDict.

Finally, the function accepts a config dictionary as a third parameter. This can set global options for the table (eg. a title) and can also hold default values to customise the output of all table columns.

The default header keys are:

single_header = {
    'namespace': '',                # Name for grouping in table
    'title': '[ dict key ]',        # Short title, table column title
    'description': '[ dict key ]',  # Longer description, goes in mouse hover text
    'max': None,                    # Minimum value in range, for bar / colour coding
    'min': None,                    # Maximum value in range, for bar / colour coding
    'ceiling': None,                # Maximum value for automatic bar limit
    'floor': None,                  # Minimum value for automatic bar limit
    'minRange': None,               # Minimum range for automatic bar
    'scale': 'GnBu',                # Colour scale for colour coding. False to disable.
    'colour': '<auto>',             # Colour for column grouping
    'suffix': None,                 # Suffix for value (eg. '%')
    'format': '{:,.1f}',            # Output format() string
    'shared_key': None              # See below for description
    'modify': None,                 # Lambda function to modify values
    'hidden': False                 # Set to True to hide the column on page load
}

A third parameter can be specified with settings for the whole table:

table_config = {
    'namespace': '',                         # Module / section that table is in. Prepends header descriptions.
    'id': '<random string>',                 # ID used for the table
    'table_title': '<table id>',             # Title of the table. Used in the column config modal
    'save_file': False,                      # Whether to save the table data to a file
    'raw_data_fn':'multiqc_<table_id>_table' # File basename to use for raw data file
    'sortRows': True                         # Whether to sort rows alphabetically
    'col1_header': 'Sample Name'             # The header used for the first column
    'no_beeswarm': False    # Force a table to always be plotted (beeswarm by default if many rows)
}

Header keys such as max, min and scale can also be specified in the table config. These will then be applied to all columns.

Colour scales are taken from ColorBrewer2. The following are available: color brewer

A very basic example is shown below:

data = {
    'sample 1': {
        'aligned': 23542,
        'not_aligned': 343,
    },
    'sample 2': {
        'aligned': 1275,
        'not_aligned': 7328,
    }
}
table_html = table.plot(data)

A more complicated version with ordered columns, defaults and column-specific settings:

data = {
    'sample 1': {
        'aligned': 23542,
        'not_aligned': 343,
        'aligned_percent': 98.563952271
    },
    'sample 2': {
        'aligned': 1275,
        'not_aligned': 7328,
        'aligned_percent': 14.820411484
    }
}
headers = OrderedDict()
headers['aligned_percent'] = {
    'title': '% Aligned',
    'description': 'Percentage of reads that aligned',
    'suffix': '%',
    'max': 100,
}
headers['aligned'] = {
    'title': '{} Aligned'.format(config.read_count_prefix),
    'description': 'Aligned Reads ({})'.format(config.read_count_desc),
    'shared_key': 'read_count',
    'modify': lambda x: x * config.read_count_multiplier
}
config = {
    'namespace': 'My Module',
    'min': 0,
    'scale': 'GnBu'
}
table_html = table.plot(data, headers, config)

Beeswarm plots (dot plots)

Beeswarm plots work from the exact same data structure as tables, so the usage is just the same. Except instead of calling table, call beeswarm:

data = {
    'sample 1': {
        'aligned': 23542,
        'not_aligned': 343,
    },
    'sample 2': {
        'not_aligned': 7328,
        'aligned': 1275,
    }
}
beeswarm_html = beeswarm.plot(data)

The function also accepts the same headers and config parameters.

Heatmaps

Heatmaps expect data in the structure of a list of lists. Then, a list of sample names for the x-axis, and optionally for the y-axis (defaults to the same as the x-axis).

heatmap.plot(data, xcats, ycats, pconfig)

A simple example:

hmdata = [
    [0.9, 0.87, 0.73, 0.6, 0.2, 0.3],
    [0.87, 1, 0.7, 0.6, 0.9, 0.3],
    [0.73, 0.8, 1, 0.6, 0.9, 0.3],
    [0.6, 0.8, 0.7, 1, 0.9, 0.3],
    [0.2, 0.8, 0.7, 0.6, 1, 0.3],
    [0.3, 0.8, 0.7, 0.6, 0.9, 1],
]
names = [ 'one', 'two', 'three', 'four', 'five', 'six' ]
hm_html = heatmap.plot(hmdata, names)

Much like the other plots, you can change the way that the heatmap looks using a config dictionary:

pconfig = {
    'title': None,                 # Plot title
    'xTitle': None,                # X-axis title
    'yTitle': None,                # Y-axis title
    'min': None,                   # Minimum value (default: auto)
    'max': None,                   # Maximum value (default: auto)
    'square': True,                # Force the plot to stay square? (Maintain aspect ratio)
    'colstops': []                 # Scale colour stops. See below.
    'reverseColors': False,        # Reverse the order of the colour axis
    'decimalPlaces': 2,            # Number of decimal places for tooltip
    'legend': True,                # Colour axis key enabled or not
    'borderWidth': 0,              # Border width between cells
    'datalabels': True,            # Show values in each cell. Defaults True when less than 20 samples.
    'datalabel_colour': '<auto>',  # Colour of text for values. Defaults to auto contrast.
}

The colour stops are a bit special and can be used to define a custom colour scheme. These should be defined as a list of lists, with a number between 0 and 1 and a HTML colour. The default is RdYlBu from ColorBrewer:

pconfig = {
    'colstops' = [
        [0, '#313695'],
        [0.1, '#4575b4'],
        [0.2, '#74add1'],
        [0.3, '#abd9e9'],
        [0.4, '#e0f3f8'],
        [0.5, '#ffffbf'],
        [0.6, '#fee090'],
        [0.7, '#fdae61'],
        [0.8, '#f46d43'],
        [0.9, '#d73027'],
        [1, '#a50026'],
    ]
}

Javascript Functions

The javascript bundled in the default MultiQC template has a number of helper functions to make your life easier.

NB: The MultiQC Python functions make use of these, so it's very unlikely that you'll need to use any of this. But it's here for reference.

Plotting line graphs

plot_xy_line_graph (target, ds)

Plots a line graph with multiple series of (x,y) data pairs. Used by the linegraph.plot() python function.

Data and configuration must be added to the document level mqc_plots variable on page load, using the target as the key. The variables used are as follows:

mqc_plots[target]['plot_type'] = 'xy_line';
mqc_plots[target]['config'];
mqc_plots[target]['datasets'];

Multiple datasets can be added in the ['datasets'] array. The supplied variable ds specifies which is plotted (defaults to 0).

Available config options with default vars:

config = {
    title: undefined,            // Plot title
    xlab: undefined,             // X axis label
    ylab: undefined,             // Y axis label
    xCeiling: undefined,         // Maximum value for automatic axis limit (good for percentages)
    xFloor: undefined,           // Minimum value for automatic axis limit
    xMinRange: undefined,        // Minimum range for axis
    xmax: undefined,             // Max x limit
    xmin: undefined,             // Min x limit
    xDecimals: true,             // Set to false to only show integer labels
    yCeiling: undefined,         // Maximum value for automatic axis limit (good for percentages)
    yFloor: undefined,           // Minimum value for automatic axis limit
    yMinRange: undefined,        // Minimum range for axis
    ymax: undefined,             // Max y limit
    ymin: undefined,             // Min y limit
    yDecimals: true,             // Set to false to only show integer labels
    yPlotBands: undefined,       // Highlighted background bands. See http://api.highcharts.com/highcharts#yAxis.plotBands
    xPlotBands: undefined,       // Highlighted background bands. See http://api.highcharts.com/highcharts#xAxis.plotBands
    tt_label: '{point.x}: {point.y:.2f}', // Use to customise tooltip label, eg. '{point.x} base pairs'
    pointFormat: undefined,      // Replace the default HTML for the entire tooltip label
    click_func: function(){},    // Javascript function to be called when a point is clicked
    cursor: undefined            // CSS mouse cursor type. Defaults to pointer when 'click_func' specified
}

An example of the markup expected, with the function being called:

<div id="my_awesome_line_graph" class="hc-plot"></div>
<script type="text/javascript">
    mqc_plots['#my_awesome_bar_plot']['plot_type'] = 'xy_line';
    mqc_plots['#my_awesome_line_graph']['datasets'] = [
        {
            name: 'Sample 1',
            data: [[1, 1.5], [1.5, 3.1], [2, 6.4]]
        },
        {
            name: 'Sample 2',
            data: [[1, 1.7], [1.5, 4.3], [2, 8.4]]
        },
    ];
    mqc_plots['#my_awesome_line_graph']['config'] = {
        "title": "Best Plot Ever",
        "ylab": "Pings",
        "xlab": "Pongs"
    };
    $(function () {
        plot_xy_line_graph('#my_awesome_line_graph');
    });
</script>

Plotting bar graphs

plot_stacked_bar_graph (target, ds)

Plots a bar graph with multiple series containing multiple categories. Used by the bargraph.plot() python function.

Data and configuration must be added to the document level mqc_plots variable on page load, using the target as the key. The variables used are as follows:

mqc_plots[target]['plot_type'] = 'bar_graph';
mqc_plots[target]['config'];
mqc_plots[target]['datasets'];
mqc_plots[target]['samples'];

All available config options with default vars:

config = {
    title: undefined,           // Plot title
    xlab: undefined,            // X axis label
    ylab: undefined,            // Y axis label
    ymax: undefined,            // Max y limit
    ymin: undefined,            // Min y limit
    yDecimals: true,            // Set to false to only show integer labels
    ylab_format: undefined,     // Format string for x axis labels. Defaults to {value}
    stacking: 'normal',         // Set to null to have category bars side by side (None in python)
    xtype: 'linear',            // Axis type. 'linear' or 'logarithmic'
    use_legend: true,           // Show / hide the legend
    click_func: undefined,      // Javascript function to be called when a point is clicked
    cursor: undefined,          // CSS mouse cursor type. Defaults to pointer when 'click_func' specified
    tt_percentages: true,       // Show the percentages of each count in the tooltip
    reversedStacks: false,      // Reverse the order of the categories in the stack.
}

An example of the markup expected, with the function being called:

<div id="my_awesome_bar_plot" class="hc-plot"></div>
<script type="text/javascript">
    mqc_plots['#my_awesome_bar_plot']['plot_type'] = 'bar_graph';
    mqc_plots['#my_awesome_bar_plot']['samples'] = ['Sample 1', 'Sample 2']
    mqc_plots['#my_awesome_bar_plot']['datasets'] = [{"data": [4, 7], "name": "Passed Test"}, {"data": [2, 3], "name": "Failed Test"}]
    mqc_plots['#my_awesome_bar_plot']['config'] = {
        "title": "My Awesome Plot",
        "ylab": "# Observations",
        "ymin": 0,
        "stacking": "normal"
    };
    $(function () {
        plot_stacked_bar_graph("#my_awesome_bar_plot");
    });
</script>

Switching counts and percentages

If you're using the plotting functions above, it's easy to add a button which switches between percentages and counts. Just add the following HTML above your plot:

<div class="btn-group switch_group">
    <button class="btn btn-default btn-sm active" data-action="set_numbers" data-target="#my_plot">Counts</button>
    <button class="btn btn-default btn-sm" data-action="set_percent" data-target="#my_plot">Percentages</button>
</div>

NB: This markup is generated automatically with the Python self.plot_bargraph() function.

Switching plot datasets

Much like the counts / percentages buttons above, you can add a button which switches the data displayed in a single plot. Make sure that both datasets are stored in named javascript variables, then add the following markup:

<div class="btn-group switch_group">
    <button class="btn btn-default btn-sm active" data-action="set_data" data-ylab="First Data" data-newdata="data_var_1" data-target="#my_plot">Data 1</button>
    <button class="btn btn-default btn-sm" data-action="set_data" data-ylab="Second Data" data-newdata="data_var_2" data-target="#my_plot">Data 2</button>
</div>

Note the CSS class active which specifies which button is 'pressed' on page load. data-ylab and data-xlab can be used to specify the new axes labels. data-newdata should be the name of the javascript object with the new data to be plotted and data-target should be the CSS selector of the plot to change.

Custom event triggers

Some of the events that take place in the general javascript code trigger jQuery events which you can hook into from within your module's code. This allows you to take advantage of events generated by the global theme whilst keeping your code modular.

$(document).on('mqc_highlights', function(e, f_texts, f_cols, regex_mode){
    // This trigger is called when the highlight strings are
    // updated. Three variables are given - an array of search
    // strings (f_texts), an array of colours with corresponding
    // indexes (f_cols) and a boolean var saying whether the
    // search should be treated as a string or a regex (regex_mode)
});

$(document).on('mqc_renamesamples', function(e, f_texts, t_texts, regex_mode){
    // This trigger is called when samples are renamed
    // Three variables are given - an array of search
    // strings (f_texts), an array of replacements with corresponding
    // indexes (t_texts) and a boolean var saying whether the
    // search should be treated as a string or a regex (regex_mode)
});

$(document).on('mqc_hidesamples', function(e, f_texts, regex_mode){
    // This trigger is called when the Hide Samples filters change.
    // Two variables are given - an array of search strings
    // (f_texts) and a boolean saying whether the search should
    // be treated as a string or a regex (regex_mode)
});

$('#YOUR_PLOT_ID').on('mqc_plotresize', function(){
    // This trigger is called when a plot handle is pulled,
    // resizing the height
});

$('#YOUR_PLOT_ID').on('mqc_original_series_click', function(e, name){
    // A plot able to show original images has had a point clicked.
    // 'name' contains the name of the series that was clicked
});

$('#YOUR_PLOT_ID').on('mqc_original_chg_source', function(e, name){
    // A plot with original images has had a request to change the
    // original image source (eg. pressing Prev / Next)
});

$('#YOUR_PLOT_ID').on('mqc_plotexport_image', function(e, cfg){
    // A trigger to export an image of the plot. cfg contains
    // config variables for the requested image.
});

$('#YOUR_PLOT_ID').on('mqc_plotexport_data', function(e, cfg){
    // A trigger to export a data file of the plot. cfg contains
    // config variables for the requested data.
});

MultiQC Plugins

MultiQC is written around a system designed for extensibility and plugins. These features allow custom code to be written without polluting the central code base.

Please note that we want MultiQC to grow as a community tool! So if you're writing a module or theme that can be used by others, please keep it within the main MultiQC framework and submit a pull request.

Entry Points

The plugin system works using setuptools entry points. In setup.py you will see a section of code that looks like this (truncated):

entry_points = {
    'multiqc.modules.v1': [
        'qualimap = multiqc.modules.qualimap:MultiqcModule',
    ],
    'multiqc.templates.v1': [
        'default = multiqc.templates.default',
    ],
    # 'multiqc.cli_options.v1': [
        # 'my-new-option = myplugin.cli:new_option'
    # ],
    # 'multiqc.hooks.v1': [
        # 'before_config = myplugin.hooks:before_config',
        # 'config_loaded = myplugin.hooks:config_loaded',
        # 'execution_start = myplugin.hooks:execution_start',
        # 'before_modules = myplugin.hooks:before_modules',
        # 'after_modules = myplugin.hooks:after_modules',
        # 'execution_finish = myplugin.hooks:execution_finish',
    # ]
},

These sets of entry points can each be extended to add functionality to MultiQC:

  • multiqc.modules.v1
    • Defines the module classes. Used to add new modules.
  • multiqc.templates.v1
    • Defines the templates. Can be used for new templates.
  • multiqc.cli_options.v1
    • Allows plugins to add new custom command line options
  • multiqc.hooks.v1
    • Code hooks for plugins to add new functionality

Any python program can create entry points with the same name, once installed MultiQC will find these and run them accordingly. For an example of this in action, see the MultiQC_NGI setup file:

entry_points = {
        'multiqc.templates.v1': [
            'ngi = multiqc_ngi.templates.ngi',
            'genstat = multiqc_ngi.templates.genstat',
        ],
        'multiqc.cli_options.v1': [
            'project = multiqc_ngi.cli:pid_option'
        ],
        'multiqc.hooks.v1': [
            'after_modules = multiqc_ngi.hooks:ngi_metadata',
        ]
    },

Here, two new templates are added, a new command line option and a new code hook.

Modules

List items added to multiqc.modules.v1 specify new modules. They should be described as follows:

modname = python_mod.dirname.submodname:classname'

Once this is done, everything else should be the same as described in the writing modules documentation.

Templates

As above, though no need to specify a class name at the end. See the writing templates documentation for further instructions.

Command line options

MultiQC handles command line interaction using the click framework. You can use the multiqc.cli_options.v1 entry point to add new click decorators for command line options. For example, the MultiQC_NGI plugin uses the entry point above with the following code in cli.py:

import click
pid_option = click.option('--project', type=str)

The values given from additional command line arguments are parsed by MultiQC and put into config.kwargs. The above plugin later reads the value given by the user with the --project flag in a hook:

if config.kwargs['project'] is not None:
  # do some stuff

See the click documentation or the main MultiQC script for more information and examples of adding command line options.

Hooks

Hooks are a little more complicated - these define points in the core MultiQC code where you can run custom functions. This can be useful as your code is able to access data generated by other parts of the program. For example, you could tie into the after_modules hook to insert data processed by MultiQC modules into a database automatically.

Here, the entry point names are the hook titles, described as commented out lines in the core MultiQC setup.py: execution_start, config_loaded, before_modules, after_modules and execution_finish.

These should point to a function in your code which will be executed when that hook fires. Your custom code can import the core MultiQC modules to access configuration and loggers. For example:

#!/usr/bin/env python
""" MultiQC hook functions - we tie into the MultiQC
core here to add in extra functionality. """

import logging
from multiqc.utils import report, config

log = logging.getLogger('multiqc')

def after_modules():
  """ Plugin code to run when MultiQC modules have completed  """
  num_modules = len(report.modules_output)
  status_string = "MultiQC hook - {} modules reported!".format(num_modules)
  log.critical(status_string)

Writing New Templates

MultiQC is built around a templating system that uses the Jinja python package. This makes it very easy to create new report templates that fit your needs.

Core or plugin?

If your template could be of use to others, it would be great if you could add it to the main MultiQC package. You can do this by creating a fork of the MultiQC GitHub repository, adding your template and then creating a pull request to merge your changes back to the main repository.

If it's very specific template, you can create a new Python package which acts as a plugin. For more information about this, see the plugins documentation.

Creating a template skeleton

For a new template to be recognised by MultiQC, it must be a python submodule directory with a __init__.py file. This must be referenced in the setup.py installation script as an entry point.

You can see the bundled templates defined in this way:

entry_points = {
    'multiqc.templates.v1': [
        'default = multiqc.templates.default',
        'default_dev = multiqc.templates.default_dev',
        'simple = multiqc.templates.simple',
        'geo = multiqc.templates.geo',
    ]
}

Note that these entry points can point to any Python modules, so if you're writing a plugin module you can specify your module name instead. Just make sure that multiqc.templates.v1 is the same.

Once you've added the entry point, remember to install the package again:

python setup.py develop

Using develop tells setuptools to softlink the plugin files instead of copying, so changes made whilst editing files will be reflected when you run MultiQC.

The __init__.py files must define two variables - the path to the template directory and the main jinja template file:

template_dir = os.path.dirname(__file__)
base_fn = 'base.html'

Child templates

The default MultiQC template contains a lot of code. Importantly, it includes 1448 lines of custom JavaScript (at time of writing) which powers the plotting and dynamic functions in the report. You probably don't want to rewrite all of this for your template, so to make your life easier you can create a child template.

To do this, add an extra variable to your template's __init__.py:

template_parent = 'default'

This tells MultiQC to use the template files from the default template unless a file with the same name is found in your child template. For instance, if you just want to add your own logo in the header of the reports, you can create your own header.html which will overwrite the default header.

Files within the default template have comments at the top explaining what part of the report they generate.

Extra init variables

There are a few extra variables that can be added to the __init__.py file to change how the report is generated.

Setting output_dir instructs MultiQC to put the report and it's contents into a subdirectory. Set the string to your desired name. Note that this will be prefixed if -p/--prefix is set at run time.

Secondly, you can copy additional files with your report when it is generated. This is usually used to copy required images or scripts with the report. These should be a list of file or directory paths, relative to the __init__.py file. Directory contents will be copied recursively.

You can also override config options in the template. For example, setting the value of config.plots_force_flat can force the report to only have static image plots.

from multiqc.utils import config

output_subdir = 'multiqc_report'
copy_files = ['assets']
config.plots_force_flat = True

Jinja template variables

There are a number of variables that you can use within your Jinja template. Two namespaces are available - report and config. You can print these using the Jinja curly brace syntax, eg. {{ config.version }}. See the Jinja2 documentation for more information.

The default MultiQC template includes dependencies in the HTML so that the report is standalone. If you would like to do the same, use the include_file function. For example:

<script>{{ include_file('js/jquery.min.js') }}</script>
<img src="data:image/png;base64,{{ include_file('img/logo.png', b64=True) }}">

Appendices

Custom plotting functions

If you don't like the default plotting functions built into MultiQC, you can write your own! If you create a callable variable in a template called either bargraph or linegraph, MultiQC will use that instead. For example:

def custom_linegraph(plotdata, pconfig):
    return '<h1>Awesome line graph here</h1>'
linegraph = custom_linegraph

def custom_bargraph(plotdata, plotseries, pconfig):
    return '<h1>Awesome bar graph here</h1>'
bargraph = custom_bargraph

These particular examples don't do very much, but hopefully you get the idea. Note that you have to set the variable linegraph or bargraph to your function.

Updating for compatibility

When releasing new versions of MultiQC we aim to maintain compatibility so that your existing modules and plugins will keep working. However, in some cases we have to make changes that require code to be modified. This section summarises the changes by MultiQC release.

v1.0 Updates

MultiQC v1.0 brings a few changes in the way that MultiQC modules and plugins are written. Most are backwards-compatible, but there are a couple that could break external plugins.

Module imports

New MultiQC module imports have been refactored to make them less inter-dependent and fragile. This has a bunch of advantages, notably allowing better, more modular, unit testing (and hopefully more reliable and maintainable code).

All MultiQC modules and plugins will need to change some of their import statements.

There are two things that you probably need to change in your plugin modules to make them work with the updated version of MultiQC, both to do with imports. Instead of this style of importing modules:

from multiqc import config, BaseMultiqcModule, plots

You now need this:

from multiqc import config
from multiqc.plots import bargraph   # Load specific plot types here
from multiqc.modules.base_module import BaseMultiqcModule

Modules that directly reference multiqc.BaseMultiqcModule instead need to reference multiqc.modules.base_module.BaseMultiqcModule.

Secondly, modules that use import plots now need to import the specific plots needed. You will also need to update any plotting functions, removing the plot. prefix.

For example, change this:

import plots
return plots.bargraph.plot(data, keys, pconfig)

to this:

from plots import bargraph
return bargraph.plot(data, keys, pconfig)

These changes have been made to simplify the module imports within MultiQC, allowing specific parts of the codebase to be imported into a Python script on their own. This enables small, atomic, clean unit testing.

If you have any questions, please open an issue.

Many thanks to @tbooth at @EdinburghGenomics for his patient work with this.

Searching for files

The core find_log_files function has been rewritten and now works a little differently. Instead of searching all analysis files each time it's called (by every module), all files are searched once at the start of the MultiQC execution. This makes MultiQC run much faster.

To use the new syntax, add your search pattern to config.sp using the new before_config plugin hook:

setup.py:

# [..]
  'multiqc.hooks.v1': [
    'before_config = myplugin.mymodule:load_config'
  ]

mymodule.py:

from multiqc.utils import config
def load_config():
    my_search_patterns = {
        'my_plugin/my_mod': {'fn': '*_somefile.txt'},
        'my_plugin/my_other_mod': {'fn': '*other_file.txt'},
    }
    config.update_dict(config.sp, my_search_patterns)

This will add in your search patterns to the default MultiQC config, before user config files are loaded (allowing people to overwrite your defaults as with other modules).

Now, you can find your files much as before, using the string specified above:

for f in self.find_log_files('my_plugin/my_mod'):
  # do something

The old syntax (supplying a dict instead of a string to the function without any previous config setup) will still work, but you will get a depreciation notice. This functionality may be removed in the future.

Adding report sections

Until now, report sections were added by creating a list called self.sections and adding to it. If you only had a single section, the routine was to instead append to the self.intro string.

These methods have been depreciated in favour of a new function called self.add_section(). For example, instead of the previous:

self.sections = list()
self.sections.append({
  'name': 'My Section',
  'anchor': 'my-html-id',
  'content': '<p>Description of what this plot shows.</p>' +
             linegraph.plot(data, pconfig)
})

the syntax is now:

self.add_section(
  name = 'My Section',
  anchor = 'my-html-id',
  description = 'Description of what this plot shows.',
  helptext = 'More extensive help text can about how to interpret this.'
  plot = linegraph.plot(data, pconfig)
)

Note that content should now be split up into three new keys: description, helptext and plot. This will allow consistent formatting and future developments with improved module help text. Text is wrapped in <p> tags by the function, so these are no longer needed. Raw content can still be provided in a content string as before if required.

All fields are optional. If name is omitted then the end result will be the same as previously done with self.intro += content.

Updated number formatting

A couple of minor updates to how numbers are handled in tables may affect your configs. Firstly, format strings looking like {:.1f} should now be {:,.1f} (note the extra comma). This enables customisable number formatting with separated thousand groups.

Secondly, any table columns reporting a read count should use new config options to allow user-configurable multipliers. For example, instead of this:

headers['read_counts'] = {
  'title': 'M Reads',
  'description': 'Read counts (millions)',
  'modify': lambda x: x / 1000000,
  'format': '{:.,2f} M',
  'shared_key': 'read_count'
}

you should now use this:

headers['read_counts'] = {
  'title': '{} Reads'.format(config.read_count_prefix),
  'description': 'Total raw sequences ({})'.format(config.read_count_desc),
  'modify': lambda x: x * config.read_count_multiplier,
  'format': '{:,.2f} ' + config.read_count_prefix,
  'shared_key': 'read_count'
}

Not as pretty, but allows users to view low depth coverage.