Welcome to the MultiQC docs.

These docs are bundled with the MultiQC download for your convenience, so you can also read in your installation or on Github.

Using MultiQC

Installing MultiQC

Installing Python

To see if you have python installed, run python --version on the command line. If you see version 2.7+, 3.4+ or 3.5+ then you can skip this step.

We recommend using virtual environments to manage your Python installation. Our favourite is Anaconda, a cross-platform tool to manage Python environments. You can installation instructions for Anaconda here.

Once Anaconda is installed, you can create an environment with the following commands:

conda create --name py2.7 python=2.7
source activate py2.7
# Windows: activate py2.7

You'll want to add the source activate py2.7 line to your .bashrc file so that the environment is loaded every time you load the terminal.

Installing with conda

If you're using conda as described above, you can install MultiQC from the bioconda channel as follows:

conda install -c bioconda multiqc

Installation with pip

This is the easiest way to install MultiQC. pip is the package manager for the Python Package Manager. It comes bundled with recent versions of Python, otherwise you can find installation instructions here.

You can now install MultiQC from PyPI as follows:

pip install multiqc

If you would like the development version, the command is:

pip install git+https://github.com/ewels/MultiQC.git

Note that if you have problems with read-only directories, you can install to your home directory with the --user parameter (though it's probably better to use virtual environments, as described above).

pip install --user multiqc

Manual installation

If you'd rather not use either of these tools, you can clone the code and install the code yourself:

git clone https://github.com/ewels/MultiQC.git
cd MultiQC
python setup.py install

git not installed? No problem - just download the flat files:

curl -LOk https://github.com/ewels/MultiQC/archive/master.zip
unzip master.zip
cd MultiQC-master
python setup.py install

Updating MultiQC

You can update MultiQC from PyPI at any time by running the following command:

pip --update multiqc

To update the development version, use:

pip install --force git+https://github.com/ewels/MultiQC.git

If you cloned the git repo, just pull the latest changes and install:

cd MultiQC
git pull
python setup.py install

If you downloaded the flat files, just repeat the installation procedure.

Installing as an environment module

Many people using MultiQC will be working on a HPC environment. Every server / cluster is different, and you're probably best off asking your friendly sysadmin to install MultiQC for you. However, with that in mind, here are a few general tips for installing MultiQC into an environment module system:

MultiQC comes in two parts - the multiqc python package and the multiqc executable script. The former must be available in $PYTHONPATH and the script must be available on the $PATH.

A typical installation procedure with an environment module Python install might look like this: (Note that $PYTHONPATH must be defined before pip installation.)

VERSION=0.7
INST=/path/to/software/multiqc/$VERSION
module load python/2.7.6
mkdir $INST
export PYTHONPATH=$INST/lib/python2.7/site-packages
pip install --install-option="--prefix=$INST" multiqc

Once installed, you'll need to create an environment module file. Again, these vary between systems a lot, but here's an example:

#%Module1.0#####################################################################
##
## MultiQC
##

set components [ file split [ module-info name ] ]
set version [ lindex $components 1 ]
set modroot /path/to/software/multiqc/$version

proc ModulesHelp { } {
    global version modroot
    puts stderr "\tMultiQC - use MultiQC $version"
    puts stderr "\n\tVersion $version\n"
}
module-whatis   "Loads MultiQC environment."

# load required modules
module load python/2.7.6

# only one version at a time
conflict multiqc

# Make the directories available
prepend-path    PATH        $modroot/bin
prepend-path    PYTHONPATH  $modroot/lib/python2.7/site-packages

Running MultiQC

Once installed, just go to your analysis directory and run multiqc, followed by a list of directories to search. At it's simplest, this can just be . (the current working directory):

multiqc .

That's it! MultiQC will scan the specified directories and produce a report based on details found in any log files that it recognises.

See Using MultiQC Reports for more information about how to use the generated report.

For a description of all command line parameters, run multiqc --help.

Choosing where to scan

You can supply MultiQC with as many directories or files as you like. Above, we supply . - just the current directory, but all of these would work too:

multiqc data/
multiqc data/ ../proj_one/analysis/ /tmp/results
multiqc data/*_fastqc.zip
multiqc data/sample_1*

You can also ignore files using the -x/--ignore flag (can be specified multiple times). This takes a string which it matches using glob expansion to filenames, directory names and entire paths:

multiqc . --ignore *_R2*
multiqc . --ignore run_two/
multiqc . --ignore */run_three/*/fastqc/*_R2.zip

Finally, you can supply a file containing a list of file paths, one per row. MultiQC only search the listed files.

multiqc --file-list my_file_list.txt

Renaming reports

The report is called multiqc_report.html by default. Tab-delimited data files are created in multiqc_data/, containing additional information. You can use a custom name for the report with the -n/--name parameter, or instruct MultiQC to create them in a subdirectory using the -o/-outdir parameter.

Note that different MultiQC templates may have different defaults.

Overwriting existing reports

It's quite common to repeatedly create new reports as new analysis results are generated. Instead of manually deleting old reports, you can just specify the -f parameter and MultiQC will overwrite any conflicting report filenames.

Sample names prefixed with directories

Sometimes, the same samples may be processed in different ways. If MultiQC finds log files with the same sample name, the previous data will be overwritten (this can be inspected by running MultiQC with -v/--verbose).

To avoid this, run MultiQC with the -d/--dirs parameter. This will prefix every sample name with the directory path that the log file was found within. As such, sample names will no longer be unique, and data will not be overwritten.

By default, --dirs will prepend the entire path to each sample name. You can choose which directories are added with the -dd/--dirs-depth parameter. Set to a positive integer to use that many directories at the end of the path. A negative integer takes directories from the start of the path.

For example:

$ multiqc -d .
# analysis_1 | results | type | sample_1 | file.log
# analysis_2 | results | type | sample_2 | file.log
# analysis_3 | results | type | sample_3 | file.log

$ multiqc -d -dd 1 .
# sample_1 | file.log
# sample_2 | file.log
# sample_3 | file.log

$ multiqc -d -dd -1 .
# analysis_1 | file.log
# analysis_2 | file.log
# analysis_3 | file.log

Using different templates

MultiQC is built around a templating system. You can produce reports with different styling by using the -t/--template option. The available templates are listed with multiqc --help.

If you're interested in creating your own custom template, see the writing new templates section.

PDF Reports

Whilst HTML is definitely the format of choice for MultiQC reports due to the interactive features that it can offer, PDF files are an integral part of some people's workflows. To try to accommodate this, MultiQC has a --pdf command line flag which will try to create a PDF report for you.

To do this, MultiQC uses the simple template. This uses flat plots, has no navigation or toolbar and strips out all JavaScript. The resulting HTML report is pretty basic, but this simplicity is helpful when generating PDFs.

Once the report is generated MultiQC attempts to call Pandoc, a command line tool able to convert documents between different file formats. You must have Pandoc already installed for this to work. If you don't have Pandoc installed, you will get an error message that looks like this:

Error creating PDF - pandoc not found. Is it installed? http://pandoc.org/

Please note that Pandoc is a complex tool and uses LaTeX / XeLaTeX for PDF generation. Please make sure that you have the latest version of Pandoc and that it can successfully convert basic HTML files to PDF before reporting and errors. Also note that not all plots have flat image equivalents, so some will be missing (at time of writing: FastQC sequence content plot, beeswarm dot plots, heatmaps).

Printing to stdout

If you would like to generate MultiQC reports on the fly, you can print the output to standard out by specifying -n stdout. Note that the data directory will not be generated and the template used must create stand-alone HTML reports.

Parsed data directory

By default, MultiQC creates a directory alongside the report containing tab-delimited files with the parsed data. This is useful for downstream processing, especially if you're running MultiQC with very large numbers of samples.

Typically, these files are tab-delimited tables. However, you can get JSON or YAML output for easier downstream parsing by specifying -k/--data-format on the command line or data_format in your configuration file.

You can also choose whether to produce the data by specifying either the --data-dir or --no-data-dir command line flags or the make_data_dir variable in your configuration file. Note that the data directory is never produced when printing the MultiQC report to stdout.

To zip the data directory, use the -z/--zip-data-dir flag.

Exporting Plots

In addition to the HTML report, it's also possible to get MultiQC to save plots as stand alone files. You can do this with the -p/--export command line flag. By default, plots will be saved in a directory called multiqc_plots as .png, .svg and .pdf files.

You can instruct MultiQC to always do this by setting the export_plots config option to true, though note that this will add a few seconds on to execution time. The plots_dir_name changes the default directory name for plots and the export_plot_formats specifies what file formats should be created (must be supported by MatPlotLib).

Note that not all plot types are yet supported, so you may find some plots are missing.

Note: You can always save static image versions of plots from within MultiQC reports, using the Export toolbox in the side bar.

Choosing which modules to run

Sometimes, it's desirable to choose which MultiQC modules run. This could be because you're only interested in one type of output and want to keep the reports small. Or perhaps the output from one module is misleading in your situation.

You can do this by using -m/--modules to explicitly define which modules you want to run. Alternatively, use -e/--exclude to run all modules except those listed.

Using MultiQC Reports

Once MultiQC has finished, you should have a HTML report file called multiqc_report.html (or something similar, depending on how you ran MultiQC). You can launch this report with open multiqc_report.html on the command line, or double clicking the file in a file browser.

Browser compatibility

MultiQC reports should work in any modern browser. They have been tested using OSX Chrome, Firefox and Safari. If you find any report bugs, please report them as a GitHub issue.

Report layout

MultiQC reports have three main page sections:

  • The navigation menu (left side)
    • Links to the different module sections in the report
    • Click the logo to go to the top of the page
  • The toolbox (right side)
    • Contains various tools to modify the report data (see below)
  • The report (middle)
    • This is what you came here for, the data!

Note that if you're viewing the report on a mobile device / small window, the content will be reformatted to fit the screen.

General Statistics table

At the top of every MultiQC report is the 'General Statistics' table. This shows an overview of key values, taken from all modules. The aim of the table is to bring together stats for each sample from across the analysis so that you can see it in one place.

Hovering over column headers will show a longer description, including which module produced the data. Clicking a header will sort the table by that value. Clicking it again will change the sort direction. You can shift-click multiple headers to sort by multiple columns.

sort column

Above the table there is a button called 'Configure Columns'. Clicking this will launch a modal window with more detailed information about each column, plus options to show/hide and change the order of columns.

configure columns

Plots

MultiQC modules can take plot more extensive data in the sections below the general statistics table.

Interactive plots

Plots in MultiQC reports are usually interactive, using the HighCharts JavaScript library.

You can hover the mouse over data to see a tooltip with more information about that dataset. Clicking and dragging on line graphs will zoom into that area.

plot zoom

To reset the zoom, use the button in the top right:

reset zoom

Plots have a grey bar along their base; clicking and dragging this will resize the plot's height:

plot zoom

You can force reports to use interactive plots instead of flat by specifying the --interactive command line option (see below).

Flat plots

Reports with large numbers of samples may contain flat plots. These are rendered when the MultiQC report is generated using MatPlotLib and are non-interactive (flat) images within the report. The reason for generating these is that large sample numbers can make MultiQC reports very data-intensive and unresponsive (crashing people's browsers in extreme cases). Plotting data in flat images is scalable to any number of samples, however.

Flat plots in MultiQC have been designed to look as similar to their interactive versions as possible. They are also copied to multiqc_data/multiqc_plots

You can force reports to use flat plots with the --flat command line option.

See the Large sample numbers section of the Configuring MultiQC docs for more on how to customise the flat / interactive plot behaviour.

Exporting plots

If you want to use the plot elsewhere (eg. in a presentation or paper), you can export it in a range of formats. Just click the menu button in the top right of the plot:

plot zoom

This opens the MultiQC Toolbox Export Plots panel with the current plot selected. You have a range of export options here. When deciding on output format bear in mind that SVG is a vector format, so can be edited in tools such as Adobe Illustrator or the free tool Inkscape. This makes it ideal for use in publications and manual customisation / annotation. The Plot scaling option changes how large the labels are relative to the plot.

Dynamic plots

Some plots have buttons above them which allow you to change the data that they show or their axis. For example, many bar plots have the option to show the data as percentages instead of counts:

percentage button

Toolbox

MultiQC reports come with a 'toolbox', accessible by clicking the buttons on the right hand side of the report:

toolbox buttons

Active toolbox panels have their button highlighted with a blue outline. You can hide the toolbox by clicking the open panel button a second time, or pressing Escape on your keyboard.

Highlight Samples

If you run MultiQC plots with a lot of samples, plots can become very data-heavy. This makes it difficult to find specific samples, or subsets of samples.

To help with this, you can use the Highlight Samples tool to colour datasets of interest. Simply enter some text which will match the samples you want to highlight and press enter (or click the add button). If you like, you can also customise the highlight colour.

toolbox highlight

To make it easier to match groups of samples, you can use a regular expressions by turning on 'Regex mode'. You can test regexes using a nice tool at regex101.com. See a nice introduction to regexes here. Note that pattern delimiters are not needed (use pattern, not /pattern/).

Here, we highlight any sample names that end in _1:

highligh regex

Note that a new button appears above the General Statistics table when samples are highlighted, allowing you to sort the table according to highlights.

Search patterns can be changed after creation, just click to edit. To remove, click the grey cross on the right hand side.

Searching for an empty string will match all samples.

Renaming Samples

Sample names are typically generated based on processed file names. These file names are not always informative. To help with this, you can do a search and replace within sample names. Here, we remove the SRR1067 and _1 parts of the sample names, which are the same for all samples:

rename samples

Again, regular expressions can be used. See above for details. Note that regex groups can be used - define a group match with parentheses and use the matching value with $1, $2 etc. For example - a search string SRR283(\d{3}) and replace string $1_SRR283 would move the final three digits of matching sample names to the start of the name.

Often, you may have a spreadsheet with filenames and informative sample names. To avoid having to manually enter each name, you can paste from a spreadsheet using the 'bulk import' tool:

bulk rename

Hiding Samples

Sometimes, you want to focus on a subset of samples. To temporarily hide samples from the report, enter a search string as described above into the 'Hide Samples' toolbox panel.

Here, we hide all samples with _trimmed in their sample name: (Note that plots will tell you how many samples have been hidden)

hide samples

Export

This tool enables you to configure the size and characteristics of exported plots, as well as allowing you to download some or all of the graphs with a single click. Width and Height set the output size of the images, scale sets how "zoomed-in" they should look (typically you want the plot to be more zoomed for printing). See above for more information about this panel.

Save Settings

To avoid having to re-enter the same toolbox setup repeatedly, you can save your settings using the 'Save Settings' panel. Just pick a name and click save. To load, choose your set of settings and press load (or delete). Loaded settings are applied on top of current settings. All configs are saved in browser local storage - they do not travel with the report and may not work in older browsers.

Configuring MultiQC

Whilst most MultiQC settings can be specified on the command line, MultiQC is also able to parse system-wide and personal config files. At run time, it collects the configuration settings from the following places in this order (overwriting at each step if a conflicting config variable is found):

  1. Hardcoded defaults in MultiQC code
  2. System-wide config in <installation_dir>/multiqc_config.yaml
    • Manual installations only, not pip or conda
  3. User config in ~/.multiqc_config.yaml
  4. Config file in the current working directory: multiqc_config.yaml
  5. Command line options

You can find an example configuration file with the MultiQC source code, called multiqc_config.example.yaml. If you installed MultiQC with pip or conda you won't have this file locally, but you can find it on GitHub: github.com/ewels/MultiQC.

Sample name cleaning

MultiQC typically generates sample names by taking the input or log file name, and 'cleaning' it. To do this, it uses the fn_clean_exts settings and looks for any matches. If it finds any matches, everything to the right is removed. For example, consider the following config:

fn_clean_exts:
    - '.gz'
    - '.fastq'

This would make the following sample names:

mysample.fastq.gz  ->  mysample
secondsample.fastq.gz_trimming_log.txt  ->  secondsample
thirdsample.fastq_aligned.sam.gz  ->  thirdsample

There is also a config list called fn_clean_trim which just removes strings if they are present at the start or end of the sample name.

Usually you don't want to overwrite the defaults (though you can). Instead, add to the special variable names extra_fn_clean_exts and extra_fn_clean_trim:

extra_fn_clean_exts:
    - '.myformat'
    - '_processedFile'
extra_fn_clean_trim:
    - '#'
    - '.myext'

Other search types

File name cleaning can also take strings to remove (instead of removing with truncation). Also regex strings can be supplied to match patterns and remove strings.

Consider the following:

extra_fn_clean_exts:
    - '.fastq'
    - type: 'replace'
      pattern: '.sorted'
    - type: 'regex'
      pattern: '^processed.'

This would make the following sample names:

mysample.fastq.gz  ->  mysample
secondsample.sorted.deduplicated.fastq.gz_processed.txt  ->  secondsample.deduplicated
processed.thirdsample.fastq_aligned.sam.gz  ->  thirdsample

Clashing sample names

This process of cleaning sample names can sometimes result in exact duplicates. A duplicate sample name will overwrite previous results. Warnings showing these events can be seen with verbose logging using the --verbose/-v flag, or in multiqc_data/multiqc.log.

Problems caused by this will typically be discovered be fewer results than expected. If you're ever unsure about where the data from results within MultiQC reports come from, have a look at multiqc_data/multiqc_sources.txt, which lists the path to the file used for every section of the report.

Directory names

One scenario where clashing names can occur is when the same file is processed in different directories. For example, if sample_1.fastq is processed with four sets of parameters in four different directories, they will all have the same name - sample_1. Only the last will be shown. If the directories are different, this can be avoided with the --dirs/-d flag.

For example, given the following files:

├── analysis_1
│   └── sample_1.fastq.gz.aligned.log
├── analysis_2
│   └── sample_1.fastq.gz.aligned.log
└── analysis_3
    └── sample_1.fastq.gz.aligned.log

Running multiqc -d . will give the following sample names:

analysis_1 | sample_1
analysis_2 | sample_1
analysis_3 | sample_1

Filename truncation

If the problem is with filename truncation, you can also use the --fullnames/-s flag, which disables all sample name cleaning. For example:

├── sample_1.fastq.gz.aligned.log
└── sample_1.fastq.gz.subsampled.fastq.gz.aligned.log

Running multiqc -s . will give the following sample names:

sample_1.fastq.gz.aligned.log
sample_1.fastq.gz.subsampled.fastq.gz.aligned.log

You can turn off sample name cleaning permanently by setting fn_clean_sample_names to false in your config file.

Module search patterns

Many bioinformatics tools have standard output formats, filenames and other signatures. MultiQC uses these to find output; for example, the FastQC module looks for files that end in _fastqc.zip.

This works well most of the time, until someone has an automated processing pipeline that renames things. For this reason, as of version v0.3.2 of MultiQC, the file search patterns are loaded as part of the main config. This means that they can be overwritten in <installation_dir>/multiqc_config.yaml or ~/.multiqc_config.yaml. So if you always rename your _fastqc.zip files to _qccheck.zip, MultiQC can still work.

To see the default search patterns, see the search_patterns.yaml file. Copy the section for the program that you want to modify and paste this into your config file. Make sure you make it part of a dictionary called sp as follows:

sp:
    mqc_module:
        fn: _mysearch.txt

Search patterns can specify a filename match (fn) or a file contents match (contents).

Ignoring Files

MultiQC begins by indexing all of the files that you specified and building a list of the ones it will use. You can specify files and directories to skip on the command line using -x/--ignore, or for more permanent memory, with the following config file options: fn_ignore_files, fn_ignore_dirs and fn_ignore_paths (the command line option simply adds to all of these).

For example, given the following files:

├── analysis_1
│   └── sample_1.fastq.gz.aligned.log
├── analysis_2
│   └── sample_1.fastq.gz.aligned.log
└── analysis_3
    └── sample_1.fastq.gz.aligned.log

You could specify the following relevant config options:

fn_ignore_files:
    - '*.log'
fn_ignore_dirs:
    - 'analysis_1'
    - 'analysis_2'
fn_ignore_paths:
    - '*/analysis_*/sample_1*'

Note that the searched file paths will usually be relative to the working directory and can be highly variable, so you'll typically want to start patterns with a * to match any preceding directory structure.

Large sample numbers

MultiQC has been written with the intention of being used for any number of samples. This means that it should work well with 6 samples or 6000. Very large sample numbers are becoming increasingly common, for example with single cell data.

Producing reports with data from many hundreds or thousands of samples provides some challenges, both technically and also in terms of data visualisation and report usability.

Disabling on-load plotting

One problem with large reports is that the browser can hang when the report is first loaded. This is because it loading and processing the data for all plots at once. To mitigate this, large reports may show plots as grey boxes with a "Show Plot" button. Clicking this will render the plot as normal and prevents the browser from trying to do everything at once.

By default this behaviour kicks in when a plot has 50 samples or more. This can be customised by changing the num_datasets_plot_limit config option.

Flat / interactive plots

Reports with many samples start to need a lot of data for plots. This results in inconvenient report file sizes (can be 100s of megabytes) and worse, web browser crashes. To allow MultiQC to scale to these sample numbers, most plot types have two plotting functions in the code base - interactive (using HighCharts) and flat (rendered with MatPlotLib). Flat plots take up the same disk space irrespective of sample number and do not consume excessive resources to display.

By default, MultiQC generates flat plots when there are 100 or more samples. This cutoff can be changed by changing the plots_flat_numseries config option. This behaviour can also be changed by running MultiQC with the --flat / --interactive command line options or by setting the plots_force_flat / plots_force_interactive config options to True.

Tables / Beeswarm plots

Report tables with thousands of samples (table rows) can quickly become impossible to use. To avoid this, tables with large numbers of rows are instead plotted as a Beeswarm plot (aka. a strip chart / jitter plot). These plots have fixed dimensions with any number of samples. Hovering on a dot will highlight the same sample in other rows.

By default, MultiQC starts using beeswarm plots when a table has 500 rows or more. This can be changed by setting the max_table_rows config option.

Customising Reports

MultiQC offers a few ways to customise reports to easily add your own branding and some additional report-level information. These features are primarily designed for core genomics facilities.

Note that much more extensive customisation of reports is possible using custom templates.

Titles and introductory text

You can specify a custom title for the report using the -i/--title command line option. The -b/--comment option can be used to add a longer comment to the top of the report at run time.

You can also specify the title and comment, as well as a subtitle and the introductory text in your config file:

title: "My Title"
subtitle: "A subtitle to go underneath in grey"
intro_text: "MultiQC reports summarise analysis results."
custom_message: "This is a comment about this report."

Note that if intro_text is None the template will display the default introduction sentence. Set this to False to hide this, or set it to a string to use your own text.

To add your own custom logo to reports, you can add the following three lines to your MultiQC configuration file:

custom_logo: '/abs/path/to/logo.png'
custom_logo_url: 'https://www.example.com'
custom_logo_title: 'Our Institute Name'

Only custom_logo is needed. The URL will make the logo open up a new web browser tab with your address and the title sets the mouse hover title text.

Project level information

You can add custom information at the top of reports by adding key:value pairs to the config option report_header_info. Note that if you have a file called multiqc_config.yaml in the working directory, this will automatically be parsed and added to the config. For example, if you have the following saved:

report_header_info:
    - Contact E-mail: 'phil.ewels@scilifelab.se'
    - Application Type: 'RNA-seq'
    - Project Type: 'Application'
    - Sequencing Platform: 'HiSeq 2500 High Output V4'
    - Sequencing Setup: '2x125'

Then this will be displayed at the top of reports:

report project info

Note that you can also specify a path to a config file using -c.

Order of modules

By default, modules are included in the report as in the order specified in config.module_order. Any modules found which aren't in this list are appended at the top of the report. To specify certain modules that should always come at the top of the report, you can configure config.top_modules in your MultiQC configuration file. For example, to always have the FastQC module at the top of reports, add the following to your ~/.multiqc_config.yaml file:

top_modules:
    - 'fastqc'

 Customising tables

Report tables such as the General Statistics table can get quite wide. To help with this, columns in the report can be hidden. Some MultiQC modules include columns which are hidden by default, others may be uninteresting to some users.

To allow customisation of this behaviour, the defaults can be changed by adding to your MultiQC config file. This is done with the table_columns_visible value. Open a MultiQC report and click Configure Columns above a table. Make a note of the Group and ID for the column that you'd like to alter. For example, to make the % Duplicate Reads column from FastQC hidden by default, the Group is FastQC and the ID is percent_duplicates. These are then added to the config as follows:

table_columns_visible:
    FastQC:
        percent_duplicates: False

Note that you can set these to True to show columns that would otherwise be hidden by default.

Number formatting

By default, the interactive HighCharts plots in MultiQC reports use spaces for thousand separators and points for decimal places (e.g. 1 234 567.89). Different countries have different preferences for this, so you can customise the two using a couple of configuration parameters - decimalPoint_format and thousandsSep_format.

For example, the following config would result in the following alternative number formatting: 1234567,89.

decimalPoint_format: ','
thousandsSep_format: ''

This formatting currently only applies to the interactive charts. It may be extended to apply elsewhere in the future (submit a new issue if you spot somewhere where you'd like it).

Troubleshooting

One tricky bit that caught me out whilst writing this is the different type casting between Python, YAML and Jinja2 templates. This is especially true when using an empty variable:

# Python
my_var = None
# YAML
my_var: null
# Jinja2
if myvar is none # Note - Lower case!

Troubleshooting

Hopefully MultiQC will be easy to use and run without any hitches. If you have any problems, please do get in touch with the developer (Phil Ewels) by e-mail or by submitting an issue on github. Before that, here are a few things previously encountered that may help...

Not enough samples found

In this scenario, MultiQC finds some logs for the bioinformatics tool in question, but not all of your samples appear in the report. This is the most common question I get regarding MultiQC operation.

Usually, this happens because sample names collide. This happens innocently a lot - MultiQC overwrites previous results of the same name and you get the last one seen in the report. You can see warnings about this by running MultiQC in verbose mode with the -v flag, or looking at the generated log file in multiqc_data/multiqc.log. If you are unsure about what log file ended up in the report, look at multiqc_data/multiqc_sources.txt which lists each source file used.

To solve this, try running MultiQC with the -d and -s flags. The Clashing sample names section of the docs explains this in more detail.

No logs found for a tool

In this case, you have run a bioinformatics tool and have some log files in a directory. When you run MultiQC with that directory, it finds nothing for the tool in question.

There are a couple of things you can check here:

  1. Is the tool definitely supported by MultiQC? If not, why not open an issue to request it!
  2. Did your bioinformatics tool definitely run properly? I've spent quite a bit of time debugging MultiQC modules only to realise that the output files from the tool were empty or incomplete. If your data is missing, take a look and the raw files and make sure that there's something to see!

If everything looks fine, then MultiQC probably needs extending to support your data. Tools have different versions, different parameters and different output formats that can confuse the parsing code. Please open an issue with your log files and we can get it fixed.

 Error messages about mkl trial mode / licences

In this case you run MultiQC and get something like this:

$ multiqc .

Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode EXPIRED 2 days ago

    You cannot run mkl without a license any longer.
    A license can be purchased it at: http://continuum.io
    We are sorry for any inconveniences.

    SHUTTING DOWN PYTHON INTERPRETER

The mkl library provides optimisations for numpy, a requirement of MatPlotLib. Recent versions of Conda have a bundled version which should come with a licence and remove the warning. See this page for more info. If you already have Conda installed you can get the updated version by running:

conda remove mkl-rt
conda install -f mkl

Another way around it is to uninstall mkl. It seems that numpy works without it fine:

$ conda remove --features mkl

Problem solved! See more here and here.

If you're not using Conda, try installing MultiQC with that instead. You can find instructions here.

Locale Error Messages

A more obscure problem is an error from the MatPlotLib python library about locale settings. Some strings (such as en_SE) aren't allowed. Running MultiQC gives the following error:

$ multiqc --version
# ..long traceback.. #
 File "/sw/comp/python/2.7.6_milou/lib/python2.7/locale.py", line 443, in _parse_localename
   raise ValueError, 'unknown locale: %s' % localename
ValueError: unknown locale: UTF-8

You can fix this by changing the locale to something that will be recognised. One way to do this is by adding these lines to your .bashrc in your home directory (or .bash_profile):

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

MultiQC Modules

Pre-alignment

Cutadapt

The Cutadapt module parses results generated by Cutadapt, a tool to find and remove adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.

This module should be able to parse logs from a wide range of versions of Cutadapt. It's been tested with log files from v1.2.1, 1.6 and 1.8. Note that you will need to change the search pattern for very old log files (such as v.1.2) with the following MultiQC config:

sp:
    cutadapt:
        contents: 'cutadapt version'

See the module search patterns section of the MultiQC documentation for more information.

FastQC

The FastQC module parses results generated by FastQC, a quality control tool for high throughput sequence data written by Simon Andrews at the Babraham Institute.

FastQC generates a HTML report which is what most people use when they run the program. However, it also helpfully generates a file called fastqc_data.txt which is relatively easy to parse.

A typical run will produce the following files:

mysample_fastqc.html
mysample_fastqc/
  Icons/
  Images/
  fastqc.fo
  fastqc_data.txt
  fastqc_report.html
  summary.txt

Sometimes the directory is zipped, with just mysample_fastqc.zip.

The FastQC MultiQC module looks for files called fastqc_data.txt or ending in _fastqc.zip. If the zip files are found, they are read in memory and fastqc_data.txt parsed.

Note: The directory and zip file are often both present. To speed up MultiQC execution, zip files will be skipped if the file name suggests that they will share a sample name with data that has already been parsed.

You can customise the patterns used for finding these files in your MultiQC config using the following base - see the docs for more details..

sp:
    fastqc:
        data:
            fn: 'fastqc_data.txt'
        zip:
            fn: '_fastqc.zip'

Note: Sample names are discovered by parsing the line beginning Filename in fastqc_data.txt, not based on the FastQC report names.

Theoretical GC Content

It is possible to plot a dashed line showing the theoretical GC content for a reference genome. MultiQC comes with genome and transcriptome guides for Human and Mouse. You can use these in your reports by adding the following MultiQC config keys (see Configuring MultiQC):

fastqc_config:
    fastqc_theoretical_gc: 'hg38_genome'

Only one theoretical distribution can be plotted. The following guides are available: hg38_genome, hg38_txome, mm10_genome, mm10_txome (txome = transcriptome).

Alternatively, a custom theoretical guide can be used in reports. To do this, create a file with fastqc_theoretical_gc in the filename and place it with your analysis files. It should be tab delimited with the following format (column 1 = %GC, column 2 = % of genome):

# FastQC theoretical GC content curve: YOUR REFERENCE NAME
0   0.005311768
1   0.004108502
2   0.004060371
3   0.005066476
[...]

You can generate these files using an R package called fastqcTheoreticalGC written by Mike Love. Please see the package readme for more details.

If you want to always use your custom file for MultiQC reports without having to add it to the analysis directory, add the full file path to the same MultiQC config variable described above:

fastqc_config:
    fastqc_theoretical_gc: '/path/to/your/custom_fastqc_theoretical_gc.txt'

FastQ Screen

The FastQ Screen module parses results generated by FastQ Screen, a tool that allows you to screen a library of sequences in FastQ format against a set of sequence databases so you can see if the composition of the library matches with what you expect.

Skewer

The Skewer module parses results generated by Skewer, an adapter trimming tool specially designed for processing next-generation sequencing (NGS) paired-end sequences.

Trimmomatic

The Trimmomatic module parses results generated by Trimmomatic, a flexible read trimming tool for Illumina NGS data.

Aligners

Bismark

The Bismark module parses logs generated by Bismark, a tool to map bisulfite converted sequence reads and determine cytosine methylation states.

Bowtie 1

The Bowtie 1 module parses results generated by Bowtie, an ultrafast, memory-efficient short read aligner.

Bowtie 2

The Bowtie 2 module parses results generated by Bowtie 2, an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences.

Please note that the Bowtie 2 logs are difficult to parse as they don't contain much extra information (such as what the input data was). A typical log looks like this:

314537 reads; of these:
  314537 (100.00%) were paired; of these:
    111016 (35.30%) aligned concordantly 0 times
    193300 (61.46%) aligned concordantly exactly 1 time
    10221 (3.25%) aligned concordantly >1 times
    ----
    111016 pairs aligned concordantly 0 times; of these:
      11377 (10.25%) aligned discordantly 1 time
    ----
    99639 pairs aligned 0 times concordantly or discordantly; of these:
      199278 mates make up the pairs; of these:
        112779 (56.59%) aligned 0 times
        85802 (43.06%) aligned exactly 1 time
        697 (0.35%) aligned >1 times
82.07% overall alignment rate

Bowtie 2 logs are from STDERR - some pipelines (such as Cluster Flow) print the Bowtie 2 command before this, so MultiQC looks to see if this can be recognised in the same file. If not, it takes the filename as the sample name.

Bowtie 2 is used by other tools too, so if your log file contains the word bisulfite, MultiQC will assume that this is actually Bismark and ignore the Bowtie 2 logs.

HiCUP

The HiCUP module parses results generated by HiCUP, (Hi-C User Pipeline), a tool for mapping and performing quality control on Hi-C data.

Kallisto

The Kallisto module parses logs generated by Kallisto, a program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads.

STAR

STAR is an ultrafast universal RNA-seq aligner.

This MultiQC module parses summary statistics from the Log.final.out log files. Sample names are taken either from the filename prefix (sampleNameLog.final.out) when set with --outFileNamePrefix in STAR. If there is no filename prefix, the sample name is set as the name of the directory containing the file.

In addition to this summary log file, the module parses ReadsPerGene.out.tab files generated with --quantMode GeneCounts, if found.

Salmon

The Salmon module parses results generated by Salmon, a tool for quantifying the expression of transcripts using RNA-seq data.

TopHat

The TopHat module parses results generated by TopHat, a fast splice junction mapper for RNA-Seq reads that aligns RNA-Seq reads to mammalian-sized genomes.

Post-alignment

Bamtools

The Bamtools module parses bamtools stats logs generated by Bamtools, a programmer's API and an end-user's toolkit for handling BAM files.

Supported commands: stats

Bcftools

The Bcftools module parses results generated by Bcftools, a suite of programs for interacting with variant call data.

Supported commands: stats

featureCounts

The featureCounts module parses results generated by featureCounts, a highly efficient general-purpose read summarization program that counts mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations.

GATK

Developed by the Data Science and Data Engineering group at the Broad Institute, the GATK toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size.

Supported tools: VariantEval

VariantEval

VariantEval is a general-purpose tool for variant evaluation. It gives information about percentage of variants in dbSNP, genotype concordance, Ti/Tv ratios and a lot more.

HTSeq

HTSeq is a general purpose Python package that provides infrastructure to process data from high-throughput sequencing assays. htseq-count is a tool that is part of the main HTSeq package - it takes a file with aligned sequencing reads, plus a list of genomic features and counts how many reads map to each feature.

methylQA

The methylQA module parses results generated by methylQA, a methylation sequencing data quality assessment tool.

Peddy

Peddy compares familial-relationships and sexes as reported in a PED file with those inferred from a VCF.

It samples the VCF at about 25000 sites (plus chrX) to accurately estimate relatedness, IBS0, heterozygosity, sex and ancestry. It uses 2504 thousand genome samples as backgrounds to calibrate the relatedness calculation and to make ancestry predictions.

It does this very quickly by sampling, by using C for computationally intensive parts, and by parallelization.

Picard

The Picard module parses results generated by Picard, a set of Java command line tools for manipulating high-throughput sequencing data.

Supported commands:

  • MarkDuplicates
  • InsertSizeMetrics
  • GcBiasMetrics
  • HsMetrics
  • OxoGMetrics
  • BaseDistributionByCycl
  • RnaSeqMetrics
  • AlignmentSummaryMetrics
  • RrbsSummaryMetrics

Coverage Levels

It's possible to customise the HsMetrics "Target Bases 30X" coverage and WgsMetrics "Fraction of Bases over 30X" that are shown in the general statistics table. This must correspond to field names in the picard report, such as PCT_TARGET_BASES_2X / PCT_10X. Any numbers not found in the reports will be ignored.

The coverage levels available for HsMetrics are typically 1, 2, 10, 20, 30, 40, 50 and 100X.

The coverage levels available for WgsMetrics are typically 1, 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90 and 100X.

To customise this, add the following to your MultiQC config:

picard_config:
    general_stats_target_coverage:
        - 10
        - 50

Preseq

The Preseq module parses results generated by Preseq, a tool that estimates the complexity of a library, showing how many additional unique reads are sequenced for increasing total read count.

Prokka

The Prokka module analyses summary results from the Prokka annotation pipeline for prokaryotic genomes.
The Prokka module accepts two configuration options:

  • prokka_table: default False. Show a table in the report.
  • prokka_barplot: default True. Show a barplot in the report.

Qualimap

The Qualimap module parses results generated by Qualimap, a platform-independent application to facilitate the quality control of alignment sequencing data and its derivatives like feature counts.

Supported commands: BamQC, RNASeq

It is possible to customise which coverage thresholds are shown from BamQC in the General Statistics table (default: 1, 5, 10, 30, 50) and which of these are hidden when the report loads (default: all except 30X).

To do this, add something like the following to your MultiQC config file:

qualimap_config:
    general_stats_coverage:
        - 10
        - 20
        - 40
        - 200
        - 30000
    general_stats_coverage_hidden:
        - 10
        - 20
        - 200

QUAST

QUAST evaluates genome assemblies by computing various metrics, including

  • N50, length for which the collection of all contigs of that length or longer covers at least 50% of assembly length
  • NG50, where length of the reference genome is being covered
  • NA50 and NGA50, where aligned blocks instead of contigs are taken
  • Misassemblies, misassembled and unaligned contigs or contigs bases
  • Genes and operons covered

The QUAST MultiQC module parses the report.tsv files generated by QUAST and adds key metrics to the report General Statistics table. All statistics for all samples are saved to multiqc_data/multiqc_quast.txt.

RSeQC

The RSeQC module parses results generated by RSeQC, a package that provides a number of useful modules that can comprehensively evaluate high throughput RNA-seq data.

Supported scripts:

  • bam_stat
  • gene_body_coverage
  • infer_experiment
  • inner_distance
  • junction_annotation
  • junction_saturation
  • read_distribution
  • read_duplication
  • read_gc

Samblaster

The Samblaster module parses results generated by Samblaster, a tool to mark duplicates and extract discordant and split reads from sam files.

Samtools

The Samtools module parses results generated by Samtools, a suite of programs for interacting with high-throughput sequencing data.

Supported commands:

  • stats
  • flagstats
  • idxstats
  • rmdup

idxstats

The samtools idxstats prints its results to standard out (no consistent file name) and has no header lines (no way to recognise from content of file). As such, idxstats result files must have the string idxstat somewhere in the filename.

There are a few MultiQC config options that you can add to customise how the idxstats module works. A typical configuration could look as follows:

# Always include these chromosomes in the plot
samtools_idxstats_always:
    - X
    - Y

# Never include these chromosomes in the plot
samtools_idxstats_ignore:
    - MT

# Threshold where chromosomes are ignored in the plot.
# Should be a fraction, default is 0.001 (0.1% of total)
samtools_idxstats_fraction_cutoff: 0.001

# Name of the X and Y chromosomes.
# If not specified, MultiQC will search for any chromosome
# names that look like x, y, chrx or chry (case insensitive search)
samtools_idxstats_xchr: myXchr
samtools_idxstats_ychr: myYchr

Slamdunk

Slamdunk is a tool to analyze data from the SLAM-Seq sequencing protocol.

This module should be able to parse logs from v0.2.2-dev onwards.

SnpEff

The SnpEff module parses results generated by SnpEff, a genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes).

Custom Content

WARNING - This feature is new and is very much in a beta status. It is expected to be further developed in future releases, which may break backwards compatibility. There are also probably quite a few bugs. Use at your own risk! Please report bugs or missing functionality as a new GitHub issue.

Introduction

Bioinformatics projects often include non-standardised analyses, with results from custom scripts or in-house packages. It can be frustrating to have a MultiQC report describing results from 90% of your pipeline but missing the final key plot. To help with this, MultiQC has a special "custom content" module.

Custom content parsing is a little more restricted than standard modules. Specifically:

  • Only one plot per section is possible
  • Plot customisation is more limited

All plot types can be generated using custom content - see the test files for examples of how data should be structured.

Configuration

Data should typically be submitted alongside some configuration, to specify how MultiQC should parse and display the data. All of these configuration parameters are optional, and MultiQC will do its best to guess sensible defaults if they are not specified.

All possible configuration keys and their default values are shown below:

id: null                # Unique ID for report section.
section_anchor: <id>    # Used in report section #soft-links
section_name: <id>      # Nice name used for the report section header
section_href: null      # External URL for the data, to find more information
description: null       # Introductory text to be printed under the section header
file_format: null       # File format of the data (typically csv / tsv - see below for more information)
plot_type: null         # The plot type to visualise the data with.
                        # - Possible options: generalstats | table | bargraph | linegraph | scatter | heatmap | beeswarm
pconfig: {}             # Configuration for the plot. See http://multiqc.info/docs/#plotting-functions

Note that any custom content data found with the same section id will be merged into the same report section / plot. The other section configuration keys are merged for each file, with identical keys overwriting what was previously parsed.

This approach means that it's possible to have a single file containing data for multiple samples, but it's also possible to have one file per sample and still have all of them summarised.

If you're using plot_type: 'generalstats' then a report section will not be created and most of the configuration keys above are ignored.

Data types generalstats and beeswarm are only possible by setting the above configuration keys (these can't be guessed by data format).

Data formats

MultiQC can parse custom data from a few different sources, in a number of different formats. Which one you use depends on how the data is being produced.

A quick summary of which approach to use looks something like this:

  • Additional data when already using custom MultiQC config files
    • Data as part of MultiQC config
  • Data specifically for MultiQC from a custom script
    • MultiQC-specific data file
  • Data from a custom script which is also used by other processes
    • Separate configuration and data files
    • Add _mqc.txt to filename and hope that MultiQC guesses correctly
  • Anything more complicated, or data from a released tool
    • Write a proper MultiQC module instead.

For more complete examples of the data formats understood by MultiQC, please see the data/custom_content directory in the MultiQC_TestData GitHub repository.

Data from a released tool

If your data comes from a released bioinformatics tool, you shouldn't be using this feature of MultiQC! Sure, you can probably get it to work, but it's better if a fully-fledged core MultiQC module is written instead. That way, other users of MultiQC can also benefit from results parsing.

Note that proper MultiQC modules are more robust and powerful than this custom-content feature. You can also write modules in MultiQC plugins if they're not suitable for general release.

Data as part of MultiQC config

If you are already using a MultiQC config file to add data to your report (for example, titles / introductory text), you can give data within this file too. This can be in any MultiQC config file (for example, passed on the command line with -c my_yaml_file.yaml). This is useful as you can keep everything contained within a single file (including stuff unrelated to this specific custom content feature of MultiQC).

If you're not using this file for other MultiQC configuration, you're probably better off using a stand-alone YAML file (see section below).

To be understood by MultiQC, the custom_data key must be found. This must contain a section with a unique id, specific to your new report section. This in turn must contain a section called data. Other configuration keys can be held alongside this. For example:

# Other MultiQC config stuff here
custom_data:
    my_data_type:
        id: 'mqc_config_file_section'
        section_name: 'My Custom Section'
        description: 'This data comes from a single multiqc_config.yaml file'
        plot_type: 'bargraph'
        pconfig:
            id: 'barplot_config_only'
            title: 'MultiQC Config Data Plot'
            ylab: 'Number of things'
        data:
            sample_a:
                first_thing: 12
                second_thing: 14
            sample_b:
                first_thing: 8
                second_thing: 6
            sample_c:
                first_thing: 11
                second_thing: 5
            sample_d:
                first_thing: 12
                second_thing: 9

Or to add data to the General Statistics table:

custom_data:
    my_genstats:
        plot_type: 'generalstats'
        pconfig:
            - col_1:
                max: 100
                min: 0
                scale: 'RdYlGn'
                format: '{:.1f}%'
            - col_2:
                min: 0
        data:
            sample_a:
                col_1: 14.32
                col_2: 1.2
            sample_b:
                col_1: 84.84
                col_2: 1.9

Note: Use a list of headers in pconfig (keys prepended with -) to specify the order of columns in the table.

See the general statistics docs for more information about configuring data for the General Statistics table.

MultiQC-specific data file

If you can choose exactly how your data output looks, then the easiest way to parse it is to use a MultiQC-specific format. If the filename ends in *_mqc.(yaml|json|txt|csv|out) then it will be found by any standard MultiQC installation with no additional customisation required (v0.9 onwards).

These files contain configuration information specifying how the data should be parsed, along side the data. If using YAML, this looks just the same as if in a MultiQC config file (see above), but without having to be within a custom_data section:

id: 'my_pca_section'
section_name: 'PCA Analysis'
description: 'This plot shows the first two components from a principal component analysis.'
plot_type: 'scatter'
pconfig:
    id: 'pca_scatter_plot'
    title: 'PCA Plot'
    xlab: 'PC1'
    ylab: 'PC2'
data:
    sample_1: {x: 12, y: 14}
    sample_2: {x: 8, y: 6 }
    sample_3: {x: 5, y: 11}
    sample_4: {x: 9, y: 12}

The file format can also be JSON:

{
    "id": "custom_data_lineplot",
    "section_name": "Custom JSON File",
    "description": "This plot is a self-contained JSON file.",
    "plot_type": "linegraph",
    "pconfig": {
        "id": "custom_data_linegraph",
        "title": "Output from my JSON file",
        "ylab": "Number of things",
        "xDecimals": false
    },
    "data": {
        "sample_1": { "1": 12, "2": 14, "3": 10, "4": 7, "5": 16 },
        "sample_2": { "1": 9, "2": 11, "3": 15, "4": 18, "5": 21 }
    }
}

If you want the data to be easy to use with other tools, you can also use comma-separated or tab-separated file. To customise plot output, include commented header lines with plot configuration in YAML format:

# title: 'Output from my script'
# description: 'This output is described in the file header. Any MultiQC installation will understand it without prior configuration.'
# section: 'Custom Data File'
# format: 'tsv'
# plot_type: 'bargraph'
# pconfig:
#    id: 'custom_bargraph_w_header'
#    ylab: 'Number of things'
Category_1    374
Category_2    229
Category_3    39
Category_4    253

If no configuration is given, MultiQC will do its best to guess how to visualise your data appropriately. To see examples of typical file structures which are understood, see the test data used to develop this code. Something will be probably be shown, but it may produce unexpected results.

Separate configuration and data files

It's not always possible or desirable to include MultiQC configuration within a data file. If this is the case, you can add to the MultiQC configuration to specify how input files should be parsed.

As described in the above Data as part of MultiQC config section, this configuration should be held within a section called custom_data with a section-specific id. The only difference is that no data subsection should be present and the sp key must be present.

The sp key tells MultiQC how to find files. This should have fn and/or contents under it, as described in the Module search patterns section of the docs.

For example:

# Other MultiQC config stuff here
custom_data:
    example_files:
        sp:
            fn: 'example_files_*'
        file_format: 'tsv'
        section_name: 'Coverage Decay'
        description: 'This plot comes from files acommpanied by a mutliqc_config.yaml file for configuration'
        plot_type: 'linegraph'
        pconfig:
            id: 'example_coverage_lineplot'
            title: 'Coverage Decay'
            ylab: 'X Coverage'
            ymax: 100
            ymin: 0

A data file within the MultiQC search directories could then simply look like this:

example_files_Sample_1.txt:

0   98.22076066
1   97.96764159
2   97.78227175
3   97.61262195
[...]

As mentioned above - if no configuration is given, MultiQC will do its best to guess how to visualise your data appropriately. To see examples of typical file structures which are understood, see the test data used to develop this code.

Coding with MultiQC

Writing New Templates

MultiQC is built around a templating system that uses the Jinja python package. This makes it very easy to create new report templates that fit your needs.

Core or plugin?

If your template could be of use to others, it would be great if you could add it to the main MultiQC package. You can do this by creating a fork of the MultiQC GitHub repository, adding your template and then creating a pull request to merge your changes back to the main repository.

If it's very specific template, you can create a new Python package which acts as a plugin. For more information about this, see the plugins documentation.

Creating a template skeleton

For a new template to be recognised by MultiQC, it must be a python submodule directory with a __init__.py file. This must be referenced in the setup.py installation script as an entry point.

You can see the bundled templates defined in this way:

entry_points = {
    'multiqc.templates.v1': [
        'default = multiqc.templates.default',
        'default_dev = multiqc.templates.default_dev',
        'simple = multiqc.templates.simple',
        'geo = multiqc.templates.geo',
    ]
}

Note that these entry points can point to any Python modules, so if you're writing a plugin module you can specify your module name instead. Just make sure that multiqc.templates.v1 is the same.

Once you've added the entry point, remember to install the package again:

python setup.py develop

Using develop tells setuptools to softlink the plugin files instead of copying, so changes made whilst editing files will be reflected when you run MultiQC.

The __init__.py files must define two variables - the path to the template directory and the main jinja template file:

template_dir = os.path.dirname(__file__)
base_fn = 'base.html'

Child templates

The default MultiQC template contains a lot of code. Importantly, it includes 1448 lines of custom JavaScript (at time of writing) which powers the plotting and dynamic functions in the report. You probably don't want to rewrite all of this for your template, so to make your life easier you can create a child template.

To do this, add an extra variable to your template's __init__.py:

template_parent = 'default'

This tells MultiQC to use the template files from the default template unless a file with the same name is found in your child template. For instance, if you just want to add your own logo in the header of the reports, you can create your own header.html which will overwrite the default header.

Files within the default template have comments at the top explaining what part of the report they generate.

Extra init variables

There are a few extra variables that can be added to the __init__.py file to change how the report is generated.

Setting output_dir instructs MultiQC to put the report and it's contents into a subdirectory. Set the string to your desired name. Note that this will be prefixed if -p/--prefix is set at run time.

Secondly, you can copy additional files with your report when it is generated. This is usually used to copy required images or scripts with the report. These should be a list of file or directory paths, relative to the __init__.py file. Directory contents will be copied recursively.

You can also override config options in the template. For example, setting the value of config.plots_force_flat can force the report to only have static image plots.

from multiqc.utils import config

output_subdir = 'multiqc_report'
copy_files = ['assets']
config.plots_force_flat = True

Jinja template variables

There are a number of variables that you can use within your Jinja template. Two namespaces are available - report and config. You can print these using the Jinja curly brace syntax, eg. {{ config.version }}. See the Jinja2 documentation for more information.

The default MultiQC template includes dependencies in the HTML so that the report is standalone. If you would like to do the same, use the include_file function. For example:

<script>{{ include_file('js/jquery.min.js') }}</script>
<img src="data:image/png;base64,{{ include_file('img/logo.png', b64=True) }}">

Appendices

Custom plotting functions

If you don't like the default plotting functions built into MultiQC, you can write your own! If you create a callable variable in a template called either bargraph or linegraph, MultiQC will use that instead. For example:

def custom_linegraph(plotdata, pconfig):
    return '<h1>Awesome line graph here</h1>'
linegraph = custom_linegraph

def custom_bargraph(plotdata, plotseries, pconfig):
    return '<h1>Awesome bar graph here</h1>'
bargraph = custom_bargraph

These particular examples don't do very much, but hopefully you get the idea. Note that you have to set the variable linegraph or bargraph to your function.

Writing New Modules

Introduction

Writing a new module can at first seem a daunting task. However, MultiQC has been written (and refactored) to provide a lot of functionality as common functions.

Provided that you are familiar with writing Python and you have a read through the guide below, you should be on your way in no time!

If you have any problems, feel free to contact the author - details here: @ewels

Core modules / plugins

New modules can either be written as part of MultiQC or in a stand-alone plugin. If your module is for a publicly available tool, please add it to the main program and contribute your code back when complete via a pull request.

If your module is for something very niche, which no-one else can use, you can write it as part of a custom plugin. The process is almost identical, though it keeps the code bases separate. For more information about this, see the docs about MultiQC Plugins below.

Initial setup

Submodule

MultiQC modules are Python submodules - as such, they need their own directory in /multiqc/ with an __init__.py file. The directory should share its name with the module. To follow common practice, the module code usually then goes in a separate python file (also with the same name) which is then imported by __init__.py:

from __future__ import absolute_import
from .modname import MultiqcModule

Entry points

Once your submodule files are in place, you need to tell MultiQC that they are available as an analysis module. This is done within setup.py using entry points. In setup.py you will see some code that looks like this:

entry_points = {
    'multiqc.modules.v1': [
        'bismark = multiqc.modules.bismark:MultiqcModule',
        [...]
    ]
}

Copy one of the existing module lines and change it to use your module name. The order is irrelevant, so stick to alphabetical if in doubt. Once this is done, you will need to update your installation of MultiQC:

python setup.py develop

MultiQC config

So that MultiQC knows what order modules should be run in, you need to add your module to the core config file.

In multiqc/utils/config_defaults.yaml you should see a list variable called module_order. This contains the name of modules in order of precedence. Add your module here in an appropriate position.

Documentation

Next up, you need to create a documentation file for your module. The reason for this is twofold: firstly, docs are important to help people to use, debug and extend MultiQC (you're reading this, aren't you?). Secondly, having the file there with the appropriate YAML front matter will make the module show up on the MultiQC homepage so that everyone knows it exists. This process is automated once the file is added to the core repository.

This docs file should be placed in docs/modules/<your_module_name>.md and should have the following structure:

---
Name: Tool Name
URL: http://www.amazing-bfx-tool.com
Description: >
    This amazing tool does some really cool stuff. You can describe it
    here and split onto multiple lines if you want. Not too long though!
---

Your documentation goes here. Feel free to use markdown and write whatever
you think would be helpful. Please avoid using heading levels 1 to 3.

Readme and Changelog

Last but not least, remember to add your new module to the main README.md file and CHANGELOG.md, so that people know that it's there. Feel free to add your name to the list of credits at the bottom of the readme.

MultiqcModule Class

If you've copied one of the other entry point statements, it will have ended in :MultiqcModule - this tells MultiQC to try to execute a class or function called MultiqcModule.

To use the helper functions bundled with MultiQC, you should extend this class from multiqc.modules.base_module.BaseMultiqcModule. This will give you access to a number of functions on the self namespace. For example:

from multiqc.modules.base_module import BaseMultiqcModule

class MultiqcModule(BaseMultiqcModule):
    def __init__(self):
        # Initialise the parent object
        super(MultiqcModule, self).__init__(name='My Module', anchor='mymod',
        href="http://www.awesome_bioinfo.com/my_module",
        info="is an example analysis module used for writing documentation.")

Ok, that should be it! The __init__() function will now be executed every time MultiQC runs. Try adding a print("Hello World!") statement and see if it appears in the MultiQC logs at the appropriate time...

Note that the __init__ variables are used to create the header, URL link, analysis module credits and description in the report.

 Logging

Last thing - MultiQC modules have a standardised way of producing output, so you shouldn't really use print() statements for your Hello World ;)

Instead, use the logger module as follows:

import logging
log = logging.getLogger(__name__)
# Initialise your class and so on
log.info('Hello World!')

Log messages can come in a range of formats:

  • log.debug
    • Thes only show if MultiQC is run in -v/--verbose mode
  • log.info
    • For more important status updates
  • log.warning
    • Alert user about problems that don't halt execution
  • log.error and log.critical
    • Not often used, these are for show-stopping problems

Step 1 - Find log files

The first thing that your module will need to do is to find analysis log files. You can do this by searching for a filename fragment, or a string within the file. It's possible to search for both (a match on either will return the file) and also to have multiple strings possible.

First, add your default patterns to:

MULTIQC_ROOT/multiqc/utils/search_patterns.yaml

You should see the patterns for all other modules to give you an idea, but you want a yaml key with the name of your module, then either fn or contents for strings to match against filenames or file contents:

mymod:
    fn: _myprogram.txt
myothermod:
    contents: This is myprogram v1.3

Note that if you want to find multiple log files, you can nest these dictionaries (though they must end with either fn or contents). For example, see the FastQC module:

fastqc:
    data:
        fn: fastqc_data.txt
    zip:
        fn: _fastqc.zip

You can supply a list of strings if needed, eg. the bismark module:

bismark:
    align:
        fn:
            - _PE_report.txt
            - _SE_report.txt

The value of adding these strings here is that they can be overwritten by users in their own config files. This is helpful as people have weird and wonderful processing pipelines with their own file naming conventions.

Once your strings are added, you can call them in your module with the config.sp['mymod']. Next, use the base function self.find_log_files() to look for your files like this:

self.find_log_files(config.sp['mymod'], filehandles=False)

This will recursively search the analysis directories looking for a matching file name (if the fn key is there) or a text string held within a file (if the contents key is there). Contents matching is only done on files smaller than config.log_filesize_limit (default 1MB). Note that both fn and contents can be used in combination if required - files will be returned if anything matches (OR not AND).

This function yields a dictionary with various information about matching files. The f key contains the contents of matching files by default.

# Find all files for mymod
for f in self.find_log_files(config.sp['mymod']):
    print f['f']        # File contents
    print f['s_name']   # Sample name (from cleaned fn)
    print f['root']     # Directory file was in
    print f['fn']       # Filename

If filehandles=True is specified, the f key contains a file handle instead:

# Find all files which contain the string 'My Statistic:'
# Return a filehandle instead of the file contents
for f in self.find_log_files(config.sp['mymod'], filehandles=True):
    line = f['f'].readline()  # f['f'] is now a filehandle instead of contents

Step 2 - Parse data from the input files

What most MultiQC modules do once they have found matching analysis files is to pass the matched file contents to another function, responsible for parsing the data from the file. How this parsing is done will depend on the format of the log file and the type of data being read. See below for a basic example, based loosly on the preseq module:

class MultiqcModule(BaseMultiqcModule):
    def __init__(self):
        # [...]
        self.mod_data = dict()
        for f in self.find_log_files(config.sp['mymod']):
            self.mod_data[f['s_name']] = self.parse_logs(f['f'])

    def parse_logs(self, f):
        data = {}
        for l in f.splitlines():
            s = l.split()
            data[s[0]] = s[1]
        return data

No files found

If your module cannot find any matching files, it needs to raise an exception of type UserWarning. This tells the core MultiQC program that no modules were found. For example:

if len(self.mod_data) == 0:
    log.debug("Could not find any data in {}".format(config.analysis_dir))
    raise UserWarning

Note that this has to be raised as early as possible, so that it halts the module progress. For example, if no logs are found then the module should not create any files or try to do any computation.

Custom sample names

Typically, sample names are taken from cleaned log filenames (the default f['s_name'] value returned). However, if possible, it's better to use the name of the input file (allowing for concatenated log files). To do this, you should use the self.clean_s_name() function, as this will prepend the directory name if requested on the command line:

input_fname = s[3] # Or parsed however
s_name = self.clean_s_name(input_fname, f['root'])

This function has already been applied to the contents of f['s_name'].

self.clean_s_name() must be used on sample names parsed from the file contents. Without it, features such as prepending directories (--dirs) will not work.

Identical sample names

If modules find samples with identical names, then the previous sample is overwritten. It's good to print a log statement when this happens, for debugging. However, most of the time it makes sense - programs often create log files and print to stdout for example.

if f['s_name'] in self.bowtie_data:
    log.debug("Duplicate sample name found! Overwriting: {}".format(f['s_name']))

Printing to the sources file

Finally, once you've found your file we want to add this information to the multiqc_sources.txt file in the MultiQC report data directory. This lists every sample name and the file from which this data came from. This is especially useful if sample names are being overwritten as it lists the source used. This code is typically written immediately after the above warning.

If you've used the self.find_log_files function, writing to the sources file is as simple as passing the log file variable to the self.add_data_source function:

for f in self.find_log_files(config.sp['mymod']):
    self.add_data_source(f)

If you have different files for different sections of the module, or are customising the sample name, you can tweak the fields. The default arguments are as shown:

self.add_data_source(f=None, s_name=None, source=None, module=None, section=None)

Step 4 - Adding to the general statistics table

Now that you have your parsed data, you can start inserting it into the MultiQC report. At the top of ever report is the 'General Statistics' table. This contains metrics from all modules, allowing cross-module comparison.

There is a helper function to add your data to this table. It can take a lot of configuration options, but most have sensible defaults. At it's simplest, it works as follows:

data = {
    'sample_1': {
        'first_col': 91.4,
        'second_col': '78.2%'
    },
    'sample_2': {
        'first_col': 138.3,
        'second_col': '66.3%'
    }
}
self.general_stats_addcols(data, headers)

To give more informative table headers and configure things like data scales and colour schemes, you can supply an extra dict:

headers = OrderedDict()
headers['first_col'] = {
    'title': 'First',
    'description': 'My First Column',
    'scale': 'RdYlGn-rev'
}
headers['second_col'] = {
    'title': 'Second',
    'description': 'My Second Column',
    'max': 100,
    'min': 0,
    'scale': 'Blues',
    'format': '{:.1f}%'
}
self.general_stats_addcols(data, headers)

Here are all options for headers, with defaults:

headers['name'] = {
    'namespace': '',                # Module name. Auto-generated for General Statistics.
    'title': '[ dict key ]',        # Short title, table column title
    'description': '[ dict key ]',  # Longer description, goes in mouse hover text
    'max': None,                    # Minimum value in range, for bar / colour coding
    'min': None,                    # Maximum value in range, for bar / colour coding
    'scale': 'GnBu',                # Colour scale for colour coding
    'format': '{:.1f}',             # Output format() string
    'shared_key': None              # See below for description
    'modify': None,                 # Lambda function to modify values
    'hidden': False                 # Set to True to hide the column on page load
}
  • namespace
    • This prepends the column title in the mouse hover: Namespace: Title. It's automatically generated for the General Statistics table.
  • scale
    • Colour scales are the names of ColorBrewer palettes. See the chroma.js documentation for a list of available colour scales
    • Add -rev to the name of a colour scale to reverse it
  • shared_key
    • Any string can be specified here, if other columns are found that share the same key, a consistent colour scheme and data scale will be used in the table. Typically this is set to things like read_count, so that the read count in a sample can be seen varying across analysis modules.
  • modify
    • A python lambda function to change the data in some way when it is inserted into the table. Typically, this is used to divide numbers to show millions: 'modify': lambda x: x / 1000000
  • hidden
    • Setting this to True will hide the column when the report loads. It can then be shown through the Configure Columns modal in the report. This can be useful when data could be sometimes useful. For example, some modules show "percentage aligned" on page load but hide "number of reads aligned".

A third parameter can be passed to this function, namespace. This is usually not needed - MultiQC automatically takes the name of the module that is calling the function and uses this. However, sometimes it can be useful to overwrite this.

Step 5 - Writing data to a file

In addition to printing data to the General Stats, MultiQC modules typically also write to text-files to allow people to easily use the data in downstream applications. This also gives the opportunity to output additional data that may not be appropriate for the General Statistics table.

Again, there is a base class function to help you with this - just supply it with a dictionary and a filename:

data = {
    'sample_1': {
        'first_col': 91.4,
        'second_col': '78.2%'
    },
    'sample_2': {
        'first_col': 138.3,
        'second_col': '66.3%'
    }
}
self.write_data_file(data, 'multiqc_mymod')

If your output has a lot of columns, you can supply the additional argument sort_cols = True to have the columns alphabetically sorted.

This function will also pay attention to the default / command line supplied data format and behave accordingly. So the written file could be a tab-separated file (default), JSON or YAML.

Note that any keys with more than 2 levels of nesting will be ignored when being written to tab-separated files.

Step 6 - Create report sections

Great! It's time to start creating sections of the report with more information. If you only have one plot / section to create, just add it to the introduction. For example (content to be replaced with a brilliant plot in the bit of the docs):

self.intro += 'My amazing module output'

If you have multiple plots to show (eg. the Qualimap and FastQC modules), you can create a list to hold the sections:

self.sections = list()
self.sections.append({
    'name': 'First Module Section',
    'anchor': 'mymod_first',
    'content': 'My amazing module output, from the first section'
})
self.sections.append({
    'name': 'Second Module Section',
    'anchor': 'mymod_second',
    'content': 'My amazing module output, from the second section'
})

The will automatically be labelled and linked in the navigation.

Step 7 - Plot some data

Ok, you have some data, now the fun bit - visualising it! Each of the plot types is described in the Plotting Functions section of the docs.

Appendices

Profiling Performance

It's important that MultiQC runs quickly and efficiently, especially on big projects with large numbers of samples. The recommended method to check this is by using cProfile to profile the code execution. To do this, run MultiQC as follows:

python -m cProfile -o multiqc_profile.prof /path/to/MultiQC/scripts/multiqc -f .

You can create a .bashrc alias to make this easier to run:

alias profile_multiqc='python -m cProfile -o multiqc_profile.prof /path/to/MultiQC/scripts/multiqc '
profile_multiqc -f .

MultiQC should run as normal, but produce the additional binary file multiqc_profile.prof. This can then be visualised with software such as SnakeViz.

To install SnakeViz and visualise the results, do the following:

pip install snakeviz
snakeviz multiqc_profile.prof

A web page should open where you can explore the execution times of different nested functions. It's a good idea to run MultiQC with a comparable number of results from other tools (eg. FastQC) to have a reference to compare against for how long the code should take to run.

Adding Custom CSS / Javascript

If you would like module-specific CSS and / or JavaScript added to the template, just add to the self.css and self.js dictionaries that come with the BaseMultiqcModule class. The key should be the filename that you want your file to have in the generated report folder (this is ignored in the default template, which includes the content file directly in the HTML). The dictionary value should be the path to the desired file. For example, see how it's done in the FastQC module:

self.css = {
    'assets/css/multiqc_fastqc.css' :
        os.path.join(os.path.dirname(__file__), 'assets', 'css', 'multiqc_fastqc.css')
}
self.js = {
    'assets/js/multiqc_fastqc.js' :
        os.path.join(os.path.dirname(__file__), 'assets', 'js', 'multiqc_fastqc.js')
}

Plotting Functions

MultiQC plotting functions are held within multiqc.plots submodules. To use them, simply import the modules you want, eg.:

from multiqc.plots import bargraph, linegraph

Once you've done that, you will have access to the corresponding plotting functions:

bargraph.plot()
linegraph.plot()
scatter.plot()
table.plot()
beeswarm.plot()
heatmap.plot()

These have been designed to work in a similar manner to each other - you pass a data structure to them, along with optional extras such as categories and configuration options, and they return a string of HTML to add to the report. You can add this to the module introduction or sections as described above. For example:

self.sections.append({
    'name': 'Module Section',
    'anchor': 'mymod_section',
    'content': bargraph.plot(self.parsed_data, categories, pconfig)
})

Bar graphs

Simple data can be plotted in bar graphs. Many MultiQC modules make use of stacked bar graphs. Here, the bargraph.plot() function comes to the rescue. A basic example is as follows:

from multiqc import plots
data = {
    'sample 1': {
        'aligned': 23542,
        'not_aligned': 343,
    },
    'sample 2': {
        'not_aligned': 7328,
        'aligned': 1275,
    }
}
html_content = bargraph.plot(data)

To specify the order of categories in the plot, you can supply a list of dictionary keys. This can also be used to exclude a key from the plot.

cats = ['aligned', 'not_aligned']
html_content = bargraph.plot(data, cats)

If cats is given as a dict instead of a list, you can specify a nice name and a colour too. Make it an OrderedDict to specify the order:

from collections import OrderedDict
cats = OrderedDict()
cats['aligned'] = {
    'name': 'Aligned Reads',
    'color': '#8bbc21'
}
cats['not_aligned'] = {
    'name': 'Unaligned Reads',
    'color': '#f7a35c'
}

Finally, a third variable can be supplied with configuration variables for the plot. The defaults are as follows:

config = {
    # Building the plot
    'id': '<random string>',                # HTML ID used for plot
    'cpswitch': True,                       # Show the 'Counts / Percentages' switch?
    'cpswitch_c_active': True,              # Initial display with 'Counts' specified? False for percentages.
    'cpswitch_counts_label': 'Counts',      # Label for 'Counts' button
    'cpswitch_percent_label': 'Percentages' # Label for 'Percentages' button
    'logswitch': False,                     # Show the 'Log10' switch?
    'logswitch_active': False,              # Initial display with 'Log10' active?
    'logswitch_label': 'Log10',             # Label for 'Log10' button
    'hide_zero_cats': True,                 # Hide categories where data for all samples is 0
    # Customising the plot
    'title': None,                          # Plot title
    'xlab': None,                           # X axis label
    'ylab': None,                           # Y axis label
    'ymax': None,                           # Max y limit
    'ymin': None,                           # Min y limit
    'yCeiling': None,                       # Maximum value for automatic axis limit (good for percentages)
    'yFloor': None,                         # Minimum value for automatic axis limit
    'yMinRange': None,                      # Minimum range for axis
    'yDecimals': True,                      # Set to false to only show integer labels
    'ylab_format': None,                    # Format string for x axis labels. Defaults to {value}
    'stacking': 'normal',                   # Set to None to have category bars side by side
    'use_legend': True,                     # Show / hide the legend
    'click_func': None,                     # Javascript function to be called when a point is clicked
    'cursor': None,                         # CSS mouse cursor type.
    'tt_decimals': 0,                       # Number of decimal places to use in the tooltip number
    'tt_suffix': '',                        # Suffix to add after tooltip number
    'tt_percentages': True,                 # Show the percentages of each count in the tooltip
}

Switching datasets

It's possible to have single plot with buttons to switch between different datasets. To do this, give a list of data objects (same formats as described above). Also add the following config options to supply names to the buttons:

config = {
    'data_labels': ['Reads', 'Bases']
}

You can also customise the y-axis label and min/max values for each dataset:

config = {
    'data_labels': [
        {'name': 'Reads', 'ylab': 'Number of Reads'},
        {'name': 'Bases', 'ylab': 'Number of Base Pairs', 'ymax':100}
    ]
}

If supplying multiple datasets, you can also supply a list of category objects. Make sure that they are in the same order as the data.

Categories should contain data keys, so if you're supplying a list of two datasets, you should supply a list of two sets of keys for the categories. MultiQC will try to guess categories from the data keys if categories are missing.

For example, with two datasets supplied as above:

cats = [
    ['aligned_reads','unaligned_reads'],
    ['aligned_base_pairs','unaligned_base_pairs'],
]

Or with additional customisation such as name and colour:

from collections import OrderedDict
cats = [OrderedDict(), OrderedDict()]
cats[0]['aligned_reads'] =        {'name': 'Aligned Reads',        'color': '#8bbc21'}
cats[0]['unaligned_reads'] =      {'name': 'Unaligned Reads',      'color': '#f7a35c'}
cats[1]['aligned_base_pairs'] =   {'name': 'Aligned Base Pairs',   'color': '#8bbc21'}
cats[1]['unaligned_base_pairs'] = {'name': 'Unaligned Base Pairs', 'color': '#f7a35c'}

Interactive / Flat image plots

Note that the bargraph.plot() function can generate both interactive JavaScript (HighCharts) powered report plots and flat image plots made using MatPlotLib. This choice is made within the function based on config variables such as number of dataseries and command line flags.

Note that both plot types should come out looking pretty much identical. If you spot something that's missing in the flat image plots, let me know.

Line graphs

This base function works much like the above, but for two-dimensional data, to produce line graphs. It expects a dictionary in the following format:

from multiqc import plots
data = {
    'sample 1': {
        '<x val 1>': '<y val 1>',
        '<x val 2>': '<y val 2>',
    },
    'sample 2': {
        '<x val 1>': '<y val 1>',
        '<x val 2>': '<y val 2>',
    }
}
html_content = linegraph.plot(data)

Additionally, a config dict can be supplied. The defaults are as follows:

from multiqc import plots
config = {
    # Building the plot
    'smooth_points': None,       # Supply a number to limit number of points / smooth data
    'smooth_points_sumcounts': True, # Sum counts in bins, or average? Can supply list for multiple datasets
    'id': '<random string>',     # HTML ID used for plot
    'categories': False,         # Set to True to use x values as categories instead of numbers.
    'colors': dict()             # Provide dict with keys = sample names and values colours
    'extra_series': None,        # See section below
    # Plot configuration
    'title': None,               # Plot title
    'xlab': None,                # X axis label
    'ylab': None,                # Y axis label
    'xCeiling': None,            # Maximum value for automatic axis limit (good for percentages)
    'xFloor': None,              # Minimum value for automatic axis limit
    'xMinRange': None,           # Minimum range for axis
    'xmax': None,                # Max x limit
    'xmin': None,                # Min x limit
    'xLog': False,               # Use log10 x axis?
    'xDecimals': True,           # Set to false to only show integer labels
    'yCeiling': None,            # Maximum value for automatic axis limit (good for percentages)
    'yFloor': None,              # Minimum value for automatic axis limit
    'yMinRange': None,           # Minimum range for axis
    'ymax': None,                # Max y limit
    'ymin': None,                # Min y limit
    'yLog': False,               # Use log10 y axis?
    'yDecimals': True,           # Set to false to only show integer labels
    'yPlotBands': None,          # Highlighted background bands. See http://api.highcharts.com/highcharts#yAxis.plotBands
    'xPlotBands': None,          # Highlighted background bands. See http://api.highcharts.com/highcharts#xAxis.plotBands
    'yPlotLines': None,          # Highlighted background lines. See http://api.highcharts.com/highcharts#yAxis.plotLines
    'xPlotLines': None,          # Highlighted background lines. See http://api.highcharts.com/highcharts#xAxis.plotLines
    'tt_label': '{point.x}: {point.y:.2f}', # Use to customise tooltip label, eg. '{point.x} base pairs'
    'pointFormat': None,         # Replace the default HTML for the entire tooltip label
    'click_func': function(){},  # Javascript function to be called when a point is clicked
    'cursor': None               # CSS mouse cursor type. Defaults to pointer when 'click_func' specified
    'reversedStacks': False      # Reverse the order of the category stacks. Defaults True for plots with Log10 option
}
html_content = linegraph.plot(data, config)

Switching datasets

You can also have a single plot with buttons to switch between different datasets. To do this, just supply a list of data dicts instead (same formats as described above). Also add the following config options to supply names to the buttons and graph labels:

config = {
    'data_labels': [
        {'name': 'DS 1', 'ylab': 'Dataset 1'},
        {'name': 'DS 2', 'ylab': 'Dataset 2'}
    ]
}

All of these config values are optional, the function will default to sensible values if things are missing. See the cutadapt module plots for an example of this in action.

Additional data series

Sometimes, it's good to be able to specify specific data series manually. To do this, use config['extra_series']. For a single extra line this can be a dict (as below). For multiple lines, use a list of dicts. For multiple dataset plots, use a list of list of dicts.

For example, to add a dotted x = y reference line:

from multiqc import plots
config = {
    'extra_series': {
        'name': 'x = y',
        'data': [[0, 0], [max_x_val, max_y_val]],
        'dashStyle': 'Dash',
        'lineWidth': 1,
        'color': '#000000',
        'marker': { 'enabled': False },
        'enableMouseTracking': False,
        'showInLegend': False,
    }
}
html_content = linegraph.plot(data, config)

Scatter Plots

Scatter plots work in almost exactly the same way as line plots. Most (if not all) config options are shared between the two. The data structure is similar but not identical:

from multiqc import plots
data = {
    'sample 1': {
        x: '<x val>',
        y: '<y val>'
    },
    'sample 2': {
        x: '<x val>',
        y: '<y val>'
    }
}
html_content = scatter.plot(data)

If you want more than one data point per sample, you can supply a list of dictionaries instead. You can also optionally specify point colours and sample name suffixes (these are appended to the sample name):

data = {
    'sample 1': [
        { x: '<x val>', y: '<y val>', color: '#a6cee3', name: 'Type 1' },
        { x: '<x val>', y: '<y val>', color: '#1f78b4', name: 'Type 2' }
    ],
    'sample 2': [
        { x: '<x val>', y: '<y val>', color: '#b2df8a', name: 'Type 1' },
        { x: '<x val>', y: '<y val>', color: '#33a02c', name: 'Type 2' }
    ]
}

Remember that MultiQC reports can contain large numbers of samples, so this plot type is not suitable for large quantities of data - 20,000 genes might look good for one sample, but when someone runs MultiQC with 500 samples, it will crash the browser and be impossible to interpret.

See the above docs about line plots for most config options. The scatter plot has a handful of unique ones in addition:

pconfig = {
    'marker_colour': 'rgba(124, 181, 236, .5)', # string, base colour of points (recommend rgba / semi-transparent)
    'marker_size': 5,               # int, size of points
    'marker_line_colour': '#999',   # string, colour of point border
    'marker_line_width': 1,         # int, width of point border
    'square': False                 # Force the plot to stay square? (Maintain aspect ratio)
}

Creating a table

Tables should work just like the functions above (most like the bar graph function). As a minimum, the function takes a dictionary containing data - the first keys will be sample names (row headers) and each key contained within will be a table column header.

You can also supply a list of key names to restrict the data in the table to certain keys / columns. This also specifies the order that columns should be displayed in.

For more customisation, the headers can be supplied as a dictionary. Each key should match the keys used in the data dictionary, but values can customise the output. If you want to specify the order of the columns, you must use an OrderedDict.

Finally, a the function accepts a third parameter, a config dictionary. This can set global options for the table (eg. a title) and can also hold default values to customise the output of all table columns.

The default header keys and table config options are:

single_header = {
    'namespace': '',                # Name for grouping in table
    'title': '[ dict key ]',        # Short title, table column title
    'description': '[ dict key ]',  # Longer description, goes in mouse hover text
    'max': None,                    # Minimum value in range, for bar / colour coding
    'min': None,                    # Maximum value in range, for bar / colour coding
    'scale': 'GnBu',                # Colour scale for colour coding
    'colour': '<auto from palette>',# Colour for column grouping
    'format': '{:.1f}',             # Output format() string
    'shared_key': None              # See below for description
    'modify': None,                 # Lambda function to modify values
    'hidden': False                 # Set to True to hide the column on page load
}
table_config = {
    'id': '<random string>',                 # ID used for the table
    'table_title': '<table id>',             # Title of the table. Used in the column config modal
    'save_file': False,                      # Whether to save the table data to a file
    'raw_data_fn':'multiqc_<table_id>_table' # File basename to use for raw data file
    'no_beeswarm': False    # Force a table to always be plotted (beeswarm by default if many rows)
}

Finally, a third parameter can be specified with table settings.

tconfig = {
    'id': None,             # ID for the table in the HTML
    'table_title': None,    # Title printed above the table
    'save_file': False,     # Save the data in the table to a file in `multiqc_data`
    'raw_data_fn': None,    # Filename to use if saving data file
    'no_beeswarm': False    # Force a table to always be plotted (beeswarm by default if many rows)
}

A very basic example is shown below:

data = {
    'sample 1': {
        'aligned': 23542,
        'not_aligned': 343,
    },
    'sample 2': {
        'aligned': 1275,
        'not_aligned': 7328,
    }
}
table_html = table.plot(data)

A more complicated version with ordered columns, defaults and column-specific settings:

data = {
    'sample 1': {
        'aligned': 23542,
        'not_aligned': 343,
        'aligned_percent': 98.563952271
    },
    'sample 2': {
        'aligned': 1275,
        'not_aligned': 7328,
        'aligned_percent': 14.820411484
    }
}
headers = OrderedDict()
headers['aligned_percent'] = {
    'title': '% Aligned',
    'description': 'Percentage of reads that aligned',
    'format': '{:.1f}%',
    'suffix': '%',
    'max': 100,
}
headers['aligned'] = {
    'title': 'M Aligned',
    'description': 'Aligned Reads (millions)',
    'format': '{:.2f}',
    'shared_key': 'read_count',
    'modify': lambda x: x / 1000000
}
config = {
    'namespace': 'My Module',
    'min': 0,
    'scale': 'GnBu'
}
table_html = table.plot(data, headers, config)

Beeswarm plots (dot plots)

Beeswarm plots work from the exact same data structure as tables, so the usage is just the same. Except instead of calling table, call beeswarm:

data = {
    'sample 1': {
        'aligned': 23542,
        'not_aligned': 343,
    },
    'sample 2': {
        'not_aligned': 7328,
        'aligned': 1275,
    }
}
beeswarm_html = beeswarm.plot(data)

The function also accepts the same headers and config parameters.

Heatmaps

Heatmaps expect data in the structure of a list of lists. Then, a list of sample names for the x-axis, and optionally for the y-axis (defaults to the same as the x-axis).

heatmap.plot(data, xcats, ycats, pconfig)

A simple example:

hmdata = [
    [0.9, 0.87, 0.73, 0.6, 0.2, 0.3],
    [0.87, 1, 0.7, 0.6, 0.9, 0.3],
    [0.73, 0.8, 1, 0.6, 0.9, 0.3],
    [0.6, 0.8, 0.7, 1, 0.9, 0.3],
    [0.2, 0.8, 0.7, 0.6, 1, 0.3],
    [0.3, 0.8, 0.7, 0.6, 0.9, 1],
]
names = [ 'one', 'two', 'three', 'four', 'five', 'six' ]
hm_html = heatmap.plot(hmdata, names)

Much like the other plots, you can change the way that the heatmap looks using a config dictionary:

pconfig = {
    'title': None,                 # Plot title
    'xTitle': None,                # X-axis title
    'yTitle': None,                # Y-axis title
    'min': None,                   # Minimum value (default: auto)
    'max': None,                   # Maximum value (default: auto)
    'square': True,                # Force the plot to stay square? (Maintain aspect ratio)
    'colstops': []                 # Scale colour stops. See below.
    'reverseColors': False,        # Reverse the order of the colour axis
    'decimalPlaces': 2,            # Number of decimal places for tooltip
    'legend': True,                # Colour axis key enabled or not
    'borderWidth': 0,              # Border width between cells
    'datalabels': True,            # Show values in each cell. Defaults True when less than 20 samples.
    'datalabel_colour': '<auto>',  # Colour of text for values. Defaults to auto contrast.
}

The colour stops are a bit special and can be used to define a custom colour scheme. These should be defined as a list of lists, with a number between 0 and 1 and a HTML colour. The default is RdYlBu from ColorBrewer:

pconfig = {
    'colstops' = [
        [0, '#313695'],
        [0.1, '#4575b4'],
        [0.2, '#74add1'],
        [0.3, '#abd9e9'],
        [0.4, '#e0f3f8'],
        [0.5, '#ffffbf'],
        [0.6, '#fee090'],
        [0.7, '#fdae61'],
        [0.8, '#f46d43'],
        [0.9, '#d73027'],
        [1, '#a50026'],
    ]
}

Javascript Functions

The javascript bundled in the default MultiQC template has a number of helper functions to make your life easier.

NB: The MultiQC Python functions make use of these, so it's very unlikely that you'll need to use any of this. But it's here for reference.

Plotting line graphs

plot_xy_line_graph (target, ds)

Plots a line graph with multiple series of (x,y) data pairs. Used by the linegraph.plot() python function.

Data and configuration must be added to the document level mqc_plots variable on page load, using the target as the key. The variables used are as follows:

mqc_plots[target]['plot_type'] = 'xy_line';
mqc_plots[target]['config'];
mqc_plots[target]['datasets'];

Multiple datasets can be added in the ['datasets'] array. The supplied variable ds specifies which is plotted (defaults to 0).

Available config options with default vars:

config = {
    title: undefined,            // Plot title
    xlab: undefined,             // X axis label
    ylab: undefined,             // Y axis label
    xCeiling: undefined,         // Maximum value for automatic axis limit (good for percentages)
    xFloor: undefined,           // Minimum value for automatic axis limit
    xMinRange: undefined,        // Minimum range for axis
    xmax: undefined,             // Max x limit
    xmin: undefined,             // Min x limit
    xDecimals: true,             // Set to false to only show integer labels
    yCeiling: undefined,         // Maximum value for automatic axis limit (good for percentages)
    yFloor: undefined,           // Minimum value for automatic axis limit
    yMinRange: undefined,        // Minimum range for axis
    ymax: undefined,             // Max y limit
    ymin: undefined,             // Min y limit
    yDecimals: true,             // Set to false to only show integer labels
    yPlotBands: undefined,       // Highlighted background bands. See http://api.highcharts.com/highcharts#yAxis.plotBands
    xPlotBands: undefined,       // Highlighted background bands. See http://api.highcharts.com/highcharts#xAxis.plotBands
    tt_label: '{point.x}: {point.y:.2f}', // Use to customise tooltip label, eg. '{point.x} base pairs'
    pointFormat: undefined,      // Replace the default HTML for the entire tooltip label
    click_func: function(){},    // Javascript function to be called when a point is clicked
    cursor: undefined            // CSS mouse cursor type. Defaults to pointer when 'click_func' specified
}

An example of the markup expected, with the function being called:

<div id="my_awesome_line_graph" class="hc-plot"></div>
<script type="text/javascript">
    mqc_plots['#my_awesome_bar_plot']['plot_type'] = 'xy_line';
    mqc_plots['#my_awesome_line_graph']['datasets'] = [
        {
            name: 'Sample 1',
            data: [[1, 1.5], [1.5, 3.1], [2, 6.4]]
        },
        {
            name: 'Sample 2',
            data: [[1, 1.7], [1.5, 4.3], [2, 8.4]]
        },
    ];
    mqc_plots['#my_awesome_line_graph']['config'] = {
        "title": "Best Plot Ever",
        "ylab": "Pings",
        "xlab": "Pongs"
    };
    $(function () {
        plot_xy_line_graph('#my_awesome_line_graph');
    });
</script>

Plotting bar graphs

plot_stacked_bar_graph (target, ds)

Plots a bar graph with multiple series containing multiple categories. Used by the bargraph.plot() python function.

Data and configuration must be added to the document level mqc_plots variable on page load, using the target as the key. The variables used are as follows:

mqc_plots[target]['plot_type'] = 'bar_graph';
mqc_plots[target]['config'];
mqc_plots[target]['datasets'];
mqc_plots[target]['samples'];

All available config options with default vars:

config = {
    title: undefined,           // Plot title
    xlab: undefined,            // X axis label
    ylab: undefined,            // Y axis label
    ymax: undefined,            // Max y limit
    ymin: undefined,            // Min y limit
    yDecimals: true,            // Set to false to only show integer labels
    ylab_format: undefined,     // Format string for x axis labels. Defaults to {value}
    stacking: 'normal',         // Set to null to have category bars side by side (None in python)
    xtype: 'linear',            // Axis type. 'linear' or 'logarithmic'
    use_legend: true,           // Show / hide the legend
    click_func: undefined,      // Javascript function to be called when a point is clicked
    cursor: undefined,          // CSS mouse cursor type. Defaults to pointer when 'click_func' specified
    tt_percentages: true,       // Show the percentages of each count in the tooltip
    reversedStacks: false,      // Reverse the order of the categories in the stack.
}

An example of the markup expected, with the function being called:

<div id="my_awesome_bar_plot" class="hc-plot"></div>
<script type="text/javascript">
    mqc_plots['#my_awesome_bar_plot']['plot_type'] = 'bar_graph';
    mqc_plots['#my_awesome_bar_plot']['samples'] = ['Sample 1', 'Sample 2']
    mqc_plots['#my_awesome_bar_plot']['datasets'] = [{"data": [4, 7], "name": "Passed Test"}, {"data": [2, 3], "name": "Failed Test"}]
    mqc_plots['#my_awesome_bar_plot']['config'] = {
        "title": "My Awesome Plot",
        "ylab": "# Observations",
        "ymin": 0,
        "stacking": "normal"
    };
    $(function () {
        plot_stacked_bar_graph("#my_awesome_bar_plot");
    });
</script>

Switching counts and percentages

If you're using the plotting functions above, it's easy to add a button which switches between percentages and counts. Just add the following HTML above your plot:

<div class="btn-group switch_group">
    <button class="btn btn-default btn-sm active" data-action="set_numbers" data-target="#my_plot">Counts</button>
    <button class="btn btn-default btn-sm" data-action="set_percent" data-target="#my_plot">Percentages</button>
</div>

NB: This markup is generated automatically with the Python self.plot_bargraph() function.

Switching plot datasets

Much like the counts / percentages buttons above, you can add a button which switches the data displayed in a single plot. Make sure that both datasets are stored in named javascript variables, then add the following markup:

<div class="btn-group switch_group">
    <button class="btn btn-default btn-sm active" data-action="set_data" data-ylab="First Data" data-newdata="data_var_1" data-target="#my_plot">Data 1</button>
    <button class="btn btn-default btn-sm" data-action="set_data" data-ylab="Second Data" data-newdata="data_var_2" data-target="#my_plot">Data 2</button>
</div>

Note the CSS class active which specifies which button is 'pressed' on page load. data-ylab and data-xlab can be used to specify the new axes labels. data-newdata should be the name of the javascript object with the new data to be plotted and data-target should be the CSS selector of the plot to change.

Custom event triggers

Some of the events that take place in the general javascript code trigger jQuery events which you can hook into from within your module's code. This allows you to take advantage of events generated by the global theme whilst keeping your code modular.

$(document).on('mqc_highlights', function(e, f_texts, f_cols, regex_mode){
    // This trigger is called when the highlight strings are
    // updated. Three variables are given - an array of search
    // strings (f_texts), an array of colours with corresponding
    // indexes (f_cols) and a boolean var saying whether the
    // search should be treated as a string or a regex (regex_mode)
});

$(document).on('mqc_renamesamples', function(e, f_texts, t_texts, regex_mode){
    // This trigger is called when samples are renamed
    // Three variables are given - an array of search
    // strings (f_texts), an array of replacements with corresponding
    // indexes (t_texts) and a boolean var saying whether the
    // search should be treated as a string or a regex (regex_mode)
});

$(document).on('mqc_hidesamples', function(e, f_texts, regex_mode){
    // This trigger is called when the Hide Samples filters change.
    // Two variables are given - an array of search strings
    // (f_texts) and a boolean saying whether the search should
    // be treated as a string or a regex (regex_mode)
});

$('#YOUR_PLOT_ID').on('mqc_plotresize', function(){
    // This trigger is called when a plot handle is pulled,
    // resizing the height
});

$('#YOUR_PLOT_ID').on('mqc_original_series_click', function(e, name){
    // A plot able to show original images has had a point clicked.
    // 'name' contains the name of the series that was clicked
});

$('#YOUR_PLOT_ID').on('mqc_original_chg_source', function(e, name){
    // A plot with original images has had a request to change the
    // original image source (eg. pressing Prev / Next)
});

MultiQC Plugins

MultiQC is written around a system designed for extensibility and plugins. These features allow custom code to be written without polluting the central code base.

Please note that we want MultiQC to grow as a community tool! So if you're writing a module or theme that can be used by others, please keep it within the main MultiQC framework and submit a pull request.

Entry Points

The plugin system works using setuptools entry points. In setup.py you will see a section of code that looks like this (truncated):

entry_points = {
    'multiqc.modules.v1': [
        'qualimap = multiqc.modules.qualimap:MultiqcModule',
    ],
    'multiqc.templates.v1': [
        'default = multiqc.templates.default',
    ],
    # 'multiqc.cli_options.v1': [
        # 'my-new-option = myplugin.cli:new_option'
    # ],
    # 'multiqc.hooks.v1': [
        # 'execution_start = myplugin.hooks:execution_start',
        # 'config_loaded = myplugin.hooks:config_loaded',
        # 'before_modules = myplugin.hooks:before_modules',
        # 'after_modules = myplugin.hooks:after_modules',
        # 'execution_finish = myplugin.hooks:execution_finish',
    # ]
},

These sets of entry points can each be extended to add functionality to MultiQC:

  • multiqc.modules.v1
    • Defines the module classes. Used to add new modules.
  • multiqc.templates.v1
    • Defines the templates. Can be used for new templates.
  • multiqc.cli_options.v1
    • Allows plugins to add new custom command line options
  • multiqc.hooks.v1
    • Code hooks for plugins to add new functionality

Any python program can create entry points with the same name, once installed MultiQC will find these and run them accordingly. For an example of this in action, see the MultiQC_NGI setup file:

entry_points = {
        'multiqc.templates.v1': [
            'ngi = multiqc_ngi.templates.ngi',
            'genstat = multiqc_ngi.templates.genstat',
        ],
        'multiqc.cli_options.v1': [
            'project = multiqc_ngi.cli:pid_option'
        ],
        'multiqc.hooks.v1': [
            'after_modules = multiqc_ngi.hooks:ngi_metadata',
        ]
    },

Here, two new templates are added, a new command line option and a new code hook.

Modules

List items added to multiqc.modules.v1 specify new modules. They should be described as follows:

modname = python_mod.dirname.submodname:classname'

Once this is done, everything else should be the same as described in the writing modules documentation.

Templates

As above, though no need to specify a class name at the end. See the writing templates documentation for further instructions.

Command line options

MultiQC handles command line interaction using the click framework. You can use the multiqc.cli_options.v1 entry point to add new click decorators for command line options. For example, the MultiQC_NGI plugin uses the entry point above with the following code in cli.py:

import click
pid_option = click.option('--project', type=str)

The values given from additional command line arguments are parsed by MultiQC and put into config.kwargs. The above plugin later reads the value given by the user with the --project flag in a hook:

if config.kwargs['project'] is not None:
  # do some stuff

See the click documentation or the main MultiQC script for more information and examples of adding command line options.

Hooks

Hooks are a little more complicated - these define points in the core MultiQC code where you can run custom functions. This can be useful as your code is able to access data generated by other parts of the program. For example, you could tie into the after_modules hook to insert data processed by MultiQC modules into a database automatically.

Here, the entry point names are the hook titles, described as commented out lines in the core MultiQC setup.py: execution_start, config_loaded, before_modules, after_modules and execution_finish.

These should point to a function in your code which will be executed when that hook fires. Your custom code can import the core MultiQC modules to access configuration and loggers. For example:

#!/usr/bin/env python
""" MultiQC hook functions - we tie into the MultiQC
core here to add in extra functionality. """

import logging
from multiqc.utils import report, config

log = logging.getLogger('multiqc')

def after_modules():
  """ Plugin code to run when MultiQC modules have completed  """
  num_modules = len(report.modules_output)
  status_string = "MultiQC hook - {} modules reported!".format(num_modules)
  log.critical(status_string)

Updating for compatibility

When releasing new versions of MultiQC we aim to maintain compatibility so that your existing modules and plugins will keep working. However, in some cases we have to make changes that require code to be modified. This section summarises the changes by MultiQC release.

v1.0 Updates

MultiQC v1.0 brings a few changes that could break external plugins. These are:

  • Module import refactoring to allow a new testing environment
    • This should allow better, more modular, unit testing. This should equate to more reliable and maintainable code
    • All modules need to change some of their import statements. This includes plugin modules outside of the core MultiQC package.
    • Many thanks to @tbooth at @EdinburghGenomics for his patient work with this.

There are two things that you probably need to change in your plugin modules to make them work with the updated version of MultiQC, both to do with imports. Instead of this style of importing modules:

from multiqc import config, BaseMultiqcModule, plots

You now need this:

from multiqc import config
from multiqc.plots import bargraph   # Load specific plot types here
from multiqc.modules.base_module import BaseMultiqcModule

Modules that directly reference multiqc.BaseMultiqcModule instead need to reference multiqc.modules.base_module.BaseMultiqcModule.

Secondly, modules that use import plots now need to import the specific plots needed. You will also need to update any plotting functions, removing the plot. prefix.

For example, change this:

import plots
return plots.bargraph.plot(data, keys, pconfig)

to this:

from plots import bargraph
return bargraph.plot(data, keys, pconfig)

These changes have been made to simplify the module imports within MultiQC, allowing specific parts of the codebase to be imported into a Python script on their own. This enables small, atomic, clean unit testing.

If you have any questions, please open an issue.

Discuss on Gitter

Back to top