Kaiju: Fast and sensitive taxonomic classification for metagenomics

Fast and sensitive taxonomic classification for metagenomics

About Kaiju

Kaiju is a program for fast and sensitive taxonomic classification of high-throughput sequencing reads from metagenomic whole genome sequencing or metatranscriptomics experiments.

Each sequencing read is assigned to a taxon in the NCBI taxonomy by comparing it to a reference protein database containing microbial and viral protein sequences. By using protein-level classification, Kaiju achieves a higher sensitivity compared with methods based on nucleotide comparison.

Several reference protein databases can be used, such as complete genomes from NCBI RefSeq or the microbial subset of the NCBI BLAST non-redundant protein database nr, optionally also including fungi and microbial eukaryotes.

Reads are translated into amino acid sequences, which are then searched in the database using a modified backward search on a memory-efficient implementation of the Burrows-Wheeler transform by finding maximum exact matches (MEMs), optionally allowing mismatches.

Kaiju can also be used for querying any custom protein database without taxonomic classification, using either protein or nucleotide queries.

Kaiju is described in Menzel, P. et al. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7:11257 (open access).

Pre-built kaiju indexes for various reference databases can also be downloaded.

Download

The latest version of Kaiju's source code can be downloaded from GitHub either as compressed archive or by cloning the repository via git:

git clone https://github.com/bioinformatics-centre/kaiju.git

Please refer to the README file for installation and usage instructions.

The source code is available under the GNU General Public License 3.

News and Release notes

2024-07-31

The web server at kaiju.binf.ku.dk was shut down after more than 8 years of operation.

Version 1.10.1

2024-03-03

fix download of refseq_nr files in kaiju-makedb

Version 1.10.0

2023-11-25

update proGenomes to v3
update RVDB-prot to v26.0
add refseq_nr and refseq_ref databases
remove Mar databases
statically linked Linux binaries are available on GitHub

2023-07-01

Pre-built indexes are now hosted on AWS S3 through the Open Data Sponsorship Program

2023-06-01

Kaiju indexes for 2023 are available

Version 1.9.2

2022-11-19

small bugfixes regarding using command line option -a
statically linked Linux binaries are available on GitHub

Version 1.9.0

2022-05-12

set default E-value to 0.01 in Greedy mode
fix MAR databases downloads
fix RefSeq plasmid downloads
update RVDB-prot to v23.0
statically linked Linux binaries are available on GitHub

Version 1.8.2

2021-10-26

make downloading RefSeq genomes work again in kaiju-makedb
statically linked Linux binaries are available on GitHub

Version 1.8.1

2021-10-13

update RVDB-prot to v22.0
use curl for downloading virus/plasmid sequences from RefSeq FTP server
statically linked Linux binaries are available on GitHub

2021-08-05

new web server with updated databases and option for setting E-value

Version 1.8

2021-08-05

add kaiju-multi
add option -l to kaiju2krona
update RVDB-prot to v21.0
better handling of downloading virus and plasmid sequences from RefSeq
statically linked Linux binaries are available on GitHub

Version 1.7.4

2020-11-04

update RVDB-prot to v20.0
fix bug in RefSeq download
statically linked Linux binaries are available on GitHub

Version 1.7.3

2020-01-16

update RVDB-prot to v17.0
add list with excluded accession numbers for NR database, based on Breitwieser et al., 2019
add option -s to kaiju-mergeOutputs
statically linked Linux binaries are available on GitHub

Version 1.7.2

2019-07-12

fix download of virus genomes for source databases viruses, refseq, progenomes
add source database fungi, which contains all fungi assemblies from NCBI RefSeq
statically linked Linux binaries are available on GitHub

Version 1.7.1

2019-06-27

update download of virus genomes for source databases viruses, refseq, progenomes
statically linked Linux binaries are available on GitHub

Version 1.7.0

2019-04-28

replace makeDB.sh with kaiju-makedb
replace kaijuReport with kaiju2table
rename addTaxonNames to kaiju-addTaxonNames
rename mergeOutputs to kaiju-mergeOutputs
add RVDB-prot as reference database
statically linked Linux binaries are available on GitHub

Version 1.6.3

2018-10-01

new options for makeDB.sh to only download plasmids (-l) or viruses (-v)
extend list of downloaded files for viruses and plasmids in makeDB.sh
updates to marDB
statically linked Linux binaries are available on GitHub

Version 1.6.2

2018-02-24

fixed crash of makeDB.sh on MacOS
statically linked Linux binaries are available on GitHub

Version 1.6.1

2018-02-16

fixed bug in MacOS compilation
statically linked Linux binaries are available on GitHub

Version 1.6.0

2018-01-09

changed default search parameters to Greedy mode with 3 allowed mismatches and enabled SEG filter (can be disabled with the new option -X)
E-value is calculated in Greedy mode and can be used as a threshold for classification using option -E
Kaiju can now also open gzip-compressed FASTQ/A input files
the MarDB database can be selected in makeDB.sh using option -m
statically linked Linux binaries are available on GitHub

Version 1.5.0

2017-02-20

add accession numbers to database identifiers
Kaiju's long output format via option -v will now print the accession numbers of the matched database sequences in column 6
Important: Kaiju 1.5.0 does not work with index files (.fmi) from previous versions and previous versions will not work with index files made with Kaiju 1.5.0
statically linked Linux binaries are available on GitHub

Version 1.4.5

2017-01-27

various small bug fixes and improvements
add option -c lowest to mergeOutputs
add option -p to kaijuReport for printing the full taxon path in the report
add option -l to kaijuReport for selecting specific ranks for taxon path
statically linked Linux binaries are available on GitHub

Version 1.4.4

2016-10-31

add option -p to makeDB.sh for using the representative set from proGenomes as a reference database
add option -r to makeDB.sh for using RefSeq complete genomes as a reference database
remove default option in makeDB.sh. Now, one of the options -r, -p, -n, or -e has to be used.
add option for static linking
statically linked Linux binaries are available on GitHub

Version 1.4.3

2016-10-19

adjust makeDB.sh to the new folder structure of RefSeq genomes on the NCBI FTP server
increase precision of percentage numbers in kaijuReport

Version 1.4.2

2016-09-03

change makeDB.sh and convertNR to accommodate the removal of GI numbers from the NCBI BLAST nr database

Version 1.4.1

2016-07-06

fix bug in calculation of percentage of unassigned reads in kaijuReport
fix parsing of names of taxon ranks from nodes.dmp in kaijuReport and addTaxonNames

Version 1.4

2016-05-17

add BLAST's SEG low complexity filter via option -x
add option -p to kaiju for protein sequence input, which disables the translation from nucleotide to amino acids
add option -e to makeDB.sh for including proteins from fungi and microbial eukaryotes when using the nr database
add program addTaxonNames for extending the output file by taxon names or taxon paths

Version 1.3.1

2016-04-19

fix read name detection on PE reads for newer Illumina name standard

Version 1.3

2016-04-13

add makeDB.sh for downloading genomes and building reference database and Kaiju index
add convertNR for using the BLAST nr database
fix overflow bug for large DBs
improved file type detection
update README and minor cosmetic changes
Paper published

2016-02-23

web server launched

Version 1.2

2016-01-11

update for Greedy mode, which is now approximately twice as fast without change of accuracy.
add kaijup, for searching protein queries against a protein database without taxonomy classification, so that the database names of best matches are printed.

Version 1.1

2015-12-18

new implementation of the FM-index, which improves query speed to the BWT.
the distance between suffix array checkpoints can be set using the -e option for mkbwt, which allows trading off database size vs speed.
MEM mode now only prints the matching sequence(s) in verbose mode instead of the full query sequence.
add kaijux, for searching translated reads against a protein database without taxonomy classification, so that the database names of best matches are printed.

Version 1.0.1

2015-11-24

add kaijuReport
bugfix in kaiju for FASTA input files, which caused the last input line to be read twice

Version 1.0

2015-11-16

initial release
preprint paper on bioRxiv published

Previous releases can be downloaded here.

Contact

For questions, bug reports or more information about Kaiju, please contact Peter Menzel. Bug reports can also be filed in GitHub's issue tracker.

Citation

Menzel P., Ng K.L., Krogh A. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7:11257

Read the behind the paper post on Nature Microbiology Community.

The program is being developed by Peter Menzel and Anders Krogh at the Bioinformatics Centre, a part of the Section for Computational and RNA Biology at the University of Copenhagen.