About Kaiju
Kaiju is a program for fast and sensitive taxonomic classification of high-throughput sequencing reads from metagenomic whole genome sequencing or metatranscriptomics experiments.
Each sequencing read is assigned to a taxon in the NCBI taxonomy by comparing it to a reference protein database containing microbial and viral protein sequences. By using protein-level classification, Kaiju achieves a higher sensitivity compared with methods based on nucleotide comparison.
Several reference protein databases can be used, such as complete genomes from NCBI RefSeq or the microbial subset of the NCBI BLAST non-redundant protein database nr, optionally also including fungi and microbial eukaryotes.
Reads are translated into amino acid sequences, which are then searched in the database using a modified backward search on a memory-efficient implementation of the Burrows-Wheeler transform by finding maximum exact matches (MEMs), optionally allowing mismatches.
Kaiju can also be used for querying any custom protein database without taxonomic classification, using either protein or nucleotide queries.
Kaiju is described in Menzel, P. et al. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7:11257 (open access).
Pre-built kaiju indexes for various reference databases can also be downloaded.
Download
The latest version of Kaiju's source code can be downloaded from GitHub
either as compressed archive or by cloning the repository
via git
:
git clone https://github.com/bioinformatics-centre/kaiju.git
Please refer to the README file for installation and usage instructions.
The source code is available under the GNU General Public License 3.
News and Release notes
2024-07-31
- The web server at kaiju.binf.ku.dk was shut down after more than 8 years of operation.
Version 1.10.1
2024-03-03
- fix download of
refseq_nr
files inkaiju-makedb
Version 1.10.0
2023-11-25
- update proGenomes to v3
- update RVDB-prot to v26.0
- add
refseq_nr
andrefseq_ref
databases - remove Mar databases
- statically linked Linux binaries are available on GitHub
2023-07-01
- Pre-built indexes are now hosted on AWS S3 through the Open Data Sponsorship Program
2023-06-01
Version 1.9.2
2022-11-19
- small bugfixes regarding using command line option
-a
- statically linked Linux binaries are available on GitHub
Version 1.9.0
2022-05-12
- set default E-value to 0.01 in Greedy mode
- fix MAR databases downloads
- fix RefSeq plasmid downloads
- update RVDB-prot to v23.0
- statically linked Linux binaries are available on GitHub
Version 1.8.2
2021-10-26
- make downloading RefSeq genomes work again in
kaiju-makedb
- statically linked Linux binaries are available on GitHub
Version 1.8.1
2021-10-13
- update RVDB-prot to v22.0
- use
curl
for downloading virus/plasmid sequences from RefSeq FTP server - statically linked Linux binaries are available on GitHub
2021-08-05
- new web server with updated databases and option for setting E-value
Version 1.8
2021-08-05
- add
kaiju-multi
- add option
-l
tokaiju2krona
- update RVDB-prot to v21.0
- better handling of downloading virus and plasmid sequences from RefSeq
- statically linked Linux binaries are available on GitHub
Version 1.7.4
2020-11-04
- update RVDB-prot to v20.0
- fix bug in RefSeq download
- statically linked Linux binaries are available on GitHub
Version 1.7.3
2020-01-16
- update RVDB-prot to v17.0
- add list with excluded accession numbers for NR database, based on Breitwieser et al., 2019
- add option
-s
tokaiju-mergeOutputs
- statically linked Linux binaries are available on GitHub
Version 1.7.2
2019-07-12
- fix download of virus genomes for source databases
viruses
,refseq
,progenomes
- add source database
fungi
, which contains all fungi assemblies from NCBI RefSeq - statically linked Linux binaries are available on GitHub
Version 1.7.1
2019-06-27
- update download of virus genomes for source databases
viruses
,refseq
,progenomes
- statically linked Linux binaries are available on GitHub
Version 1.7.0
2019-04-28
- replace
makeDB.sh
withkaiju-makedb
- replace
kaijuReport
withkaiju2table
- rename
addTaxonNames
tokaiju-addTaxonNames
- rename
mergeOutputs
tokaiju-mergeOutputs
- add RVDB-prot as reference database
- statically linked Linux binaries are available on GitHub
Version 1.6.3
2018-10-01
- new options for
makeDB.sh
to only download plasmids (-l
) or viruses (-v
) - extend list of downloaded files for viruses and plasmids in
makeDB.sh
- updates to marDB
- statically linked Linux binaries are available on GitHub
Version 1.6.2
2018-02-24
- fixed crash of
makeDB.sh
on MacOS - statically linked Linux binaries are available on GitHub
Version 1.6.1
2018-02-16
- fixed bug in MacOS compilation
- statically linked Linux binaries are available on GitHub
Version 1.6.0
2018-01-09
- changed default search parameters to Greedy mode with 3 allowed mismatches and enabled SEG filter (can be disabled with the new option
-X
) - E-value is calculated in Greedy mode and can be used as a threshold for classification using option
-E
- Kaiju can now also open gzip-compressed FASTQ/A input files
- the MarDB database can be selected in
makeDB.sh
using option-m
- statically linked Linux binaries are available on GitHub
Version 1.5.0
2017-02-20
- add accession numbers to database identifiers
- Kaiju's long output format via option
-v
will now print the accession numbers of the matched database sequences in column 6 - Important: Kaiju 1.5.0 does not work with index files (
.fmi
) from previous versions and previous versions will not work with index files made with Kaiju 1.5.0 - statically linked Linux binaries are available on GitHub
Version 1.4.5
2017-01-27
- various small bug fixes and improvements
- add option
-c lowest
tomergeOutputs
- add option
-p
tokaijuReport
for printing the full taxon path in the report - add option
-l
tokaijuReport
for selecting specific ranks for taxon path - statically linked Linux binaries are available on GitHub
Version 1.4.4
2016-10-31
- add option
-p
tomakeDB.sh
for using the representative set from proGenomes as a reference database - add option
-r
tomakeDB.sh
for using RefSeq complete genomes as a reference database - remove default option in
makeDB.sh
. Now, one of the options-r
,-p
,-n
, or-e
has to be used. - add option for static linking
- statically linked Linux binaries are available on GitHub
Version 1.4.3
2016-10-19
- adjust
makeDB.sh
to the new folder structure of RefSeq genomes on the NCBI FTP server - increase precision of percentage numbers in
kaijuReport
Version 1.4.2
2016-09-03
- change
makeDB.sh
andconvertNR
to accommodate the removal of GI numbers from the NCBI BLAST nr database
Version 1.4.1
2016-07-06
- fix bug in calculation of percentage of unassigned reads in
kaijuReport
- fix parsing of names of taxon ranks from nodes.dmp in
kaijuReport
andaddTaxonNames
Version 1.4
2016-05-17
- add BLAST's SEG low complexity filter via option
-x
- add option
-p
tokaiju
for protein sequence input, which disables the translation from nucleotide to amino acids - add option
-e
tomakeDB.sh
for including proteins from fungi and microbial eukaryotes when using the nr database - add program
addTaxonNames
for extending the output file by taxon names or taxon paths
Version 1.3.1
2016-04-19
- fix read name detection on PE reads for newer Illumina name standard
Version 1.3
2016-04-13
- add
makeDB.sh
for downloading genomes and building reference database and Kaiju index - add
convertNR
for using the BLAST nr database - fix overflow bug for large DBs
- improved file type detection
- update README and minor cosmetic changes
- Paper published
2016-02-23
- web server launched
Version 1.2
2016-01-11
- update for Greedy mode, which is now approximately twice as fast without change of accuracy.
- add
kaijup
, for searching protein queries against a protein database without taxonomy classification, so that the database names of best matches are printed.
Version 1.1
2015-12-18
- new implementation of the FM-index, which improves query speed to the BWT.
- the distance between suffix array checkpoints can be set using the
-e
option formkbwt
, which allows trading off database size vs speed. - MEM mode now only prints the matching sequence(s) in verbose mode instead of the full query sequence.
- add
kaijux
, for searching translated reads against a protein database without taxonomy classification, so that the database names of best matches are printed.
Version 1.0.1
2015-11-24
- add
kaijuReport
- bugfix in
kaiju
for FASTA input files, which caused the last input line to be read twice
Version 1.0
2015-11-16
- initial release
- preprint paper on bioRxiv published
Previous releases can be downloaded here.
Contact
For questions, bug reports or more information about Kaiju, please contact Peter Menzel. Bug reports can also be filed in GitHub's issue tracker.
Citation
Read the behind the paper post on Nature Microbiology Community.
The program is being developed by Peter Menzel and Anders Krogh at the Bioinformatics Centre, a part of the Section for Computational and RNA Biology at the University of Copenhagen.