ruby-biosyntax

CI Lines of Code

:dna: bioSyntax - Syntax highlighting for biological data formats - for Ruby.

Powered by libbiosyntax.

Installation

gem install biosyntax

ANSI coloring

require "biosyntax"

hl = BioSyntax.fastq

File.foreach("reads.fastq", chomp: false) do |line|
  print hl.colorize(line)
end

colorize returns a string with ANSI SGR escape sequences using the built-in libbiosyntax colors.

Highlighter is stateful. Reuse one highlighter for one input stream, especially for FASTQ and WIG. Use reset before starting another stream with the same object.

hl = BioSyntax.fastq
# process one file...
hl.reset
# process another file...

Highlight spans

require "biosyntax"

hl = BioSyntax.vcf
line = "chr1\t42\trs1\tA\tT\t99\tPASS\tDP=10;AF=0.5\n"

spans = hl.highlight(line)

spans.each do |span|
  puts [span.start, span.end, span.kind.name, span.scope].join("\t")
end

A span uses byte offsets into the input line:

span.start   # byte offset at the start of the highlighted range
span.end     # byte offset just after the highlighted range
span.length  # byte length
span.kind    # BioSyntax::Kind
span.scope   # e.g. "biosyntax.chrom"

Formats and metadata

Create highlighters with BioSyntax.<format> or BioSyntax[format]. Hyphenated format names use underscores for factory methods.

BioSyntax.vcf
BioSyntax.fastq
BioSyntax.fasta_nt
BioSyntax[:"fasta-nt"]
BioSyntax["bam"]       # canonical format is :sam

Useful metadata:

BioSyntax::FORMAT_NAMES # array of canonical format names
BioSyntax::FORMATS      # { name => BioSyntax::Format }
BioSyntax::KIND_NAMES   # array of kind names
BioSyntax::KINDS        # { name => BioSyntax::Kind }
BioSyntax::SCOPES       # { scope => [BioSyntax::Kind, ...] }

BioSyntax::Format::VCF
BioSyntax::Kind::CHROM

BioSyntax.format_supported?(:vcf)  # true
BioSyntax.format_name(:bam)        # :sam
BioSyntax.guess_format("a.vcf.gz") # :vcf

The metadata is generated from libbiosyntax at load time. The Ruby side does not maintain a separate hand-written table of formats or kinds.

Examples

This gem does not install a CLI. See examples/ for small scripts:

ruby examples/bcat.rb sample.vcf
ruby examples/bcat.rb -l fastq reads.fastq
ruby examples/bcat.rb -l
ruby examples/inspect_spans.rb sample.vcf

bcat.rb guesses the format from the file name when possible. Use -l / --language to pass a format explicitly. Calling -l without an argument prints the supported format names.

Development tasks

bundle exec rake -T
bundle exec rake test
bundle exec rake build
bundle exec yard doc

The native extension is built with rake-compiler. Temporary build products are written under tmp/, and the compiled extension is copied to lib/biosyntax/.

Updating vendored libbiosyntax

This gem vendors the C source of libbiosyntax and builds it into the Ruby extension. It does not require a system libbiosyntax shared library. The vendored C source lives under:

ext/biosyntax/biosyntax.c
ext/biosyntax/biosyntax.h

When libbiosyntax is updated, refresh the vendored files and run the test suite:

bundle exec rake update:libbiosyntax
bundle exec rake

License

biosyntax vendors libbiosyntax, which is licensed under the GNU General Public License version 3 only. This gem is therefore distributed under GPL-3.0-only. See LICENSE.md.

This project is inspired by the original bioSyntax project: https://github.com/bioSyntax/bioSyntax