htsinfer package

HTSinfer project root

Submodules

htsinfer.cli module

Command-line interface client.

htsinfer.cli.main() None

Entry point for CLI executable.

htsinfer.cli.parse_args() argparse.Namespace

Parse CLI arguments.

Returns

Parsed CLI arguments.

htsinfer.cli.setup_logging(verbosity: str = 'INFO') None

Configure logging.

Parameters

verbosity – Level of logging verbosity.

htsinfer.exceptions module

Custom exceptions.

exception htsinfer.exceptions.FileProblem

Bases: Exception

Exception raised when file could not be opened or parsed.

exception htsinfer.exceptions.InconsistentFastqIdentifiers

Bases: Exception

Exception raised when inconsistent FASTQ sequence identifiers were ecountered.

exception htsinfer.exceptions.KallistoProblem

Bases: Exception

Exception raised when running kallisto index and quant commands.

exception htsinfer.exceptions.MetadataWarning

Bases: Exception

Exception raised when metadata could not be determined.

exception htsinfer.exceptions.StarProblem

Bases: Exception

Exception raised when running STAR index and quant commands.

exception htsinfer.exceptions.UnknownFastqIdentifier

Bases: Exception

Exception raised when a FASTQ sequence identifier of unknown format was ecountered.

exception htsinfer.exceptions.WorkEnvProblem

Bases: Exception

Exception raised when the work environment could not be set up or cleaned.

htsinfer.get_library_source module

Infer library source from sample data.

class htsinfer.get_library_source.GetLibSource(paths: Tuple[pathlib.Path, Optional[pathlib.Path]], transcripts_file: pathlib.Path, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/results_htsinfer'), tmp_dir: pathlib.Path = PosixPath('/tmp/tmp_htsinfer'), min_match_pct: float = 2, min_freq_ratio: float = 2)

Bases: object

Determine the source of FASTQ sequencing of a single- or paired-end seguencing library.

Parameters
  • paths – Tuple of one or two paths for single-end and paired end library files.

  • transcripts_file – File path to an uncompressed transcripts file in FASTA format. Expected to contain |-separated sequence identifier lines that contain an organism short name and a taxon identifier in the fourth and fifth columns, respectively. Example sequence identifier: rpl-13|ACYPI006272|ACYPI006272-RA|apisum|7029

  • out_dir – Path to directory where output is written to.

  • tmp_dir – Path to directory where temporary output is written to.

  • min_match_pct – Minimum percentage of reads that are consistent with a given source in order for it to be considered as the to be considered the library’s source.

  • min_freq_ratio – Minimum frequency ratio between the first and second most frequent source in order for the former to be considered the library’s source.

Attrubutes:
paths: Tuple of one or two paths for single-end and paired end library

files.

transcripts_file: File path to an uncompressed transcripts file in

FASTA format. Expected to contain |-separated sequence identifier lines that contain an organism short name and a taxon identifier in the fourth and fifth columns, respectively. Example sequence identifier: rpl-13|ACYPI006272|ACYPI006272-RA|apisum|7029

out_dir: Path to directory where output is written to. tmp_dir: Path to directory where temporary output is written to. min_match_pct: Minimum percentage of reads that are consistent with a

given source in order for it to be considered as the to be considered the library’s source.

min_freq_ratio: Minimum frequency ratio between the first and second

most frequent source in order for the former to be considered the library’s source.

create_kallisto_index() pathlib.Path

Build Kallisto index from FASTA file of target sequences.

Returns

Path to Kallisto index.

Raises

KallistoProblem – Kallisto index could not be created.

evaluate() htsinfer.models.ResultsSource

Infer read source.

Returns

Source results object.

get_source(fastq: pathlib.Path, index: pathlib.Path) htsinfer.models.Source

Determine source of a single sequencing library file.

Parameters
  • fastq – Path to FASTQ file.

  • index – Path to Kallisto index.

Returns

Source of library file.

static get_source_expression(kallisto_dir: pathlib.Path) pandas.core.frame.DataFrame

Return percentages of total expression per read source.

Parameters

kallisto_dir – Directory containing Kallisto quantification results.

Returns

Data frame with columns source_ids (a tuple of source short name

and taxon identifier, e.g., (“hsapiens”, 9606)) and tpm, signifying the percentages of total expression per read source. The data frame is sorted by total expression in descending order.

Raises

FileProblem – Kallisto quantification results could not be processed.

run_kallisto_quantification(fastq: pathlib.Path, index: pathlib.Path) pathlib.Path

Run Kallisto quantification on individual sequencing library file.

Parameters
  • fastq – Path to FASTQ file.

  • index – Path to Kallisto index.

Returns

Path to output directory.

Raises

KallistoProblem – Kallisto quantification failed.

htsinfer.get_library_stats module

Infer read orientation from sample data.

class htsinfer.get_library_stats.GetLibStats(paths: Tuple[pathlib.Path, Optional[pathlib.Path]], tmp_dir: pathlib.Path = PosixPath('/tmp/tmp_htsinfer'))

Bases: object

Determine library statitics of a single- or paired-end seguencing library.

Parameters
  • paths – Tuple of one or two paths for single-end and paired end library files.

  • tmp_dir – Path to directory where temporary output is written to.

paths

Tuple of one or two paths for single-end and paired end library files.

tmp_dir

Path to directory where temporary output is written to.

evaluate() htsinfer.models.ResultsStats

Infer read statistics.

Returns

Statistics results object.

static fastq_get_min_max_read_length(fastq: pathlib.Path) Tuple[int, int]

Get number of records in a FASTQ file.

Parameters

fastq – Path to FASTQ file.

Returns

Tuple of minimum and maximum read lengths in input file.

Raises

FileProblem – Could not process FASTQ file.

htsinfer.get_library_type module

Infer mate information from sample data.

class htsinfer.get_library_type.GetFastqType(path: pathlib.Path)

Bases: object

Determine type (single/paired) information for an individual FASTQ sequencing library.

Parameters

path – File path to read library.

path

File path to read library.

seq_ids

List of sequence identifier prefixes of the provided read library, i.e., the fragments up until the mate information, if available, as defined by a named capture group prefix in a regular expression to extract mate information.

seq_id_format

The sequence identifier format of the read library, as identified by inspecting the first read and matching one of the available regular expressions for the different identifier formats.

result

The current best guess for the type of the provided library.

Examples

>>> lib_type = GetFastqType(
...     path="tests/files/first_mate.fastq"
... ).evaluate()
<OutcomesType.first_mate: 'first_mate'>
evaluate() None

Decide library type.

Raises

NoMetadataDetermined – Type information could not be determined.

class htsinfer.get_library_type.GetLibType(path_1: pathlib.Path, path_2: Optional[pathlib.Path] = None)

Bases: object

Determine type (single/paired) information for a single or a pair of

FASTQ sequencing libraries.

Args:

path_1: Path to single-end library or first mate file. path_2: Path to second mate file.

Attributes:

path_1: Path to single-end library or first mate file. path_2: Path to second mate file. results: Results container for storing library type information for

the provided files, as well as the mate relationship between the two files, if applicable.

Examples:
>>> GetLibType(
...     path_1="tests/files/first_mate.fastq"
... ).evaluate()
ResultsType(file_1=<OutcomesType.single: 'single'>, file_2=<OutcomesTyp

e.not_available: ‘not_available’>, relationship=<OutcomesTypeRelationship.not_a vailable: ‘not_available’>)

>>> GetLibType(
...     path_1="tests/files/first_mate.fastq",
...     path_2="../tests/test_files/second_mate.fastq",
... ).evaluate()
ResultsType(file_1=<OutcomesType.first_mate: 'first_mate'>, file_2=<Out

comesType.second_mate: ‘second_mate’>, relationship=<OutcomesTypeRelationship.s plit_mates: ‘split_mates’>)

(‘first_mate’, ‘second_mate’, ‘split_mates’)

evaluate() None

Decide type information and mate relationship.

htsinfer.get_read_layout module

Infer adapter sequences present in reads.

class htsinfer.get_read_layout.GetAdapter3(path: pathlib.Path, adapter_file: pathlib.Path, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/latest/docs/api'), min_match_pct: float = 2, min_freq_ratio: float = 2)

Bases: object

Determine 3’ adapter sequence for an individual FASTQ library.

Parameters
  • path – File path to read library.

  • adapter_file – Path to text file containing 3’ adapter sequences (one sequence per line) to scan for.

  • out_dir – Path to directory where output is written to.

  • min_match_pct – Minimum percentage of reads that contain a given adapter Minimum percentage of reads that contain a given adapter sequence in order for it to be considered as the library’s 3’-end adapter.

  • min_freq_ratio – Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.

path

File path to read library.

adapter_file

Path to text file containing 3’ adapter sequences (one sequence per line) to scan for.

out_dir

Path to directory where output is written to.

min_match_pct

Minimum percentage of reads that contain a given adapter Minimum percentage of reads that contain a given adapter sequence in order for it to be considered as the library’s 3’-end adapter.

min_freq_ratio

Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.

adapters

List of adapter sequences.

trie

Trie data structure of adapter sequences.

adapter_counts

Dictionary of adapter sequences and corresponding count percentages.

result

The most frequent adapter sequence in FASTQ file.

Examples

>>> GetAdapter3(
...     path_1="tests/files/sra_sample_2.fastq",
...     adapter_file="data/adapter_fragments.txt",
... ).evaluate()
<"AAAAAAAAAAAAAAA">
evaluate() None

Search for adapter sequences and validate result confidence constraints.

class htsinfer.get_read_layout.GetReadLayout(path_1: pathlib.Path, path_2: Optional[pathlib.Path] = None, adapter_file: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/latest/data/adapter_fragments.txt'), out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/latest/docs/api/results_htsinfer'), min_match_pct: float = 2, min_freq_ratio: float = 2)

Bases: object

Determine the adapter sequence present in the FASTQ sequencing libraries.

Parameters
  • path_1 – Path to single-end library or first mate file.

  • path_2 – Path to second mate file.

  • adapter_file – Path to text file containing 3’ adapter sequences (one sequence per line) to scan for.

  • out_dir – Path to directory where output is written to.

  • min_match_pct – Minimum percentage of reads that contain a given adapter sequence in order for it to be considered as the library’s 3’-end adapter.

  • min_freq_ratio – Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.

path_1

Path to single-end library or first mate file.

path_2

Path to second mate file.

adapter_file

Path to text file containing 3’ adapter sequences (one sequence per line) to scan for.

out_dir

Path to directory where output is written to.

min_match_pct

Minimum percentage of reads that contain a given adapter sequence in order for it to be considered as the library’s 3’-end adapter.

min_freq_ratio

Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.

results

Results container for storing adapter sequence information for the provided files.

Examples

>>> GetReadLayout(
...     path_1="tests/files/sra_sample_2.fastq",
...     adapter_file="data/adapter_fragments.txt",
... ).evaluate()
ResultsLayout(
    file_1=<Layout().adapt_3: "AAAAAAAAAAAAAAA">,
    file_2=<Layout().adapt_3: None>,
)
>>> GetReadLayout(
...     path_1="tests/files/sra_sample_1.fastq",
...     path_2="tests/files/sra_sample_2.fastq",
...     adapter_file="data/adapter_fragments.txt",
...     min_match_pct=2,
...     min_freq_ratio=1,
... ).evaluate()
ResultsLayout(
    file_1=<Layout().adapt_3: "AAAAAAAAAAAAAAA">,
    file_2=<Layout().adapt_3: "AAAAAAAAAAAAAAA">,
)
evaluate() None

Decide adapter sequence.

htsinfer.get_read_orientation module

Infer read orientation from sample data.

class htsinfer.get_read_orientation.GetOrientation(paths: Tuple[pathlib.Path, Optional[pathlib.Path]], library_type: htsinfer.models.ResultsType, library_source: htsinfer.models.ResultsSource, transcripts_file: pathlib.Path, tmp_dir: pathlib.Path = PosixPath('/tmp/tmp_htsinfer'), threads_star: int = 1, min_mapped_reads: int = 20, min_fraction: float = 0.75)

Bases: object

Determine library strandedness and relative read orientation of a single- or paired-end seguencing library.

Parameters
  • paths – Tuple of one or two paths for single-end and paired end library files.

  • library_type – ResultsType object with library type and mate relationship.

  • library_source – ResultsSource object with source information on each library file.

  • transcripts_file – File path to an uncompressed transcripts file in FASTA format.

  • tmp_dir – Path to directory where temporary output is written to.

  • threads_star – Number of threads to run STAR with.

  • source – Source (organism, tissue, etc.) of the sequencing library.

  • min_mapped_reads – Minimum number of mapped reads for deeming the read orientation result reliable.

  • min_fraction – Minimum fraction of mapped reads required to be consistent with a given read orientation state in order for that orientation to be reported. Must be above 0.5.

  • mate_relationship – Type/mate relationship between the provided files.

paths

Tuple of one or two paths for single-end and paired end library files.

library_type

ResultsType object with library type and mate relationship.

library_source

ResultsSource object with source information on each library file.

transcripts_file

File path to an uncompressed transcripts file in FASTA format.

tmp_dir

Path to directory where temporary output is written to.

threads_star

Number of threads to run STAR with.

source

Source (organism, tissue, etc.) of the sequencing library.

min_mapped_reads

Minimum number of mapped reads for deeming the read orientation result reliable.

min_fraction

Minimum fraction of mapped reads required to be consistent with a given read orientation state in order for that orientation to be reported. Must be above 0.5.

mate_relationship

Type/mate relationship between the provided files.

create_star_index(fasta: pathlib.Path, index_string_size: int = 5) pathlib.Path

Prepare STAR index.

Parameters
  • fasta – Path to FASTA file of sequence records to create index from.

  • index_string_size – Size of SA pre-indexing string, in nucleotides.

Returns

Path to directory containing STAR index.

Raises

StarProblem – STAR index could not be created.

evaluate() htsinfer.models.ResultsOrientation

Infer read orientation.

Returns

Orientation results object.

static generate_star_alignments(commands: Dict[pathlib.Path, List[str]]) None

Align reads to index with STAR.

Parameters

commands – Dictionary of output paths and corresponding STAR commands.

Raises

StarProblem – Generating alignments failed.

static get_fasta_size(fasta: pathlib.Path) int

Get size of FASTA file in total nucleotides.

Parameters

fasta – Path to FASTA file.

Returns

Total number of nucleotides of all records.

Raises

FileProblem – Could not open FASTA file for reading.

static get_frequencies(*items: Any) Dict[Any, float]

Get frequencies of arguments as fractions of the number of all arguments.

Parameters

*items – Items to get frequencies for.

Returns

Dictionary of arguments and their frequencies.

static get_star_index_string_size(ref_size: int) int

Get length of STAR SA pre-indexing string.

Cf. https://github.com/alexdobin/STAR/blob/51b64d4fafb7586459b8a61303e40beceeead8c0/doc/STARmanual.pdf

Parameters

ref_size – Size of genome/transcriptome reference in nucleotides.

Returns

Size (in nucleotides) of SA pre-indexing string.

prepare_star_alignment_commands(index_dir: pathlib.Path) Dict[pathlib.Path, List[str]]

Prepare STAR alignment commands.

Parameters

index_dir – Path to directory containing STAR index.

Returns

Dictionary of output paths and corresponding STAR commands.

process_alignments(star_dirs: List[pathlib.Path]) htsinfer.models.ResultsOrientation

Determine read orientation of one or two single-ended or one paired-end sequencing library.

Parameters

star_dirs – List of one or two paths to STAR output directories.

Returns

Read orientation state of library or libraries.

process_paired(sam: pathlib.Path) htsinfer.models.ResultsOrientation

Determine read orientation of a paired-ended sequencing library.

Parameters

sam – Path to SAM file.

Returns

Read orientation state of each mate and orientation state

relationship of library.

process_single(sam: pathlib.Path) htsinfer.models.StatesOrientation

Determine read orientation of a single-ended sequencing library.

Parameters

sam – Path to SAM file.

Returns

Read orientation state of library.

subset_transcripts_by_organism() pathlib.Path

Filter FASTA file of transcripts by current sources.

The filtered file contains records from the indicated sources.

Typically, this is one source. However, for if two input files were supplied that are originating from different sources (i.e., not from a valid paired-ended library), it may be from two different sources. If no source is supplied (because it could not be inferred), no filtering is done.

Returns

Path to filtered FASTA file.

Raises

FileProblem – Could not open input/output FASTA file for reading/writing.

static sum_dicts(*dicts: Dict[Any, float]) Dict[Any, float]

Sum of dictionaries with numeric values.

Parameters

*dicts – Dictionaries to sum up.

Returns

Dictionary with union of keys of input dictionaries and all values added up.

htsinfer.htsinfer module

Main module.

class htsinfer.htsinfer.HtsInfer(path_1: pathlib.Path, path_2: Optional[pathlib.Path] = None, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/latest/docs/api/results_htsinfer'), tmp_dir: pathlib.Path = PosixPath('/tmp/tmp_htsinfer'), cleanup_regime: htsinfer.models.CleanupRegimes = CleanupRegimes.DEFAULT, records: int = 0, threads: int = 1, transcripts_file: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/latest/data/transcripts.fasta.gz'), read_layout_adapter_file: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/latest/data/adapter_fragments.txt'), read_layout_min_match_pct: float = 2, read_layout_min_freq_ratio: float = 2, lib_source_min_match_pct: float = 2, lib_source_min_freq_ratio: float = 2, read_orientation_min_mapped_reads: int = 20, read_orientation_min_fraction: float = 0.75)

Bases: object

Determine sequencing library metadata.

Parameters
  • path_1 – Path to single-end library or first mate file.

  • path_2 – Path to second mate file.

  • out_dir – Path to directory where output is written to.

  • tmp_dir – Path to directory where temporary output is written to.

  • cleanup_regime – Which data to keep after run concludes; one of CleanupRegimes.

  • records – Number of input file records to process; set to 0 to process all records.

  • threads – Number of threads to run STAR with.

  • transcripts_file – File path to transcripts FASTA file.

  • read_layout_adapter_file – Path to text file containing 3’ adapter sequences to scan for (one sequence per line).

  • read_layout_min_match_pct – Minimum percentage of reads that contain a given adapter in order for it to be considered as the library’s 3’-end adapter.

  • read_layout_min_freq_ratio – Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.

  • lib_source_min_match_pct – Minimum percentage of reads that are consistent with a given source in order for it to be considered as the to be considered the library’s source.

  • lib_source_min_freq_ratio – Minimum frequency ratio between the first and second most frequent source in order for the former to be considered the library’s source.

  • read_orientation_min_mapped_reads – Minimum number of mapped reads for deeming the read orientation result reliable.

  • read_orientation_min_fraction – Minimum fraction of mapped reads required to be consistent with a given read orientation state in order for that orientation to be reported. Must be above 0.5.

path_1

Path to single-end library or first mate file.

path_2

Path to second mate file.

out_dir

Path to directory where output is written to.

run_id

Random string identifier for HTSinfer run.

tmp_dir

Path to directory where temporary output is written to.

cleanup_regime

Which data to keep after run concludes; one of CleanupRegimes.

records

Number of input file records to process.

threads

Number of threads to run STAR with.

transcripts_file

File path to transcripts FASTA file.

read_layout_adapter_file

Path to text file containing 3’ adapter sequences to scan for (one sequence per line).

read_layout_min_match_pct

Minimum percentage of reads that contain a given adapter in order for it to be considered as the library’s 3’-end adapter.

read_layout_min_freq_ratio

Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.

lib_source_min_match_pct

Minimum percentage of reads that are consistent with a given source in order for it to be considered as the to be considered the library’s source.

lib_source_min_freq_ratio

Minimum frequency ratio between the first and second most frequent source in order for the former to be considered the library’s source.

read_orientation_min_mapped_reads

Minimum number of mapped reads for deeming the read orientation result reliable.

read_orientation_min_fraction

Minimum fraction of mapped reads required to be consistent with a given read orientation state in order for that orientation to be reported. Must be above 0.5.

path_1_processed

Path to processed path_1 file.

path_2_processed

Path to processed path_2 file.

transcripts_file_processed

Path to processed transcripts_file file.

state

State of the run; one of RunStates.

results

Results container for storing determined library metadata.

clean_up()

Clean up work environment.

evaluate()

Determine library metadata.

get_library_source() htsinfer.models.ResultsSource

Determine library source.

Returns

Library source results.

get_library_stats()

Determine library statistics.

get_library_type()

Determine library type.

get_read_layout()

Determine read layout.

get_read_orientation()

Determine read orientation.

prepare_env()

Set up work environment.

print()

Print results to STDOUT.

process_inputs()

Process and validate inputs.

htsinfer.models module

Data models.

class htsinfer.models.CleanupRegimes(value)

Bases: enum.Enum

Enumerator of cleanup regimes.

DEFAULT = 'default'
KEEP_ALL = 'keep_all'
KEEP_NONE = 'keep_none'
KEEP_RESULTS = 'keep_results'
class htsinfer.models.Layout(*, adapt_3: str = None)

Bases: pydantic.main.BaseModel

Read layout of a single sequencing file.

Parameters

adapt_3 – Adapter sequence ligated to 3’-end of sequence.

adapt_3

Adapter sequence ligated to 3’-end of sequence.

Type

Optional[str]

adapt_3: Optional[str]
class htsinfer.models.LogLevels(value)

Bases: enum.Enum

Log level enumerator.

CRITICAL = 50
DEBUG = 10
ERROR = 40
INFO = 20
WARN = 30
WARNING = 30
class htsinfer.models.ReadLength(*, min: int = None, max: int = None)

Bases: pydantic.main.BaseModel

Read length of a sequencing file.

Parameters
  • min – Minimum read length.

  • max – Maximum read length.

min

Minimum read length.

Type

Optional[int]

max

Maximum read length.

Type

Optional[int]

max: Optional[int]
min: Optional[int]
class htsinfer.models.Results(*, library_stats: htsinfer.models.ResultsStats = ResultsStats(file_1=Stats(read_length=ReadLength(min=None, max=None)), file_2=Stats(read_length=ReadLength(min=None, max=None))), library_type: htsinfer.models.ResultsType = ResultsType(file_1=<StatesType.not_available: None>, file_2=<StatesType.not_available: None>, relationship=<StatesTypeRelationship.not_available: None>), library_source: htsinfer.models.ResultsSource = ResultsSource(file_1=Source(short_name=None, taxon_id=None), file_2=Source(short_name=None, taxon_id=None)), read_orientation: htsinfer.models.ResultsOrientation = ResultsOrientation(file_1=<StatesOrientation.not_available: None>, file_2=<StatesOrientation.not_available: None>, relationship=<StatesOrientationRelationship.not_available: None>), read_layout: htsinfer.models.ResultsLayout = ResultsLayout(file_1=Layout(adapt_3=None), file_2=Layout(adapt_3=None)))

Bases: pydantic.main.BaseModel

Container class for aggregating results from the different inference functionalities.

Parameters
  • library_type – Library type inference results.

  • library_source – Library source inference results.

  • orientation – Read orientation inference results.

  • read_layout – Read layout inference results.

  • type – Library type inference results.

  • source – Library source inference results.

  • read_orientation – Read orientation inference results.

  • read_layout – Read layout inference results.

library_source: htsinfer.models.ResultsSource
library_stats: htsinfer.models.ResultsStats
library_type: htsinfer.models.ResultsType
read_layout: htsinfer.models.ResultsLayout
read_orientation: htsinfer.models.ResultsOrientation
class htsinfer.models.ResultsLayout(*, file_1: htsinfer.models.Layout = Layout(adapt_3=None), file_2: htsinfer.models.Layout = Layout(adapt_3=None))

Bases: pydantic.main.BaseModel

Container class for read layout of a sequencing library.

Parameters
  • file_1 – Adapter sequence present in first file.

  • file_2 – Adapter sequence present in second file.

file_1

Adapter sequence present in first file.

Type

htsinfer.models.Layout

file_2

Adapter sequence present in second file.

Type

htsinfer.models.Layout

file_1: htsinfer.models.Layout
file_2: htsinfer.models.Layout
class htsinfer.models.ResultsOrientation(*, file_1: htsinfer.models.StatesOrientation = StatesOrientation.not_available, file_2: htsinfer.models.StatesOrientation = StatesOrientation.not_available, relationship: htsinfer.models.StatesOrientationRelationship = StatesOrientationRelationship.not_available)

Bases: pydantic.main.BaseModel

Container class for aggregating library orientation.
Args:

file_1: Read orientation of first file. file_2: Read orientation of second file. relationship: Orientation type relationship between the provided files.

file_1

Read orientation of first file.

Type

htsinfer.models.StatesOrientation

file_2

Read orientation of second file.

Type

htsinfer.models.StatesOrientation

file_1: htsinfer.models.StatesOrientation
file_2: htsinfer.models.StatesOrientation
relationship: htsinfer.models.StatesOrientationRelationship
class htsinfer.models.ResultsSource(*, file_1: htsinfer.models.Source = Source(short_name=None, taxon_id=None), file_2: htsinfer.models.Source = Source(short_name=None, taxon_id=None))

Bases: pydantic.main.BaseModel

Container class for aggregating library source.

Parameters
  • file_1 – Library source of the first file.

  • file_2 – Library source of the second file.

file_1

Library source of the first file.

Type

htsinfer.models.Source

file_2

Library source of the second file.

Type

htsinfer.models.Source

file_1: htsinfer.models.Source
file_2: htsinfer.models.Source
class htsinfer.models.ResultsStats(*, file_1: htsinfer.models.Stats = Stats(read_length=ReadLength(min=None, max=None)), file_2: htsinfer.models.Stats = Stats(read_length=ReadLength(min=None, max=None)))

Bases: pydantic.main.BaseModel

Container class for aggregating library statistics information.

Parameters
  • file_1 – Library statistics for the first file.

  • file_2 – Library statistics for the second file.

file_1

Library statistics for the first file.

Type

htsinfer.models.Stats

file_2

Library statistics for the second file.

Type

htsinfer.models.Stats

file_1: htsinfer.models.Stats
file_2: htsinfer.models.Stats
class htsinfer.models.ResultsType(*, file_1: htsinfer.models.StatesType = StatesType.not_available, file_2: htsinfer.models.StatesType = StatesType.not_available, relationship: htsinfer.models.StatesTypeRelationship = StatesTypeRelationship.not_available)

Bases: pydantic.main.BaseModel

Container class for aggregating library type and mate relationship information.

Parameters
  • file_1 – Library type of the first file.

  • file_2 – Library type of the second file.

  • relationship – Type/mate relationship between the provided files.

file_1

Library type of the first file.

Type

htsinfer.models.StatesType

file_2

Library type of the second file.

Type

htsinfer.models.StatesType

relationship

Type/mate relationship between the provided files.

Type

htsinfer.models.StatesTypeRelationship

file_1: htsinfer.models.StatesType
file_2: htsinfer.models.StatesType
relationship: htsinfer.models.StatesTypeRelationship
class htsinfer.models.RunStates(value)

Bases: enum.IntEnum

Enumerator of run states and exit codes.

ERROR = 2
OKAY = 0
WARNING = 1
class htsinfer.models.SeqIdFormats(value)

Bases: enum.Enum

An enumeration.

class htsinfer.models.Source(*, short_name: str = None, taxon_id: int = None)

Bases: pydantic.main.BaseModel

Library source of an individual sequencing file.

Parameters
  • short_name – Library source short name, e.g., “hsapiens”.

  • taxon_id – Library source taxon identifer, e.g., 9606.

short_name

Library source short name, e.g., “hsapiens”.

Type

Optional[str]

taxon_id

Library source taxon identifer, e.g., 9606.

Type

Optional[int]

short_name: Optional[str]
taxon_id: Optional[int]
class htsinfer.models.StatesOrientation(value)

Bases: enum.Enum

Enumerator of read orientation types for individual library files. Cf. https://salmon.readthedocs.io/en/latest/library_type.html

not_available

Orientation type information is not available for a given file, either because no file was provided, the file could not be parsed, an orientation type has not yet been assigned.

stranded_forward

Reads are stranded and come from the forward strand.

stranded_reverse

Reads are stranded and come from the reverse strand.

unstranded

Reads are unstranded.

not_available = None
stranded_forward = 'SF'
stranded_reverse = 'SR'
unstranded = 'U'
class htsinfer.models.StatesOrientationRelationship(value)

Bases: enum.Enum

Enumerator of read orientation type relationships for paired-ended libraries. Cf. https://salmon.readthedocs.io/en/latest/library_type.html

inward_stranded_forward

Mates are oriented toward each other, the library is stranded, and first mates come from the forward strand.

inward_stranded_reverse

Mates are oriented toward each other, the library is stranded, and first mates come from the reverse strand.

inward_unstranded

Mates are oriented toward each other and the library is unstranded.

not_available

Orientation type relationship information is not available, likely because only a single file was provided or because the orientation type relationship has not been or could not be evaluated.

inward_stranded_forward = 'ISF'
inward_stranded_reverse = 'ISR'
inward_unstranded = 'IU'
not_available = None
class htsinfer.models.StatesType(value)

Bases: enum.Enum

Possible outcomes of determining the sequencing library type of an individual FASTQ file.

file_problem

There was a problem with opening or parsing the file.

first_mate

All of the sequence identifiers of the processed file counts indicate that the library represents the first mate of a paired-end library.

mixed_mates

All of the sequence identifiers of the processed file include mate information. However, the file includes at least one record for either mate, indicating that the library represents a mixed mate library.

not_available

Library type information is not available for a given file, either because no file was provided, the file could not be parsed, a library type has not yet been assigned, the processed file contains records with sequence identifiers of an unknown format, of different formats or that are inconsistent in that they indicate the library represents both a single-ended and paired-ended library at the same time.

second_mate

All of the sequence identifiers of the processed file indicate that the library represents the second mate of a paired-end library.

single

All of the sequence identifiers of the processed file indicate that the library represents a single-end library.

first_mate = 'first_mate'
mixed_mates = 'mixed_mates'
not_available = None
second_mate = 'second_mate'
single = 'single'
class htsinfer.models.StatesTypeRelationship(value)

Bases: enum.Enum

Possible outcomes of determining the sequencing library type/mate relationship between two FASTQ files.

not_available

Mate relationship information is not available, likely because only a single file was provided or because the mate relationship has not yet been evaluated.

not_mates

The library type information of the files is not compatible, either because not a pair of first and second mate files was provided, or because the files do not compatible sequence identifiers.

split_mates

One of the provided files represents the first and the the other the second mates of a paired-end library.

not_available = None
not_mates = 'not_mates'
split_mates = 'split_mates'
class htsinfer.models.Stats(*, read_length: htsinfer.models.ReadLength = ReadLength(min=None, max=None))

Bases: pydantic.main.BaseModel

Library statistics of an individual sequencing file.

Parameters

read_length – Tuple of minimum and maximum length of reads in library.

read_length

Tuple of minimum and maximum length of reads in library.

Type

htsinfer.models.ReadLength

read_length: htsinfer.models.ReadLength

htsinfer.subset_fastq module

FASTQ subsetting, extraction and validation.

class htsinfer.subset_fastq.SubsetFastq(path: pathlib.Path, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/latest/docs/api/results_htsinfer'), records: int = 0)

Bases: object

Subset, uncompress and validate a FASTQ file.

Parameters
  • path – Path to FASTQ file.

  • out_dir – Path to directory where output is written to.

  • records – Number of input file records to process; set to 0 to process all records.

path

Path to FASTQ file.

out_dir

Path to directory where output is written to.

records

Number of input file records to process.

out_path

Path for uncompressed, filtered path file.

n_processed

Total number of processed records.

Raises

FileProblem – The input file could not be parsed or the output file could not be written.

process()

Uncompress, subset and validate files.

htsinfer.utils module

Utilities used across multiple HTSinfer modules.

htsinfer.utils.convert_dict_to_df(dic: Dict, col_headers: Optional[Tuple[str, str]] = None, sort: bool = False, sort_by: int = 0, sort_ascending: bool = True) pandas.core.frame.DataFrame

Convert dictionary to two-column data frame.

Parameters
  • dic – Dictionary to convert.

  • col_headers – List of column headers. Length MUST match number of dictionary keys/data frame columns.

  • sort – Whether the resulting data frame is supposed to be sorted.

  • sort_by – Column index used for sorting. Ignored if sort is False.

  • sort_ascending – Whether the data frame is supposed to be sorted in ascending order. Ignored if sort is False.

Returns

Data frame prepared from dictionary.

Raises

ValueError – Raised if number of provided column headers does not match the number of data frame columns.

htsinfer.utils.validate_top_score(vector: List[float], min_value: float = 2, min_ratio: float = 2, accept_zero: bool = True, rev_sorted: bool = True) bool

Validates whether (1) the maximum value of a numeric list is equal to or higher than a specified minimum value AND (2) that the ratio of the first and second highest values of the list is higher than a specified minimum ratio.

If the passed list/vector does NOT contain at least two items, the function returns False.

Parameters
  • vector – List of numbers.

  • min_value – Minimum value required in first row of column_index for validation to pass.

  • min_ratio – Minimum ratio of first and second rows of column_index required for validation to pass.

  • accept_zero – Whether to accept a top score (i.e., return True) if the second highest value in the provided list is zero. If not set to True, False is returned in these cases.

  • rev_sorted – Whether the list of numbers is sorted in descencing numeric order.

Returns

Whether data frame data satisfies the min_value and min_ratio constraints for value in column column_index.

Raises

ValueError – Raised if one of the list items can not be interpreted as a number.