htsinfer package¶

HTSinfer project root

Submodules¶

htsinfer.cli module¶

Command-line interface client.

htsinfer.cli.main() → None¶: Entry point for CLI executable.

htsinfer.cli.parse_args() → argparse.Namespace¶

Parse CLI arguments.

Returns: Parsed CLI arguments.

htsinfer.cli.setup_logging(verbosity: str = 'INFO') → None¶

Configure logging.

Parameters: verbosity – Level of logging verbosity.

htsinfer.exceptions module¶

Custom exceptions.

exception htsinfer.exceptions.FileProblem¶

Bases: Exception

Exception raised when file could not be opened or parsed.

exception htsinfer.exceptions.InconsistentFastqIdentifiers¶

Bases: Exception

Exception raised when inconsistent FASTQ sequence identifiers were ecountered.

exception htsinfer.exceptions.KallistoProblem¶

Bases: Exception

Exception raised when running kallisto index and quant commands.

exception htsinfer.exceptions.MetadataWarning¶

Bases: Exception

Exception raised when metadata could not be determined.

exception htsinfer.exceptions.StarProblem¶

Bases: Exception

Exception raised when running STAR index and quant commands.

exception htsinfer.exceptions.UnknownFastqIdentifier¶

Bases: Exception

Exception raised when a FASTQ sequence identifier of unknown format was ecountered.

exception htsinfer.exceptions.WorkEnvProblem¶

Bases: Exception

Exception raised when the work environment could not be set up or cleaned.

htsinfer.get_library_source module¶

Infer library source from sample data.

class htsinfer.get_library_source.GetLibSource(paths: Tuple[pathlib.Path, Optional[pathlib.Path]], transcripts_file: pathlib.Path, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/results_htsinfer'), tmp_dir: pathlib.Path = PosixPath('/tmp/tmp_htsinfer'), min_match_pct: float = 2, min_freq_ratio: float = 2)¶

Bases: object

Determine the source of FASTQ sequencing of a single- or paired-end seguencing library.

Parameters

paths – Tuple of one or two paths for single-end and paired end library files.
transcripts_file – File path to an uncompressed transcripts file in FASTA format. Expected to contain |-separated sequence identifier lines that contain an organism short name and a taxon identifier in the fourth and fifth columns, respectively. Example sequence identifier: rpl-13|ACYPI006272|ACYPI006272-RA|apisum|7029
out_dir – Path to directory where output is written to.
tmp_dir – Path to directory where temporary output is written to.
min_match_pct – Minimum percentage of reads that are consistent with a given source in order for it to be considered as the to be considered the library’s source.
min_freq_ratio – Minimum frequency ratio between the first and second most frequent source in order for the former to be considered the library’s source.

Attrubutes:

paths: Tuple of one or two paths for single-end and paired end library: files.
transcripts_file: File path to an uncompressed transcripts file in: FASTA format. Expected to contain |-separated sequence identifier lines that contain an organism short name and a taxon identifier in the fourth and fifth columns, respectively. Example sequence identifier: rpl-13|ACYPI006272|ACYPI006272-RA|apisum|7029

out_dir: Path to directory where output is written to. tmp_dir: Path to directory where temporary output is written to. min_match_pct: Minimum percentage of reads that are consistent with a

given source in order for it to be considered as the to be considered the library’s source.

min_freq_ratio: Minimum frequency ratio between the first and second: most frequent source in order for the former to be considered the library’s source.

create_kallisto_index() → pathlib.Path¶

Build Kallisto index from FASTA file of target sequences.

Returns: Path to Kallisto index.
Raises: KallistoProblem – Kallisto index could not be created.

evaluate() → htsinfer.models.ResultsSource¶

Infer read source.

Returns: Source results object.

get_source(fastq: pathlib.Path, index: pathlib.Path) → htsinfer.models.Source¶

Determine source of a single sequencing library file.

Parameters

fastq – Path to FASTQ file.
index – Path to Kallisto index.

Returns

Source of library file.

static get_source_expression(kallisto_dir: pathlib.Path) → pandas.core.frame.DataFrame¶

Return percentages of total expression per read source.

Parameters

kallisto_dir – Directory containing Kallisto quantification results.

Returns

Data frame with columns source_ids (a tuple of source short name: and taxon identifier, e.g., (“hsapiens”, 9606)) and tpm, signifying the percentages of total expression per read source. The data frame is sorted by total expression in descending order.

Raises

FileProblem – Kallisto quantification results could not be processed.

run_kallisto_quantification(fastq: pathlib.Path, index: pathlib.Path) → pathlib.Path¶

Run Kallisto quantification on individual sequencing library file.

Parameters

fastq – Path to FASTQ file.
index – Path to Kallisto index.

Returns

Path to output directory.

Raises

KallistoProblem – Kallisto quantification failed.

htsinfer.get_library_stats module¶

Infer read orientation from sample data.

class htsinfer.get_library_stats.GetLibStats(paths: Tuple[pathlib.Path, Optional[pathlib.Path]], tmp_dir: pathlib.Path = PosixPath('/tmp/tmp_htsinfer'))¶

Bases: object

Determine library statitics of a single- or paired-end seguencing library.

Parameters

paths – Tuple of one or two paths for single-end and paired end library files.
tmp_dir – Path to directory where temporary output is written to.

paths¶: Tuple of one or two paths for single-end and paired end library files.

tmp_dir¶: Path to directory where temporary output is written to.

evaluate() → htsinfer.models.ResultsStats¶

Infer read statistics.

Returns: Statistics results object.

static fastq_get_min_max_read_length(fastq: pathlib.Path) → Tuple[int, int]¶

Get number of records in a FASTQ file.

Parameters: fastq – Path to FASTQ file.
Returns: Tuple of minimum and maximum read lengths in input file.
Raises: FileProblem – Could not process FASTQ file.

htsinfer.get_library_type module¶

Infer mate information from sample data.

class htsinfer.get_library_type.GetFastqType(path: pathlib.Path)¶

Bases: object

Determine type (single/paired) information for an individual FASTQ sequencing library.

Parameters: path – File path to read library.

path¶: File path to read library.

seq_ids¶: List of sequence identifier prefixes of the provided read library, i.e., the fragments up until the mate information, if available, as defined by a named capture group prefix in a regular expression to extract mate information.

seq_id_format¶: The sequence identifier format of the read library, as identified by inspecting the first read and matching one of the available regular expressions for the different identifier formats.

result¶: The current best guess for the type of the provided library.

Examples

>>> lib_type = GetFastqType(
...     path="tests/files/first_mate.fastq"
... ).evaluate()
<OutcomesType.first_mate: 'first_mate'>

evaluate() → None¶

Decide library type.

Raises: NoMetadataDetermined – Type information could not be determined.

class htsinfer.get_library_type.GetLibType(path_1: pathlib.Path, path_2: Optional[pathlib.Path] = None)¶

Bases: object

Determine type (single/paired) information for a single or a pair of

FASTQ sequencing libraries.

Args:

path_1: Path to single-end library or first mate file. path_2: Path to second mate file.

Attributes:

path_1: Path to single-end library or first mate file. path_2: Path to second mate file. results: Results container for storing library type information for

the provided files, as well as the mate relationship between the two files, if applicable.

Examples:

>>> GetLibType(
...     path_1="tests/files/first_mate.fastq"
... ).evaluate()
ResultsType(file_1=<OutcomesType.single: 'single'>, file_2=<OutcomesTyp

e.not_available: ‘not_available’>, relationship=<OutcomesTypeRelationship.not_a vailable: ‘not_available’>)

>>> GetLibType(
...     path_1="tests/files/first_mate.fastq",
...     path_2="../tests/test_files/second_mate.fastq",
... ).evaluate()
ResultsType(file_1=<OutcomesType.first_mate: 'first_mate'>, file_2=<Out

comesType.second_mate: ‘second_mate’>, relationship=<OutcomesTypeRelationship.s plit_mates: ‘split_mates’>)

(‘first_mate’, ‘second_mate’, ‘split_mates’)

evaluate() → None¶: Decide type information and mate relationship.

htsinfer.get_read_layout module¶

Infer adapter sequences present in reads.

class htsinfer.get_read_layout.GetAdapter3(path: pathlib.Path, adapter_file: pathlib.Path, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/latest/docs/api'), min_match_pct: float = 2, min_freq_ratio: float = 2)¶

Bases: object

Determine 3’ adapter sequence for an individual FASTQ library.

Parameters

path – File path to read library.
adapter_file – Path to text file containing 3’ adapter sequences (one sequence per line) to scan for.
out_dir – Path to directory where output is written to.
min_match_pct – Minimum percentage of reads that contain a given adapter Minimum percentage of reads that contain a given adapter sequence in order for it to be considered as the library’s 3’-end adapter.
min_freq_ratio – Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.

path¶: File path to read library.

adapter_file¶: Path to text file containing 3’ adapter sequences (one sequence per line) to scan for.

out_dir¶: Path to directory where output is written to.

min_match_pct¶: Minimum percentage of reads that contain a given adapter Minimum percentage of reads that contain a given adapter sequence in order for it to be considered as the library’s 3’-end adapter.

min_freq_ratio¶: Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.

adapters¶: List of adapter sequences.

trie¶: Trie data structure of adapter sequences.

adapter_counts¶: Dictionary of adapter sequences and corresponding count percentages.

result¶: The most frequent adapter sequence in FASTQ file.

Examples

>>> GetAdapter3(
...     path_1="tests/files/sra_sample_2.fastq",
...     adapter_file="data/adapter_fragments.txt",
... ).evaluate()
<"AAAAAAAAAAAAAAA">

evaluate() → None¶: Search for adapter sequences and validate result confidence constraints.

class htsinfer.get_read_layout.GetReadLayout(path_1: pathlib.Path, path_2: Optional[pathlib.Path] = None, adapter_file: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/latest/data/adapter_fragments.txt'), out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/latest/docs/api/results_htsinfer'), min_match_pct: float = 2, min_freq_ratio: float = 2)¶

Bases: object

Determine the adapter sequence present in the FASTQ sequencing libraries.

Parameters

path_1 – Path to single-end library or first mate file.
path_2 – Path to second mate file.
adapter_file – Path to text file containing 3’ adapter sequences (one sequence per line) to scan for.
out_dir – Path to directory where output is written to.
min_match_pct – Minimum percentage of reads that contain a given adapter sequence in order for it to be considered as the library’s 3’-end adapter.
min_freq_ratio – Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.

path_1¶: Path to single-end library or first mate file.

path_2¶: Path to second mate file.

adapter_file¶: Path to text file containing 3’ adapter sequences (one sequence per line) to scan for.

out_dir¶: Path to directory where output is written to.

min_match_pct¶: Minimum percentage of reads that contain a given adapter sequence in order for it to be considered as the library’s 3’-end adapter.

min_freq_ratio¶: Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.

results¶: Results container for storing adapter sequence information for the provided files.

Examples

>>> GetReadLayout(
...     path_1="tests/files/sra_sample_2.fastq",
...     adapter_file="data/adapter_fragments.txt",
... ).evaluate()
ResultsLayout(
    file_1=<Layout().adapt_3: "AAAAAAAAAAAAAAA">,
    file_2=<Layout().adapt_3: None>,
)
>>> GetReadLayout(
...     path_1="tests/files/sra_sample_1.fastq",
...     path_2="tests/files/sra_sample_2.fastq",
...     adapter_file="data/adapter_fragments.txt",
...     min_match_pct=2,
...     min_freq_ratio=1,
... ).evaluate()
ResultsLayout(
    file_1=<Layout().adapt_3: "AAAAAAAAAAAAAAA">,
    file_2=<Layout().adapt_3: "AAAAAAAAAAAAAAA">,
)

evaluate() → None¶: Decide adapter sequence.

htsinfer.get_read_orientation module¶

Infer read orientation from sample data.

class htsinfer.get_read_orientation.GetOrientation(paths: Tuple[pathlib.Path, Optional[pathlib.Path]], library_type: htsinfer.models.ResultsType, library_source: htsinfer.models.ResultsSource, transcripts_file: pathlib.Path, tmp_dir: pathlib.Path = PosixPath('/tmp/tmp_htsinfer'), threads_star: int = 1, min_mapped_reads: int = 20, min_fraction: float = 0.75)¶

Bases: object

Determine library strandedness and relative read orientation of a single- or paired-end seguencing library.

Parameters

paths – Tuple of one or two paths for single-end and paired end library files.
library_type – ResultsType object with library type and mate relationship.
library_source – ResultsSource object with source information on each library file.
transcripts_file – File path to an uncompressed transcripts file in FASTA format.
tmp_dir – Path to directory where temporary output is written to.
threads_star – Number of threads to run STAR with.
source – Source (organism, tissue, etc.) of the sequencing library.
min_mapped_reads – Minimum number of mapped reads for deeming the read orientation result reliable.
min_fraction – Minimum fraction of mapped reads required to be consistent with a given read orientation state in order for that orientation to be reported. Must be above 0.5.
mate_relationship – Type/mate relationship between the provided files.

paths¶: Tuple of one or two paths for single-end and paired end library files.

library_type¶: ResultsType object with library type and mate relationship.

library_source¶: ResultsSource object with source information on each library file.

transcripts_file¶: File path to an uncompressed transcripts file in FASTA format.

tmp_dir¶: Path to directory where temporary output is written to.

threads_star¶: Number of threads to run STAR with.

source¶: Source (organism, tissue, etc.) of the sequencing library.

min_mapped_reads¶: Minimum number of mapped reads for deeming the read orientation result reliable.

min_fraction¶: Minimum fraction of mapped reads required to be consistent with a given read orientation state in order for that orientation to be reported. Must be above 0.5.

mate_relationship¶: Type/mate relationship between the provided files.

create_star_index(fasta: pathlib.Path, index_string_size: int = 5) → pathlib.Path¶

Prepare STAR index.

Parameters

fasta – Path to FASTA file of sequence records to create index from.
index_string_size – Size of SA pre-indexing string, in nucleotides.

Returns

Path to directory containing STAR index.

Raises

StarProblem – STAR index could not be created.

evaluate() → htsinfer.models.ResultsOrientation¶

Infer read orientation.

Returns: Orientation results object.

static generate_star_alignments(commands: Dict[pathlib.Path, List[str]]) → None¶

Align reads to index with STAR.

Parameters: commands – Dictionary of output paths and corresponding STAR commands.
Raises: StarProblem – Generating alignments failed.

static get_fasta_size(fasta: pathlib.Path) → int¶

Get size of FASTA file in total nucleotides.

Parameters: fasta – Path to FASTA file.
Returns: Total number of nucleotides of all records.
Raises: FileProblem – Could not open FASTA file for reading.

static get_frequencies(*items: Any) → Dict[Any, float]¶

Get frequencies of arguments as fractions of the number of all arguments.

Parameters: *items – Items to get frequencies for.
Returns: Dictionary of arguments and their frequencies.

static get_star_index_string_size(ref_size: int) → int¶

Get length of STAR SA pre-indexing string.

Cf. https://github.com/alexdobin/STAR/blob/51b64d4fafb7586459b8a61303e40beceeead8c0/doc/STARmanual.pdf

Parameters: ref_size – Size of genome/transcriptome reference in nucleotides.
Returns: Size (in nucleotides) of SA pre-indexing string.

prepare_star_alignment_commands(index_dir: pathlib.Path) → Dict[pathlib.Path, List[str]]¶

Prepare STAR alignment commands.

Parameters: index_dir – Path to directory containing STAR index.
Returns: Dictionary of output paths and corresponding STAR commands.

process_alignments(star_dirs: List[pathlib.Path]) → htsinfer.models.ResultsOrientation¶

Determine read orientation of one or two single-ended or one paired-end sequencing library.

Parameters: star_dirs – List of one or two paths to STAR output directories.
Returns: Read orientation state of library or libraries.

process_paired(sam: pathlib.Path) → htsinfer.models.ResultsOrientation¶

Determine read orientation of a paired-ended sequencing library.

Parameters

sam – Path to SAM file.

Returns

Read orientation state of each mate and orientation state: relationship of library.

process_single(sam: pathlib.Path) → htsinfer.models.StatesOrientation¶

Determine read orientation of a single-ended sequencing library.

Parameters: sam – Path to SAM file.
Returns: Read orientation state of library.

subset_transcripts_by_organism() → pathlib.Path¶

Filter FASTA file of transcripts by current sources.

The filtered file contains records from the indicated sources.: Typically, this is one source. However, for if two input files were supplied that are originating from different sources (i.e., not from a valid paired-ended library), it may be from two different sources. If no source is supplied (because it could not be inferred), no filtering is done.

Returns: Path to filtered FASTA file.
Raises: FileProblem – Could not open input/output FASTA file for reading/writing.

static sum_dicts(*dicts: Dict[Any, float]) → Dict[Any, float]¶

Sum of dictionaries with numeric values.

Parameters: *dicts – Dictionaries to sum up.
Returns: Dictionary with union of keys of input dictionaries and all values added up.

htsinfer.htsinfer module¶

Main module.

class htsinfer.htsinfer.HtsInfer(path_1: pathlib.Path, path_2: Optional[pathlib.Path] = None, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/latest/docs/api/results_htsinfer'), tmp_dir: pathlib.Path = PosixPath('/tmp/tmp_htsinfer'), cleanup_regime: htsinfer.models.CleanupRegimes = CleanupRegimes.DEFAULT, records: int = 0, threads: int = 1, transcripts_file: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/latest/data/transcripts.fasta.gz'), read_layout_adapter_file: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/latest/data/adapter_fragments.txt'), read_layout_min_match_pct: float = 2, read_layout_min_freq_ratio: float = 2, lib_source_min_match_pct: float = 2, lib_source_min_freq_ratio: float = 2, read_orientation_min_mapped_reads: int = 20, read_orientation_min_fraction: float = 0.75)¶

Bases: object

Determine sequencing library metadata.

Parameters

path_1 – Path to single-end library or first mate file.
path_2 – Path to second mate file.
out_dir – Path to directory where output is written to.
tmp_dir – Path to directory where temporary output is written to.
cleanup_regime – Which data to keep after run concludes; one of CleanupRegimes.
records – Number of input file records to process; set to 0 to process all records.
threads – Number of threads to run STAR with.
transcripts_file – File path to transcripts FASTA file.
read_layout_adapter_file – Path to text file containing 3’ adapter sequences to scan for (one sequence per line).
read_layout_min_match_pct – Minimum percentage of reads that contain a given adapter in order for it to be considered as the library’s 3’-end adapter.
read_layout_min_freq_ratio – Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.
lib_source_min_match_pct – Minimum percentage of reads that are consistent with a given source in order for it to be considered as the to be considered the library’s source.
lib_source_min_freq_ratio – Minimum frequency ratio between the first and second most frequent source in order for the former to be considered the library’s source.
read_orientation_min_mapped_reads – Minimum number of mapped reads for deeming the read orientation result reliable.
read_orientation_min_fraction – Minimum fraction of mapped reads required to be consistent with a given read orientation state in order for that orientation to be reported. Must be above 0.5.

path_1¶: Path to single-end library or first mate file.

path_2¶: Path to second mate file.

out_dir¶: Path to directory where output is written to.

run_id¶: Random string identifier for HTSinfer run.

tmp_dir¶: Path to directory where temporary output is written to.

cleanup_regime¶: Which data to keep after run concludes; one of CleanupRegimes.

records¶: Number of input file records to process.

threads¶: Number of threads to run STAR with.

transcripts_file¶: File path to transcripts FASTA file.

read_layout_adapter_file¶: Path to text file containing 3’ adapter sequences to scan for (one sequence per line).

read_layout_min_match_pct¶: Minimum percentage of reads that contain a given adapter in order for it to be considered as the library’s 3’-end adapter.

read_layout_min_freq_ratio¶: Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.

lib_source_min_match_pct¶: Minimum percentage of reads that are consistent with a given source in order for it to be considered as the to be considered the library’s source.

lib_source_min_freq_ratio¶: Minimum frequency ratio between the first and second most frequent source in order for the former to be considered the library’s source.

read_orientation_min_mapped_reads¶: Minimum number of mapped reads for deeming the read orientation result reliable.

read_orientation_min_fraction¶: Minimum fraction of mapped reads required to be consistent with a given read orientation state in order for that orientation to be reported. Must be above 0.5.

path_1_processed¶: Path to processed path_1 file.

path_2_processed¶: Path to processed path_2 file.

transcripts_file_processed¶: Path to processed transcripts_file file.

state¶: State of the run; one of RunStates.

results¶: Results container for storing determined library metadata.

clean_up()¶: Clean up work environment.

evaluate()¶: Determine library metadata.

get_library_source() → htsinfer.models.ResultsSource¶

Determine library source.

Returns: Library source results.

get_library_stats()¶: Determine library statistics.

get_library_type()¶: Determine library type.

get_read_layout()¶: Determine read layout.

get_read_orientation()¶: Determine read orientation.

prepare_env()¶: Set up work environment.

print()¶: Print results to STDOUT.

process_inputs()¶: Process and validate inputs.

htsinfer.models module¶

Data models.

class htsinfer.models.CleanupRegimes(value)¶

Bases: enum.Enum

Enumerator of cleanup regimes.

DEFAULT = 'default'¶

KEEP_ALL = 'keep_all'¶

KEEP_NONE = 'keep_none'¶

KEEP_RESULTS = 'keep_results'¶

class htsinfer.models.Layout(*, adapt_3: str = None)¶

Bases: pydantic.main.BaseModel

Read layout of a single sequencing file.

Parameters: adapt_3 – Adapter sequence ligated to 3’-end of sequence.

adapt_3¶

Adapter sequence ligated to 3’-end of sequence.

Type: Optional[str]

adapt_3: Optional[str]¶

class htsinfer.models.LogLevels(value)¶

Bases: enum.Enum

Log level enumerator.

CRITICAL = 50¶

DEBUG = 10¶

ERROR = 40¶

INFO = 20¶

WARN = 30¶

WARNING = 30¶

class htsinfer.models.ReadLength(*, min: int = None, max: int = None)¶

Bases: pydantic.main.BaseModel

Read length of a sequencing file.

Parameters

min – Minimum read length.
max – Maximum read length.

min¶

Minimum read length.

Type: Optional[int]

max¶

Maximum read length.

Type: Optional[int]

max: Optional[int]¶

min: Optional[int]¶

class htsinfer.models.Results(*, library_stats: htsinfer.models.ResultsStats = ResultsStats(file_1=Stats(read_length=ReadLength(min=None, max=None)), file_2=Stats(read_length=ReadLength(min=None, max=None))), library_type: htsinfer.models.ResultsType = ResultsType(file_1=<StatesType.not_available: None>, file_2=<StatesType.not_available: None>, relationship=<StatesTypeRelationship.not_available: None>), library_source: htsinfer.models.ResultsSource = ResultsSource(file_1=Source(short_name=None, taxon_id=None), file_2=Source(short_name=None, taxon_id=None)), read_orientation: htsinfer.models.ResultsOrientation = ResultsOrientation(file_1=<StatesOrientation.not_available: None>, file_2=<StatesOrientation.not_available: None>, relationship=<StatesOrientationRelationship.not_available: None>), read_layout: htsinfer.models.ResultsLayout = ResultsLayout(file_1=Layout(adapt_3=None), file_2=Layout(adapt_3=None)))¶

Bases: pydantic.main.BaseModel

Container class for aggregating results from the different inference functionalities.

Parameters

library_type – Library type inference results.
library_source – Library source inference results.
orientation – Read orientation inference results.
read_layout – Read layout inference results.
type – Library type inference results.
source – Library source inference results.
read_orientation – Read orientation inference results.
read_layout – Read layout inference results.

library_source: htsinfer.models.ResultsSource¶

library_stats: htsinfer.models.ResultsStats¶

library_type: htsinfer.models.ResultsType¶

read_layout: htsinfer.models.ResultsLayout¶

read_orientation: htsinfer.models.ResultsOrientation¶

class htsinfer.models.ResultsLayout(*, file_1: htsinfer.models.Layout = Layout(adapt_3=None), file_2: htsinfer.models.Layout = Layout(adapt_3=None))¶

Bases: pydantic.main.BaseModel

Container class for read layout of a sequencing library.

Parameters

file_1 – Adapter sequence present in first file.
file_2 – Adapter sequence present in second file.

file_1¶

Adapter sequence present in first file.

Type: htsinfer.models.Layout

file_2¶

Adapter sequence present in second file.

Type: htsinfer.models.Layout

file_1: htsinfer.models.Layout¶

file_2: htsinfer.models.Layout¶

class htsinfer.models.ResultsOrientation(*, file_1: htsinfer.models.StatesOrientation = StatesOrientation.not_available, file_2: htsinfer.models.StatesOrientation = StatesOrientation.not_available, relationship: htsinfer.models.StatesOrientationRelationship = StatesOrientationRelationship.not_available)¶

Bases: pydantic.main.BaseModel

Container class for aggregating library orientation.

Args:: file_1: Read orientation of first file. file_2: Read orientation of second file. relationship: Orientation type relationship between the provided files.

file_1¶

Read orientation of first file.

Type: htsinfer.models.StatesOrientation

file_2¶

Read orientation of second file.

Type: htsinfer.models.StatesOrientation

file_1: htsinfer.models.StatesOrientation¶

file_2: htsinfer.models.StatesOrientation¶

relationship: htsinfer.models.StatesOrientationRelationship¶

class htsinfer.models.ResultsSource(*, file_1: htsinfer.models.Source = Source(short_name=None, taxon_id=None), file_2: htsinfer.models.Source = Source(short_name=None, taxon_id=None))¶

Bases: pydantic.main.BaseModel

Container class for aggregating library source.

Parameters

file_1 – Library source of the first file.
file_2 – Library source of the second file.

file_1¶

Library source of the first file.

Type: htsinfer.models.Source

file_2¶

Library source of the second file.

Type: htsinfer.models.Source

file_1: htsinfer.models.Source¶

file_2: htsinfer.models.Source¶

class htsinfer.models.ResultsStats(*, file_1: htsinfer.models.Stats = Stats(read_length=ReadLength(min=None, max=None)), file_2: htsinfer.models.Stats = Stats(read_length=ReadLength(min=None, max=None)))¶

Bases: pydantic.main.BaseModel

Container class for aggregating library statistics information.

Parameters

file_1 – Library statistics for the first file.
file_2 – Library statistics for the second file.

file_1¶

Library statistics for the first file.

Type: htsinfer.models.Stats

file_2¶

Library statistics for the second file.

Type: htsinfer.models.Stats

file_1: htsinfer.models.Stats¶

file_2: htsinfer.models.Stats¶

class htsinfer.models.ResultsType(*, file_1: htsinfer.models.StatesType = StatesType.not_available, file_2: htsinfer.models.StatesType = StatesType.not_available, relationship: htsinfer.models.StatesTypeRelationship = StatesTypeRelationship.not_available)¶

Bases: pydantic.main.BaseModel

Container class for aggregating library type and mate relationship information.

Parameters

file_1 – Library type of the first file.
file_2 – Library type of the second file.
relationship – Type/mate relationship between the provided files.

file_1¶

Library type of the first file.

Type: htsinfer.models.StatesType

file_2¶

Library type of the second file.

Type: htsinfer.models.StatesType

relationship¶

Type/mate relationship between the provided files.

Type: htsinfer.models.StatesTypeRelationship

file_1: htsinfer.models.StatesType¶

file_2: htsinfer.models.StatesType¶

relationship: htsinfer.models.StatesTypeRelationship¶

class htsinfer.models.RunStates(value)¶

Bases: enum.IntEnum

Enumerator of run states and exit codes.

ERROR = 2¶

OKAY = 0¶

WARNING = 1¶

class htsinfer.models.SeqIdFormats(value)¶

Bases: enum.Enum

An enumeration.

class htsinfer.models.Source(*, short_name: str = None, taxon_id: int = None)¶

Bases: pydantic.main.BaseModel

Library source of an individual sequencing file.

Parameters

short_name – Library source short name, e.g., “hsapiens”.
taxon_id – Library source taxon identifer, e.g., 9606.

short_name¶

Library source short name, e.g., “hsapiens”.

Type: Optional[str]

taxon_id¶

Library source taxon identifer, e.g., 9606.

Type: Optional[int]

short_name: Optional[str]¶

taxon_id: Optional[int]¶

class htsinfer.models.StatesOrientation(value)¶

Bases: enum.Enum

Enumerator of read orientation types for individual library files. Cf. https://salmon.readthedocs.io/en/latest/library_type.html

not_available¶: Orientation type information is not available for a given file, either because no file was provided, the file could not be parsed, an orientation type has not yet been assigned.

stranded_forward¶: Reads are stranded and come from the forward strand.

stranded_reverse¶: Reads are stranded and come from the reverse strand.

unstranded¶: Reads are unstranded.

not_available = None¶

stranded_forward = 'SF'¶

stranded_reverse = 'SR'¶

unstranded = 'U'¶

class htsinfer.models.StatesOrientationRelationship(value)¶

Bases: enum.Enum

Enumerator of read orientation type relationships for paired-ended libraries. Cf. https://salmon.readthedocs.io/en/latest/library_type.html

inward_stranded_forward¶: Mates are oriented toward each other, the library is stranded, and first mates come from the forward strand.

inward_stranded_reverse¶: Mates are oriented toward each other, the library is stranded, and first mates come from the reverse strand.

inward_unstranded¶: Mates are oriented toward each other and the library is unstranded.

not_available¶: Orientation type relationship information is not available, likely because only a single file was provided or because the orientation type relationship has not been or could not be evaluated.

inward_stranded_forward = 'ISF'¶

inward_stranded_reverse = 'ISR'¶

inward_unstranded = 'IU'¶

not_available = None¶

class htsinfer.models.StatesType(value)¶

Bases: enum.Enum

Possible outcomes of determining the sequencing library type of an individual FASTQ file.

file_problem¶: There was a problem with opening or parsing the file.

first_mate¶: All of the sequence identifiers of the processed file counts indicate that the library represents the first mate of a paired-end library.

mixed_mates¶: All of the sequence identifiers of the processed file include mate information. However, the file includes at least one record for either mate, indicating that the library represents a mixed mate library.

not_available¶: Library type information is not available for a given file, either because no file was provided, the file could not be parsed, a library type has not yet been assigned, the processed file contains records with sequence identifiers of an unknown format, of different formats or that are inconsistent in that they indicate the library represents both a single-ended and paired-ended library at the same time.

second_mate¶: All of the sequence identifiers of the processed file indicate that the library represents the second mate of a paired-end library.

single¶: All of the sequence identifiers of the processed file indicate that the library represents a single-end library.

first_mate = 'first_mate'¶

mixed_mates = 'mixed_mates'¶

not_available = None¶

second_mate = 'second_mate'¶

single = 'single'¶

class htsinfer.models.StatesTypeRelationship(value)¶

Bases: enum.Enum

Possible outcomes of determining the sequencing library type/mate relationship between two FASTQ files.

not_available¶: Mate relationship information is not available, likely because only a single file was provided or because the mate relationship has not yet been evaluated.

not_mates¶: The library type information of the files is not compatible, either because not a pair of first and second mate files was provided, or because the files do not compatible sequence identifiers.

split_mates¶: One of the provided files represents the first and the the other the second mates of a paired-end library.

not_available = None¶

not_mates = 'not_mates'¶

split_mates = 'split_mates'¶

class htsinfer.models.Stats(*, read_length: htsinfer.models.ReadLength = ReadLength(min=None, max=None))¶

Bases: pydantic.main.BaseModel

Library statistics of an individual sequencing file.

Parameters: read_length – Tuple of minimum and maximum length of reads in library.

read_length¶

Tuple of minimum and maximum length of reads in library.

Type: htsinfer.models.ReadLength

read_length: htsinfer.models.ReadLength¶

htsinfer.subset_fastq module¶

FASTQ subsetting, extraction and validation.

class htsinfer.subset_fastq.SubsetFastq(path: pathlib.Path, out_dir: pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/latest/docs/api/results_htsinfer'), records: int = 0)¶

Bases: object

Subset, uncompress and validate a FASTQ file.

Parameters

path – Path to FASTQ file.
out_dir – Path to directory where output is written to.
records – Number of input file records to process; set to 0 to process all records.

path¶: Path to FASTQ file.

out_dir¶: Path to directory where output is written to.

records¶: Number of input file records to process.

out_path¶: Path for uncompressed, filtered path file.

n_processed¶: Total number of processed records.

Raises: FileProblem – The input file could not be parsed or the output file could not be written.

process()¶: Uncompress, subset and validate files.

htsinfer.utils module¶

Utilities used across multiple HTSinfer modules.

htsinfer.utils.convert_dict_to_df(dic: Dict, col_headers: Optional[Tuple[str, str]] = None, sort: bool = False, sort_by: int = 0, sort_ascending: bool = True) → pandas.core.frame.DataFrame¶

Convert dictionary to two-column data frame.

Parameters

dic – Dictionary to convert.
col_headers – List of column headers. Length MUST match number of dictionary keys/data frame columns.
sort – Whether the resulting data frame is supposed to be sorted.
sort_by – Column index used for sorting. Ignored if sort is False.
sort_ascending – Whether the data frame is supposed to be sorted in ascending order. Ignored if sort is False.

Returns

Data frame prepared from dictionary.

Raises

ValueError – Raised if number of provided column headers does not match the number of data frame columns.

htsinfer.utils.validate_top_score(vector: List[float], min_value: float = 2, min_ratio: float = 2, accept_zero: bool = True, rev_sorted: bool = True) → bool¶

Validates whether (1) the maximum value of a numeric list is equal to or higher than a specified minimum value AND (2) that the ratio of the first and second highest values of the list is higher than a specified minimum ratio.

If the passed list/vector does NOT contain at least two items, the function returns False.

Parameters

vector – List of numbers.
min_value – Minimum value required in first row of column_index for validation to pass.
min_ratio – Minimum ratio of first and second rows of column_index required for validation to pass.
accept_zero – Whether to accept a top score (i.e., return True) if the second highest value in the provided list is zero. If not set to True, False is returned in these cases.
rev_sorted – Whether the list of numbers is sorted in descencing numeric order.

Returns

Whether data frame data satisfies the min_value and min_ratio constraints for value in column column_index.

Raises

ValueError – Raised if one of the list items can not be interpreted as a number.