htsinfer package
HTSinfer project root
Submodules
htsinfer.cli module
Command-line interface client.
- htsinfer.cli.main() None
Entry point for CLI executable.
- htsinfer.cli.parse_args() Namespace
Parse CLI arguments.
- Returns:
Parsed CLI arguments.
- htsinfer.cli.setup_logging(verbosity: str = 'INFO') None
Configure logging.
- Parameters:
verbosity – Level of logging verbosity.
htsinfer.exceptions module
Custom exceptions.
- exception htsinfer.exceptions.CutadaptProblem
Bases:
Exception
Exception raised when running cutadapt commands.
- exception htsinfer.exceptions.FileProblem
Bases:
Exception
Exception raised when file could not be opened or parsed.
- exception htsinfer.exceptions.InconsistentFastqIdentifiers
Bases:
Exception
Exception raised when inconsistent FASTQ sequence identifiers were ecountered.
- exception htsinfer.exceptions.KallistoProblem
Bases:
Exception
Exception raised when running kallisto index and quant commands.
- exception htsinfer.exceptions.MetadataWarning
Bases:
Exception
Exception raised when metadata could not be determined.
- exception htsinfer.exceptions.SamFileProblem
Bases:
Exception
Exception raised when an invalid sam file is encountered.
- exception htsinfer.exceptions.StarProblem
Bases:
Exception
Exception raised when running STAR index and quant commands.
- exception htsinfer.exceptions.TranscriptsFastaProblem
Bases:
Exception
Exception raised when an invalid transcripts fasta file is passed.
- exception htsinfer.exceptions.UnknownFastqIdentifier
Bases:
Exception
Exception raised when a FASTQ sequence identifier of unknown format was ecountered.
- exception htsinfer.exceptions.UnsupportedSampleSourceException
Bases:
Exception
Exception raised when taxonomy ID is not supported.
- exception htsinfer.exceptions.WorkEnvProblem
Bases:
Exception
Exception raised when the work environment could not be set up or cleaned.
htsinfer.get_library_source module
Infer library source from sample data.
- class htsinfer.get_library_source.GetLibSource(config: Config)
Bases:
object
Determine the source of FASTQ sequencing of a single- or paired-end seguencing library.
- Parameters:
config – Container class for all arguments used in inference and results produced by the class.
- Attrubutes:
- paths: Tuple of one or two paths for single-end and paired end library
files.
- transcripts_file: File path to an uncompressed transcripts file in
FASTA format. Expected to contain |-separated sequence identifier lines that contain an organism short name and a taxon identifier in the fourth and fifth columns, respectively. Example sequence identifier: rpl-13|ACYPI006272|ACYPI006272-RA|apisum|7029
out_dir: Path to directory where output is written to. tmp_dir: Path to directory where temporary output is written to. min_match_pct: Minimum percentage of reads that are consistent with a
given source in order for it to be considered as the to be considered the library’s source.
- min_freq_ratio: Minimum frequency ratio between the first and second
most frequent source in order for the former to be considered the library’s source.
tax_id: Taxonomy ID of the sample source.
- create_kallisto_index() Path
Build Kallisto index from FASTA file of target sequences.
- Returns:
Path to Kallisto index.
- Raises:
KallistoProblem – Kallisto index could not be created.
- evaluate() ResultsSource
Infer read source.
- Returns:
Source results object.
- get_source(fastq: Path, index: Path) Source
Determine source of a single sequencing library file.
- Parameters:
fastq – Path to FASTQ file.
index – Path to Kallisto index.
- Returns:
Source of library file.
- static get_source_expression(kallisto_dir: Path) DataFrame
Return percentages of total expression per read source.
- Parameters:
kallisto_dir – Directory containing Kallisto quantification results.
- Returns:
- Data frame with columns source_ids (a tuple of source short name
and taxon identifier, e.g., (“hsapiens”, 9606)) and tpm, signifying the percentages of total expression per read source. The data frame is sorted by total expression in descending order.
- Raises:
FileProblem – Kallisto quantification results could not be processed.
- static get_source_name(taxon_id: int, transcripts_file: Path) str
Return name of the source organism, based on tax ID.
- Parameters:
taxon_id – Taxonomy ID of a given organism.
transcripts_file – Path to FASTA file containing transcripts.
- Returns:
Short name of the organism belonging to the given tax ID.
- Raises:
FileProblem – Could not process input FASTA file.
UnsupportedSampleSourceException – Taxon ID is not supported.
- run_kallisto_quantification(fastq: Path, index: Path) Path
Run Kallisto quantification on individual sequencing library file.
- Parameters:
fastq – Path to FASTQ file.
index – Path to Kallisto index.
- Returns:
Path to output directory.
- Raises:
KallistoProblem – Kallisto quantification failed.
htsinfer.get_library_stats module
Infer read orientation from sample data.
- class htsinfer.get_library_stats.GetLibStats(config: Config)
Bases:
object
Determine library statistics of a single- or paired-end sequencing library.
- Parameters:
config – Container class for all arguments used in inference and results produced by the class.
- paths
Tuple of one or two paths for single-end and paired end library files.
- tmp_dir
Path to directory where temporary output is written to.
- evaluate() ResultsStats
Infer read statistics.
- Returns:
Statistics results object.
- static fastq_get_stats_read_length(fastq: Path) Tuple[int, int, float, int, int]
Get number of records in a FASTQ file.
- Parameters:
fastq – Path to FASTQ file.
- Returns:
Tuple of minimum and maximum read lengths in input file.
- Raises:
FileProblem – Could not process FASTQ file.
htsinfer.get_library_type module
Infer mate information from sample data.
- class htsinfer.get_library_type.GetFastqType(path: Path)
Bases:
object
Determine type (single/paired) information for an individual FASTQ sequencing library.
- Parameters:
path – File path to read library.
- path
File path to read library.
- seq_ids
List of sequence identifier prefixes of the provided read library, i.e., the fragments up until the mate information, if available, as defined by a named capture group
prefix
in a regular expression to extract mate information.
- seq_id_format
The sequence identifier format of the read library, as identified by inspecting the first read and matching one of the available regular expressions for the different identifier formats.
- result
The current best guess for the type of the provided library.
Examples
>>> lib_type = GetFastqType( ... path="tests/files/first_mate.fastq" ... ).evaluate() <OutcomesType.first_mate: 'first_mate'>
- evaluate() None
Decide library type.
- Raises:
NoMetadataDetermined – Type information could not be determined.
- class htsinfer.get_library_type.GetLibType(config: Config, mapping: Mapping)
Bases:
object
- Determine type (single/paired) information for a single or a pair of
FASTQ sequencing libraries.
- Args:
- config: Container class for all arguments used in inference
and results produced by the class.
- Attributes:
path_1: Path to single-end library or first mate file. path_2: Path to second mate file. results: Results container for storing library type information for
the provided files, as well as the mate relationship between the two files, if applicable.
- Examples:
>>> GetLibType( ... path_1="tests/files/first_mate.fastq" ... ).evaluate() ResultsType(file_1=<OutcomesType.single: 'single'>, file_2=<OutcomesTyp
e.not_available: ‘not_available’>, relationship=<OutcomesTypeRelationship.not_a vailable: ‘not_available’>)
>>> GetLibType( ... path_1="tests/files/first_mate.fastq", ... path_2="../tests/test_files/second_mate.fastq", ... ).evaluate() ResultsType(file_1=<OutcomesType.first_mate: 'first_mate'>, file_2=<Out
comesType.second_mate: ‘second_mate’>, relationship=<OutcomesTypeRelationship.s plit_mates: ‘split_mates’>)
(‘first_mate’, ‘second_mate’, ‘split_mates’)
- class AlignedSegment
Bases:
object
Placeholder class for mypy “Missing attribute” error in _compare_alignments(), the actual object used is pysam.AlignedSegment class.
- evaluate() None
Decide type information and mate relationship.
htsinfer.get_read_layout module
Infer adapter sequences present in reads.
- class htsinfer.get_read_layout.GetAdapter3(path: Path, adapter_file: Path, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/stable/docs/api'), min_match_pct: float = 0.1, min_freq_ratio: float = 2)
Bases:
object
Determine 3’ adapter sequence for an individual FASTQ library.
- Parameters:
path – File path to read library.
adapter_file – Path to text file containing 3’ adapter sequences (one sequence per line) to scan for.
out_dir – Path to directory where output is written to.
min_match_pct – Minimum percentage of reads that contain a given adapter Minimum percentage of reads that contain a given adapter sequence in order for it to be considered as the library’s 3’-end adapter.
min_freq_ratio – Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.
- path
File path to read library.
- adapter_file
Path to text file containing 3’ adapter sequences (one sequence per line) to scan for.
- out_dir
Path to directory where output is written to.
- min_match_pct
Minimum percentage of reads that contain a given adapter Minimum percentage of reads that contain a given adapter sequence in order for it to be considered as the library’s 3’-end adapter.
- min_freq_ratio
Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.
- adapters
List of adapter sequences.
- trie
Trie data structure of adapter sequences.
- adapter_counts
Dictionary of adapter sequences and corresponding count percentages.
- result
The most frequent adapter sequence in FASTQ file.
Examples
>>> GetAdapter3( ... path_1="tests/files/sra_sample_2.fastq", ... adapter_file="data/adapter_fragments.txt", ... ).evaluate() <"AAAAAAAAAAAAAAA">
- evaluate() None
Search for adapter sequences and validate result confidence constraints.
- class htsinfer.get_read_layout.GetReadLayout(config: Config)
Bases:
object
Determine the adapter sequence present in the FASTQ sequencing libraries.
- Parameters:
config – Container class for all arguments used in inference and results produced by the class.
- path_1
Path to single-end library or first mate file.
- path_2
Path to second mate file.
- adapter_file
Path to text file containing 3’ adapter sequences (one sequence per line) to scan for.
- out_dir
Path to directory where output is written to.
- min_match_pct
Minimum percentage of reads that contain a given adapter sequence in order for it to be considered as the library’s 3’-end adapter.
- min_freq_ratio
Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.
- results
Results container for storing adapter sequence information for the provided files.
Examples
>>> GetReadLayout( ... path_1="tests/files/sra_sample_2.fastq", ... adapter_file="data/adapter_fragments.txt", ... ).evaluate() ResultsLayout( file_1=<Layout().adapt_3: "AAAAAAAAAAAAAAA">, file_2=<Layout().adapt_3: None>, ) >>> GetReadLayout( ... path_1="tests/files/sra_sample_1.fastq", ... path_2="tests/files/sra_sample_2.fastq", ... adapter_file="data/adapter_fragments.txt", ... min_match_pct=2, ... min_freq_ratio=1, ... ).evaluate() ResultsLayout( file_1=<Layout().adapt_3: "AAAAAAAAAAAAAAA">, file_2=<Layout().adapt_3: "AAAAAAAAAAAAAAA">, )
- evaluate() None
Decide adapter sequence.
- get_poly_a(file=PosixPath('.')) float
Run cutadapt and parse report
htsinfer.get_read_orientation module
Infer read orientation from sample data.
- class htsinfer.get_read_orientation.GetOrientation(config: Config, mapping: Mapping)
Bases:
object
Determine library strandedness and relative read orientation of a single- or paired-end seguencing library.
- Parameters:
config – Container class for all arguments used in inference and results produced by the class.
- paths
Tuple of one or two paths for single-end and paired end library files.
- library_type
ResultsType object with library type and mate relationship.
- library_source
ResultsSource object with source information on each library file.
- transcripts_file
File path to an uncompressed transcripts file in FASTA format.
- tmp_dir
Path to directory where temporary output is written to.
- threads_star
Number of threads to run STAR with.
- min_mapped_reads
Minimum number of mapped reads for deeming the read orientation result reliable.
- min_fraction
Minimum fraction of mapped reads required to be consistent with a given read orientation state in order for that orientation to be reported. Must be above 0.5.
- evaluate() ResultsOrientation
Infer read orientation.
- Returns:
Orientation results object.
- static get_frequencies(*items: Any) Dict[Any, float]
Get frequencies of arguments as fractions of the number of all arguments.
- Parameters:
*items – Items to get frequencies for.
- Returns:
Dictionary of arguments and their frequencies.
- process_alignments(star_dirs: List[Path]) ResultsOrientation
Determine read orientation of one or two single-ended or one paired-end sequencing library.
- Parameters:
star_dirs – List of one or two paths to STAR output directories.
- Returns:
Read orientation state of library or libraries.
- process_paired(sam: Path) ResultsOrientation
Determine read orientation of a paired-ended sequencing library.
- Parameters:
sam – Path to SAM file.
- Returns:
- Read orientation state of each mate and orientation state
relationship of library.
- process_single(sam: Path) StatesOrientation
Determine read orientation of a single-ended sequencing library. :param sam: Path to SAM file.
- Returns:
Read orientation state of library.
- Raises:
Sam file could not be processed. –
- static sum_dicts(*dicts: Dict[Any, float]) Dict[Any, float]
Sum of dictionaries with numeric values.
- Parameters:
*dicts – Dictionaries to sum up.
- Returns:
Dictionary with union of keys of input dictionaries and all values added up.
htsinfer.htsinfer module
Main module.
- class htsinfer.htsinfer.HtsInfer(config: Config)
Bases:
object
Determine sequencing library metadata.
- Parameters:
config – Container class for all arguments used in inference and results produced by the class.
- config
Container class for all arguments used in inference and results produced by the class.
- run_id
Random string identifier for HTSinfer run.
- state
State of the run; one of RunStates.
- clean_up()
Clean up work environment.
- evaluate()
Determine library metadata.
- get_library_source() ResultsSource
Determine library source.
- Returns:
Library source results.
- get_library_stats()
Determine library statistics.
- get_library_type()
Determine library type.
- get_read_layout()
Determine read layout.
- get_read_orientation()
Determine read orientation.
- prepare_env()
Set up work environment.
- print()
Print results to STDOUT.
- process_inputs()
Process and validate inputs.
htsinfer.mapping module
Mapping FASTQ’s and managing the outputs of STAR.
- class htsinfer.mapping.Mapping(config: Config)
Bases:
object
Map FASTQ file/s and manage outputs.
- Parameters:
path – Path to FASTQ file.
- path_1
Path to single-end library or first mate file.
- path_2
Path to second mate file.
- Raises:
FileProblem – The input file could not be parsed or the output file could not be written.
- create_star_index(fasta: Path, chr_bin_bits: int = 18, index_string_size: int = 5) Path
Prepare STAR index.
- Parameters:
fasta – Path to FASTA file of sequence records to create index from.
index_string_size – Size of SA pre-indexing string, in nucleotides.
- Returns:
Path to directory containing STAR index.
- Raises:
StarProblem – STAR index could not be created.
- evaluate()
Align FASTQ files to reference transcripts with STAR.
- static generate_star_alignments(commands: Dict[Path, List[str]]) None
Align reads to index with STAR.
- Parameters:
commands – Dictionary of output paths and corresponding STAR commands.
- Raises:
StarProblem – Generating alignments failed.
- static get_fasta_size(fasta: Path) int
Get size of FASTA file in total nucleotides.
- Parameters:
fasta – Path to FASTA file.
- Returns:
Total number of nucleotides of all records.
- Raises:
FileProblem – Could not open FASTA file for reading.
- static get_star_chr_bin_bits(ref_size: int, transcripts: Path) int
Get size of bins for STAR genome storage.
- Parameters:
ref_size – Size of genome/transcriptome reference in nucleotides.
transcripts – Path to filtered FASTA transcripts file.
- Returns:
Number of bins for genome storage.
- static get_star_index_string_size(ref_size: int) int
Get length of STAR SA pre-indexing string.
Cf. https://github.com/alexdobin/STAR/blob/51b64d4fafb7586459b8a61303e40beceeead8c0/doc/STARmanual.pdf
- Parameters:
ref_size – Size of genome/transcriptome reference in nucleotides.
- Returns:
Size (in nucleotides) of SA pre-indexing string.
- prepare_star_alignment_commands(index_dir: Path) Dict[Path, List[str]]
Prepare STAR alignment commands.
Input FASTQ files are assumed to be sorted according to reference names or coordinates, the order of input reads is kept with the option “PairedKeepInputOrder”, no additional sorting of aligned reads is done.
- Parameters:
index_dir – Path to directory containing STAR index.
- Returns:
Dictionary of output paths and corresponding STAR commands.
- subset_transcripts_by_organism() Path
Filter FASTA file of transcripts by current sources.
- The filtered file contains records from the indicated sources.
Typically, this is one source. However, for if two input files were supplied that are originating from different sources (i.e., not from a valid paired-ended library), it may be from two different sources. If no source is supplied (because it could not be inferred), no filtering is done.
- Returns:
Path to filtered FASTA file.
- Raises:
FileProblem – Could not open input/output FASTA file for reading/writing.
htsinfer.models module
Data models.
- class htsinfer.models.Args(*, path_1: Path = PosixPath('.'), path_2: Path | None = None, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/stable/docs/api/results_htsinfer'), tmp_dir: Path = PosixPath('/tmp/tmp_htsinfer'), cleanup_regime: CleanupRegimes = CleanupRegimes.DEFAULT, records: int = 1000000, threads: int = 1, tax_id: int | None = None, transcripts_file: Path = PosixPath('.'), read_layout_adapter_file: Path = PosixPath('.'), read_layout_min_match_pct: float = 0.1, read_layout_min_freq_ratio: float = 2, lib_source_min_match_pct: float = 2, lib_source_min_freq_ratio: float = 2, lib_type_max_distance: int = 1000, lib_type_mates_cutoff: float = 0.85, read_orientation_min_mapped_reads: int = 20, read_orientation_min_fraction: float = 0.75, path_1_processed: Path = PosixPath('.'), path_2_processed: Path | None = None, t_file_processed: Path = PosixPath('.'))
Bases:
BaseModel
Configuration model for CLI arguments.
- Parameters:
path_1 – Path to single-end library or first mate file.
path_2 – Path to second mate file.
out_dir – Path to directory where output is written to.
tmp_dir – Path to directory where temporary output is written to.
cleanup_regime – Which data to keep after run concludes; one of CleanupRegimes.
records – Number of input file records to process; set to 0 to process all records.
threads – Number of threads to run STAR with.
tax_id – Taxonomy ID of the sample source.
transcripts_file – File path to transcripts FASTA file.
read_layout_adapter_file – Path to text file containing 3’ adapter sequences to scan for (one sequence per line).
read_layout_min_match_pct – Minimum percentage of reads that contain a given adapter in order for it to be considered as the library’s 3’-end adapter.
read_layout_min_freq_ratio – Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.
lib_source_min_match_pct – Minimum percentage of reads that are consistent with a given source in order for it to be considered as the to be considered the library’s source.
lib_source_min_freq_ratio – Minimum frequency ratio between the first and second most frequent source in order for the former to be considered the library’s source.
read_orientation_min_mapped_reads – Minimum number of mapped reads for deeming the read orientation result reliable.
read_orientation_min_fraction – Minimum fraction of mapped reads required to be consistent with a given read orientation state in order for that orientation to be reported. Must be above 0.5.
- path_1
Path to single-end library or first mate file.
- Type:
pathlib.Path
- path_2
Path to second mate file.
- Type:
pathlib.Path | None
- out_dir
Path to directory where output is written to.
- Type:
pathlib.Path
- run_id
Random string identifier for HTSinfer run.
- tmp_dir
Path to directory where temporary output is written to.
- Type:
pathlib.Path
- cleanup_regime
Which data to keep after run concludes; one of CleanupRegimes.
- records
Number of input file records to process.
- Type:
int
- threads
Number of threads to run STAR with.
- Type:
int
- transcripts_file
File path to transcripts FASTA file.
- Type:
pathlib.Path
- read_layout_adapter_file
Path to text file containing 3’ adapter sequences to scan for (one sequence per line).
- Type:
pathlib.Path
- read_layout_min_match_pct
Minimum percentage of reads that contain a given adapter in order for it to be considered as the library’s 3’-end adapter.
- Type:
float
- read_layout_min_freq_ratio
Minimum frequency ratio between the first and second most frequent adapter in order for the former to be considered as the library’s 3’-end adapter.
- Type:
float
- lib_source_min_match_pct
Minimum percentage of reads that are consistent with a given source in order for it to be considered as the to be considered the library’s source.
- Type:
float
- lib_source_min_freq_ratio
Minimum frequency ratio between the first and second most frequent source in order for the former to be considered the library’s source.
- Type:
float
- lib_type_max_distance
Upper limit on the difference in the reference sequence coordinates between the two mates to be considered as coming from a single fragment. (Used only when sequence identifiers do not match)
- Type:
int
- lib_type_mates_cutoff
Minimum fraction of mates that can be mapped to compatible loci and are considered concordant pairs / all mates. (Used only when sequence identifiers do not match)
- Type:
float
- read_orientation_min_mapped_reads
Minimum number of mapped reads for deeming the read orientation result reliable.
- Type:
int
- read_orientation_min_fraction
Minimum fraction of mapped reads required to be consistent with a given read orientation state in order for that orientation to be reported. Must be above 0.5.
- Type:
float
- path_1_processed
Path to processed path_1 file.
- Type:
pathlib.Path
- path_2_processed
Path to processed path_2 file.
- Type:
pathlib.Path | None
- t_file_processed
Path to processed transcripts_file file.
- Type:
pathlib.Path
- state
State of the run; one of RunStates.
- results
Results container for storing determined library metadata.
- cleanup_regime: CleanupRegimes
- lib_source_min_freq_ratio: float
- lib_source_min_match_pct: float
- lib_type_mates_cutoff: float
- lib_type_max_distance: int
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'cleanup_regime': FieldInfo(annotation=CleanupRegimes, required=False, default=<CleanupRegimes.DEFAULT: 'default'>), 'lib_source_min_freq_ratio': FieldInfo(annotation=float, required=False, default=2), 'lib_source_min_match_pct': FieldInfo(annotation=float, required=False, default=2), 'lib_type_mates_cutoff': FieldInfo(annotation=float, required=False, default=0.85), 'lib_type_max_distance': FieldInfo(annotation=int, required=False, default=1000), 'out_dir': FieldInfo(annotation=Path, required=False, default=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/stable/docs/api/results_htsinfer')), 'path_1': FieldInfo(annotation=Path, required=False, default=PosixPath('.')), 'path_1_processed': FieldInfo(annotation=Path, required=False, default=PosixPath('.')), 'path_2': FieldInfo(annotation=Union[Path, NoneType], required=False), 'path_2_processed': FieldInfo(annotation=Union[Path, NoneType], required=False), 'read_layout_adapter_file': FieldInfo(annotation=Path, required=False, default=PosixPath('.')), 'read_layout_min_freq_ratio': FieldInfo(annotation=float, required=False, default=2), 'read_layout_min_match_pct': FieldInfo(annotation=float, required=False, default=0.1), 'read_orientation_min_fraction': FieldInfo(annotation=float, required=False, default=0.75), 'read_orientation_min_mapped_reads': FieldInfo(annotation=int, required=False, default=20), 'records': FieldInfo(annotation=int, required=False, default=1000000), 't_file_processed': FieldInfo(annotation=Path, required=False, default=PosixPath('.')), 'tax_id': FieldInfo(annotation=Union[int, NoneType], required=False), 'threads': FieldInfo(annotation=int, required=False, default=1), 'tmp_dir': FieldInfo(annotation=Path, required=False, default=PosixPath('/tmp/tmp_htsinfer')), 'transcripts_file': FieldInfo(annotation=Path, required=False, default=PosixPath('.'))}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- out_dir: Path
- path_1: Path
- path_1_processed: Path
- path_2: Path | None
- path_2_processed: Path | None
- read_layout_adapter_file: Path
- read_layout_min_freq_ratio: float
- read_layout_min_match_pct: float
- read_orientation_min_fraction: float
- read_orientation_min_mapped_reads: int
- records: int
- t_file_processed: Path
- tax_id: int | None
- threads: int
- tmp_dir: Path
- transcripts_file: Path
- class htsinfer.models.CleanupRegimes(value)
Bases:
Enum
Enumerator of cleanup regimes.
- DEFAULT = 'default'
- KEEP_ALL = 'keep_all'
- KEEP_NONE = 'keep_none'
- KEEP_RESULTS = 'keep_results'
- class htsinfer.models.Config(*, args: ~htsinfer.models.Args = Args(path_1=PosixPath('.'), path_2=None, out_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/stable/docs/api/results_htsinfer'), tmp_dir=PosixPath('/tmp/tmp_htsinfer'), cleanup_regime=<CleanupRegimes.DEFAULT: 'default'>, records=1000000, threads=1, tax_id=None, transcripts_file=PosixPath('.'), read_layout_adapter_file=PosixPath('.'), read_layout_min_match_pct=0.1, read_layout_min_freq_ratio=2, lib_source_min_match_pct=2, lib_source_min_freq_ratio=2, lib_type_max_distance=1000, lib_type_mates_cutoff=0.85, read_orientation_min_mapped_reads=20, read_orientation_min_fraction=0.75, path_1_processed=PosixPath('.'), path_2_processed=None, t_file_processed=PosixPath('.')), results: ~htsinfer.models.Results = Results(library_stats=ResultsStats(file_1=Stats(read_length=ReadLength(min=None, max=None, mean=None, median=None, mode=None)), file_2=Stats(read_length=ReadLength(min=None, max=None, mean=None, median=None, mode=None))), library_source=ResultsSource(file_1=Source(short_name=None, taxon_id=None), file_2=Source(short_name=None, taxon_id=None)), library_type=ResultsType(file_1=<StatesType.not_available: None>, file_2=<StatesType.not_available: None>, relationship=<StatesTypeRelationship.not_available: None>), read_orientation=ResultsOrientation(file_1=<StatesOrientation.not_available: None>, file_2=<StatesOrientation.not_available: None>, relationship=<StatesOrientationRelationship.not_available: None>), read_layout=ResultsLayout(file_1=Layout(adapt_3=None, polyA_frac=None), file_2=Layout(adapt_3=None, polyA_frac=None))))
Bases:
BaseModel
Configuration model for CLI arguments and inference results.
- Parameters:
args – Container class for CLI arguments.
results – Container class for aggregating results from the different inference functionalities.
- args
Container class for CLI arguments.
- Type:
- results
Container class for aggregating results from the different inference functionalities.
- Type:
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'args': FieldInfo(annotation=Args, required=False, default=Args(path_1=PosixPath('.'), path_2=None, out_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/stable/docs/api/results_htsinfer'), tmp_dir=PosixPath('/tmp/tmp_htsinfer'), cleanup_regime=<CleanupRegimes.DEFAULT: 'default'>, records=1000000, threads=1, tax_id=None, transcripts_file=PosixPath('.'), read_layout_adapter_file=PosixPath('.'), read_layout_min_match_pct=0.1, read_layout_min_freq_ratio=2, lib_source_min_match_pct=2, lib_source_min_freq_ratio=2, lib_type_max_distance=1000, lib_type_mates_cutoff=0.85, read_orientation_min_mapped_reads=20, read_orientation_min_fraction=0.75, path_1_processed=PosixPath('.'), path_2_processed=None, t_file_processed=PosixPath('.'))), 'results': FieldInfo(annotation=Results, required=False, default=Results(library_stats=ResultsStats(file_1=Stats(read_length=ReadLength(min=None, max=None, mean=None, median=None, mode=None)), file_2=Stats(read_length=ReadLength(min=None, max=None, mean=None, median=None, mode=None))), library_source=ResultsSource(file_1=Source(short_name=None, taxon_id=None), file_2=Source(short_name=None, taxon_id=None)), library_type=ResultsType(file_1=<StatesType.not_available: None>, file_2=<StatesType.not_available: None>, relationship=<StatesTypeRelationship.not_available: None>), read_orientation=ResultsOrientation(file_1=<StatesOrientation.not_available: None>, file_2=<StatesOrientation.not_available: None>, relationship=<StatesOrientationRelationship.not_available: None>), read_layout=ResultsLayout(file_1=Layout(adapt_3=None, polyA_frac=None), file_2=Layout(adapt_3=None, polyA_frac=None))))}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class htsinfer.models.Layout(*, adapt_3: str | None = None, polyA_frac: float | None = None)
Bases:
BaseModel
Read layout of a single sequencing file.
- Parameters:
adapt_3 – Adapter sequence ligated to 3’-end of sequence.
polyA_frac – Fraction of reads containing polyA tails.
- adapt_3
Adapter sequence ligated to 3’-end of sequence.
- Type:
str | None
- polyA_frac
Fraction of reads containing polyA tails.
- Type:
float | None
- adapt_3: str | None
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'adapt_3': FieldInfo(annotation=Union[str, NoneType], required=False), 'polyA_frac': FieldInfo(annotation=Union[float, NoneType], required=False)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- polyA_frac: float | None
- class htsinfer.models.LogLevels(value)
Bases:
Enum
Log level enumerator.
- CRITICAL = 50
- DEBUG = 10
- ERROR = 40
- INFO = 20
- WARN = 30
- WARNING = 30
- class htsinfer.models.ReadLength(*, min: int | None = None, max: int | None = None, mean: float | None = None, median: int | None = None, mode: int | None = None)
Bases:
BaseModel
Read length of a sequencing file.
- Parameters:
min – Minimum read length.
max – Maximum read length.
mean – Mean of read lengths.
median – Median of read lengths.
mode – Mode of read length.
- min
Minimum read length.
- Type:
int | None
- max
Maximum read length.
- Type:
int | None
- mean
Mean of read lengths.
- Type:
float | None
- median
Median of read lengths.
- Type:
int | None
- mode
Mode of read length.
- Type:
int | None
- max: int | None
- mean: float | None
- median: int | None
- min: int | None
- mode: int | None
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'max': FieldInfo(annotation=Union[int, NoneType], required=False), 'mean': FieldInfo(annotation=Union[float, NoneType], required=False), 'median': FieldInfo(annotation=Union[int, NoneType], required=False), 'min': FieldInfo(annotation=Union[int, NoneType], required=False), 'mode': FieldInfo(annotation=Union[int, NoneType], required=False)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class htsinfer.models.Results(*, library_stats: ~htsinfer.models.ResultsStats = ResultsStats(file_1=Stats(read_length=ReadLength(min=None, max=None, mean=None, median=None, mode=None)), file_2=Stats(read_length=ReadLength(min=None, max=None, mean=None, median=None, mode=None))), library_source: ~htsinfer.models.ResultsSource = ResultsSource(file_1=Source(short_name=None, taxon_id=None), file_2=Source(short_name=None, taxon_id=None)), library_type: ~htsinfer.models.ResultsType = ResultsType(file_1=<StatesType.not_available: None>, file_2=<StatesType.not_available: None>, relationship=<StatesTypeRelationship.not_available: None>), read_orientation: ~htsinfer.models.ResultsOrientation = ResultsOrientation(file_1=<StatesOrientation.not_available: None>, file_2=<StatesOrientation.not_available: None>, relationship=<StatesOrientationRelationship.not_available: None>), read_layout: ~htsinfer.models.ResultsLayout = ResultsLayout(file_1=Layout(adapt_3=None, polyA_frac=None), file_2=Layout(adapt_3=None, polyA_frac=None)))
Bases:
BaseModel
Container class for aggregating results from the different inference functionalities.
- Parameters:
library_type – Library type inference results.
library_source – Library source inference results.
orientation – Read orientation inference results.
read_layout – Read layout inference results.
type – Library type inference results.
source – Library source inference results.
read_orientation – Read orientation inference results.
read_layout – Read layout inference results.
- library_source: ResultsSource
- library_stats: ResultsStats
- library_type: ResultsType
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'library_source': FieldInfo(annotation=ResultsSource, required=False, default=ResultsSource(file_1=Source(short_name=None, taxon_id=None), file_2=Source(short_name=None, taxon_id=None))), 'library_stats': FieldInfo(annotation=ResultsStats, required=False, default=ResultsStats(file_1=Stats(read_length=ReadLength(min=None, max=None, mean=None, median=None, mode=None)), file_2=Stats(read_length=ReadLength(min=None, max=None, mean=None, median=None, mode=None)))), 'library_type': FieldInfo(annotation=ResultsType, required=False, default=ResultsType(file_1=<StatesType.not_available: None>, file_2=<StatesType.not_available: None>, relationship=<StatesTypeRelationship.not_available: None>)), 'read_layout': FieldInfo(annotation=ResultsLayout, required=False, default=ResultsLayout(file_1=Layout(adapt_3=None, polyA_frac=None), file_2=Layout(adapt_3=None, polyA_frac=None))), 'read_orientation': FieldInfo(annotation=ResultsOrientation, required=False, default=ResultsOrientation(file_1=<StatesOrientation.not_available: None>, file_2=<StatesOrientation.not_available: None>, relationship=<StatesOrientationRelationship.not_available: None>))}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- read_layout: ResultsLayout
- read_orientation: ResultsOrientation
- class htsinfer.models.ResultsLayout(*, file_1: Layout = Layout(adapt_3=None, polyA_frac=None), file_2: Layout = Layout(adapt_3=None, polyA_frac=None))
Bases:
BaseModel
Container class for read layout of a sequencing library.
- Parameters:
file_1 – Adapter sequence present in first file.
file_2 – Adapter sequence present in second file.
- file_1
Adapter sequence present in first file.
- Type:
- file_2
Adapter sequence present in second file.
- Type:
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'file_1': FieldInfo(annotation=Layout, required=False, default=Layout(adapt_3=None, polyA_frac=None)), 'file_2': FieldInfo(annotation=Layout, required=False, default=Layout(adapt_3=None, polyA_frac=None))}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class htsinfer.models.ResultsOrientation(*, file_1: StatesOrientation | None = StatesOrientation.not_available, file_2: StatesOrientation | None = StatesOrientation.not_available, relationship: StatesOrientationRelationship | None = StatesOrientationRelationship.not_available)
Bases:
BaseModel
- Container class for aggregating library orientation.
- Args:
file_1: Read orientation of first file. file_2: Read orientation of second file. relationship: Orientation type relationship between the provided files.
- file_1
Read orientation of first file.
- Type:
- file_2
Read orientation of second file.
- Type:
- file_1: StatesOrientation | None
- file_2: StatesOrientation | None
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'file_1': FieldInfo(annotation=Union[StatesOrientation, NoneType], required=False, default=<StatesOrientation.not_available: None>), 'file_2': FieldInfo(annotation=Union[StatesOrientation, NoneType], required=False, default=<StatesOrientation.not_available: None>), 'relationship': FieldInfo(annotation=Union[StatesOrientationRelationship, NoneType], required=False, default=<StatesOrientationRelationship.not_available: None>)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- relationship: StatesOrientationRelationship | None
- class htsinfer.models.ResultsSource(*, file_1: Source = Source(short_name=None, taxon_id=None), file_2: Source = Source(short_name=None, taxon_id=None))
Bases:
BaseModel
Container class for aggregating library source.
- Parameters:
file_1 – Library source of the first file.
file_2 – Library source of the second file.
- file_1
Library source of the first file.
- Type:
- file_2
Library source of the second file.
- Type:
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'file_1': FieldInfo(annotation=Source, required=False, default=Source(short_name=None, taxon_id=None)), 'file_2': FieldInfo(annotation=Source, required=False, default=Source(short_name=None, taxon_id=None))}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class htsinfer.models.ResultsStats(*, file_1: Stats = Stats(read_length=ReadLength(min=None, max=None, mean=None, median=None, mode=None)), file_2: Stats = Stats(read_length=ReadLength(min=None, max=None, mean=None, median=None, mode=None)))
Bases:
BaseModel
Container class for aggregating library statistics information.
- Parameters:
file_1 – Library statistics for the first file.
file_2 – Library statistics for the second file.
- file_1
Library statistics for the first file.
- Type:
- file_2
Library statistics for the second file.
- Type:
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'file_1': FieldInfo(annotation=Stats, required=False, default=Stats(read_length=ReadLength(min=None, max=None, mean=None, median=None, mode=None))), 'file_2': FieldInfo(annotation=Stats, required=False, default=Stats(read_length=ReadLength(min=None, max=None, mean=None, median=None, mode=None)))}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class htsinfer.models.ResultsType(*, file_1: StatesType | None = StatesType.not_available, file_2: StatesType | None = StatesType.not_available, relationship: StatesTypeRelationship | None = StatesTypeRelationship.not_available)
Bases:
BaseModel
Container class for aggregating library type and mate relationship information.
- Parameters:
file_1 – Library type of the first file.
file_2 – Library type of the second file.
relationship – Type/mate relationship between the provided files.
- file_1
Library type of the first file.
- Type:
htsinfer.models.StatesType | None
- file_2
Library type of the second file.
- Type:
htsinfer.models.StatesType | None
- relationship
Type/mate relationship between the provided files.
- Type:
- file_1: StatesType | None
- file_2: StatesType | None
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'file_1': FieldInfo(annotation=Union[StatesType, NoneType], required=False, default=<StatesType.not_available: None>), 'file_2': FieldInfo(annotation=Union[StatesType, NoneType], required=False, default=<StatesType.not_available: None>), 'relationship': FieldInfo(annotation=Union[StatesTypeRelationship, NoneType], required=False, default=<StatesTypeRelationship.not_available: None>)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- relationship: StatesTypeRelationship | None
- class htsinfer.models.RunStates(value)
Bases:
IntEnum
Enumerator of run states and exit codes.
- ERROR = 2
- OKAY = 0
- WARNING = 1
- class htsinfer.models.SeqIdFormats(value)
Bases:
Enum
An enumeration.
- class htsinfer.models.Source(*, short_name: str | None = None, taxon_id: int | None = None)
Bases:
BaseModel
Library source of an individual sequencing file.
- Parameters:
short_name – Library source short name, e.g., “hsapiens”.
taxon_id – Library source taxon identifer, e.g., 9606.
- short_name
Library source short name, e.g., “hsapiens”.
- Type:
str | None
- taxon_id
Library source taxon identifer, e.g., 9606.
- Type:
int | None
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'short_name': FieldInfo(annotation=Union[str, NoneType], required=False), 'taxon_id': FieldInfo(annotation=Union[int, NoneType], required=False)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- short_name: str | None
- taxon_id: int | None
- class htsinfer.models.StatesOrientation(value)
Bases:
Enum
Enumerator of read orientation types for individual library files. Cf. https://salmon.readthedocs.io/en/latest/library_type.html
- not_available
Orientation type information is not available for a given file, either because no file was provided, the file could not be parsed, an orientation type has not yet been assigned.
- stranded_forward
Reads are stranded and come from the forward strand.
- stranded_reverse
Reads are stranded and come from the reverse strand.
- unstranded
Reads are unstranded.
- not_available = None
- stranded_forward = 'SF'
- stranded_reverse = 'SR'
- unstranded = 'U'
- class htsinfer.models.StatesOrientationRelationship(value)
Bases:
Enum
Enumerator of read orientation type relationships for paired-ended libraries. Cf. https://salmon.readthedocs.io/en/latest/library_type.html
- inward_stranded_forward
Mates are oriented toward each other, the library is stranded, and first mates come from the forward strand.
- inward_stranded_reverse
Mates are oriented toward each other, the library is stranded, and first mates come from the reverse strand.
- inward_unstranded
Mates are oriented toward each other and the library is unstranded.
- not_available
Orientation type relationship information is not available, likely because only a single file was provided or because the orientation type relationship has not been or could not be evaluated.
- inward_stranded_forward = 'ISF'
- inward_stranded_reverse = 'ISR'
- inward_unstranded = 'IU'
- not_available = None
- class htsinfer.models.StatesType(value)
Bases:
Enum
Possible outcomes of determining the sequencing library type of an individual FASTQ file.
- file_problem
There was a problem with opening or parsing the file.
- first_mate
All of the sequence identifiers of the processed file counts indicate that the library represents the first mate of a paired-end library.
- mixed_mates
All of the sequence identifiers of the processed file include mate information. However, the file includes at least one record for either mate, indicating that the library represents a mixed mate library.
- not_available
Library type information is not available for a given file, either because no file was provided, the file could not be parsed, a library type has not yet been assigned, the processed file contains records with sequence identifiers of an unknown format, of different formats or that are inconsistent in that they indicate the library represents both a single-ended and paired-ended library at the same time.
- second_mate
All of the sequence identifiers of the processed file indicate that the library represents the second mate of a paired-end library.
- single
All of the sequence identifiers of the processed file indicate that the library represents a single-end library.
- first_mate = 'first_mate'
- first_mate_assumed = 'first_mate_assumed'
- mixed_mates = 'mixed_mates'
- not_available = None
- second_mate = 'second_mate'
- second_mate_assumed = 'second_mate_assumed'
- single = 'single'
- class htsinfer.models.StatesTypeRelationship(value)
Bases:
Enum
Possible outcomes of determining the sequencing library type/mate relationship between two FASTQ files.
- not_available
Mate relationship information is not available, likely because only a single file was provided or because the mate relationship has not yet been evaluated.
- not_mates
The library type information of the files is not compatible, either because not a pair of first and second mate files was provided, or because the files do not compatible sequence identifiers.
- split_mates
One of the provided files represents the first and the other the second mates of a paired-end library.
- not_available = None
- not_mates = 'not_mates'
- split_mates = 'split_mates'
- class htsinfer.models.Stats(*, read_length: ReadLength = ReadLength(min=None, max=None, mean=None, median=None, mode=None))
Bases:
BaseModel
Library statistics of an individual sequencing file.
- Parameters:
read_length – Tuple of mininimum, maximum, mean, median and mode of lengths of reads in library.
- read_length
Tuple of mininimum, maximum, mean, median and mode of lengths of reads in library.
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'read_length': FieldInfo(annotation=ReadLength, required=False, default=ReadLength(min=None, max=None, mean=None, median=None, mode=None))}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- read_length: ReadLength
htsinfer.subset_fastq module
FASTQ subsetting, extraction and validation.
- class htsinfer.subset_fastq.SubsetFastq(path: Path, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/htsinfer/checkouts/stable/docs/api/results_htsinfer'), records: int = 0)
Bases:
object
Subset, uncompress and validate a FASTQ file.
- Parameters:
path – Path to FASTQ file.
out_dir – Path to directory where output is written to.
records – Number of input file records to process; set to 0 to process all records.
- path
Path to FASTQ file.
- out_dir
Path to directory where output is written to.
- records
Number of input file records to process.
- out_path
Path for uncompressed, filtered path file.
- n_processed
Total number of processed records.
- Raises:
FileProblem – The input file could not be parsed or the output file could not be written.
- process()
Uncompress, subset and validate files.
htsinfer.utils module
Utilities used across multiple HTSinfer modules.
- htsinfer.utils.convert_dict_to_df(dic: Dict, col_headers: Tuple[str, str] | None = None, sort: bool = False, sort_by: int = 0, sort_ascending: bool = True) DataFrame
Convert dictionary to two-column data frame.
- Parameters:
dic – Dictionary to convert.
col_headers – List of column headers. Length MUST match number of dictionary keys/data frame columns.
sort – Whether the resulting data frame is supposed to be sorted.
sort_by – Column index used for sorting. Ignored if sort is False.
sort_ascending – Whether the data frame is supposed to be sorted in ascending order. Ignored if sort is False.
- Returns:
Data frame prepared from dictionary.
- Raises:
ValueError – Raised if number of provided column headers does not match the number of data frame columns.
- htsinfer.utils.validate_top_score(vector: List[float], min_value: float = 2, min_ratio: float = 2, accept_zero: bool = True, rev_sorted: bool = True) bool
Validates whether (1) the maximum value of a numeric list is equal to or higher than a specified minimum value AND (2) that the ratio of the first and second highest values of the list is higher than a specified minimum ratio.
If the passed list/vector does NOT contain at least two items, the function returns False.
- Parameters:
vector – List of numbers.
min_value – Minimum value required in first row of column_index for validation to pass.
min_ratio – Minimum ratio of first and second rows of column_index required for validation to pass.
accept_zero – Whether to accept a top score (i.e., return True) if the second highest value in the provided list is zero. If not set to True, False is returned in these cases.
rev_sorted – Whether the list of numbers is sorted in descencing numeric order.
- Returns:
Whether data frame data satisfies the min_value and min_ratio constraints for value in column column_index.
- Raises:
ValueError – Raised if one of the list items can not be interpreted as a number.