phages2050.features.transformers package

Submodules

phages2050.features.transformers.kmers module

class phages2050.features.transformers.kmers.GenomeAvgTransformer(gensim_model: Union[gensim.models.fasttext.FastText, gensim.models.word2vec.Word2Vec])[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Average k-mers to represent Bacteriophage with word embedding

Most Word2vec or fastText pre-trained models allow to get numerical representations of individual words but not of entire documents With this class it can average each k-mer of a DNA so that the generated Bacteriophage vector is actually a centroid of all k-mers in feature space

average_word_vectors(words: List[str], vocabulary: Set[T]) → numpy.array[source]

Return fixed-length numeric vector for each DNA sequence

averaged_word_vectorizer(column_with_kmers_seqs) → numpy.array[source]

Execute DNA averaged vector transformer on each k-mer sequence and return as array of numeric values

transform(column_with_kmers_seqs: pandas.core.series.Series) → pandas.core.frame.DataFrame[source]

Execute DNA averaged vector transformer on each k-mer sequence and return it Pandas DataFrame with fixed-length numeric vector space

class phages2050.features.transformers.kmers.KMersTransformer(size: int = 6)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

K-mer transformer is responsible to extract set of words which are subsequences of length (6 by default) contained within a biological sequence

Each of the word is called k-mer and are composed of nucleotides (i.e. A, T, G, and C)

Example:

fname = ‘NC_001604.fasta’ fr = FastaReader(fname)

sample = fr.to_df()

kmt = KMersTransformer() kmt.transform(sample)

transform(df: pandas.core.frame.DataFrame) → pandas.core.series.Series[source]

Execute k-mer transformer on each DNA sequence and return it as Series with k-mers strings

Module contents