phages2050.features.transformers package¶
Submodules¶
phages2050.features.transformers.kmers module¶
-
class
phages2050.features.transformers.kmers.GenomeAvgTransformer(gensim_model: Union[gensim.models.fasttext.FastText, gensim.models.word2vec.Word2Vec])[source]¶ Bases:
sklearn.base.TransformerMixin,sklearn.base.BaseEstimatorAverage k-mers to represent Bacteriophage with word embedding
Most Word2vec or fastText pre-trained models allow to get numerical representations of individual words but not of entire documents With this class it can average each k-mer of a DNA so that the generated Bacteriophage vector is actually a centroid of all k-mers in feature space
-
average_word_vectors(words: List[str], vocabulary: Set[T]) → numpy.array[source]¶ Return fixed-length numeric vector for each DNA sequence
-
-
class
phages2050.features.transformers.kmers.KMersTransformer(size: int = 6)[source]¶ Bases:
sklearn.base.BaseEstimator,sklearn.base.TransformerMixinK-mer transformer is responsible to extract set of words which are subsequences of length (6 by default) contained within a biological sequence
Each of the word is called k-mer and are composed of nucleotides (i.e. A, T, G, and C)
- Example:
fname = ‘NC_001604.fasta’ fr = FastaReader(fname)
sample = fr.to_df()
kmt = KMersTransformer() kmt.transform(sample)