Updated Wiki: The Manual of CULO-tuple Pipeline

Manual of CULO-tuple pipeline on Linux platform

Pre-required running environment

Linux operational platform;
Python 2.7 or higher version;
R 2.7 or higher version;
Tuple counting tools, (e.g: DSK, jellyfish);

Processing procedure

Step 1: Download this pipeline relevant source code and test data to your workspace directory.

Step 2: Counting the 40 length sequence tuple for each metagenomics sample by scanning each short reads with DSK. Command as follows:

Compiling: make omp=1 k=40. This is compile command is for parallel running model, k denotes as the tuple length. And for serial model, just ignore the “omp” options.
tuple Counting: ./dsk Sample_A_01.fa 40
Format Transformation: ./parese Sample_A_01.solid_kmers_binary > Sample_A_01_40-tuple.txt

Step 3: Filter out the barely occurred tuple for each sample

./tuple_filtering.py -f Sample_A_01_40-tuple.txt -n x

Where x is the threshold for the tuple that just its occurrence greater than x would be retained.

Step 4: Sorting each filtered sample based on the tuple sequence signature.

./tuple_quickSort.py –f xxxx.filtered.txt –p outputDir

Where –f option requires the filtered samples in former step, and –p option denotes the output directory of sorted files, default set is current directory.

Step 5: Integrate all the samples into a feature matrix.

This procedure we utilize the Linux command as follows:

join -a1 -a2 -1 1 -2 1 -e '0' -o 0,1.2,2.2 sample-01.txt sample-02.txt > new-01.txt

At first, each two distinguish samples would be joined together into a new sample, and then, this procedure will be repeated on the new samples until all samples integrating into a feature matrix. For more details usage of command join is provided here.

Step 6: merging identical pattern features

Frequency type

./merging_frequency.py –f xxx_OriginalFeatureMatrix.txt -o xxx_MergedFeatureMatrix_frequency.txt

Boolean type

./merging_boolean.py –f xxx_OriginalFeatureMatrix.txt -o xxx_MergedFeatureMatrix_boolean.txt

Where –f option requires the feature matrix in former step, and -o option is the output file name.

Step 7: calculating TF-IDF weights and TC values of merged features

./TF-IDF.py -f xxx_MergedFeatureMatrix.txt -n N

./TC.py -w weight_matrix.txt -t total_w.txt -o xxx_TF-IDF_TC.txt

Where –f option is the merged feature matrix(Frequency or Boolean) in former step, and N is the number of samples.

Step 8: ranking TC values of merged features