Manual of CULO-tuple pipeline on Linux platform
Pre-required running environment
- Linux operational platform;
- Python 2.7 or higher version;
- R 2.7 or higher version;
- Tuple counting tools, (e.g: DSK, jellyfish);
Processing procedure
Step 1: Download this pipeline relevant source code and test data to your workspace directory.Step 2: Counting the 40 length sequence tuple for each metagenomics sample by scanning each short reads with DSK. Command as follows:
- Compiling: make omp=1 k=40. This is compile command is for parallel running model, k denotes as the tuple length. And for serial model, just ignore the “omp” options.
- tuple Counting: ./dsk Sample_A_01.fa 40
- Format Transformation: ./parese Sample_A_01.solid_kmers_binary > Sample_A_01_40-tuple.txt
Step 3: Filter out the barely occurred tuple for each sample
./tuple_filtering.py -f Sample_A_01_40-tuple.txt -n xWhere x is the threshold for the tuple that just its occurrence greater than x would be retained.
Step 4: Sorting each filtered sample based on the tuple sequence signature.
./tuple_quickSort.py –f xxxx.filtered.txt –p outputDirWhere –f option requires the filtered samples in former step, and –p option denotes the output directory of sorted files, default set is current directory.
Step 5: Integrate all the samples into a feature matrix.
This procedure we utilize the Linux command as follows:
join -a1 -a2 -1 1 -2 1 -e '0' -o 0,1.2,2.2 sample-01.txt sample-02.txt > new-01.txtAt first, each two distinguish samples would be joined together into a new sample, and then, this procedure will be repeated on the new samples until all samples integrating into a feature matrix. For more details usage of command join is provided here.
Step 6: merging identical pattern features
- Frequency type
./merging_frequency.py –f xxx_OriginalFeatureMatrix.txt -o xxx_MergedFeatureMatrix_frequency.txt
- Boolean type
./merging_boolean.py –f xxx_OriginalFeatureMatrix.txt -o xxx_MergedFeatureMatrix_boolean.txtWhere –f option requires the feature matrix in former step, and -o option is the output file name.
Step 7: calculating TF-IDF weights and TC values of merged features
./TF-IDF.py -f xxx_MergedFeatureMatrix.txt -n N
./TC.py -w weight_matrix.txt -t total_w.txt -o xxx_TF-IDF_TC.txtWhere –f option is the merged feature matrix(Frequency or Boolean) in former step, and N is the number of samples.
Step 8: ranking TC values of merged features
./TC_ranking.py -f xxx_TF-IDF_TC.txt -n NStep 9: selecting merged features whose TC values are bigger than the inflection point
./feature_selecting -f xxx_MergedFeatureMatrix.txt -t xxx_TF-IDF_TC_topRanking.txt -o xxx_SelectedFeatureMatrix.txtStep 10: calculating the dissimilarity bteween samples
- Frequency type
./scalar_producDis.py -f xxx_SelectedFeatureMatrix.txt -s xxx_samplelist.txt
- Boolean
./hammingDis.py -f xxx_SelectedFeatureMatrix.txt -s xxx_samplelist.txt