博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
spark kmer计算
阅读量:5265 次
发布时间:2019-06-14

本文共 3137 字,大约阅读时间需要 10 分钟。

  • 输入文件:fa格式的文件
  • 输出结果:kmer的频数和对应的kmer类型和计数
    1.将fq.gz的文件转换成fa文件:
#!/usr/bin/python env# -*- coding:utf-8 -*-import osimport reimport os.pathimport gzipimport sys#在这里可以写一个函数用来将文件转换成id和序列对应的字典#需要用到哪个转化操作呢?考虑先尝试使用filter或者map''' @r261DFRAAXX100204:1:100:10494:3070/1ACTGCATCCTGGAAAGAATCAATGGTGGCCGGAAAGTGTTTTTCAAATACAAGAGTGACAATGTGCCCTGTTGTTT+ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCACCCCCCCCCCCCCCCCCCCCCCCC'''#这里是利用python直接读取压缩的fastq文件def read_gz_file(path):    if os.path.exists(path):        with gzip.open(path,'rt') as pf:            for line in pf:                yield line    else:        print 'the path [{}] is not exist!'.format(path)def ReadFastq(fastq):    flag = 1    dict_fq={}    if fastq.endswith('gz'):        con = read_gz_file(fastq)        if getattr(con,'__iter__','None'):            for line in con:                line=line.strip()                flag_index = flag%4                if flag_index == 1:                    id = line                if flag%4 == 2:                    seq = line                else:                    flag +=1                    continue                dict_fq[id] = seq                flag+=1            return dict_fq    else:        with open (fastq,'r') as fqr:            for line in fqr.readlines():                line = line.strip()                          flag_index = flag%4                if flag_index == 1:                    id = line                if flag%4 == 2:                    seq = line                else:                    flag +=1                    continue                dict_fq[id] = seq                flag+=1            return dict_fqdef convert_to_fa(dict_hash,output):    with open (output,'w') as fr:        for i in dict_hash.keys():            fr.write(i+'\n')            fr.write(dict_hash[i]+'\n')if __name__ == '__main__':    input = sys.argv[1]    output = sys.argv[2]    dic_fa = ReadFastq(input)    convert_to_fa(dic_fa,output)

2.将reads打断成kmer并统计kmer的频数

#!/usr/bin/env python# coding=utf-8import osimport sysimport refrom pyspark import SparkConf, SparkContextinput_fasta_file ='/home/yueyao/Spark/00.data/both.fa'conf = SparkConf().setMaster("local").setAppName("Yue Yao app")sc = SparkContext(conf = conf)fasta_file = sc.textFile(input_fasta_file)#这里是对fasta文件进行转化操作,过滤掉reads的名称reads_fa = fasta_file.filter(lambda line :">" not in line)#这个函数用来将reads打断成kmer,这里的kmer是25,返回一个列表def map_file(line):    seq_lis=[]    for i in range(len(line)-25+1):        sub_seq = line[i:i+25]        seq_lis.append(sub_seq)    return seq_liskmer_list = reads_fa.flatMap(map_file)#对打断的kmer进行计数kmer_count = kmer_list.map(lambda id:(id,1))kmer_total_count = kmer_count.reduceByKey(lambda a,b:(a+b))#这里过滤掉了含有N的kmerkmer_not_contain_N = kmer_total_count.filter(lambda line :"N" not in line[0])kmer_key=kmer_not_contain_N.keys()#统计kmer的种类,并计数kmer_vari_count = kmer_not_contain_N.map(lambda kmer_vari:(kmer_vari[1],1))kmer_histo = kmer_vari_count.reduceByKey(lambda a,b:(a+b))#输出kmer频数的结果kmer_histo.saveAsTextFile('Kmer25.histo')kmer_not_contain_N.saveAsTextFile('kmer25')kmer_key.saveAsTextFile('kmer25_key')

转载于:https://www.cnblogs.com/raisok/p/10917678.html

你可能感兴趣的文章
事务的四种隔离级别和七种传播行为
查看>>
dom4j 通用解析器,解析成List<Map<String,Object>>
查看>>
13. 用Roberts、Sobel、Prewitt和Laplace算子对一幅灰度图像进行边缘检测。观察异同。...
查看>>
第一个项目--用bootstrap实现美工设计的首页
查看>>
使用XML传递数据
查看>>
手机自带输入法emoji表情的输入,提交及显示——前端解决方案
查看>>
TYVJ.1864.[Poetize I]守卫者的挑战(概率DP)
查看>>
LOJ.6160.[美团CodeM初赛 RoundA]二分图染色(容斥 组合)
查看>>
基于CMMI的敏捷开发过程文档裁剪
查看>>
0925 韩顺平java视频
查看>>
软件需求规格说明书
查看>>
53. Maximum Subarray
查看>>
iOS-程序启动原理和UIApplication
查看>>
SpringMVC入门(二)—— 参数的传递、Controller方法返回值、json数据交互、异常处理、图片上传、拦截器...
查看>>
git的安装
查看>>
mysql 8.0 zip包安装
查看>>
Spring框架系列(三)--Bean的作用域和生命周期
查看>>
springboot + mybatis
查看>>
awk 统计
查看>>
CSS min-height 属性
查看>>