Commit c691f287 authored by Gökçe Aydos's avatar Gökçe Aydos
Browse files

clarifications

parent 1ca940f7
Pipeline #13829 passed with stage
in 6 minutes and 39 seconds
%% Cell type:markdown id:f5187258 tags:
As a programmer in a biomedical company you often work with sequencing data. One day you decide to analyze the raw data that you get from one of the sequencers. The sequencer generates sequences of `A`, `G`, `C`, `T`, where the sequence can be 60 to 65 nucleotides long. These data look like this after some preprocessing:
```python
[
'CATGAGCGCTGGTGAGGGTAGTCATCAGCGTAAGCTGCGGGGTGTATGCCTCGACTACAG',
'ACCGCATGTCGCCGTACGCCTGCTTATGCCAGCACAAGGACTCGGCAACTATATGAATTATGA',
'CAACACCGATGGTGCTAGCCTACTTTGTGGGCATTCCGCAGGAGAGGCCTTGGCACCTCTA',
# ...
]
```
You notice that you do not have access to actual data from the sequencer. As a lazy but smart programmer you decide to generate some sample data in the previous format.
# Q (10p)
You write code which generates 50 sequences which are 60 to 65 nucleutides long and assigns the list of sequences to the variable `reads`:
You write code which generates 50 sequences which are 60 to 65 nucleutides long and assigns the list of sequences to the variable `reads`. In other words, a read is a string of nucleotides.
%% Cell type:code id:4cac9ab0 tags:
``` python
### BEGIN SOLUTION
import random
reads = [
''.join(
random.choices('AGCT', k=random.randrange(60, 66))
) for _ in range(50)]
### END SOLUTION
reads
```
%% Output
['TAAATAGAGAGATGCATGCATGGAGCAATCAGGCTAGGGAAGATATTCCCGTCCCGGTTAAAACT',
'TGGAGGGACGACGGGAACCGTGGTCAATCCCAGGAACTTGGCGTATAGGCGCATAGCTAGCGAAT',
'GTGCCCTTGCCAGAATACGAGACTCTCAGTTCGCTTGACCCATTTTAGGGGTAAGGAGTAAAAT',
'CAGCGCATACTCGCGGCCTACTCGGCTTTACCACGCTCACTATTACGGGACAGGTTACATAAC',
'ACTTAGAAGGTGCTGTCTGTCAGTTGCTTACCATCGGGTACCTGGCGTGCTACTGTATATACGCT',
'TGGAGCAGTGGCCTATCACGAAGCCCGAGGAAACCCCAACCTCTATGAACACTCACGACATTCGG',
'TTTCGTATGTTCGCTCCAAGTCAACCTTCGAACGGTGGAGGACTTTTTCACCATGCCGTGACAGC',
'TAGAGACTGTATACGCTATGCCATAGAAACCTCCGCAGTGCCCGATCAATACCATGATGCCAGC',
'TGAAGTCTCGTTAGCATAAGCTAGTTTGGCTCATAACTCAGATTGTTGACTCCCGGGCGCC',
'TTCGAAACCGGTACCGTTCCGGGCTAAAGCCAGGAATCAGCGCTTCCTTGTATCTTGCCTTTTG',
'GTACTACAAGTAGAACGCACATGTTGACGTGCGACCTCACTAGCTCTGCACGTACCCGGGGT',
'ATACGAACAGAATTGCTGAACCGATTGATGAGTGTGGTGAAGTTAATGACTTATTAATCC',
'CCCAACCAATTAAAATCCCAATAAACAGGAAGCCAGCTATACTAATCCTCCACACCAAGATCTC',
'GAGAGACCATGCTCGGTCGGTTGCGAAGAGTCGCGCACCGGATTTACTACGGGCCAATATG',
'CAGGATGATTACTGGCGCGTAACCAGCTGTACTTTACTTTTCTAGACAAGAGGTGGTATGC',
'CGCGTCTAAACCCGGTGGGCATCACTGCCAGAAACCGGTGTGGTACCTAGACGCTGACCG',
'CTAGGCGTAATTTCGCACAGAGATTGCTTCCTCAATTGAAGCCTGGCAGGTCGACTGTTCG',
'ACAGCGTCAATTTTATGGGAATAGCTGACGCTTCGTGGCTATGAACGAGACGTCGCTTTTAA',
'GCGTCCTCAAGGTCGCCTGTATTTGCGTGTGTCTGGAACGGGTCTCGTACGTCTATTTTATGA',
'TAATTGTCCCCAACTATACTTATATGGCACTAACATACGCACGTAAAAGAGTAAACCGGGT',
'TGGTCGTTTTGTGGTCCGAACCTGAGGGCGCTTTGGACCATAGCTCTGCCATCTTAGTCCT',
'ATGGTGACTGAAGCTCTCCCTCGGGGTACGCTGCCCATATCCTGGAGTACGGACTCAATCCTCC',
'TACATCTAGAAGTGAGCTTCGGATCGTATCGCGATTAAGGTTGACGGCTGCGCATACACCGATG',
'TGTAGTCGCGTCGAGACCTATCACCGACATCACCTGGGCGTGGATAATGACAAACCCAATC',
'TGCCCGATTCTAACTGAGGTGCCAATGAAGGTGTCTGTCCGGTTCACTGAGAAATCCGAGTTAAC',
'GTTTTAGCTCTGGGAGCCTAACACCCTCTCGCTGAACGTTTGCTTGTTGTCCGTTACTCCCAGG',
'ACCGGATCGGAACCAACCCGACACTACCGTTTGCTATCGGTAAGACTCAGCCAATAGCAAT',
'TTTCTGCTATTATAGAGCATGCGTTGGCATGTACAAGTATTGGGCAAATCAAGCTTTAGA',
'ATAAGTGCGCAGCTCAGCTTCCAAGATATAAGGATCCTACGAGTTCCGTTGCCAAGATACAAAC',
'AAGCCCTGTAAGACGTAAGTTCTTCATGGGAGACGCTACACTGTCCGTTATGGTGGACGAT',
'GAAGGGAATCCACATGTAATGCGATCGACATGAGTCCCTGCTGCCGATGGTCTGTGATGGCT',
'TTTCATGCACAAATAAGCCTATGATTGTCGCTCTAGAGTAGGTCTACGATTTAAATCTCCGTT',
'AGACATCACTCAGCTTTAGATTGTCACTATGAGGTAGGCGGGGAAGGGCCGAATCTACTGTATCG',
'ACACTAATAAGCGCTCTATTCAACTACTAATGACCGAACAAGCAATGCACCGCTTGCTGATCT',
'ACGATAACGAACGAGGAGTACGCTACCCCATTAACACCCATGATGAAGCAGTGAGGCTATCG',
'TCCGCGGTCCCAATTACCAGGCATCTCAGATCTAATTGTTAACTGTTGTGTGTTTCGCGAGGTGC',
'ATCGTTCCTGCGCCGCTGCAGCATTTAGTTATGCTGGTCCTGATGAATTTGAGCGAGGAAT',
'TGATTGAGTGGCTCGTGATCTATTCCAGGTTGCCTAGCAAAATTGGATATAAGTCCTGGGGGA',
'GAAAACAGACTGTGAGCACGACGCAGGACTAACACTTTATCGTGCGACGGCCGGTAGTCAGAGGC',
'GCGGTGTCTGGGAATAGTGCACAGATCAATGACTTCGGTTGCACGAATTATACAGATACTCG',
'CCAAATGAAATGCAGTATTCTCCTCGGTCATCCCTGATGTGGTCCAGAGAGATGGCTGAGCGC',
'TATCGCTACTTGTACCTACACACAATAACCGCCCTAGAAGAACGGACTGCAACTTTTGTCTGT',
'TTCACCAAGGCTGTAAAACATTCATCTCGCTCATATGTAGGTTCACCGGCGACACCAAGA',
'CGAACACGGACATCCCCGAATCTTTATTCGTCAGAATTTTTTGTTAGGGTCTAGCATTCAACC',
'TGGGTGAAGGGACAGGGAGAAATATCTCAACTCCTCAGCACGCCTGTGGCGCCAAAACCT',
'AGCTAACAGGGGGGTCGATGCCATTTTCTTCAGATCATATCGGTATTCGCTTGGATCGAC',
'ATCCTGGTGGTCCAGCAACAGAACACTAACTGGAATCAAATCTCCGCCCTTGGGTTGCACCGCA',
'GCTAGAAATGCGCTAGCTTACCGTACTAGCGGGCAAGCCTTCACTCAGTTCTGTTAAGCTGAT',
'TGGAGTATGGGGGTAGAAACTATCCCGTGGCAGTCACATTTGATCGTACGATGCGCCTTGATT',
'AATCGGCAGGTACGTTTACTTCGCGGATCGAATTACGTGGTTCTACATGACGTCGCGCGT']
['CATGTACTCTCCATTAACTCTGAGCCGCAGAAGGCGCGCTAGCAGACTGGTCACCTAGGTGG',
'TCCATAGAGATAGTGGTGTATGGAGGACTCAGTTTGATTCGCTTGCCCTAGGGTAGTTCACGACG',
'CCACTCCCCCTAGTATCCAGGCTACCTGAGTAAGATCCGTGCTGATAATGCTGAATGACCT',
'TCGGATCTAGCCGCATCGCTGTTAAGGCAAGTCAATGTATCCTACGGAGCCCCGGCTATTT',
'ATCACGAACGCCGAATACTCCGCAGCAGCTCGCTACGAATGTCAGTCTACCGGCGGGGAAGCT',
'CGAGACTCTGCCAACTCTACCAGTTCGCTACCAACAGCTAGAAATTGCAGATTGATCTCGTAATT',
'GGAACTGCGATTTCCGTGCCCTGCCCTACAGGCCTAGTAAGACCTACCTCGCCCTATAAT',
'TATTGTGTTTAGAGTCTGTCCTTCGGCCTTTAATGGTATCGGGGATCCAACTAGAGCCATCTC',
'TAGTCCCCTACCAAATCGGAACGTCCAATGTTAACATGATATCTGATCCGAGTCCCGTTGTAA',
'CAGTGCTCGGCCCTGGGCTAGGAATTAATCACAATACGAGATCGTTTGTCTTTGCGGGAC',
'TAGATTGAGTGTATCGACTGCCTTGTATGGCCGCTTCGCATGATAGCCTATAAGGTATTACCTG',
'AGCTCGCGCCACGGGGGAGAGTTCGCTGAGAGAAAGAATTACGTATCATTCCCTAGTCCGG',
'GGTGATCGGAAAATTTGGCTTAATCGGAGAGGGCTGATATCTTGGAGCTGGTGCACCTGTTGG',
'ATCATCATAATTACTGAGACATTTGGAGAGTGAGCGGCGAGCGGTAGGCTTTTCCCATCAA',
'AGCTATTAGGGGTCTCACGACGGGACGCCTAGTCGCATTCCTCAACGATTCTCCGCACCCTTTA',
'CACAGGCGAGCGTGCGGGCGGCAATTTTTTGACGCACCTATTATAGTAAACTTCGTAATCGGGT',
'GTGCACGCTTTTGACGTCAGTACGGTAAGTGCTGCGGTCTACGCCAGTTGGTTAGATCGCTCA',
'GCGTAGTCACGAAACTACAACGTATTTAGGCCTCGGGTTGTGTATCAAGAGACAGGACTTAAT',
'AAACGTGTGCCACCTCCTAATATTAAAAATATGGGGTTGAATTTGTAAGCAGAAGAAAGT',
'GGTATTATCAATGGATACTAATTGTCAAGCAAAGCATTACTCCTCAGAGACGAGCCAGATACA',
'ATGCACATAGATGGTATAGTACGCCAGCGGTATTCCTTGGCGTCGAAGCCGGTTAGTGTGG',
'CCGGGCCCAAAGCCCGTGGTTTGATCCTTGTGAATGCAGTAATGAGATTGTGTTAAGGGGATCAC',
'CTCTAGCAGTATCCTCCGCCTCGATCAATACTGCTTCAGGTCTCGAACAAGAGCCGCCGATTG',
'GCGTGGATGCACCTACTAGTACGGATACTCTCGATTTACAACCGCCTTAGAACTTCAGTAAT',
'TTAAAACCCCGCGGAAATCTGGATGGATAATCTCGAATTTCATCAGCGAGTAGCTAGACC',
'CGCTAAGGCACTTATTATTGCGTGGATACGTGTTTGCCATATGCGTGGACAAGGGGAAGAA',
'ATCGCGAACCGGCACACTGTCGTAATGCTTCCACGGATCTCTAATCCGAACCGTAAGGGGCT',
'CGGTCGTTCTTGCGAGATGCTACTTTGTTCAATGTGACCCTCGTGGGAGTCAATAGCCTCAC',
'ATTTTCTGGGCACGACTTGTCAAAGATCACTGGGTCCGTTCCTGATCCCAATAGCGCGCCGG',
'CTTCTGAGATCAAAATTAATGATATCAGTTCGGACTCGACTAAGGACGGTCTATCCCCTGAGA',
'CACCCTGTGTGAGGAAGCTGGTTAATTGAGATGACAGAAGATCATTGGTCTTCGGACTTA',
'CCAAGCTGGCCATCCTTACGCACAGTAACCGAACCTAATTCATGTCGCTTCGGTCAAGCG',
'GACGCTCACTGCGCATCTTCTGATGCAGGGGAAATGAAAAGCCGGGTCAGAGTAGAAAATTGA',
'AGAGCGACGTTTCTAAACCGCGGGCGGTATGACGCAGTTGGTACTGCTCAATAAACAAATTG',
'GTGACCGCTGACATCCTTCGCTCCATTCGGTAGACCGGGGGTGTATCCAACCTAAACGGGGCCT',
'CGTGCAAGGAAGCTACAGACATATCTCCCACTAGATCAAATAATTTCATGCCAAATGGCCAG',
'TATATGGTGCACGAATGCGAGGATTCGCGGTTCATTCACCTGACTTGATAGTGGGCCATGACAC',
'TGGTAAATGCCTCGCACTCATAAAATCAATATGGATCGCAGTACGCACTTCGCTGGTTCAC',
'TTTCCTGCCTCGTGTCAGTCAACGCGTTTGTGCGTAGAAAAATTCTTTCGTAAGAGGCATAGTGG',
'GTGAGGGCAGTGCATATTCCCACCTATTGTTGCAACGAGCTTTTAGAACAGCTTTGTTACTG',
'TAATATAGAAAGGTCGGAGTCATAGCGGACAACGGTCATATGCGTTGAGCGGCTGTCGAAGTC',
'CACAAGGGACACACAGCTTGTAAGGATGCGCATTGAGGTCCTGTCGAACCCGCCTGTACGG',
'AATGTGCTCTGGCACTACGTTTTCCATAACGCAAAACTCGCCTCCTCGGTTGTATATTCCCA',
'GTAGTGTGACAGTTCTCCGTTTGAATCGACTTGTTTAAGGCCTTCACGCTCCATTCACGAGTTG',
'CCGGAACCACGAGCGCCAGTCTACTATTGCAAATGAGCGTTGGCTCCAATTTCGTAACACC',
'GGCTGATAGTCGCCGAGACATGAATTAATCCCTAACGACGTAACTTGATTAGGACCTGAGTTA',
'TTGGGTATGCATTCCTCGTTAGCAAAGTGGGGCAAGTTTACTACTCGAAACCTACACTAATA',
'ACACGCCCACGAGTGTCATATGTTAACATGCTGAAAGTTCCCATTAGCATATCAGTTCACCGTA',
'CTGAGCGTTCCATGCCGAGTCGTATGACATATATCGGGTCCTCATGGGCATCTCTTGGAAGTCT',
'AGCCGCATTATTGGTCATCTCGCCTCCGGGGGATCACTACGTTAATATTCGTCAATCTGGT']
%% Cell type:markdown id:55b9e10c tags:
You test your code:
- There should be 50 sequences
%% Cell type:code id:980265c2 tags:
``` python
assert len(reads) == 50
```
%% Cell type:markdown id:3e094d4a tags:
- The sequences should be roughly random, in other words the sequences should be different from eachother
%% Cell type:code id:fe8d25c3 tags:
``` python
assert len(reads) == len(set(reads))
```
%% Cell type:markdown id:dac7dd63 tags:
- Each sequence length should be 60 to 65
%% Cell type:code id:16e06f35 tags:
``` python
assert all(
len(read) >= 60 and len(read) <= 65
for read in reads
)
```
%% Cell type:markdown id:4cbb6e51 tags:
- The length of the random sequences should be somewhat different, in other words every possible length should exist
%% Cell type:code id:fb465add tags:
``` python
assert set(len(read) for read in reads) == set(range(60, 66))
```
%% Cell type:markdown id:f75d3863 tags:
- The sequences should only contain `AGCT`
%% Cell type:code id:3ec8854c tags:
``` python
assert all(
set(read) == set('AGCT') for read in reads
)
```
%% Cell type:markdown id:5738b545 tags:
🎊 Sweet! Now you have some simulated data for your next data analysis endeavors.
(If you could not generate random sequences, then use the sequences from the example and assign them to `reads`).
## Q (10p)
You want to search for complementary DNA strands in these sequences. Two sequences are complementary if `A` meets `T` and `G` meets `C` (and vice versa) when you put two sequences together regardless of the sequence direction. In other words, your function should also return `True` in case of reverse complement. You find examples in the test cells.
You want to search for complementary DNA strands in these sequences. Two sequences are complementary if `A` meets `T` and `G` meets `C` (and vice versa) when you put two sequences together *regardless of the sequence direction*. In other words, your function should also return `True` in case of reverse complement. Your code should also accept sequences of different lengths and cut away the rightmost nucleotides in the longer sequence. Refer to the examples in the test cells.
First you begin by writing a function which tests if two sequences are complementary.
%% Cell type:code id:a52adaf5 tags:
``` python
def are_complementary(read1, read2):
### BEGIN SOLUTION
complements = dict(A='T', G='C', C='G', T='A')
if all(n1 == complements[n2] for n1, n2 in zip(read1, read2)):
return True
elif all(n1 == complements[n2] for n1, n2 in zip(read1, read2[::-1])):
return True
else:
return False
### END SOLUTION
```
%% Cell type:markdown id:cb6b47c0 tags:
You test your function:
- for sequences of the same length
- for sequences of the same length (not reverse complement)
%% Cell type:code id:afeedb74 tags:
``` python
assert are_complementary('A', 'T')
assert are_complementary('T', 'A')
assert are_complementary('G', 'C')
assert are_complementary('C', 'G')
assert not are_complementary('A', 'G')
assert not are_complementary('A', 'C')
assert are_complementary('AC', 'TG')
assert not are_complementary('AC', 'TA')
assert are_complementary('ACC', 'TGG')
assert are_complementary(
'GTAAACGGAGTAACTGTTCTAGTCTATTTGATATTCCTGTGCCACCATGGATAAGACTAAACT',
'CATTTGCCTCATTGACAAGATCAGATAAACTATAAGGACACGGTGGTACCTATTCTGATTTGA'
)
### BEGIN HIDDEN TESTS
def _complement(nucleotides):
d = dict(A='T', G='C', C='G', T='A')
return ''.join(d[n] for n in nucleotides)
import random
_rand_seq = ''.join(random.choices('AGCT', k=100))
assert are_complementary(_rand_seq, _complement(_rand_seq))
### END HIDDEN TESTS
```
%% Cell type:markdown id:ed4ebaac tags:
- for sequences where one of the sequence is read backwards
- reverse complement and same length
%% Cell type:code id:e8dbde0f tags:
``` python
assert are_complementary('AC', 'GT')
assert are_complementary('ACC', 'GGT')
assert are_complementary(
'GTAAACGGAGTAACTGTTCTAGTCTATTTGATATTCCTGTGCCACCATGGATAAGACTAAACT',
'AGTTTAGTCTTATCCATGGTGGCACAGGAATATCAAATAGACTAGAACAGTTACTCCGTTTAC'
)
```
%% Cell type:markdown id:5a47ff06 tags:
- for sequences that have a different length. For simplicity, you only align them at the beginning (including the backwards direction)
%% Cell type:code id:e3776614 tags:
``` python
assert are_complementary('AC', 'TGA')
assert are_complementary('AC', 'TGAAAA')
assert are_complementary('AC', 'T')
assert are_complementary('ACC', 'TGGA')
assert are_complementary('ACC', 'TGGAAAA')
### BEGIN HIDDEN TESTS
_rand_seq = ''.join(random.choices('AGCT', k=random.randrange(60, 66)))
assert are_complementary(_rand_seq, _complement(_rand_seq))
### END HIDDEN TESTS
```
%% Cell type:markdown id:a0f8c87f tags:
Finally you can search for complementary sequences in your simulated data!
It's lunch time and you want to leave the rest of the analysis for afternoon. Right after standing up you notice that you will very likely not find any complementary sequences in your data.
# Q (3p)
What could be the reason? Just use your intuition.
%% Cell type:markdown id:d94c17d9 tags:
We generate random sequences which are 60 long. To generate two complementary sequences, we would had to choose the same nucleotide for 60 times which has the probability of $\approx \frac{1}{4}^{60}$. This is very unlikely.
%% Cell type:markdown id:6f453b35 tags:
# Q (5p)
After noticing the problem you decide to add two sequences to your simulated sequences `reads` where
- the first one is complementary to the first sequence in `reads` with the same length (51th sequence)
- the second one is two nucleotides longer than the second sequence in `reads` and complementary (52th sequence)
%% Cell type:code id:d2740ba7 tags:
``` python
### BEGIN SOLUTION
d = dict(A='T', G='C', C='G', T='A')
reads.append(''.join((d[nuc] for nuc in reads[0])))
reads.append(
''.join(
(d[nuc] for nuc in reads[1])
) + random.choice('AGCT') + random.choice('AGCT')
)
### END SOLUTION
reads
```
%% Output
['TAAATAGAGAGATGCATGCATGGAGCAATCAGGCTAGGGAAGATATTCCCGTCCCGGTTAAAACT',
'TGGAGGGACGACGGGAACCGTGGTCAATCCCAGGAACTTGGCGTATAGGCGCATAGCTAGCGAAT',
'GTGCCCTTGCCAGAATACGAGACTCTCAGTTCGCTTGACCCATTTTAGGGGTAAGGAGTAAAAT',
'CAGCGCATACTCGCGGCCTACTCGGCTTTACCACGCTCACTATTACGGGACAGGTTACATAAC',
'ACTTAGAAGGTGCTGTCTGTCAGTTGCTTACCATCGGGTACCTGGCGTGCTACTGTATATACGCT',
'TGGAGCAGTGGCCTATCACGAAGCCCGAGGAAACCCCAACCTCTATGAACACTCACGACATTCGG',
'TTTCGTATGTTCGCTCCAAGTCAACCTTCGAACGGTGGAGGACTTTTTCACCATGCCGTGACAGC',
'TAGAGACTGTATACGCTATGCCATAGAAACCTCCGCAGTGCCCGATCAATACCATGATGCCAGC',
'TGAAGTCTCGTTAGCATAAGCTAGTTTGGCTCATAACTCAGATTGTTGACTCCCGGGCGCC',
'TTCGAAACCGGTACCGTTCCGGGCTAAAGCCAGGAATCAGCGCTTCCTTGTATCTTGCCTTTTG',
'GTACTACAAGTAGAACGCACATGTTGACGTGCGACCTCACTAGCTCTGCACGTACCCGGGGT',
'ATACGAACAGAATTGCTGAACCGATTGATGAGTGTGGTGAAGTTAATGACTTATTAATCC',
'CCCAACCAATTAAAATCCCAATAAACAGGAAGCCAGCTATACTAATCCTCCACACCAAGATCTC',
'GAGAGACCATGCTCGGTCGGTTGCGAAGAGTCGCGCACCGGATTTACTACGGGCCAATATG',
'CAGGATGATTACTGGCGCGTAACCAGCTGTACTTTACTTTTCTAGACAAGAGGTGGTATGC',
'CGCGTCTAAACCCGGTGGGCATCACTGCCAGAAACCGGTGTGGTACCTAGACGCTGACCG',
'CTAGGCGTAATTTCGCACAGAGATTGCTTCCTCAATTGAAGCCTGGCAGGTCGACTGTTCG',
'ACAGCGTCAATTTTATGGGAATAGCTGACGCTTCGTGGCTATGAACGAGACGTCGCTTTTAA',
'GCGTCCTCAAGGTCGCCTGTATTTGCGTGTGTCTGGAACGGGTCTCGTACGTCTATTTTATGA',
'TAATTGTCCCCAACTATACTTATATGGCACTAACATACGCACGTAAAAGAGTAAACCGGGT',
'TGGTCGTTTTGTGGTCCGAACCTGAGGGCGCTTTGGACCATAGCTCTGCCATCTTAGTCCT',
'ATGGTGACTGAAGCTCTCCCTCGGGGTACGCTGCCCATATCCTGGAGTACGGACTCAATCCTCC',
'TACATCTAGAAGTGAGCTTCGGATCGTATCGCGATTAAGGTTGACGGCTGCGCATACACCGATG',
'TGTAGTCGCGTCGAGACCTATCACCGACATCACCTGGGCGTGGATAATGACAAACCCAATC',
'TGCCCGATTCTAACTGAGGTGCCAATGAAGGTGTCTGTCCGGTTCACTGAGAAATCCGAGTTAAC',
'GTTTTAGCTCTGGGAGCCTAACACCCTCTCGCTGAACGTTTGCTTGTTGTCCGTTACTCCCAGG',
'ACCGGATCGGAACCAACCCGACACTACCGTTTGCTATCGGTAAGACTCAGCCAATAGCAAT',
'TTTCTGCTATTATAGAGCATGCGTTGGCATGTACAAGTATTGGGCAAATCAAGCTTTAGA',
'ATAAGTGCGCAGCTCAGCTTCCAAGATATAAGGATCCTACGAGTTCCGTTGCCAAGATACAAAC',
'AAGCCCTGTAAGACGTAAGTTCTTCATGGGAGACGCTACACTGTCCGTTATGGTGGACGAT',
'GAAGGGAATCCACATGTAATGCGATCGACATGAGTCCCTGCTGCCGATGGTCTGTGATGGCT',
'TTTCATGCACAAATAAGCCTATGATTGTCGCTCTAGAGTAGGTCTACGATTTAAATCTCCGTT',
'AGACATCACTCAGCTTTAGATTGTCACTATGAGGTAGGCGGGGAAGGGCCGAATCTACTGTATCG',
'ACACTAATAAGCGCTCTATTCAACTACTAATGACCGAACAAGCAATGCACCGCTTGCTGATCT',
'ACGATAACGAACGAGGAGTACGCTACCCCATTAACACCCATGATGAAGCAGTGAGGCTATCG',
'TCCGCGGTCCCAATTACCAGGCATCTCAGATCTAATTGTTAACTGTTGTGTGTTTCGCGAGGTGC',
'ATCGTTCCTGCGCCGCTGCAGCATTTAGTTATGCTGGTCCTGATGAATTTGAGCGAGGAAT',
'TGATTGAGTGGCTCGTGATCTATTCCAGGTTGCCTAGCAAAATTGGATATAAGTCCTGGGGGA',
'GAAAACAGACTGTGAGCACGACGCAGGACTAACACTTTATCGTGCGACGGCCGGTAGTCAGAGGC',
'GCGGTGTCTGGGAATAGTGCACAGATCAATGACTTCGGTTGCACGAATTATACAGATACTCG',
'CCAAATGAAATGCAGTATTCTCCTCGGTCATCCCTGATGTGGTCCAGAGAGATGGCTGAGCGC',
'TATCGCTACTTGTACCTACACACAATAACCGCCCTAGAAGAACGGACTGCAACTTTTGTCTGT',
'TTCACCAAGGCTGTAAAACATTCATCTCGCTCATATGTAGGTTCACCGGCGACACCAAGA',
'CGAACACGGACATCCCCGAATCTTTATTCGTCAGAATTTTTTGTTAGGGTCTAGCATTCAACC',
'TGGGTGAAGGGACAGGGAGAAATATCTCAACTCCTCAGCACGCCTGTGGCGCCAAAACCT',
'AGCTAACAGGGGGGTCGATGCCATTTTCTTCAGATCATATCGGTATTCGCTTGGATCGAC',
'ATCCTGGTGGTCCAGCAACAGAACACTAACTGGAATCAAATCTCCGCCCTTGGGTTGCACCGCA',
'GCTAGAAATGCGCTAGCTTACCGTACTAGCGGGCAAGCCTTCACTCAGTTCTGTTAAGCTGAT',
'TGGAGTATGGGGGTAGAAACTATCCCGTGGCAGTCACATTTGATCGTACGATGCGCCTTGATT',
'AATCGGCAGGTACGTTTACTTCGCGGATCGAATTACGTGGTTCTACATGACGTCGCGCGT',
'ATTTATCTCTCTACGTACGTACCTCGTTAGTCCGATCCCTTCTATAAGGGCAGGGCCAATTTTGA',
'ACCTCCCTGCTGCCCTTGGCACCAGTTAGGGTCCTTGAACCGCATATCCGCGTATCGATCGCTTAAG']
%% Cell type:markdown id:7a04c83a tags:
You test your code:
- there should be 52 sequences in `reads`
%% Cell type:code id:b9db42ce tags:
``` python
assert len(reads) == 52
```
%% Cell type:markdown id:ad879d35 tags:
- the 1. and 51. sequences should be complementary
%% Cell type:code id:4a24d9e8 tags:
``` python
assert are_complementary(reads[0], reads[50])
```
%% Cell type:markdown id:5e2494c4 tags:
- the 2. and 52. sequences should be complementary
%% Cell type:code id:7d66f0a9 tags:
``` python
assert are_complementary(reads[1], reads[51])
```
%% Cell type:markdown id:d7b0ddfd tags:
- the first pair have the same length
%% Cell type:code id:f1c17e8d tags:
``` python
assert len(reads[0]) == len(reads[50])
```
%% Cell type:markdown id:f63df1ba tags:
- the 52th sequence has two more nucleotides than the second sequence.
%% Cell type:code id:9934c9cf tags:
``` python
assert len(reads[51]) - len(reads[1]) == 2
```
%% Cell type:markdown id:3db079ac tags:
Great! 🙌
Even you are hungry and have a bit of headache you want to test your function `are_complementary()` on your simulated dataset `reads`.
## Q (10p)
Write the function `complementary_pairs()` which finds all complementary sequences. The output should be a set of index pair tuples, e.g., `{ (0, 50), (1, 51) }`.
%% Cell type:code id:76011951 tags:
``` python
def complementary_pairs_in(sequences):
### BEGIN SOLUTION
import itertools
return {
(sequences.index(read1), sequences.index(read2))
for read1, read2 in itertools.combinations(sequences, 2)
if are_complementary(read1, read2)
}
### END SOLUTION
```
%% Cell type:markdown id:ae9004fd tags:
You test your function on `reads`:
%% Cell type:code id:373b0b8e tags:
``` python
print(complementary_pairs_in(reads))
assert complementary_pairs_in(reads) == {(0,50), (1, 51)}
```
%% Cell type:markdown id:727d58de tags:
You double-check your function by using another dataset:
%% Cell type:code id:2edfab22 tags:
``` python
complementary_test_reads = [
'AG',
'TCA',
'ACT',
'AAAA',
]
print(complementary_pairs_in(complementary_test_reads))
assert complementary_pairs_in(complementary_test_reads) == {
(0, 1),
(0, 2),
}
```
%% Output
{(0, 1), (0, 2)}
%% Cell type:markdown id:7d461f6d tags:
🎉
Being more confident with Python than half an hour ago you go to your lunch. It was worth the headache!
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment