{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# An introduction to solving biological problems with Python\n", "\n", "## Session 1.3: Collections Lists and Strings\n", "\n", "- [Tuples](#Tuples), [Lists](#Lists) and [Manipulating tuples and lists](#Manipulating-tuples-and-lists) | [Exercise 1.3.1](#Exercise-1.3.1)\n", "- [String manipulations](#String-manipulations) | [Exercise 1.3.2](#Exercise-1.3.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As well as the basic data types we introduced above, very commonly you will want to store and operate on collections of values, and python has several _data structures_ that you can use to do this. The general idea is that you can place several items into a single collection and then refer to that collection as a whole. Which one you will use will depend on what problem you are trying to solve.\n", "\n", "## Tuples\n", "\n", "- Can contain any number of items\n", "- Can contain different types of items\n", "- __Cannot__ be altered once created (they are immutable)\n", "- Items have a defined order\n", "\n", "A tuple is created by using round brackets around the items it contains, with commas seperating the individual elements." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = (123, 54, 92) # tuple of 4 integers\n", "b = () # empty tuple\n", "c = (\"Ala\",) # tuple of a single string (note the trailing \",\")\n", "d = (2, 3, False, \"Arg\", None) # a tuple of mixed types\n", "\n", "print(a)\n", "print(b)\n", "print(c)\n", "print(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can of course use variables in tuples and other data structures" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = 1.2\n", "y = -0.3\n", "z = 0.9\n", "t = (x, y, z)\n", "\n", "print(t)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tuples can be _packed_ and _unpacked_ with a convenient syntax. The number of variables used to unpack the tuple must match the number of elements in the tuple." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t = 2, 3, 4 # tuple packing\n", "print('t is', t)\n", "x, y, z = t # tuple unpacking\n", "print('x is', x)\n", "print('y is', y)\n", "print('z is', z)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lists\n", "\n", "- Can contain any number of items\n", "- Can contain different types of items\n", "- __Can__ be altered once created (they are _mutable_)\n", "- Items have a particular order\n", "\n", "Lists are created with square brackets around their items:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = [1, 3, 9]\n", "b = [\"ATG\"]\n", "c = []\n", "\n", "print(a)\n", "print(b)\n", "print(c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lists and tuples can contain other list and tuples, or any other type of collection:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "matrix = [[1, 0], [0, 2]]\n", "print(matrix)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can convert between tuples and lists with the tuple and list functions. Note that these create a new collection with the same items, and leave the original unaffected." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = (1, 4, 9, 16) # A tuple of numbers\n", "b = ['G','C','A','T'] # A list of characters\n", "\n", "print(a)\n", "print(b)\n", "\n", "l = list(a) # Make a list based on a tuple \n", "print(l)\n", "\n", "t = tuple(b) # Make a tuple based on a list\n", "print(t)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Manipulating tuples and lists\n", "\n", "Once your data is in a list or tuple, python supports a number of ways you can access elements of the list and manipulate the list in useful ways, such as sorting the data.\n", "\n", "Tuples and lists can generally be used in very similar ways." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Index access\n", "\n", "You can access individual elements of the collection using their _index_, note that the first element is at index 0. Negative indices count backwards from the end." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t = (123, 54, 92, 87, 33)\n", "x = [123, 54, 92, 87, 33]\n", "\n", "print('t is', t)\n", "print('t[0] is', t[0])\n", "print('t[2] is', t[2])\n", "\n", "print('x is', x)\n", "print('x[-1] is', x[-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Slices\n", "\n", "You can also access a range of items, known as _slices_, from inside lists and tuples using a colon `:` to indicate the beginning and end of the slice inside the square brackets. **Note that the slice notation `[a:b]` includes positions from `a` up to _but not including_ `b`**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t = (123, 54, 92, 87, 33)\n", "x = [123, 54, 92, 87, 33]\n", "print('t[1:3] is', t[1:3])\n", "print('x[2:] is', x[2:])\n", "print('x[:-1] is', x[:-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `in` operator\n", "You can check if a value is in a tuple or list with the in operator, and you can negate this with not" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t = (123, 54, 92, 87, 33)\n", "x = [123, 54, 92, 87, 33]\n", "print('123 in', x, 123 in x)\n", "print('234 in', t, 234 in t)\n", "print('999 not in', x, 999 not in x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `len()` and `count()` functions\n", "You can get the length of a list or tuple with the in-built len() function, and you can count the number of particular elements contained in a list with the .count() function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t = (123, 54, 92, 87, 33)\n", "x = [123, 54, 92, 87, 33]\n", "print(\"length of t is\", len(t))\n", "print(\"number of 33s in x is\", x.count(33))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Modifying lists\n", "You can alter lists in place, but not tuples" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = [123, 54, 92, 87, 33]\n", "print(x)\n", "x[2] = 33\n", "print(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tuples _cannot_ be altered once they have been created, if you try to do so, you'll get an error." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t = (123, 54, 92, 87, 33)\n", "print(t)\n", "t[1] = 4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can add elements to the end of a list with append()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = [123, 54, 92, 87, 33]\n", "x.append(101)\n", "print(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or insert values at a certain position with insert(), by supplying the desired position as well as the new value" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = [123, 54, 92, 87, 33]\n", "x.insert(3, 1111)\n", "print(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can remove values with remove()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = [123, 54, 92, 87, 33]\n", "x.remove(123)\n", "print(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and delete values by index with del" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = [123, 54, 92, 87, 33]\n", "print(x)\n", "del x[0]\n", "print(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's often useful to be able to combine arrays together, which can be done with extend() (as append would add the whole list as a single element in the list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = [1,2,3]\n", "b = [4,5,6]\n", "a.extend(b)\n", "print(a)\n", "a.append(b)\n", "print(a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The plus symbol + is shorthand for the extend operation when applied to lists:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = [1, 2, 3]\n", "b = [4, 5, 6]\n", "a = a + b\n", "print(a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Slice syntax can be used on the left hand side of an assignment operation to assign subregions of a list" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = [1, 2, 3, 4, 5, 6]\n", "a[1:3] = [9, 9, 9, 9]\n", "print(a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can change the order of elements in a list" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = [1, 3, 5, 4, 2]\n", "a.reverse()\n", "print(a)\n", "a.sort()\n", "print(a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that both of these change the list, if you want a sorted copy of the list while leaving the original untouched, use sorted()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = [2, 5, 7, 1]\n", "b = sorted(a)\n", "print(a)\n", "print(b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting help from the official Python documentation\n", "\n", "The most useful information is online on https://www.python.org/ website and should be used as a reference guide.\n", "\n", "- [Python 3.5.2 documentation](https://docs.python.org/3/) is the starting page with links to tutorials and libraries' documentation for Python 3\n", " - [The Python Tutorial](https://docs.python.org/3/tutorial/index.html)\n", " - [The Python Standard Library Reference](https://docs.python.org/3/library/index.html) is the documentation of all libraries included within Python as well as built-in functions and data types like:\n", " - [Text Sequence Type — `str`](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str)\n", " - [Numeric Types — `int`, `float`](https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex)\n", " - [Sequence Types — `list`, `tuple`](https://docs.python.org/3/library/stdtypes.html#sequence-types-list-tuple-range)\n", " - [Set Types — `set`](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset)\n", " - [Mapping Types — `dict`](https://docs.python.org/3/library/stdtypes.html#mapping-types-dict)\n", " \n", "### Getting help directly from within Python using `help()`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(len)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(list.insert)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(list.count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1.3.1\n", "\n", "1. Create a list of DNA codons for the protein sequence CLYSY based on the codon variables you defined previously.\n", "2. Print the DNA sequence of the protein to the screen.\n", "3. Print the DNA codon of the last amino acid in the protein sequence.\n", "4. Create two more variables containing the DNA sequence of a stop codon and a start codon, and replace the first element of the DNA sequence with the start codon and append the stop codon to the end of the DNA sequence. Print out the resulting DNA sequence." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## String manipulations\n", "\n", "Strings are a lot like tuples of characters, and individual characters and substrings can be accessed and manipulated using similar operations we introduced above.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text = \"ATGTCATTTGT\"\n", "print(text[0])\n", "print(text[-2])\n", "print(text[0:6])\n", "print(\"ATG\" in text)\n", "print(\"TGA\" in text)\n", "print(len(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just as with tuples, trying to assign a value to an element of a string results in an error" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text = \"ATGTCATTTGT\"\n", "text[0:2] = \"CCC\" " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python provides a number of useful functions that let you manipulate strings\n", "\n", "The in operator lets you check if a substring is contained within a larger string, but it does not tell you where the substring is located. This is often useful to know and python provides the .find() method which returns the index of the first occurrence of the search string, and the .rfind() method to start searching from the end of the string.\n", "\n", "If the search string is not found in the string both these methods return -1." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dna = \"ATGTCACCGTTT\"\n", "index = dna.find(\"TCA\")\n", "print(\"TCA is at position:\", index)\n", "index = dna.rfind('C')\n", "print(\"The last Cytosine is at position:\", index)\n", "print(\"Position of a stop codon:\", dna.find(\"TGA\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we are reading text from files (which we will see later on), often there is unwanted whitespace at the start or end of the string. We can remove leading whitespace with the .lstrip() method, trailing whitespace with .rstrip(), and whitespace from both ends with .strip().\n", "\n", "All of these methods return a copy of the changed string, so if you want to replace the original you can assign the result of the method call to the original variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s = \" Chromosome Start End \"\n", "print(len(s), s)\n", "s = s.lstrip()\n", "print(len(s), s)\n", "s = s.rstrip()\n", "print(len(s), s)\n", "s = \" Chromosome Start End \"\n", "s = s.strip()\n", "print(len(s), s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can split a string into a list of substrings using the .split() method, supplying the delimiter as an argument to the method. If you don't supply any delimiter the method will split the string on whitespace by default (which is very often what you want!)\n", "\n", "To split a string into its component characters you can simply _cast_ the string to a list " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "seq = \"ATG TCA CCG GGC\"\n", "codons = seq.split(\" \")\n", "print(codons)\n", "\n", "bases = list(seq) # a tuple of character converted into a list\n", "print(bases)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ".split() is the counterpart to the .join() method that lets you join the elements of a list into a string only if all the elements are of type String:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "seq = \"ATG TCA CCG GGC\"\n", "codons = seq.split(\" \")\n", "print(codons)\n", "print(\"|\".join(codons))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also saw earlier that the + operator lets you concatenate strings together into a larger string.\n", "\n", "Note that this operator only works on variables of the same type. If you want to concatenate a string with an integer (or some other type), first you have to cast the integer to a string with the str() function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s = \"chr\"\n", "chrom_number = 2\n", "print(s + str(chrom_number))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get more information about these two methods `split()` and `join()` we could find it online in the Python documentation starting from [www.python.org](http://www.python.org) or get help using the `help()` builtin function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(str.split)\n", "help(str.join)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1.3.2\n", "\n", "1. Create a string variable with your full name in it, with your first and last name (and any middle names) seperated by a space. Split the string into a list, and print out your surname.\n", "2. Check if your surname contains the letter \"E\", and print out the position of this letter in the string. Try a few other letters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next session\n", "\n", "Go to our next notebook: [python_basic_1_4](python_basic_1_4.ipynb)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 1 }