diff --git a/intro1.ipynb b/intro1.ipynb new file mode 100644 index 0000000..e9b4084 --- /dev/null +++ b/intro1.ipynb @@ -0,0 +1,2408 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "nbpresent": { + "id": "dc7a1635-0bbd-4bf7-a07e-7a36f58e258b" + }, + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# An introduction to solving biological problems with Python: 1" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "nbpresent": { + "id": "21082cb9-e1b9-4fe9-80d5-9d9e8418937b" + }, + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Learning objectives\n", + "- **Recall** how to print, create variables and save Python code in files\n", + "- **List** the most common data types in Python\n", + "- **Explain** how to use different type of collections\n", + "- **Use and compare** these concepts in different code examples \n", + "- **Propose and create** solutions using these concepts in different exercises" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Part 1.1: Variables\n", + "\n", + "-----\n", + "\n", + " - Printing values\n", + " - Using variables\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Printing values\n", + "\n", + "The first bit of python syntax we're going to learn is the print statement. This command lets us print messages to the user, and also to see what Python thinks is the value of some expression (very useful when debugging your programs).\n", + "\n", + "We will go into details later on, but for now just note that to print some text you have to enclose it in \"quotation marks\". \n", + "\n", + "We will go into detail on the arithmetic operations supported in python shortly, but you can try exploring python's calculating abilities." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "print(\"Hello from python!\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "print(34)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "print(2 + 3)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "You can print multiple expressions you need to seperate them with commas. Python will insert a space between each element, and a newline at the end of the message (though you can suppress this behaviour by leaving a trailing comma at the end of the command)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "print(\"The answer:\", 42)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises 1.1.1\n", + "\n", + "1. In Jupyter, insert a new cell below this one to print your name. Execute the code by pressing `run cell` from the menu bar or use your keyboard `Ctrl-Enter`.\n", + "2. Do now the same using the interpreter" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Using variables" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "-" + } + }, + "source": [ + "In the print commands above we have directly operated on values such as text strings and numbers. When programming we will typically want to deal with rather more complex expressions where it is useful to be able to assign a name to an expression, especially if we are trying to deal with multiple values at the same time.\n", + "\n", + "We can give a name to a value using _variables_, the name is apt because the values stored in a variable can _vary_. Unlike some other languages, the type of value assigned to a variable can also change (this is one of the reasons why python is known as a _dynamic_ language).\n", + "\n", + "A variable can be assigned to a simple value..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = 3\n", + "print(x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "... or the outcome of a more complex expression." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = 2 + 2\n", + "print(x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A variable can be called whatever you like (as long as it starts with a character, it does not contain space and is meaningful) and you assign a value to a variable with the **`=` operator**. Note that this is different to mathematical equality (which we will come to later...)\n", + "\n", + "You can print a variable to see what python thinks its current value is." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "serine = \"TCA\"\n", + "print(serine, \"codes for serine\")\n", + "serine = \"TCG\"\n", + "print(\"as does\", serine)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the interactive interpreter you don't have to print everything, if you type a variable name (or just a value), the interpreter will automatically print out what python thinks the value is. Note though that this is not the case if your code is in a file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "3 + 4" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = 5\n", + "3 * x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Variables can be used on the right hand side of an assignment as well, in which case they will be evaluated before the value is assigned to the variable on the left hand side." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = 5\n", + "y = x * 3\n", + "print(y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "or just `y` in the interpreter and in Jupyter notebook" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can use the current value of a variable itself in an assignment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y = y + 1\n", + "y" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In fact this is such a common idiom that there are special operators that will do this implicitly (more on these later)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y += 1\n", + "y" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises 1.1.2\n", + "\n", + "In the interpreter:\n", + "\n", + "1. Create a variable and assign it the string value of your first name, assign your age to another variable (you are free to lie!), print out a message saying how old you are\n", + "2. Use the addition operator to add 10 to your age and print out a message saying how old you will be in 10 years time" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Part 1.2: Simple data types\n", + "\n", + "-----\n", + "\n", + " - Simple data types\n", + " - Comments\n", + " - Arithmetic\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Simple data types" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python (and computers in general) treats different types of data differently. Python has 4 main basic data types. Types are useful to constrain some operations to a certain category of variables. For example it doesn't really make sense to try to divide a string.\n", + "\n", + "We will see some examples of these in use shortly, but for now let's see all of the basic types available in python." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Integers\n", + "\n", + "Integers represent whole numbers, as you would use when counting items, and can be positive or negative." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "i = -7\n", + "j = 123\n", + "print(i, j)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Floats\n", + "\n", + "Floating point numbers, often simply referred to as floats, are numbers expressed in the decimal system, i.e. 2.1, 999.998, -0.000004 etc. The value 2.0 would also be interpreted as a floating point number, but the value 2, without the decimal point will not; it will be interpreted as an integer." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "x = 3.14159\n", + "y = -42.3\n", + "print(x * y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Floating point numbers can also carry an e suffix that states which power of ten they operate at." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "k = 1.5e3\n", + "l = 3e-2\n", + "print(k)\n", + "print(l)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Strings\n", + "\n", + "Strings represent text, i.e. \"strings\" of characters. They can be delimited by single quotes or double quotes , but you have to use the same delimiter at both ends. Unlike some programming languages, such as Perl, there is no difference between the two types of quote, although using one type does allow the other type to appear inside the string as a regular character.\n", + "\n", + "Normally a python statement ends at the end of the line, but if you want to type a string over several lines you can enclose it in triple quotation marks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "s = \"ATGTCGTCTACAACACT\"\n", + "t = 'Serine'\n", + "u = \"It's a string with apostrophes\"\n", + "v = \"\"\"A string that extends\n", + "over multiple lines\"\"\"\n", + "print(v)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Booleans\n", + "\n", + "Boolean values represent truth or falsehood, as used in logical operations, for example. Not surprisingly, there are only two values, and in Python they are called True and False." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "a = True\n", + "b = False\n", + "print(a, b)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The None object\n", + "\n", + "The None object is special built-in value which can be thought of as **representing nothingness or that something is undefined**. For example, it can be used to indicate that a variable exists, but has not yet been set to anything specific." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "z = None\n", + "print(z)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Object type\n", + "\n", + "You can check what type python thinks an expression is with the type function, which you can call with the name type immediately followed by parentheses enclosing the expression you want to check (either a variable or a value), e.g. type(3). (This is the general form for calling functions, we'll see lots more examples of functions later...)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "a = True\n", + "print(a, \"is of\", type(a))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "i = -7\n", + "print(i, \"is of\", type(i))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "x = 12.7893\n", + "print(x, \"is of\", type(x))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "s = \"ATGTCGTCTACAACACT\"\n", + "print(s, \"is of\", type(s))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "z = None\n", + "print(z, \"is of\", type(z))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Comments\n", + "\n", + "When you are writing a program it is often convenient to annotate your code to remind you what you were (intending) it to do. In programming these annotations are known as _comments_. You can include a comment in python by prefixing some text with a # character. All text following the # will then be ignored by the interpreter. You can start a comment on its own line, or you can include it at the end of a line of code.\n", + "\n", + "It is also often useful to temporarily remove some code from a script without deleting it. This is known as _commenting out_ some code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "print(\"Hi\") # this will be ignored\n", + "# as will this\n", + "print(\"Bye\")\n", + "# print \"Never seen\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Arithmetic" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python supports all the standard arithmetical operations on numerical types, and mostly uses a similar syntax to several other computer languages:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "x = 4.5\n", + "y = 2\n", + "\n", + "print('x', x, 'y', y)\n", + "print('addition x + y =', x + y) \n", + "print('subtraction x - y =', x - y) \n", + "print('multiplication x * y =', x * y) \n", + "print('division x / y =', x / y) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "x = 4.5\n", + "y = 2\n", + "\n", + "print('x', x, 'y', y)\n", + "print('division x / y =', x / y)\n", + "print('floored division x // y =', x // y) \n", + "print('modulus (remainder of x/y) x % y =', x % y) \n", + "print('exponentiation x ** y =', x ** y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As usual in maths, division and multiplication have higher precedence than addition and subtraction, but arithmetic expressions can be grouped using parentheses to override the default precedence" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "x = 13\n", + "y = 5\n", + "\n", + "print('x * (2 + y) =', x * (2 + y))\n", + "print('(x * 2) + y =', (x * 2) + y)\n", + "print('x * 2 + y =', x * 2 + y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can mix (some) types in arithmetic expressions and python will apply rules as to the type of the result\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "13 + 5.0" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can force python to use a particular type by converting an expression explicitly, using helpful named functions: float, int, str etc." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "float(3) + float(7)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "int(3.14159) + 1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The addition operator `+` allows you also to concatenate strings together." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "print('number' + str(3))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Division in Python 2 sometimes trips up new (and experienced!) programmers. If you divide 2 integers you will only get an integer result. If you want a floating point result you should explicitly cast at least one of the arguments to a float." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "print(\"3/4 =\", 3/4) # in Python 2, you would get 0\n", + "print(\"3.0/4 =\", 3.0/4)\n", + "print(\"float(3)/4 =\", float(3)/4)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are a few shortcut assignment statements to make modifying variables directly faster to type" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "x = 3\n", + "x += 1 # equivalent to x = x + 1\n", + "x" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "x = 2\n", + "y = 10\n", + "y *= x\n", + "y" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "These shortcut operators are available for all arithmetic and logical operators." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises 1.2.1\n", + "\n", + "In the interpreter:\n", + "\n", + "Assign numerical values to 2 variables, calculate the mean of these two variables and store the result in another variable. Print out the result to the screen." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises 1.2.2\n", + "\n", + "Create a new Python file to solve these exercises. It is good practice to create a new file each time you solve a new problem.\n", + "\n", + "1. Look up the genetic code. Create four string variables that store possible DNA encodings of serine (S), leucine (L), tyrosine (Y) and cysteine (C). Where multiple codings are available, just pick one for now.\n", + "2. Create a variable containing a possible DNA sequence for the protein sequence SYLYC. (Note that the addition operator + allows you to concatenate strings together.) Print the DNA sequence.\n", + "3. Include a comment in your file to remind you the purpose of the script." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Part 1.3\n", + "\n", + "-------\n", + "\n", + " - Tuples\n", + " - Lists\n", + " - Manipulating tuples and lists\n", + " - String manipulations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As well as the basic data types we introduced above, very commonly you will want to store and operate on collections of values, and python has several _data structures_ that you can use to do this. The general idea is that you can place several items into a single collection and then refer to that collection as a whole. Which one you will use will depend on what problem you are trying to solve.\n", + "\n", + "## Tuples\n", + "\n", + "- Can contain any number of items\n", + "- Can contain different types of items\n", + "- __Cannot__ be altered once created (they are immutable)\n", + "- Items have a defined order\n", + "\n", + "A tuple is created by using round brackets around the items it contains, with commas seperating the individual elements." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a = (123, 54, 92) # tuple of 4 integers\n", + "b = () # empty tuple\n", + "c = (\"Ala\",) # tuple of a single string (note the trailing \",\")\n", + "d = (2, 3, False, \"Arg\", None) # a tuple of mixed types\n", + "\n", + "print(a)\n", + "print(b)\n", + "print(c)\n", + "print(d)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can of course use variables in tuples and other data structures" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = 1.2\n", + "y = -0.3\n", + "z = 0.9\n", + "t = (x, y, z)\n", + "\n", + "print(t)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Tuples can be _packed_ and _unpacked_ with a convenient syntax. The number of variables used to unpack the tuple must match the number of elements in the tuple." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = 2, 3, 4 # tuple packing\n", + "print('t is', t)\n", + "x, y, z = t # tuple unpacking\n", + "print('x is', x)\n", + "print('y is', y)\n", + "print('z is', z)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Lists\n", + "\n", + "- Can contain any number of items\n", + "- Can contain different types of items\n", + "- __Can__ be altered once created (they are _mutable_)\n", + "- Items have a particular order\n", + "\n", + "Lists are created with square brackets around their items:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a = [1, 3, 9]\n", + "b = [\"ATG\"]\n", + "c = []\n", + "\n", + "print(a)\n", + "print(b)\n", + "print(c)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Lists and tuples can contain other list and tuples, or any other type of collection:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "matrix = [[1, 0], [0, 2]]\n", + "print(matrix)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can convert between tuples and lists with the tuple and list functions. Note that these create a new collection with the same items, and leave the original unaffected." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a = (1, 4, 9, 16) # A tuple of numbers\n", + "b = ['G','C','A','T'] # A list of characters\n", + "\n", + "print(a)\n", + "print(b)\n", + "\n", + "l = list(a) # Make a list based on a tuple \n", + "print(l)\n", + "\n", + "t = tuple(b) # Make a tuple based on a list\n", + "print(t)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Manipulating tuples and lists\n", + "\n", + "Once your data is in a list or tuple, python supports a number of ways you can access elements of the list and manipulate the list in useful ways, such as sorting the data.\n", + "\n", + "Tuples and lists can generally be used in very similar ways." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Index access\n", + "\n", + "You can access individual elements of the collection using their _index_, note that the first element is at index 0. Negative indices count backwards from the end." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = (123, 54, 92, 87, 33)\n", + "x = [123, 54, 92, 87, 33]\n", + "\n", + "print('t is', t)\n", + "print('t[0] is', t[0])\n", + "print('t[2] is', t[2])\n", + "\n", + "print('x is', x)\n", + "print('x[-1] is', x[-1])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Slices\n", + "\n", + "You can also access a range of items, known as _slices_, from inside lists and tuples using a colon `:` to indicate the beginning and end of the slice inside the square brackets. **Note that the slice notation `[a:b]` includes positions from `a` up to _but not including_ `b`**." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = (123, 54, 92, 87, 33)\n", + "x = [123, 54, 92, 87, 33]\n", + "print('t[1:3] is', t[1:3])\n", + "print('x[2:] is', x[2:])\n", + "print('x[:-1] is', x[:-1])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### `in` operator\n", + "You can check if a value is in a tuple or list with the in operator, and you can negate this with not" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = (123, 54, 92, 87, 33)\n", + "x = [123, 54, 92, 87, 33]\n", + "print('123 in', x, 123 in x)\n", + "print('234 in', t, 234 in t)\n", + "print('999 not in', x, 999 not in x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### `len()` and `count()` functions\n", + "You can get the length of a list or tuple with the in-built len() function, and you can count the number of particular elements contained in a list with the .count() function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = (123, 54, 92, 87, 33)\n", + "x = [123, 54, 92, 87, 33]\n", + "print(\"length of t is\", len(t))\n", + "print(\"number of 33s in x is\", x.count(33))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Modifying lists\n", + "You can alter lists in place, but not tuples" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = [123, 54, 92, 87, 33]\n", + "print(x)\n", + "x[2] = 33\n", + "print(x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Tuples _cannot_ be altered once they have been created, if you try to do so, you'll get an error." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = (123, 54, 92, 87, 33)\n", + "print(t)\n", + "t[1] = 4" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can add elements to the end of a list with append()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = [123, 54, 92, 87, 33]\n", + "x.append(101)\n", + "print(x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "or insert values at a certain position with insert(), by supplying the desired position as well as the new value" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = [123, 54, 92, 87, 33]\n", + "x.insert(3, 1111)\n", + "print(x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can remove values with remove()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = [123, 54, 92, 87, 33]\n", + "x.remove(123)\n", + "print(x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "and delete values by index with del" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = [123, 54, 92, 87, 33]\n", + "print(x)\n", + "del x[0]\n", + "print(x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It's often useful to be able to combine arrays together, which can be done with extend() (as append would add the whole list as a single element in the list)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a = [1,2,3]\n", + "b = [4,5,6]\n", + "a.extend(b)\n", + "print(a)\n", + "a.append(b)\n", + "print(a)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The plus symbol + is shorthand for the extend operation when applied to lists:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a = [1, 2, 3]\n", + "b = [4, 5, 6]\n", + "a = a + b\n", + "print(a)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Slice syntax can be used on the left hand side of an assignment operation to assign subregions of a list" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a = [1, 2, 3, 4, 5, 6]\n", + "a[1:3] = [9, 9, 9, 9]\n", + "print(a)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can change the order of elements in a list" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a = [1, 3, 5, 4, 2]\n", + "a.reverse()\n", + "print(a)\n", + "a.sort()\n", + "print(a)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that both of these change the list, if you want a sorted copy of the list while leaving the original untouched, use sorted()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a = [2, 5, 7, 1]\n", + "b = sorted(a)\n", + "print(a)\n", + "print(b)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Getting help from the official Python documentation\n", + "\n", + "The most useful information is online on https://www.python.org/ website and should be used as a reference guide.\n", + "\n", + "- [Python 3.5.2 documentation](https://docs.python.org/3/) is the starting page with links to tutorials and libraries' documentation for Python 3\n", + " - [The Python Tutorial](https://docs.python.org/3/tutorial/index.html)\n", + " - [The Python Standard Library Reference](https://docs.python.org/3/library/index.html) is the documentation of all libraries included within Python as well as built-in functions and data types like:\n", + " - [Text Sequence Type — `str`](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str)\n", + " - [Numeric Types — `int`, `float`](https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex)\n", + " - [Sequence Types — `list`, `tuple`](https://docs.python.org/3/library/stdtypes.html#sequence-types-list-tuple-range)\n", + " - [Set Types — `set`](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset)\n", + " - [Mapping Types — `dict`](https://docs.python.org/3/library/stdtypes.html#mapping-types-dict)\n", + " \n", + "### Getting help directly from within Python using `help()`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "help(len)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "help(list)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "help(list.insert)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "help(list.count)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise 1.3.1\n", + "\n", + "1. Create a list of DNA codons for the protein sequence CLYSY based on the codon variables you defined previously.\n", + "2. Print the DNA sequence of the protein to the screen.\n", + "3. Print the DNA codon of the last amino acid in the protein sequence.\n", + "4. Create two more variables containing the DNA sequence of a stop codon and a start codon, and replace the first element of the DNA sequence with the start codon and append the stop codon to the end of the DNA sequence. Print out the resulting DNA sequence." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## String manipulations\n", + "\n", + "Strings are a lot like tuples of characters, and individual characters and substrings can be accessed and manipulated using similar operations we introduced above.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "text = \"ATGTCATTTGT\"\n", + "print(text[0])\n", + "print(text[-2])\n", + "print(text[0:6])\n", + "print(\"ATG\" in text)\n", + "print(\"TGA\" in text)\n", + "print(len(text))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Just as with tuples, trying to assign a value to an element of a string results in an error" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "text = \"ATGTCATTTGT\"\n", + "text[0:2] = \"CCC\" " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python provides a number of useful functions that let you manipulate strings\n", + "\n", + "The in operator lets you check if a substring is contained within a larger string, but it does not tell you where the substring is located. This is often useful to know and python provides the .find() method which returns the index of the first occurrence of the search string, and the .rfind() method to start searching from the end of the string.\n", + "\n", + "If the search string is not found in the string both these methods return -1." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dna = \"ATGTCACCGTTT\"\n", + "index = dna.find(\"TCA\")\n", + "print(\"TCA is at position:\", index)\n", + "index = dna.rfind('C')\n", + "print(\"The last Cytosine is at position:\", index)\n", + "print(\"Position of a stop codon:\", dna.find(\"TGA\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When we are reading text from files (which we will see later on), often there is unwanted whitespace at the start or end of the string. We can remove leading whitespace with the .lstrip() method, trailing whitespace with .rstrip(), and whitespace from both ends with .strip().\n", + "\n", + "All of these methods return a copy of the changed string, so if you want to replace the original you can assign the result of the method call to the original variable." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s = \" Chromosome Start End \"\n", + "print(len(s), s)\n", + "s = s.lstrip()\n", + "print(len(s), s)\n", + "s = s.rstrip()\n", + "print(len(s), s)\n", + "s = \" Chromosome Start End \"\n", + "s = s.strip()\n", + "print(len(s), s)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can split a string into a list of substrings using the .split() method, supplying the delimiter as an argument to the method. If you don't supply any delimiter the method will split the string on whitespace by default (which is very often what you want!)\n", + "\n", + "To split a string into its component characters you can simply _cast_ the string to a list " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "seq = \"ATG TCA CCG GGC\"\n", + "codons = seq.split(\" \")\n", + "print(codons)\n", + "\n", + "bases = list(seq) # a tuple of character converted into a list\n", + "print(bases)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + ".split() is the counterpart to the .join() method that lets you join the elements of a list into a string only if all the elements are of type String:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "seq = \"ATG TCA CCG GGC\"\n", + "codons = seq.split(\" \")\n", + "print(codons)\n", + "print(\"|\".join(codons))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We also saw earlier that the + operator lets you concatenate strings together into a larger string.\n", + "\n", + "Note that this operator only works on variables of the same type. If you want to concatenate a string with an integer (or some other type), first you have to cast the integer to a string with the str() function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s = \"chr\"\n", + "chrom_number = 2\n", + "print(s + str(chrom_number))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To get more information about these two methods `split()` and `join()` we could find it online in the Python documentation starting from [www.python.org](http://www.python.org) or get help using the `help()` builtin function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "help(str.split)\n", + "help(str.join)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise 1.3.2\n", + "\n", + "1. Create a string variable with your full name in it, with your first and last name (and any middle names) seperated by a space. Split the string into a list, and print out your surname.\n", + "2. Check if your surname contains the letter \"E\", and print out the position of this letter in the string. Try a few other letters." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Part 1.4: Sets and dictionaries\n", + "\n", + "-------\n", + "\n", + " - Sets\n", + " - Dictionaries" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Sets\n", + "\n", + "- Sets contain unique elements, i.e. no repeats are allowed\n", + "- The elements in a set do not have an order\n", + "- Sets cannot contain elements which can be internally modified (e.g. lists and dictionaries)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "l = [1, 2, 3, 2, 3] # list of 5 values\n", + "s = set(l) # set of 3 unique values\n", + "print(s)\n", + "e = set() # empty set\n", + "print(e)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Sets are very similar to lists and tuples and you can use many of the same operators and functions, except they are **inherently unordered**, so they don't have an index, and can only contain _unique_ values, so adding a value already in the set will have no effect" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s = set([1, 2, 3, 2, 3])\n", + "print(s)\n", + "print(\"number in set:\", len(s))\n", + "s.add(4)\n", + "print(s)\n", + "s.add(3)\n", + "print(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can remove specific elements from the set." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s = set([1, 2, 3, 2, 3])\n", + "print(s)\n", + "s.remove(3)\n", + "print(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can do all the expected logical operations on sets, such as taking the union or intersection of 2 sets with the | _or_ and & _and_ operators " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s1 = set([2, 4, 6, 8, 10])\n", + "s2 = set([4, 5, 6, 7])\n", + "\n", + "print(\"Union:\", s1 | s2)\n", + "print(\"Intersection:\", s1 & s2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise 1.4.1\n", + "\n", + "1. Given the protein sequence \"MPISEPTFFEIF\", split the sequence into its component amino acid codes and use a set to establish the unique amino acids in the protein and print out the result." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dictionaries\n", + "\n", + "Lists are useful in many contexts, but often we have some data that has no inherent order and that we want to access by some useful name rather than an index. For example, as a result of some experiment we may have a set of genes and corresponding expression values. We could put the expression values in a list, but then we'd have to remember which index in the list corresponded to which gene and this would quickly get complicated.\n", + "\n", + "For these situations a _dictionary_ is a very useful data structure.\n", + "\n", + "Dictionaries:\n", + "\n", + "- Contain a mapping of keys to values (like a word and its corresponding definition in a dictionary)\n", + "- The keys of a dictionary are unique, i.e. they cannot repeat\n", + "- The values of a dictionary can be of any data type\n", + "- The keys of a dictionary cannot be an internally modifiable type (e.g. lists, but you can use tuples)\n", + "- Dictionaries do not store data in any particular order" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dna = {\"A\": \"Adenine\", \"C\": \"Cytosine\", \"G\": \"Guanine\", \"T\": \"Thymine\"}\n", + "print(dna)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can access values in a dictionary using the key inside square brackets" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dna = {\"A\": \"Adenine\", \"C\": \"Cytosine\", \"G\": \"Guanine\", \"T\": \"Thymine\"}\n", + "print(\"A represents\", dna[\"A\"])\n", + "print(\"G represents\", dna[\"G\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "An error is triggered if a key is absent from the dictionary:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dna = {\"A\": \"Adenine\", \"C\": \"Cytosine\", \"G\": \"Guanine\", \"T\": \"Thymine\"}\n", + "print(\"What about N?\", dna[\"N\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can access values safely with the get method, which gives back None if the key is absent and you can also supply a default values" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dna = {\"A\": \"Adenine\", \"C\": \"Cytosine\", \"G\": \"Guanine\", \"T\": \"Thymine\"}\n", + "print(\"What about N?\", dna.get(\"N\"))\n", + "print(\"With a default value:\", dna.get(\"N\", \"unknown\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can check if a key is in a dictionary with the in operator, and you can negate this with not" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dna = {\"A\": \"Adenine\", \"C\": \"Cytosine\", \"G\": \"Guanine\", \"T\": \"Thymine\"}\n", + "\"T\" in dna" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dna = {\"A\": \"Adenine\", \"C\": \"Cytosine\", \"G\": \"Guanine\", \"T\": \"Thymine\"}\n", + "\"Y\" not in dna" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The len() function gives back the number of (key, value) pairs in the dictionary:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dna = {\"A\": \"Adenine\", \"C\": \"Cytosine\", \"G\": \"Guanine\", \"T\": \"Thymine\"}\n", + "print(len(dna))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can introduce new entries in the dictionary by assigning a value with a new key:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dna = {\"A\": \"Adenine\", \"C\": \"Cytosine\", \"G\": \"Guanine\", \"T\": \"Thymine\"}\n", + "dna['Y'] = 'Pyrimidine'\n", + "print(dna)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can change the value for an existing key by reassigning it:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dna = {'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine', 'Y': 'Pyrimidine'}\n", + "dna['Y'] = 'Cytosine or Thymine'\n", + "print(dna)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can delete entries from the dictionary:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dna = {'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine', 'Y': 'Pyrimidine'}\n", + "del dna['Y']\n", + "print(dna)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can get a list of all the keys (in arbitrary order) using the inbuilt .keys() function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dna = {'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine', 'Y': 'Pyrimidine'}\n", + "print(list(dna.keys()))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And equivalently get a list of the values:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dna = {'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine', 'Y': 'Pyrimidine'}\n", + "print(list(dna.values()))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And a list of tuples containing (key, value) pairs:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dna = {'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine', 'Y': 'Pyrimidine'}\n", + "print(list(dna.items()))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises 1.4.2\n", + "\n", + "1. Print out the names of the amino acids that would be produced by the DNA sequence \"GTT GCA CCA CAA CCG\" ([See the DNA codon table](https://en.wikipedia.org/wiki/DNA_codon_table)). Split this string into the individual codons and then use a dictionary to map between codon sequences and the amino acids they encode.\n", + "2. Print each codon and its corresponding amino acid.\n", + "3. Why couldn't we build a dictionary where the keys are names of amino acids and the values are the DNA codons?\n", + "\n", + "### Advanced exercise 1.4.3\n", + "\n", + "- Starting with an empty dictionary, count the abundance of different residue types present in the 1-letter lysozyme protein sequence (http://www.uniprot.org/uniprot/B2R4C5.fasta) and print the results to the screen in alphabetical key order." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "anaconda-cloud": {}, + "celltoolbar": "Slideshow", + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + }, + "nbpresent": { + "slides": { + "152c5a3b-78f9-4183-bce2-379a4012baf6": { + "id": "152c5a3b-78f9-4183-bce2-379a4012baf6", + "layout": "grid", + "prev": "5613e857-5b4e-42e4-9feb-df0440592ca2", + "regions": { + "20d6059c-7745-410d-a5fb-0b91cacbc2e2": { + "attrs": { + "height": 0.6666666666666666, + "pad": 0.01, + "treemap:weight": 1, + "width": 0.5, + "x": 0, + "y": 0 + }, + "id": "20d6059c-7745-410d-a5fb-0b91cacbc2e2" + }, + "300e6ccd-ecf4-425e-8574-3debe305aafb": { + "attrs": { + "height": 0.3333333333333333, + "pad": 0.01, + "treemap:weight": 1, + "width": 1, + "x": 0, + "y": 0.6666666666666666 + }, + "content": { + "cell": "9814e8d7-60e0-43e6-aee0-3c33cc2cc809", + "part": "whole" + }, + "id": "300e6ccd-ecf4-425e-8574-3debe305aafb" + }, + "df2dd6ff-570b-4b75-9cb7-1ff1dbdd4f55": { + "attrs": { + "height": 0.6666666666666666, + "pad": 0.01, + "treemap:weight": 1, + "width": 0.5, + "x": 0.5, + "y": 0 + }, + "id": "df2dd6ff-570b-4b75-9cb7-1ff1dbdd4f55" + } + } + }, + "2586ca7d-5091-40ea-b566-ccc5fbf833c6": { + "id": "2586ca7d-5091-40ea-b566-ccc5fbf833c6", + "prev": "f001d476-5814-4664-a722-f04f5d23cd52", + "regions": { + "d6011048-43db-4990-a82e-768683aa4fe5": { + "attrs": { + "height": 0.8, + "width": 0.8, + "x": 0.1, + "y": 0.1 + }, + "content": { + "cell": "ceb5f5a0-a5e8-435e-ae16-23c2ba8c6ab2", + "part": "whole" + }, + "id": "d6011048-43db-4990-a82e-768683aa4fe5" + } + } + }, + "27ee4130-d0bb-4287-b8fe-75a7b0ecf178": { + "id": "27ee4130-d0bb-4287-b8fe-75a7b0ecf178", + "prev": "2586ca7d-5091-40ea-b566-ccc5fbf833c6", + "regions": { + "7a689d66-0c9d-4492-928b-f35bfd2ffc4c": { + "attrs": { + "height": 0.8, + "width": 0.8, + "x": 0.1, + "y": 0.1 + }, + "content": { + "cell": "e6c2e441-eb7b-4a4c-9c9c-b88cc9a2527f", + "part": "whole" + }, + "id": "7a689d66-0c9d-4492-928b-f35bfd2ffc4c" + } + } + }, + "2de0c027-7a07-4f7e-8594-a98d36125372": { + "id": "2de0c027-7a07-4f7e-8594-a98d36125372", + "prev": "75e76bd9-24ae-4c42-b6bc-5f58a0550ba8", + "regions": { + "868fd842-e6fb-48b2-9ac5-95e8fe20927e": { + "attrs": { + "height": 0.8, + "width": 0.8, + "x": 0.1, + "y": 0.1 + }, + "content": { + "cell": "0e25dad3-add0-466e-8f71-e771d6ec4500", + "part": "whole" + }, + "id": "868fd842-e6fb-48b2-9ac5-95e8fe20927e" + } + } + }, + "5613e857-5b4e-42e4-9feb-df0440592ca2": { + "id": "5613e857-5b4e-42e4-9feb-df0440592ca2", + "prev": "564dae42-4185-46c1-b156-e503f475e25c", + "regions": { + "17e888b0-050b-406a-a5a3-0d5c1605b8df": { + "attrs": { + "height": 0.8, + "width": 0.8, + "x": 0.1, + "y": 0.1 + }, + "content": { + "cell": "f5bcbcb5-4352-4674-a7b6-c8e576220422", + "part": "whole" + }, + "id": "17e888b0-050b-406a-a5a3-0d5c1605b8df" + } + } + }, + "564dae42-4185-46c1-b156-e503f475e25c": { + "id": "564dae42-4185-46c1-b156-e503f475e25c", + "prev": "ba285213-f645-4314-afd5-0a656fa35631", + "regions": { + "328d4d72-cd9e-4e5b-aaa8-175833f5bfdb": { + "attrs": { + "height": 0.8, + "width": 0.8, + "x": 0.1, + "y": 0.1 + }, + "content": { + "cell": "8a4ac456-6c4b-4249-8662-b1cabfd7cee4", + "part": "whole" + }, + "id": "328d4d72-cd9e-4e5b-aaa8-175833f5bfdb" + } + } + }, + "6ff94ac3-8ded-442e-ae43-aa0a5c14d468": { + "id": "6ff94ac3-8ded-442e-ae43-aa0a5c14d468", + "prev": "27ee4130-d0bb-4287-b8fe-75a7b0ecf178", + "regions": { + "ad759b3a-6080-4356-a9fd-87f2b1b90bc2": { + "attrs": { + "height": 0.8, + "width": 0.8, + "x": 0.1, + "y": 0.1 + }, + "content": { + "cell": "8458de53-35b5-405e-a372-5db5d2e2c2c5", + "part": "whole" + }, + "id": "ad759b3a-6080-4356-a9fd-87f2b1b90bc2" + } + } + }, + "75e76bd9-24ae-4c42-b6bc-5f58a0550ba8": { + "id": "75e76bd9-24ae-4c42-b6bc-5f58a0550ba8", + "prev": "152c5a3b-78f9-4183-bce2-379a4012baf6", + "regions": { + "4afd3b41-071f-44eb-a8f6-9a7f780041c2": { + "attrs": { + "height": 0.8, + "width": 0.8, + "x": 0.1, + "y": 0.1 + }, + "content": { + "cell": "62fdd00c-a006-4f11-b9dc-e2ca072225d7", + "part": "whole" + }, + "id": "4afd3b41-071f-44eb-a8f6-9a7f780041c2" + } + } + }, + "8c46fa2c-d5dc-4ef7-8d99-f504e2c3a4a1": { + "id": "8c46fa2c-d5dc-4ef7-8d99-f504e2c3a4a1", + "prev": "e2f5626f-0d60-47cb-967f-0edababb0329", + "regions": { + "af33776f-ec36-45be-a627-39573a78b1d6": { + "attrs": { + "height": 0.8, + "width": 0.8, + "x": 0.1, + "y": 0.1 + }, + "content": { + "cell": "0d61b4b4-163f-47fe-80f1-092287218273", + "part": "whole" + }, + "id": "af33776f-ec36-45be-a627-39573a78b1d6" + } + } + }, + "ae3f4c01-80dc-4add-889a-05c74f7155a5": { + "id": "ae3f4c01-80dc-4add-889a-05c74f7155a5", + "prev": "6ff94ac3-8ded-442e-ae43-aa0a5c14d468", + "regions": { + "15f00a98-7b04-439d-996d-851b773b060a": { + "attrs": { + "height": 0.8, + "width": 0.8, + "x": 0.1, + "y": 0.1 + }, + "content": { + "cell": "96ca5c44-2cfc-471c-8da7-39870c822e20", + "part": "whole" + }, + "id": "15f00a98-7b04-439d-996d-851b773b060a" + } + } + }, + "ba285213-f645-4314-afd5-0a656fa35631": { + "id": "ba285213-f645-4314-afd5-0a656fa35631", + "prev": "8c46fa2c-d5dc-4ef7-8d99-f504e2c3a4a1", + "regions": { + "6cddb9f2-8e39-4010-8fab-3e70b3a8993f": { + "attrs": { + "height": 0.8, + "width": 0.8, + "x": 0.1, + "y": 0.1 + }, + "content": { + "cell": "b878a4f9-4345-4abb-81f4-5a731c639ab8", + "part": "whole" + }, + "id": "6cddb9f2-8e39-4010-8fab-3e70b3a8993f" + } + } + }, + "cd587236-8a19-444d-8b18-69d782dbf725": { + "id": "cd587236-8a19-444d-8b18-69d782dbf725", + "prev": null, + "regions": { + "ef377bfe-ff45-49db-b471-f79ecb10b580": { + "attrs": { + "height": 0.8, + "width": 0.8, + "x": 0.1, + "y": 0.1 + }, + "content": { + "cell": "dc7a1635-0bbd-4bf7-a07e-7a36f58e258b", + "part": "whole" + }, + "id": "ef377bfe-ff45-49db-b471-f79ecb10b580" + } + } + }, + "e2f5626f-0d60-47cb-967f-0edababb0329": { + "id": "e2f5626f-0d60-47cb-967f-0edababb0329", + "prev": "ae3f4c01-80dc-4add-889a-05c74f7155a5", + "regions": { + "eef49fa0-0f9b-4228-8fb8-79e079bf7682": { + "attrs": { + "height": 0.8, + "width": 0.8, + "x": 0.1, + "y": 0.1 + }, + "content": { + "cell": "9110098b-9675-4d64-adf3-c947073d4c4d", + "part": "whole" + }, + "id": "eef49fa0-0f9b-4228-8fb8-79e079bf7682" + } + } + }, + "f001d476-5814-4664-a722-f04f5d23cd52": { + "id": "f001d476-5814-4664-a722-f04f5d23cd52", + "prev": "cd587236-8a19-444d-8b18-69d782dbf725", + "regions": { + "5a176076-c5a5-4b50-ab2c-9cd0baedad45": { + "attrs": { + "height": 0.8, + "width": 0.8, + "x": 0.1, + "y": 0.1 + }, + "content": { + "cell": "53eee250-b3d0-4262-ad09-e87fb2acf82e", + "part": "whole" + }, + "id": "5a176076-c5a5-4b50-ab2c-9cd0baedad45" + } + } + } + }, + "themes": { + "default": "c6b5d1ad-d691-4000-9f62-de7fc0e83644", + "theme": { + "586a6e7a-f661-4d6c-90d0-1392715bea27": { + "id": "586a6e7a-f661-4d6c-90d0-1392715bea27", + "palette": { + "19cc588f-0593-49c9-9f4b-e4d7cc113b1c": { + "id": "19cc588f-0593-49c9-9f4b-e4d7cc113b1c", + "rgb": [ + 252, + 252, + 252 + ] + }, + "31af15d2-7e15-44c5-ab5e-e04b16a89eff": { + "id": "31af15d2-7e15-44c5-ab5e-e04b16a89eff", + "rgb": [ + 68, + 68, + 68 + ] + }, + "50f92c45-a630-455b-aec3-788680ec7410": { + "id": "50f92c45-a630-455b-aec3-788680ec7410", + "rgb": [ + 155, + 177, + 192 + ] + }, + "c5cc3653-2ee1-402a-aba2-7caae1da4f6c": { + "id": "c5cc3653-2ee1-402a-aba2-7caae1da4f6c", + "rgb": [ + 43, + 126, + 184 + ] + }, + "efa7f048-9acb-414c-8b04-a26811511a21": { + "id": "efa7f048-9acb-414c-8b04-a26811511a21", + "rgb": [ + 25.118061674008803, + 73.60176211453744, + 107.4819383259912 + ] + } + }, + "rules": { + "blockquote": { + "color": "50f92c45-a630-455b-aec3-788680ec7410" + }, + "code": { + "font-family": "Anonymous Pro" + }, + "h1": { + "color": "c5cc3653-2ee1-402a-aba2-7caae1da4f6c", + "font-family": "Lato", + "font-size": 8 + }, + "h2": { + "color": "c5cc3653-2ee1-402a-aba2-7caae1da4f6c", + "font-family": "Lato", + "font-size": 6 + }, + "h3": { + "color": "50f92c45-a630-455b-aec3-788680ec7410", + "font-family": "Lato", + "font-size": 5.5 + }, + "h4": { + "color": "c5cc3653-2ee1-402a-aba2-7caae1da4f6c", + "font-family": "Lato", + "font-size": 5 + }, + "h5": { + "font-family": "Lato" + }, + "h6": { + "font-family": "Lato" + }, + "h7": { + "font-family": "Lato" + }, + "pre": { + "font-family": "Anonymous Pro", + "font-size": 4 + } + }, + "text-base": { + "font-family": "Merriweather", + "font-size": 4 + } + }, + "c6b5d1ad-d691-4000-9f62-de7fc0e83644": { + "backgrounds": { + "dc7afa04-bf90-40b1-82a5-726e3cff5267": { + "background-color": "31af15d2-7e15-44c5-ab5e-e04b16a89eff", + "id": "dc7afa04-bf90-40b1-82a5-726e3cff5267" + } + }, + "id": "c6b5d1ad-d691-4000-9f62-de7fc0e83644", + "palette": { + "19cc588f-0593-49c9-9f4b-e4d7cc113b1c": { + "id": "19cc588f-0593-49c9-9f4b-e4d7cc113b1c", + "rgb": [ + 252, + 252, + 252 + ] + }, + "31af15d2-7e15-44c5-ab5e-e04b16a89eff": { + "id": "31af15d2-7e15-44c5-ab5e-e04b16a89eff", + "rgb": [ + 68, + 68, + 68 + ] + }, + "50f92c45-a630-455b-aec3-788680ec7410": { + "id": "50f92c45-a630-455b-aec3-788680ec7410", + "rgb": [ + 197, + 226, + 245 + ] + }, + "c5cc3653-2ee1-402a-aba2-7caae1da4f6c": { + "id": "c5cc3653-2ee1-402a-aba2-7caae1da4f6c", + "rgb": [ + 43, + 126, + 184 + ] + }, + "efa7f048-9acb-414c-8b04-a26811511a21": { + "id": "efa7f048-9acb-414c-8b04-a26811511a21", + "rgb": [ + 25.118061674008803, + 73.60176211453744, + 107.4819383259912 + ] + } + }, + "rules": { + "a": { + "color": "19cc588f-0593-49c9-9f4b-e4d7cc113b1c" + }, + "blockquote": { + "color": "50f92c45-a630-455b-aec3-788680ec7410", + "font-size": 3 + }, + "code": { + "font-family": "Anonymous Pro" + }, + "h1": { + "color": "19cc588f-0593-49c9-9f4b-e4d7cc113b1c", + "font-family": "Merriweather", + "font-size": 8 + }, + "h2": { + "color": "19cc588f-0593-49c9-9f4b-e4d7cc113b1c", + "font-family": "Merriweather", + "font-size": 6 + }, + "h3": { + "color": "50f92c45-a630-455b-aec3-788680ec7410", + "font-family": "Lato", + "font-size": 5.5 + }, + "h4": { + "color": "c5cc3653-2ee1-402a-aba2-7caae1da4f6c", + "font-family": "Lato", + "font-size": 5 + }, + "h5": { + "font-family": "Lato" + }, + "h6": { + "font-family": "Lato" + }, + "h7": { + "font-family": "Lato" + }, + "li": { + "color": "50f92c45-a630-455b-aec3-788680ec7410", + "font-size": 3.25 + }, + "pre": { + "font-family": "Anonymous Pro", + "font-size": 4 + } + }, + "text-base": { + "color": "19cc588f-0593-49c9-9f4b-e4d7cc113b1c", + "font-family": "Lato", + "font-size": 4 + } + } + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/intro2.ipynb b/intro2.ipynb new file mode 100644 index 0000000..e1e4ce5 --- /dev/null +++ b/intro2.ipynb @@ -0,0 +1,1830 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# An introduction to solving biological problems with Python: 2" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Learning objectives\n", + "\n", + "- **Recall** what we've learned so far on variables, common data types and collections\n", + "- **Propose and create** solutions using these concepts in an exercise\n", + "- **Use** conditions to execute specific code block\n", + "- **Employ** loops to repeat code block\n", + "- **Practice** reading and writing files with Python\n", + "- **Solve** more complex exercises" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Recap\n", + "\n", + "- Simple data types, Collections\n", + "- Functions used so far..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Simple data types" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Integer: 1\n", + "Float 3.14\n", + "True\n" + ] + } + ], + "source": [ + "## Integer\n", + "i = 1\n", + "print('Integer:', i)\n", + "## Float\n", + "x = 3.14\n", + "print('Float', x)\n", + "## Boolean\n", + "print(True)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ATGTCGTCTACAACACTspam's\n", + "ATGTCGTCTACAACACT spam's\n" + ] + } + ], + "source": [ + "## String\n", + "s0 = '' # empty string\n", + "s1 = 'ATGTCGTCTACAACACT' # single quotes\n", + "s2 = \"spam's\" # double quotes\n", + "print(s1 + s2) # concatenate\n", + "print(s1, s2) # print" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Collections" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "A tuple: (2, 3, 4, 5)\n", + "First element of tuple: 2\n" + ] + } + ], + "source": [ + "## Tuple - immutable\n", + "my_tuple = (2, 3, 4, 5)\n", + "print('A tuple:', my_tuple)\n", + "print('First element of tuple:', my_tuple[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "A list: [2, 3, 4, 5]\n", + "First element of list: 2\n", + "Appended list: [2, 3, 4, 5, 12]\n", + "Modified list: [45, 3, 4, 5, 12]\n" + ] + } + ], + "source": [ + "## List\n", + "my_list = [2, 3, 4, 5]\n", + "print('A list:', my_list)\n", + "print('First element of list:', my_list[0])\n", + "my_list.append(12)\n", + "print('Appended list:', my_list)\n", + "my_list[0] = 45\n", + "print('Modified list:', my_list)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Here is a string: ATGTCATTT\n", + "First character: A\n", + "Slice text[1:3]: TG\n", + "Number of characters in text 9\n" + ] + } + ], + "source": [ + "## String - immutable, tuple of characters\n", + "text = \"ATGTCATTT\"\n", + "print('Here is a string:', text)\n", + "print('First character:', text[0])\n", + "print('Slice text[1:3]:', text[1:3])\n", + "print('Number of characters in text', len(text))" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "A set: {1, 2, 4, 5, 6}\n" + ] + } + ], + "source": [ + "## Set - unique unordered elements\n", + "my_set = set([1,2,2,2,2,4,5,6,6,6])\n", + "print('A set:', my_set)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "A dictionary: {'A': 'Adenine', 'C': 'Cytosine', 'G': 'Guanine', 'T': 'Thymine'}\n", + "Value associated to key C: Cytosine\n" + ] + } + ], + "source": [ + "## Dictionary\n", + "my_dictionary = {\"A\": \"Adenine\", \n", + " \"C\": \"Cytosine\", \n", + " \"G\": \"Guanine\", \n", + " \"T\": \"Thymine\"}\n", + "print('A dictionary:', my_dictionary)\n", + "print('Value associated to key C:', my_dictionary['C'])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Functions used so far..." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "There are 5 elements in the list ['A', 'C', 'A', 'T', 'G']\n", + "There are 2 letter A in the list ['A', 'C', 'A', 'T', 'G']\n", + "['ATG', 'TCA', 'CCG', 'GGC']\n" + ] + } + ], + "source": [ + "my_list = ['A', 'C', 'A', 'T', 'G']\n", + "print('There are', len(my_list), 'elements in the list', my_list)\n", + "print('There are', my_list.count('A'), 'letter A in the list', my_list)\n", + "print(\"ATG TCA CCG GGC\".split())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": true + }, + "source": [ + "# Part 2.1: Conditional execution\n", + "\n", + "-----\n", + "\n", + " - Code blocks\n", + " - Conditional execution\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Program control and logic" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A program will normally run by executing the stated commands, one after the other in sequential order. Frequently however, you will need the program to deviate from this. There are several ways of diverting from the line-by-line paradigm:\n", + "\n", + "- With conditional statements. Here you can check if some statement or expression is true, and if it is then you continue on with the following block of code, otherwise you might skip it or execute a different bit of code.\n", + "\n", + "- By performing repetitive loops through the same block of code, where each time through the loop different values may be used for the variables.\n", + "\n", + "- Through the use of functions (subroutines) where the program’s execution jumps from a particular line of code to an entirely different spot, even in a different file or module, to do a task before (usually) jumping back again. Functions are covered in the next session, so we will not discuss them yet.\n", + "\n", + "- By checking if an error or exception occurs, i.e. something illegal has happened, and executing different blocks of code accordingly" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Code blocks" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "With all of the means by which Python code execution can jump about we naturally need to be aware of the boundaries of the block of code we jump into, so that it is clear at what point the job is done, and program execution can jump back again. In essence it is required that the end of a function, loop or conditional statement be defined, so that we know the bounds of their respective code blocks." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python uses indentation to show which statements are in a block of code, other languages use specific `begin` and `end` statements or curly braces `{}`. It doesn't matter how much indentation you use, but the whole block must be consistent, i.e., if the first statement is indented by four spaces, the rest of the block must be indented by the same amount. The Python style guide recommends using 4-space indentation. Use spaces, rather than tabs, since different editors display tab characters with different widths.\n", + "\n", + "The use of indentation to delineate code blocks is illustrated in an abstract manner in the following scheme: \n", + "\n", + "Statement 1:\n", + "\n", + " Command A – in the block of statement 1\n", + " Command B – in the block of statement 1\n", + " \n", + " Statement 2:\n", + " Command C – in the block of statement 2\n", + " Command D – in the block of statement 2\n", + " \n", + " Command E – back in the block of statement 1\n", + "\n", + "Command F – outside all statement blocks\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conditional execution" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The if statement" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A conditional if statement is used to specify that some block of code should only be executed if some associated test is upheld; a conditional expression evaluates to True. This might also involve subsidiary checks using the elif statement to use an alternative block if the previous expression turns out to be False. There can even be a final else statement to do something if none of the checks are passed. \n", + "\n", + "The following uses statements that test whether a number is less than zero, greater than zero or otherwise equal to zero and will print out a different message in each case:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "x = -3\n", + "\n", + "if x > 0:\n", + " print(\"Value is positive\")\n", + "\n", + "elif x < 0:\n", + " print(\"Value is negative\")\n", + "\n", + "else:\n", + " print(\"Value is zero\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The general form of writing out such combined conditional statements is as follows:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n",
+    "if conditionalExpression1:\n",
+    "    # codeBlock1\n",
+    "\n",
+    "elif conditionalExpression2:\n",
+    "    # codeBlock2\n",
+    "\n",
+    "elif conditionalExpressionN:\n",
+    "    # codeBlockN\n",
+    "    +any number of additional elif statements, then finally:\n",
+    "\n",
+    "else:\n",
+    "    # codeBlockE\n",
+    "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The elif block is optional, and we can use as many as we like. The else block is also optional, so will only have the if statement, which is a fairly common situation. It is often good practice to include else where possible though, so that you always catch cases that do not pass, otherwise values might go unnoticed, which might not be the desired behaviour." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Placeholders are needed for “empty” code blocks:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "gene = \"BRCA2\"\n", + "geneExpression = -1.2\n", + "\n", + "if geneExpression < 0:\n", + " print(gene, \"is downregulated\")\n", + " \n", + "elif geneExpression > 0:\n", + " print(gene, \"is upregulated\")\n", + " \n", + "else:\n", + " pass" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For very simple conditional checks, you can write the `if` statement on a single line as a single expression, and the result will be the expression before the `if` if the condition is true or the expression after the `else` otherwise.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "x = 11\n", + "\n", + "if x < 10:\n", + " s = \"Yes\"\n", + "else:\n", + " s = \"No\"\n", + "print(s)\n", + "\n", + "# Could also be written onto one line\n", + "s = \"Yes\" if x < 10 else \"No\"\n", + "print(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Comparisons and truth" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "With conditional execution the question naturally arises as to which expressions are deemed to be true and which false. For the python boolean values True and False the answer is (hopefully) obvious. Also, the logical states of truth and falsehood that result from conditional checks like “Is x greater than 5?” or “Is y in this list?” are also clear. When comparing values Python has the standard comparison (or relational) operators, some of which we have already seen:\n", + "\n", + "|Operator |\tDescription |\tExample |\n", + "|---------|-------------|-----------|\n", + "|`==` |\t equality |\t1 == 2 # False |\n", + "|`!=` |\t non equality |\t1 != 2 # True |\n", + "| `<` |\t less than |\t1 < 2 # True |\n", + "| `<=` |\t equal or less than |\t2 <= 2 # True |\n", + "| `>` |\t greater then |\t1 > 2 # False |\n", + "| `>=` |\t equal or greater than |\t1 >= 1 # True |\n", + "\n", + "It is notable that comparison operations can be combined, for example to check if a value is within a range." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "x = -5\n", + "\n", + "if x > 0 and x < 10:\n", + " print(\"In range A\")\n", + " \n", + "elif x < 0 or x > 10:\n", + " print(\"In range B\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python has two additional comparison operators is and is not.\n", + "\n", + "These compare whether two objects are the same object, whereas == and != compare whether values are the same.\n", + "\n", + "There is a simple rule of thumb to tell when to use `==` or `is`:\n", + "- `==` is for **value** equality. Use it to check if two objects have the same value.\n", + "- `is` is for **reference** equality. Use it to check if two references refer to the same object.\n", + "\n", + "*Something to note*, you will get unexpected and inconsistent results if you mistakenly use `is` to compare for reference equality on integers:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a = 500\n", + "b = 500\n", + "print(a == b) # True\n", + "print(a is b) # False" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another example with lists `x`, `y`, and `z`:\n", + "- `y` being a copy of `x`\n", + "- and `z` being another reference to the same object as `x`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = [123, 54, 92, 87, 33]\n", + "y = x[:] # y is a copy of x\n", + "z = x\n", + "print(x)\n", + "print(y)\n", + "print(z)\n", + "print(\"Are values of y and x the same?\", y == x)\n", + "print(\"Are objects y and x the same?\", y is x)\n", + "print(\"Are values of z and x the same?\", z == x)\n", + "print(\"Are objects z and x the same?\", z is x)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Let's change x\n", + "x[1] = 23\n", + "print(x)\n", + "print(y)\n", + "print(z)\n", + "print(\"Are values of y and x the same?\", y == x)\n", + "print(\"Are objects y and x the same?\", y is x)\n", + "print(\"Are values of z and x the same?\", z == x)\n", + "print(\"Are objects z and x the same?\", z is x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In Python even expressions that do not involve an obvious boolean value can be assigned a status of \"truthfulness\"; the value of an item itself can be forced to be considered as either True or False inside an if statement. For the Python built-in types discussed in this chapter the following are deemed to be False in such a context:\n", + "\n", + "| False value | Description | \n", + "|-------------|-------------|\n", + "| `None` |\tnumeric equality |\n", + "| `False` |\tFalse boolean |\n", + "| `0`\t| 0 integer |\n", + "| `0.0` |\t0.0 floating point |\n", + "| `\"\"` |\tempty string |\n", + "| `()` |\tempty tuple |\n", + "| `[]` |\tempty list |\n", + "| `{}` |\tempty dictonary |\n", + "| `set()` |\tempty set |\n", + "\n", + "And everything else is deemed to be True in a conditional context." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "x = '' # An empty string\n", + "y = ['a'] # A list with one item\n", + "\n", + "if x:\n", + " print(\"x is true\")\n", + "else: \n", + " print(\"x is false\") \n", + "\n", + "if y:\n", + " print(\"y is true\")\n", + "else:\n", + " print(\"y is false\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises 2.1.1\n", + "\n", + "Create a `if..elif..else` block that will compare a variable containing your age to another variable containing another person's age and print a statement which says if you are younger, older or the same age as that person.\n", + "\n", + "## Exercises 2.1.2\n", + "\n", + "Use an `if` statement to check if some variable containing DNA sequence contains a stop codon. (e.g. `dna = \"ATGGCGGTCGAATAG\"`), first just check for one possible stop, but then extend your code to look for any of the 3 stop codons (`TAG`, `TAA`, `TGA`). Hint: recall that the `in` operator lets you check if a string contains some substring, and returns `True` or `False` accordingly." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Next session\n", + "\n", + "Go to our next notebook: [python_basic_2_2](python_basic_2_2.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": true + }, + "source": [ + "# Part 2.2: Loops\n", + "\n", + "- The for loop\n", + "- The while loop\n", + "- Skipping and breaking loops\n", + "- More looping using `range()` and `enumerate()`\n", + "- Filtering in loops\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Loops" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When an operation needs to be repeated multiple times, for example on all of the items in a list, we \n", + "avoid having to type (or copy and paste) repetitive code by creating a loop. There are two ways of creating loops in Python, the for loop and the while loop." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The for loop" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The for loop in Python iterates over each item in a sequence (such as a list or tuple) in the order that they appear in the sequence. What this means is that a variable (code in the below example) is set to each item from the sequence of values in turn, and each time this happens the indented block of code is executed again." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "codeList = ['NA06984', 'NA06985', 'NA06986', 'NA06989', 'NA06991']\n", + "\n", + "for code in codeList:\n", + " print(code)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A for loop can iterate over the individual characters in a string:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "dnaSequence = 'ATGGTGTTGCC'\n", + "\n", + "for base in dnaSequence:\n", + " print(base)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And also over the keys of a dictionary: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "rnaMassDict = {\"G\":345.21, \"C\":305.18, \"A\":329.21, \"U\":302.16}\n", + "\n", + "for x in rnaMassDict:\n", + " print(x, rnaMassDict[x])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Any variables that are defined before the loop can be accessed from inside the loop. So for example to calculate the summation of the items in a list of values we could define the total initially to be zero and add each value to the total in the loop:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "total = 0\n", + "values = [1, 2, 4, 8, 16]\n", + "\n", + "for v in values:\n", + " total = total + v\n", + " # total += v\n", + " print(total)\n", + "\n", + "print(total)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Naturally we can combine a for loop with an if statement, noting that we need two indentation levels, one for the outer loop and another for the conditional blocks:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "geneExpression = {\n", + " 'Beta-Catenin': 2.5, \n", + " 'Beta-Actin': 1.7, \n", + " 'Pax6': 0, \n", + " 'HoxA2': -3.2\n", + "}\n", + "\n", + "for gene in geneExpression:\n", + " if geneExpression[gene] < 0:\n", + " print(gene, \"is downregulated\")\n", + " \n", + " elif geneExpression[gene] > 0:\n", + " print(gene, \"is upregulated\")\n", + " \n", + " else:\n", + " print(\"No change in expression of \", gene)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises 2.2.1\n", + "\n", + "1. Create a sequence where each element is an individual base of DNA. Make the sequence 15 bases long.\n", + "2. Print the length of the sequence.\n", + "3. Create a `for` loop to output every base of the sequence on a new line." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The while loop" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In addition to the for loop that operates on a collection of items, there is a while loop that simply repeats while some statement evaluates to True and stops when it is False. Note that if the tested expression never evaluates to False then you have an “infinite loop”, which is not good.\n", + "\n", + "In this example we generate a series of numbers by doubling a value after each iteration, until a limit is reached: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "value = 0.25\n", + "while value < 8:\n", + " value = value * 2\n", + " print(value)\n", + "\n", + "print(\"final value:\", value)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Whats going on here is that the value is doubled in each iteration and once it gets to 8 the while test fails (8 is not less than 8) and that last value is preserved. Note that if the test were instead value `<= 8` then we would get one more doubling and the value would reach 16." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises 2.2.2\n", + "\n", + "1. Reuse the 15 bases long sequence created at the previous exercise where each element is an individual base of DNA.\n", + "2. Create a while loop similar to the one above that starts at the third base in the sequence and outputs every third base until the 12th." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Skipping and breaking loops" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python has two ways of affecting the flow of the for or while loop inside the block. The continue statement means that the rest of the code in the block is skipped for this particular item in the collection, i.e. jump to the next iteration. In this example negative numbers are left out of a summation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "values = [10, -5, 3, -1, 7]\n", + "\n", + "total = 0\n", + "for v in values:\n", + " if v < 0:\n", + " continue # Skip this iteration \n", + " total += v\n", + "\n", + "print(total)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The other way of affecting a loop is with the break statement. In contrast to the continue statement, this immediately causes all looping to finish, and execution is resumed at the next statement _after_ the loop." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "geneticCode = {'TAT': 'Tyrosine', 'TAC': 'Tyrosine',\n", + " 'CAA': 'Glutamine', 'CAG': 'Glutamine',\n", + " 'TAG': 'STOP'}\n", + "\n", + "sequence = ['CAG','TAC','CAA','TAG','TAC','CAG','CAA']\n", + "\n", + "for codon in sequence:\n", + " if geneticCode[codon] == 'STOP':\n", + " break # Quit looping at this point\n", + " else:\n", + " print(geneticCode[codon])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Looping gotchas" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "An internal counter is used to keep track of which item is used next, and this is incremented on each iteration. When this counter has reached the length of the sequence the loop terminates. This means that if you delete the current item from the sequence, the next item will be skipped (since it gets the index of the current item which has already been treated). Likewise, if you insert an item in a sequence before the current item, the current item will be treated again the next time through the loop. This can lead to nasty bugs that can be avoided by making a temporary copy using a slice of the whole sequence." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "**When looping, never modify the collection!** Always create a copy of it first.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## More looping" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Using `range()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you would like to iterate over a numeric sequence then this is possible by combining the `range()` function and a for loop." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "print(list(range(10)))\n", + "\n", + "print(list(range(5, 10)))\n", + "\n", + "print(list(range(0, 10, 3)))\n", + "\n", + "print(list(range(7, 2, -2)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Looping through ranges " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "for x in range(8):\n", + " print(x*x)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "squares = []\n", + "for x in range(8):\n", + " s = x*x\n", + " squares.append(s)\n", + " \n", + "print(squares)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Using `enumerate()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Given a sequence, `enumerate()` allows you to iterate over the sequence generating a tuple containing each value along with a corresponding index." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "letters = ['A','C','G','T']\n", + "for index, letter in enumerate(letters):\n", + " print(index, letter)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "numbered_letters = list(enumerate(letters))\n", + "print(numbered_letters)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Filtering in loops" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "city_pops = {\n", + " 'London': 8200000,\n", + " 'Cambridge': 130000,\n", + " 'Edinburgh': 420000,\n", + " 'Glasgow': 1200000\n", + "}\n", + "\n", + "big_cities = []\n", + "for city in city_pops:\n", + " if city_pops[city] >= 1000000:\n", + " big_cities.append(city)\n", + "\n", + "print(big_cities)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "total = 0\n", + "for city in city_pops:\n", + " total += city_pops[city]\n", + "print(\"total population:\", total)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "pops = list(city_pops.values())\n", + "print(\"total population:\", sum(pops))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Formating string\n", + "\n", + "Constructing more complex strings from a mix of variables of different types can be cumbersome, and sometimes you want more control over how values are interpolated into a string. Python provides a powerful mechanism for formatting strings using built-in `.format()` function using \"replacement fields\" surrounded by curly braces `{}` which starts with an optional field name followed by a colon `:` and finishes with a format specification. \n", + "\n", + "There are lots of these specifiers, but here are 3 useful ones:\n", + "\n", + " d: decimal integer\n", + " f: floating point number\n", + " s: string\n", + "\n", + "You can specify the number of decimal points to use in a floating point number with, e.g. `.2f` to use 2 decimal places or `+.2f` to use 2 decimal with always showing its associated sign." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "print('{:.2f}'.format(0.4567))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "geneExpression = {\n", + " 'Beta-Catenin': 2.5, \n", + " 'Beta-Actin': 1.7, \n", + " 'Pax6': 0, \n", + " 'HoxA2': -3.2\n", + "}\n", + "\n", + "for gene in geneExpression:\n", + " print('{:s}\\t{:+.2f}'.format(gene, geneExpression[gene])) # s is optional\n", + " # could also be written using variable names\n", + " #print('{gene:s}\\t{exp:+.2f}'.format(gene=gene, exp=geneExpression[gene]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises 2.2.3\n", + "\n", + "1. Let's calculate the GC content of a DNA sequence. Use the 15-base sequence you created for the exercises above. Create a variable, `gc`, which we will use to count the number of Gs or Cs in our sequence.\n", + "2. Output every base of the sequence alongside its index on a new line.\n", + "3. Create a loop to iterate over the bases in your sequence. If the base is a G or the base is a C, add one to your `gc` variable.\n", + "4. When the loop is done, divide the number of GC bases by the length of the sequence and multiply by 100 to get the GC percentage. Format the result to only display 2 decimal places." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Part 2.3: Files\n", + "\n", + "-----\n", + "\n", + "- Using files\n", + "- Reading from files\n", + "- Writing to files\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Data input and output (I/O)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So far, all that data we have been working with has been written by us into our scripts, and the results of out computation has just been displayed in the terminal output. In the real world data will be supplied by the user of our programs (who may be you!) by some means, and we will often want to save the results of some analysis somewhere more permanent than just printing it to the screen. In this session we cover the way of reading data into our programs by reading files from disk, we also discuss writing out data to files. \n", + "\n", + "There are, of course, many other ways of accessing data, such as querying a database or retrieving data from a network such as the internet. We don't cover these here, but python has excellent support for interacting with databases and networks either in the standard library or using external modules." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using files" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Frequently the data we want to operate on or analyse will be stored in files, so in our programs we need to be able to open files, read through them (perhaps all at once, perhaps not), and then close them. \n", + "\n", + "We will also frequently want to be able to print output to files rather than always printing out results to the terminal.\n", + "\n", + "Python supports all of these modes of operations on files, and provides a number of useful functions and syntax to make dealing with files straightforward." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Opening files" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To open a file, python provides the `open` function, which takes a filename as its first argument and returns a _file object_ which is python's internal representation of the file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "path = \"data/datafile.txt\"\n", + "fileObj = open( path )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`open` takes an optional second argument specifying the _mode_ in which the file is opened, either for reading, writing or appending.\n", + "\n", + "It defaults to `'r'` which means open for reading in text mode. Other common values are `'w'` for writing (truncating the file if it already exists) and `'a'` for appending." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "open( \"data/myfile.txt\", \"r\" ) # open for reading, default" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "open( \"data/myfile.txt\", \"w\" ) # open for writing (existing files will be overwritten)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "open( \"data/myfile.txt\", \"a\" ) # open for appending" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Closing files" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To close a file once you finished with it, you can call the `.close` method on a file object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fileObj.close()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Mode modifiers\n", + "\n", + "These mode strings can include some extra modifier characters to deal with issues with files across multiple platforms.\n", + "\n", + "`'b'`: binary mode, e.g. `'rb'`. No translation for end-of-line characters to platform specific setting value.\n", + "\n", + "|Character | Meaning |\n", + "|----------|---------|\n", + "|`'r'` |\topen for reading (default) |\n", + "|`'w'` |\topen for writing, truncating the file first |\n", + "|`'x'` |\topen for exclusive creation, failing if the file already exists |\n", + "|`'a'` |\topen for writing, appending to the end of the file if it exists |\n", + "|`'b'` |\tbinary mode |\n", + "|`'t'` |\ttext mode (default) |\n", + "|`'+'` |\topen a disk file for updating (reading and writing) |" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Reading from files" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once we have opened a file for reading, file objects provide a number of methods for accessing the data in a file. The simplest of these is the `.read` method that reads the entire contents of the file into a string variable.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fileObj = open( \"data/datafile.txt\" )\n", + "print(fileObj.read()) # everything\n", + "fileObj.close()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that this means the entire file will be read into memory. If you are operating on a large file and don't actually need all the data at the same time this is rather inefficient.\n", + "\n", + "Frequently, we just need to operate on individual lines of the file, and you can use the `.readline` method to read a line from a file and return it as a python string.\n", + "\n", + "File objects internally keep track of your current location in a file, so to get following lines from the file you can call this method multiple times.\n", + "\n", + "It is important to note that the string representing each line will have a trailing newline `\"\\n\"` character, which you may want to remove with the `.rstrip` string method.\n", + "\n", + "Once the end of the file is reached, `.readline` will return an empty string `''`. This is different from an apparently empty line in a file, as even an empty line will contain a newline character. Recall that the empty string is considered as `False` in python, so you can readily check for this condition with an `if` statement etc." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# one line at a time\n", + "fileObj = open( \"data/datafile.txt\" )\n", + "print(\"1st line:\", fileObj.readline())\n", + "print(\"2nd line:\", fileObj.readline())\n", + "print(\"3rd line:\", fileObj.readline())\n", + "print(\"4th line:\", fileObj.readline())\n", + "fileObj.close()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To read in all lines from a file as a list of strings containing the data from each line, use the `.readlines` method (though note that this will again read all data into memory)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# all lines\n", + "fileObj = open( \"data/datafile.txt\" )\n", + "\n", + "lines = fileObj.readlines()\n", + "\n", + "print(\"The file has\", len(lines), \"lines\")\n", + "\n", + "fileObj.close()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Looping over the lines in a file is a very common operation and python lets you iterate over a file using a `for` loop just as if it were an array of strings. This does not read all data into memory at once, and so is much more efficient that reading the file with `.readlines` and then looping over the resulting list." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# as an iterable\n", + "fileObj = open( \"data/datafile.txt\" )\n", + "\n", + "for line in fileObj:\n", + " print(line.rstrip().upper())\n", + "\n", + "fileObj.close()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The with statement" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is important that files are closed when they are no longer required, but writing ``fileObj.close()`` is tedious (and more importantly, easy to forget). An alternative syntax is to open the files within a ``with`` statement, in which case the file will automatically be closed at the end of the `with` block." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# fileObj will be closed when leaving the block\n", + "with open( \"data/datafile.txt\" ) as fileObj:\n", + " for ( i, line ) in enumerate( fileObj, start = 1 ):\n", + " print( i, line.strip() )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises 2.3.1\n", + "\n", + "Write a script that reads a file containing many lines of nucleotide sequence. For each line in the file, print out the line number, the length of the sequence and the sequence (There is an example file here or in `data/dna.txt` from the course materials )." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Writing to files" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once a file has been opened for writing, you can use the `.write()` method on a file object to write data to the file.\n", + "\n", + "The argument to the `.write()` method must be a string, so if you want to write out numerical data to a file you will have to convert it to a string somehow beforehand." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
**Remember** to include a newline character `\\n` to separate lines of your output, unlike the `print()` statement, `.write()` does not include this by default.
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "read_counts = {\n", + " 'BRCA2': 43234,\n", + " 'FOXP2': 3245,\n", + " 'SORT1': 343792\n", + "}\n", + "\n", + "with open( \"out.txt\", \"w\" ) as output:\n", + " output.write(\"GENE\\tREAD_COUNT\\n\")\n", + "\n", + " for gene in read_counts:\n", + " line = \"\\t\".join( [ gene, str(read_counts[gene]) ] )\n", + " output.write(line + \"\\n\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To view the output file, open a terminal window, go to the directory where the file has been written, and print the content of the file using `cat` command or open it using your favourite editor:\n", + "\n", + "```bash\n", + "cat out.txt\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Be cautious when opening a file for writing, as python will happily let you overwrite any existing data in the file. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises 2.3.2\n", + "\n", + "Create a script that writes the values of a list of numbers to a file, with each number on a seperate line." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Part 2.4: Delimited files\n", + "\n", + "------\n", + "\n", + "- Data formats\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data formats" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Bioinformaticians love creating endless new file formats for their data, but there is one very common standard format\n", + "that it is good to get used to parsing." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Delimited file example:\n", + "```\n", + "X 169008682 1 111267453 1.0976\n", + "2 8265484 5 69763543 4.9825\n", + "MT 10924 MT 81934 7.2357\n", + "3 127 8 10908776 1.2509\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Reading delimited files" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use the various string manipulation techniques covered earlier to process delimited files in a fairly straightforward way. Here we loop through a file with columns delimited by spaces, reading the data for each row into a list, and storing each of these lists into a main results list." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To view the an example of a delimited file, open a terminal window, go to the course directory, and print the content of the file using `cat` command or open it using your favourite editor:\n", + "\n", + "```bash\n", + "cat data/mydata.txt\n", + "```\n", + "\n", + "```\n", + "Index Organism Score\n", + "1 Human 1.076\n", + "2 Mouse 1.202\n", + "3 Frog 2.2362\n", + "4 Fly 0.9853\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "results = []\n", + "\n", + "with open(\"data/mydata.txt\", \"r\") as data:\n", + " header = data.readline()\n", + " for line in data:\n", + " results.append(line.split())\n", + " \n", + " \n", + "print(results)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we show a slightly more complicated example where we are reading the results into a more convenient data structure, a list of dictionaries with the dictionary keys corresponding to the column headers and the values to the values from each line. We also convert the columns to an appropriate type as we go." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "results = []\n", + "\n", + "with open(\"data/mydata.txt\", \"r\") as data:\n", + " header = data.readline()\n", + " for line in data:\n", + " idx, org, score = line.split()\n", + " row = {'Index': int(idx), 'Organism': org, 'Score': float(score)}\n", + " results.append(row)\n", + " \n", + "print(results)\n", + "print('Score of first row:', results[0]['Score'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Writing delimited files" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Writing out a delimited file is also straightforward using the `join` method. Here, as an example we will recreate our original file from above, but this time we will delimit the columns with a comma." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "mydata = [{'Organism': 'Human', 'Index': 1, 'Score': 1.076}, \n", + " {'Organism': 'Mouse', 'Index': 2, 'Score': 1.202}, \n", + " {'Organism': 'Frog', 'Index': 3, 'Score': 2.2362}, \n", + " {'Organism': 'Fly', 'Index': 4, 'Score': 0.9853}]\n", + "\n", + "with open('data/mydata.csv', 'w') as output:\n", + " # write a header\n", + " header = \",\".join(['Index', 'Organism', 'Score'])\n", + " output.write(header + \"\\n\")\n", + " for row in mydata:\n", + " line = \",\".join([str(row['Index']), row['Organism'], str(row['Score'])])\n", + " output.write(line + \"\\n\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To view the output file, open a terminal window, go to the course directory, and print the content of the file using `cat` command or open it using your favourite editor:\n", + "\n", + "```bash\n", + "cat data/mydata.csv\n", + "```\n", + "\n", + "```\n", + "Index,Organism,Score\n", + "1,Human,1.076\n", + "2,Mouse,1.202\n", + "3,Frog,2.2362\n", + "4,Fly,0.9853\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Last but not least\n", + "\n", + "### A big thank you!\n", + "\n", + "### Remember...\n", + "- Our course webpage: http://pycam.github.io\n", + "- The Python website: https://www.python.org/ \n", + "- To fill the course survey ;-)\n", + "- To come to our next course 'Working with Python: functions and modules' and register at https://training.csx.cam.ac.uk/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises 2.4.1\n", + "\n", + "Write a script that reads a tab delimited file which has 4 columns: gene, chromosome, start and end coordinates; that computes each gene's length and stores it into a dictionary; and writes the results into a new tab separated file. You can find a data file in ` data/genes.txt` directory of the course materials." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises 2.4.2 \n", + "\n", + "Read the lyrics of Imagine by John Lennon, 1971 from the file in `data/imagine.txt`. Split the text into words. Print the total number of words, and the number of distinct words. Calculate the frequency of each distinct word and store the result into a dictionary. Print each distinct word along with its frequency. Find the most frequent word longer than 3 characters in the song, print it with its frequency.\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises 2.4.3 \n", + "#### Real life example\n", + "\n", + "You have a tab separated file which contains information about all the yeast (*S.cerevisiae*) gene `data/yeast_genes.txt`:\n", + "\n", + "`Systematic_name\tStandard_name\tChromosome\tStart\t End\n", + "YBR127C VMA2 chrII 491269 492822\n", + "YBR128C ATG14 chrII 493081 494115\n", + "...\n", + "`\n", + "\n", + "For every gene, its location and coordinates are recorded. \n", + "You should read through the file and store the data into an appropriate structure.\n", + "Then answer these questions:\n", + "\n", + "- How many genes are there in *S.cerevisiae*?\n", + "- Which is the longest and which is the shortest gene?\n", + "- How many genes per chromosome? Print the number of genes per chromosome.\n", + "- For each chromosome, what is the longest and what is the shortest gene?\n", + "- For each chromosome, how many genes on the Watson strand and how many genes on the Crick strand?\n", + "\n", + "**bonus** \n", + "\n", + "- What is the chromosome with the highest gene density? You can calculate the length of each chromosome assuming that they start at 1 and they end at the end (if on the Watson strand) or at the start (if on the Crick strand) of their last gene. Then you can calculate the length of all the genes on each chromosome and the ratio between coding vs. noncoding regions." + ] + } + ], + "metadata": { + "celltoolbar": "Slideshow", + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +}