[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Python tutorial - written version



Attached is a written version of the Python IRC tutorial we had two and
a half weeks ago on #debian-women. I'd appreciate help in putting it on
the D-W wiki; I'm somewhat busy right now with other things and don't
have the time to convert it to wiki notation.

-- 
I'm a Luddite with neophilia
This is a summary of the IRC tutorial on Python on #debian-women,
February 11, 2006.

This version of the tutorial is not a direct transcription of the IRC
log. Instead, it expands a little bit on some points, so if you were
part of the IRC tutorial, you may still want to read this one.

The structure of the tutorial was that I gave examples and said
something about them, and then there were questions and discussion
until we moved to the next one.


* About Python in general
-------------------------

>From the Python FAQ: "Python is an interpreted, interactive,
object-oriented programming language. It incorporates modules,
exceptions, dynamic typing, very high level dynamic data types, and
classes."

"Interpreted" means, in practice, that you can just run Python
programs stored in files without having to compile them first, so
it's similar to shell scripts in that way, and unlike C programs.

"Interactive" means that you can start the Python interpreter
and start feeding it statements, one at a time, and it will
execute them and print out any results. Those familiar with
Lisp's Read-Eval-Print Loop (REPL) will be familiar with this.
Ditto for BASIC.

"Object-oriented" means that Python favors OOP, but it doesn't
force it. Python is more relaxed about paradigms than, say, Java.

Of the other things the FAQ lists, "dynamic typing" is perhaps
the most interesting. It comes a bit of a shock to those whose
only languages are like C or Java, which are statically typed:
variables have a type that is explicitly declared, and therefore
every expression also has a type that can be analyzed at 
compile time. Therefore, the compiler can find some errors before
the program starts.

In dynamic typing, variables don't have types, values do. A variable
can be "bound" to different types of variables at different points in
time, and therefore all type checking is done at run-time. This makes
some things nicer to do, but can make it harder to find type errors.

Dynamic typing tends to be good for small to medium sized programs,
and rapid development and prototyping. Static typing is good for
large programs.

Python is often described as a scripting langauge, but that doesn't
mean it isn't a general purpose language. Python, like Perl, is often
used for quick hacks, and it is good for that since it is high-level
and interpreted, and has some nifty features and libraries for certain
kinds of sysadmin-like tasks. At the same time, it is often used for
so called real application development.


* Links
-------

http://python.org/

http://docs.python.org/lib/lib.html

file:///usr/share/doc/python2.3/html/lib/index.html
(you need python2.3-doc for that, but it is very nice to have)


* The hello world program
-------------------------

---- 8< ----
# Save to file "hello.py" and run with "python hello.py"
print "hello, world"
---- 8< ----

Save the above (between the scissor lines) to a file called "hello.py",
and run it with the command "python hello.py". If that works, then you
know you have a working Python installation, and know how to use it.

The first line is a comment: starts with a hash sign ("#"), and continues
to the end of the line.

The second line is a simple statement that prints out a string, plus
a newline. "print" is the simplest way of producing output to the
standard output. You can print out any number of values, just separate
them with commas, and there'll be spaces between the values.


* Command line arguments
------------------------

---- 8< ----
# Run this as: python hello2.py yourname
import sys
print "greetings,", sys.argv[1]
---- 8< ----

We have here a way to import a library module, and a way to access
command line arguments. "sys" is one of the modules in the Python
standard library. In that module there is an array, or really a list,
called argv, which contains the command line arguments of the Python
program, similar to C's argv argument to the main function.

Lists (or arrays) start indexing at zero. Thus, sys.argv[0] is the
first command line argument; like in C, it is the name of the program
being run. sys.argv[1] is the first actual argument. Note that since
Python is dynamically typed, list elements don't all need to be of
the same type.

Run the program as "python hello2.py darling", and it will print
out "greetings, darling".

If you don't give it a command line argument, the program will try to
use sys.argv[1] when it doesn't exist, and this causes a run time error,
an exception, and the python interpreter prints out a long, nasty error
message. Like this:

---- 8< ----
liw@esme$ python hello2.py
greetings,
Traceback (most recent call last):
  File "hello2.py", line 3, in ?
    print "greetings,", sys.argv[1]
IndexError: list index out of range
---- 8< ----

The exception traceback has a record (two lines) per entry in the call
stack, with the place where the exception was raised at the bottom
(i.e., the main program at the top). By reading it carefully and
analyzing what was called where, you can (usually!) figure out what was
wrong. It's possible for a program to catch exceptions.



* "if"
------

We continue to be inordinately fond of "hello, world" examples.

---- 8< ----
# Run this as: python hello3.py yourname
import sys
print "greetings,",
if len(sys.argv) == 2:
    print sys.argv[1]
else:
    print "my nameless friend"
---- 8< ----

A comma after the last argument to "print" will prevent it from printing
a newline.

"len(foo)" returns the length of a list "foo", that is, the number of
elements in it. Indexes, therefore, go from 0 to len(foo)-1. len() is
very fast, constant-time function.

We use len to check that there was an argument given on the command
line and if not, we substitute a generic greeting.

The other big thing about this example is the "if" statement. This is
where we learn about Python's use of indentation to mark blocks. The
"then" and "else" parts of the "if" statement are both marked by 
indenting them more than the "if" statement. Python does not have
explicit block markers, it's always done with indentation. A block is
a series of statements with the same indentation; empty lines and comments
are ignored, of course.

Tabs are expanded, by default to every 8 spaces, but that is configurable.
Using any other value is likely to cause trouble when sharing code with
others. Python programmers tend to prefer to use spaces only, and no tabs
at all.

"if" and other statements that introduct blocks end with a colon.
This is a stylistic issue. You have to put the colon there, and occasionally
it helps the parser to catch syntax errors.

It is possible to put several statements on a line by separating them
with a semicolon, but it is considered very bad style.



* "while"
---------

One more "hello" example.

---- 8< ----
# Run this as: python hello4.py name1 name2 name3 ...
import sys
print "greetings,",
if len(sys.argv) == 1:
    print "my nameless friend"
elif len(sys.argv) == 2:
    print sys.argv[1]
else:
    i = 1
    last_index = len(sys.argv) - 1
    while i < last_index:
        print sys.argv[i] + ",",
        i = i + 1
    print "and", sys.argv[last_index]
---- 8< ----

In this example we get variables, "elif", and "while". Variables work
pretty much as you'd expect. Variables are not declared, but it is an
error to use a variable that has not yet been assigned to in the local
scope, or a surrounding scope. An assignment creates a local variable
(it is possible to use global variables, but we'll skip that now). Thus,
a typo in the last statement of the while loop above to change it to "i
= j + 1" would cause Python to raise an exception, but "j = i + 1" would
work, causing an infinite loop?

When I say "a variable is assigned a value", what I really mean is
that a variable gets a reference to the value. All variables are
references.

"elif" is a contractio of "else if", and should likewise be pretty
clear. There is no "switch" statement in Python, instead a long "if: ...
elif: ... elif: ... else: ..." statement is used.

All the usual integer operators work: +, -, *, /, %, <, >, <=, >=, ==,
!=. There is no ++ or -- operators, and += and similar ones are only
used in assignment statements (assignments are never expressions in
Python).

Some of the operators are overloaded for other types as well. For
example, + also works as string concatenation when both operands are
strings.

For "if", "while", and other contexts where a boolean value is required,
the values False, 0, 0.0, "", and None, plus a few other "empty" values,
are treated as false, everything else as true.
    
    
* "for"
-------

More greetings.

---- 8< ----
# Run this as: python hello5.py name1 name2 name3 ...
import sys
print "greetings,",
if len(sys.argv) == 1:
    print "my nameless friend"
elif len(sys.argv) == 2:
    print sys.argv[1]
else:
    for name in sys.argv[1:-1]:
        print name + ",",
    print "and", sys.argv[-1]
---- 8< ----

The new thing here is the "for" loop, which iterates over a sequence of
values, such as a list. "name" is assigned the value of each command
line argument in turn, and then the block inside the "for" is executed.
This tends to be more convenient than doing explicit indexing with
"while".

The other fun thing is the use of slices. Slices are a way of creating a
new list out of elements from another, a subsection of another list.
Given a list "foo", "foo[i]" is element at index i, "foo[a:b]" is a new
list with all elements from index a up to, but not including index b.
For extra fun, i, a, and b can all be negative, in which case they index
from the end of the list, so "foo[-1]" is the last element. Thus,
"sys.argv[1:-1]" is all the command line arguments from the first one
after the program name up until, but not including the last one.

The a and b index may also be missing; in that case, the corresponding
end of the list is used. "foo[a:]" is everything from index a to the end
of the list. "foo[:b]" is everything from the beginning of the list up to,
but not including index b. "foo[:]" is a copy of the entire list.


* functions
-----------

The greetings never end, do they?

---- 8< ----
# Run this as: python hello6.py name1 name2 name3 ...
import sys

def greet(greeting, names):
    print greeting + ",",
    if not names:
        print "my nameless friend"
    elif len(names) == 1:
        print names[0]
    else:
        for name in names[:-1]:
            print name + ",",
        print "and", names[-1]

greet("hi there", sys.argv[1:])
---- 8< ----

Here we see how a function is defined. Note that argument names (if any)
are declared, but not their types, and neither is the return type. All
typing in Python is dynamic.

Also note that the function gets a list of names to be greeted, and
sys.argv starts with the name of the program, so the main program strips
it out with a slice when calling the function.

        
* hashbanging
-------------

My bag of helloworld programs is infinite!

---- 8< ----
#!/usr/bin/python

import sys

def greet(greeting, names):
    print greeting + ",",
    if not names:
        print "my nameless friend"
    elif len(names) == 1:
        print names[0]
    else:
        for name in names[:-1]:
            print name + ",",
        print "and", names[-1]

def main():
    greet("hi there", sys.argv[1:])

main()
---- 8< ----

This is how one would make a Python script that can be run as any
command, without prefixing the command with "python". Just save this
into a file "hello7", chmod +x it, and then run it with "./hello7".

The main program of a Python program is customarily put into a function
(often called "main"). That function is then called either directly
or, better, like this:

---- 8< ----
if __name__ == "__main__":
    main()
---- 8< ----

"__name__" is a special Python variable that has the value "__main__"
if the Python file is run directly. This allows the file to be used
as a Python module (i.e., with "import") without invoking its main
program. This can also be used to invoke unit testing.


* I/O
-----

I think we've been polite enough now.

---- 8< ----
#!/usr/bin/python

import sys

line_count = 0
while True:
    line = sys.stdin.readline()
    if not line:
        break
    line_count += 1
sys.stdout.write("%d lines\n" % line_count)
---- 8< ----

This program counts the number of lines in the standard input.

"sys.stdin", "sys.stdout", and "sys.stderr" are file objects that correspond
to the standard input, output, and error streams. File objects have a method
".readline()" that reads and returns the next line, including the newline,
or the empty string if they hit EOF. Similarly, ".write()" is a file object
method that writes a string to the file; it does not add a newline.
      
"if not line" tests whether the variable line is false or not; it's
false, if it is the empty string (since it has a string value). Thus,
the condition is true at the end of the file. "break" then jumps out of
the innermost loop.

The "while True: data = f.read(); if not data: break" pattern is a
common way of doing input in a loop.

When the first operand of the % operator is a string, it works similar
to sprintf in C. The first operand acts as the format string, and "%s"
in it gets replaced by a string value, "%d" with an integer value, etc.
The values are taken from the second operand, which can be a single value,
if there is only one %something in the format string, or a sequence of
values inside parentheses if there are several.


* String manipulation
---------------------

We're not going back to hello, world.

---- 8< ----
#!/usr/bin/python

import sys

def count_words(str):
    word_count = 0
    i = 0
    in_word = False
    while i < len(str):
        c = str[i]
        is_word_char = (c >= "a" and c <= "z") or (c >= "A" and c <= "Z")
        if in_word:
            if not is_word_char:
                in_word = False
        else:
            if is_word_char:
                in_word = True
                word_count += 1
        i += 1
    return word_count

def main():
    line_count = 0
    word_count = 0
    byte_count = 0

    while True:
        line = sys.stdin.readline()
        if not line:
            break
        byte_count += len(line)
        line_count += 1
        word_count += count_words(line)

    sys.stdout.write("%d words, %d lines, %d bytes\n" %
                     (word_count, line_count, byte_count))

main()
---- 8< ----

This program counts words, defined as sequences of letters or digits.
It is the biggest example yet, and it is also very, very ugly. We'll
make it prettier next, though.

Strings can be used (partly) like lists: "len(str)" is the length of
a string, "str[i]" is the character at index, "str[a:b]" also works
as expected. There is no separate character type; single-character
strings are used instead.

Strings can be compared with <, <=, and so on; the comparison is based
on the values of the bytes (since strings are strings of bytes; we'll
come ot unicode later).

The last line of main() shows one way of extending Python statements to
multiple lines: if a parenthesized expression is too long, just break it
to the next line, and it will all work automatically. The other way is
to use a backslash at the end of a line.

The ugly parts of this code is that it is very much specific to ASCII,
when it should be locale sensitive, and there is also no point in using
"while" to loop over characters in a string, since "for" also works.

        
* Unicode strings
-----------------

Disclaimer: I am not very good at Unicode handling, either in general or
in Python.

Unicode characters are bigger than 8 bits (and you don't need to care
exactly how big they are, when using Python). Python has a separate
string type for Unicode strings. They work pretty much identically to
normal strings (which are strings of bytes), but for I/O you need
to conver them from and to byte strings, using some kind of encoding.
The encoding depends on various factors, but often it is OK to use
an encoding based on the current locale.

Note that a Python Unicode string is not a UTF-8 string. UTF-8 is
one of the encodings used for I/O (and storage).

In source code, 'u"Copyright \u00A9 2006 Lars Wirzenius"' is a Unicode
string containing the copyright character. You can't write non-ASCII
characters into Python source code unless you tell the Python
interpreter what the encoding and character set are (and I don't know
how).

"sys.stdin.readline" returns a normal string, which we will call "s"
here. "s.decode(enc)" decodes s into a Unicode string ("u") using some
encoding. "u.encode(enc)" encodes in the other direction, from Unicode
to normal string. "enc" can be "utf-8", for example.
"locale.getpreferredencoding()" returns the preferred encoding for the
current locale.


* wordcounting revisited
------------------------

Let's apply what we learned to word counting.

---- 8< ----
#!/usr/bin/python

import locale
import sys

def count_words(str):
    word_count = 0
    in_word = False
    for c in str:
        if in_word and not c.isalnum():
            in_word = False
        elif not in_word and c.isalnum():
            in_word = True
            word_count += 1
    return word_count

def main():
    locale.setlocale(locale.LC_ALL, "")

    line_count = 0
    word_count = 0
    char_count = 0

    while True:
        line = sys.stdin.readline()
        if not line:
            break
        line = line.decode(locale.getpreferredencoding())
        char_count += len(line)
        line_count += 1
        word_count += count_words(line)

    sys.stdout.write("%d words, %d lines, %d chars\n" %
                     (word_count, line_count, char_count))

main()
---- 8< ----

In addition to the above discussion about Unicode, the line
'locale.setlocale(locale.LC_ALL, "")' is necessary to active the locale
settings.
        

* more word play: print out all words
-------------------------------------

Let's write words out.

---- 8< ----
#!/usr/bin/python

import locale
import sys

def split_words(str):
    words = []
    word = None
    for c in str + " ":
        if word:
            if c.isalnum():
                word += c
            else:
                words.append(word)
                word = None
        else:
            if c.isalnum():
                word = c
    return words

def main():
    locale.setlocale(locale.LC_ALL, "")
    encoding = locale.getpreferredencoding()

    while True:
        line = sys.stdin.readline()
        if not line:
            break
        line = line.decode(encoding)
        for word in split_words(line):
            sys.stdout.write("%s\n" % word.encode(encoding))

main()
---- 8< ----

An empty list is written as "[]". A non-empty list would be
written like this: '[1, 2, 3, "hello"]'. "list.append(item)" modifies
the list in place and adds a new item to the end. Lists can be
concatenated: "[1,2] + [3,4]" gives "[1,2,3,4]".
    
The split_words function creates new lists (and new strings)
indiscrimantely, they are not freed anywhere in the program; Python does
garbage collection, which is a very nice thing to have.

The 'str + " "' thing in split_words is there so that there is a
guaranteed non-isalnum character so that if the line ends with a word
(no newline at the end) it is still counted correctly.
      
        
* word frequencies: dictionaries!
---------------------------------

Let's count word frequencies.

---- 8< ----
#!/usr/bin/python

import locale
import sys

def split_words(str):
    words = []
    word = None
    for c in str + " ":
        if word:
            if c.isalnum():
                word += c
            else:
                words.append(word)
                word = None
        else:
            if c.isalnum():
                word = c
    return words

def main():
    locale.setlocale(locale.LC_ALL, "")
    encoding = locale.getpreferredencoding()

    counts = {}

    while True:
        line = sys.stdin.readline()
        if not line:
            break
        line = line.decode(encoding)
        for word in split_words(line):
            word = word.lower()
            if counts.has_key(word):
                counts[word] += 1
            else:
                counts[word] = 1

    words = counts.keys()
    words.sort()
    for word in words:
        sys.stdout.write("%d %s\n" % (counts[word], word.encode(encoding)))

main()
---- 8< ----

The changes are to the main program. Python's hash tables (or hash
maps) are called dictionaries. An empty dictionary: "{}". A
non-empty one: '{ "foo": 0, "bar": 1 }'. "dict[key]" is the value
stored at a given key. Keys can be numbers, strings, or various other
types for which Python knows how to compute a hash value.

"dict.has_key(key)" is True if "dict[key]" exists (has been assigned
to). Alternatively "key in dict".

"dict.keys()" is an unsorted list of all keys.

The string method ".lower()" converts it to lower case, returning the
new string (the original is not modified; strings cannot be modified
in Python). Similarly, ".upper()" to convert to upper case.
       
"list.sort()" sorts in place (original list is changed, does not return
sorted list, or any other value).


* let's have some class
-----------------------

Let's see how classes and objects are used in Python.

---- 8< ----
#!/usr/bin/python

import locale
import sys

class WordFreqCounter:

    def __init__(self):
        self.counts = {}
        
    def count_word(self, word):
        word = word.lower()
        if self.counts.has_key(word):
            self.counts[word] += 1
        else:
            self.counts[word] = 1
            
    def print_counts(self, file):
        encoding = locale.getpreferredencoding()
        words = self.counts.keys()
        words.sort()
        for word in words:
            file.write("%d %s\n" % 
                       (self.counts[word], word.encode(encoding)))
            
def split_words(str):
    words = []
    word = None
    for c in str + " ":
        if word:
            if c.isalnum():
                word += c
            else:
                words.append(word)
                word = None
        else:
            if c.isalnum():
                word = c
    return words

def main():
    locale.setlocale(locale.LC_ALL, "")
    encoding = locale.getpreferredencoding()

    counter = WordFreqCounter()

    while True:
        line = sys.stdin.readline()
        if not line:
            break
        line = line.decode(encoding)
        for word in split_words(line):
            counter.count_word(word)

    counter.print_counts(sys.stdout)

main()
---- 8< ----

In this example, we put the dictionary inside a class. It doesn't really
matter in a program this small, whether we have a custom class or a plain
dictionary, but we do it for demonstration purposes.

"class" starts a class definition. A class is instantiated by saying
"ClassName()" (with arguments, if any, to the constructor inside the
parentheses).

Methods are defined as functions inside the class, i.e., they must be
indented relative to the "class" line. Methods are straightforward,
except for their first argument, customarily called "self", which is
a reference to the class instance (object) they're being called for.
Thus, when you call "counter.count_word(word)", the method's first
argument ("self") is bound to "counter", and its second argument
("word") is bound to "word" (in the caller's context).

There is no implicit way to refer to other methods or attributes of
the object or class, you must always go via "self".

The special method name "__init__" indicates the constructor. It is
called when the object is created.

Simplifying slightly, there are no access controls on Python object
attributes and methods. Everything is "public" in the C++ terminology.


* modules
---------

First the file wordstuff.py:

---- 8< ----
import locale

class WordFreqCounter:

    def __init__(self):
        self.counts = {}
        
    def count_word(self, word):
        word = word.lower()
        if self.counts.has_key(word):
            self.counts[word] += 1
        else:
            self.counts[word] = 1
            
    def print_counts(self, file):
        encoding = locale.getpreferredencoding()
        words = self.counts.keys()
        words.sort()
        for word in words:
            file.write("%d %s\n" % 
                       (self.counts[word], word.encode(encoding)))

def split_words(str):
    words = []
    word = None
    for c in str + " ":
        if word:
            if c.isalnum():
                word += c
            else:
                words.append(word)
                word = None
        else:
            if c.isalnum():
                word = c
    return words
---- 8< ----

And then the file freq3.py:

---- 8< ----
#!/usr/bin/python

import locale
import sys

from wordstuff import WordFreqCounter, split_words

def main():
    locale.setlocale(locale.LC_ALL, "")
    encoding = locale.getpreferredencoding()

    counter = WordFreqCounter()

    while True:
        line = sys.stdin.readline()
        if not line:
            break
        line = line.decode(encoding)
        for word in split_words(line):
            counter.count_word(word)

    counter.print_counts(sys.stdout)

if __name__ == "__main__":
    main()
---- 8< ----

freq3.py is the main program and uses wordstuff.py as a module,
and imports only certain names from it. These names can then
be referred to without prefixing them with the module name.

Potentially every Python file is a module that can be imported to
another file. Modules are searched for in $PYTHONPATH; see the
documentation for more details. Usually you don't need to worry about
setting $PYTHONPATH if things are installed in the canonical way.


* what next
-----------

Read the tutorial on python.org.
    
Skim through the library reference and play with any interesting stuff
you find there.
    
Write programs, read programs.

Reply to: