files.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
	<meta name="generator" content="HTML Tidy for Linux/x86 (vers 1 September 2005), see www.w3.org" />

	<title>Introductory Programming in Python: Files</title>
	<link rel='stylesheet' type='text/css' href='style.css' />
	<meta http-equiv='Content-Type' content='text/html; charset=utf-8' />
<script src="animation.js" type="text/javascript">
</script>
</head>

<body onload="animate_loop()">
	<div class="page">

		<h1>Introductory Programming in Python: Lesson 18<br />

		Files for Input and Output</h1>

		<div class="centered">
			[<a href="random.html">Prev: Random Numbers</a>]&nbsp;[<a href="index.html">Course Outline</a>]&nbsp;[<a href="regexp.html">Next: Regular Expressions</a>]
		</div>

		<h2>Filesystems, Directories, and Files</h2>

		<p>The concept of a file is fairly intuitive, but as with all things
		programming, intuition is not enough. Let us briefly explore exactly
		what a file is. Microsoft has served, as always, to confuse and
		obfuscate the concept of a file. A file is interchangeably called a
		document, a file, a song, a movie, a spreadsheet, etc... It is
		important to understand that the multiple names used in common parlance
		for a file are in fact references to the <em>use</em> to which that
		file is put, rather than the idea of what a file actually is. So what
		is a file?  Simply put, a file is an ordered collection of data
		associated with a name and location by a filesystem on a particular
		device. The device might be a hard drive, a flash disk, a CD-ROM, or
		even your cellphone. The same file may be used for many different
		things, e.g. I can read my .mp3 file as text. It makes very little
		sense, but it is possible. Alternatively, one could attempt to play the
		contents of a large spreadsheet through one's sound card. Again, it
		won't make much sense, in fact it'll just sound like a burst of static,
		but it is possible.</p>

		<div class="centered">
			<strong>A file is a name associated with an ordered collection of data on some storage medium</strong>
		</div>

		<p>Another concept that the GUI's of the 90's onwards have eroded
		significantly is the idea of the filesystem. Viewed on a physical
		storage device, we encounter a number of problems dealing directly with
		files. Firstly, a single file need not be stored in a single contiguous
		area; it can be <em>fragmented</em>. Secondly, there's no apparent
		structure or order to where files are stored relative to one another.
		Your word processing documents might be stored right next to, or even
		intermeshed with, files belonging to the operating system, your music
		collection, or your applications. Clearly we need a way of imposing a
		logical structure onto a collection of files. So we introduce a method
		of grouping arbitrary files together, namely the
		<strong>directory</strong>, which you may know of as a folder.
		Generally, any file can be put into any directory, but cannot exist in
		multiple directories at once. Thus we have given files a
		<strong>location</strong>. However, directories can also contain other
		directories, introducing a hierarchical structure of files within
		directories, within other directories, within ... oh hell.  Where does
		it end? It does end, or rather it <em>starts</em> somewhere!</p>
		
		<div class="centered">
			<strong>A directory is an arbitrary unordered grouping of files</strong>
		</div>

		<p>Every filesystem has a starting point called the
		<strong>root</strong>. In MS-DOS, Windows, Symbian and some other
		cellphone OS's, each filesystem is assigned it's own root using a
		letter from the alphabet, as in <code>C:\</code>. Note the backslash!
		In linux, unix and any other POSIX complaint OS, the root is simply
		called <code>/</code>, and other filesystems can be placed inside the
		root, much like directories can be placed inside one another. So now we
		have a way to specify a particular file exactly, even if two files
		might have the name name, no two files can have exactly the same
		location. The location of a file is always specified from the root down
		through each directory to the file, including the name of the file,
		eg.</p>

<pre class='listing'>C:\Documents and Settings\James\Desktop\todo.txt
/home/james/Desktop/todo.txt</pre>

		<p>Note how in both cases we start with the root, and name each
		directory successively, zeroing in on the directory containing the file
		of interest. We separate the directory names with <code>\</code> in the
		case of windows/MS-DOS, or <code>/</code> in the case of Linux/POSIX.
		The explicit sequence from root to name is known as the <strong>full
		path</strong> of the file, as we have followed the full path from root
		through each directory to the file.</p>

		<h2>Relative Paths and the Working Directory</h2>

		<p>Of course specifying the full path of every file every time we wish
		to use it is inconvenient. Many operating systems thus include the idea
		of a <strong>working directory</strong>. Working directories are tied
		to login sessions, and are not readily apparent in GUI's. When a user
		logs in to a system, their working directory for that login session is
		usually set to their home directory. Various OS commands can change or
		print out the working directory. <code>cd</code> is used to change the
		working directory to another one. Whenever a file or directory is
		specified, and the specification is not a full path, the file is
		considered relative to the current working directory. For example,
		specifying only a name for a file, implies the file we are looking for
		is in the working directory. When program runs, it inherits the working
		directory from the login session from which it is run. In the example
		below, the full path to the working directory is displayed in the
		prompt.</p>

<pre class='listing'>/home/james $ ls
todo.txt
/home/james $ cd /home
/home $ ls
james
/home $ cat james/todo.txt
I have nothing to do!
/home $ </pre>

		<p>Note how the final command specifies an incomplete path
		'james/todo.txt', and not the full path. Because a full path was not
		specified, the working directory ('/home' at the time) is prepended to
		the name, yielding '/home/james/todo.txt'. Thus we are able to specify
		files and directories lower down in the directory hierarchy in a
		relative manner.</p>

		<p>Of course if a file is in a directory somewhere above or rather
		outside the working directory, it would seem we must still use the full
		path to specify it, but there are a few shortcuts we can take.</p>

		<ul>
			<li><code>./</code> indicates the working directory, not much help to us really...</li>

			<li><code>../</code> indicates the directory above the directory specified in the path so far.</li>
		</ul>

		<p>For example, if the working directory were '/home/james/IntroPython'</p>

		<ul>
			<li><code>index.html</code> would have the full path <code>/home/james/IntroPython/index.html</code></li>
			
			<li><code>./index.html</code> would have the full path <code>/home/james/IntroPython/index.html</code></li>

			<li><code>data/testinput.txt</code> would have the full path <code>/home/james/IntroPython/data/testinput.txt</code></li>

			<li><code>../todo.txt</code> would have the full path <code>/home/james/todo.txt</code></li>

			<li><code>../amusement/phdcomics.tar.gz</code> would have the full path <code>/home/james/amusement/phdcomics.tar.gz</code></li>
			
			<li><code>../../../bin/ls</code> would have the full path <code>/bin/ls</code></li>
			
			<li><code>/home/james/../../bin/ls</code> would have the full path <code>/bin/ls</code></li>

		</ul>

		<h2>Files for Input and Output</h2>

		<p>And all this has been leading up to the idea that programs need not
		limit themselves to key strokes and the screen as input and output
		methods respectively. Files can be used as both input and output. Files
		allow us to conveniently input large quantities of data, and similarly
		output large quantities of data for later review. When we want to work
		with files, we must clearly be able to specify the file we wish to work
		with, using it's location, and name, i.e. a valid full or relative
		path.</p>

		<h2>Opening Files</h2>

		<p>Opening a file in python is simple. We use the <a class="doclink"
		href='http://docs.python.org/lib/built-in-funcs.html'>open
		function</a>. The open function returns a <a class="doclink"
		href='http://docs.python.org/lib/bltin-file-objects.html'>file
		object</a>, given a path to a file in the form of string, and a string
		specifying whether to open the file for reading or writing.</p>

		<p>All files in python are treated as text files, meaning they are
		broken up into units called lines. However, there is also a concept
		called a <strong>file pointer</strong>. A file pointer is like a cursor
		in a word processor, which sits between characters in a file. When we
		read from a file, we read from the file pointer onwards to the right.
		Similarly all writing done to the file will be from the file pointer
		onwards, overwriting if the file already contained data to the right of
		the file pointer, otherwise extending the size of the file as
		necessary. Both reading and writing a file, reposition the file pointer
		at the end of the sequence read or written.</p>

<pre class='listing'>Python 2.6.4 (r264:75706, Dec 7 2009, 18:43:55) [MSC v.1310 32 bit (Intel)] on win32 
[GCC 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
&gt;&gt;&gt; f = open("input.txt","r")
&gt;&gt;&gt; f
&lt;open file 'input.txt', mode 'r' at 0xb7dacentered&gt;
&gt;&gt;&gt;
</pre>

		<p>Here we see the open function in action. It is most commonly used as
		the expression of an assignment statement, but obviously, being a
		function, can be used anywhere an expression is valid. The open
		function takes two parameters, firstly, the path of the file to open,
		and secondly, a string specifying the mode (read, write, or append) in
		which to open the file. Valid strings for the mode are</p>

		<ul>

			<li>"r" opens the file in read only mode. The file must exist prior
			to opening. A plus suffix ("r+") means the file is opened for both
			reading and writing. In either case the file is opened with the
			file pointer at the beginning of the file.</li>

			<li>"w" opens the file in write only mode. The file will be
			overwritten if it already exists, otherwise it will be created. A
			plus suffix ("w+") means the file is opened for both reading and
			writing, but the file is still truncated to zero bytes if it exists
			already. Obviously this means the file pointer starts at the
			beginning of the file. Writing beyond the end of a file simply
			enlarges the file to accommodate whatever is written.</li>

			<li>"a" opens the file in append mode. The file can only be written
			to, and the file pointer starts at the <strong>end</strong> of the
			file, meaning all data subsequently written will be added to the
			end of the file. If the file doesn't already exist, it will be
			created. A plus suffix ("a+") means the file is opened for both
			reading and writing, however the file pointer is positioned at the
			<strong>beginning</strong> of the file, meaning "a+" and "r+" are
			essentially equivalent.</li>

		</ul>

		<h2>Reading From Files</h2>

		<p>Once we have an open file object, we can use its methods to both
		read from and write to the file it represents. When a file is opened
		for reading, the file pointer is positioned at the beginning of the
		file (position 0) which is just before the first character in the file.
		File objects provide a variety of methods to read from files...</p>

		<ul>

			<li><code>&lt;file object&gt;.read(&lt;count&gt;)</code> reads
			'count' characters from the file object starting at the file
			pointer, and returns the read characters as a string. If less than
			'count' characters remain to be read, then a string is still
			returned, but it will be shorter than count characters. An empty
			string is returned if the file pointer is at the end of the
			file.</li>

			<li><code>&lt;file object&gt;.readline()</code> reads from the file
			pointer onwards up to and including the first newline ('\n')
			character, and returns the read characters as a string. An empty
			string is returned if the file pointer is at the end of the
			file.</li>

			<li><code>&lt;file object&gt;.readlines()</code> reads from the
			file pointer until the end of file, returning a list of lines, each
			containing the trailing newline characters, as strings. An empty
			list is returned if the file pointer is already at the end of the
			file.</li>
		
		</ul>

		<p>Using the following file as an example</p>

<pre class='listing'>This is a simple file
Containing only three lines
of text</pre>

		<p>We can demonstrate the use of the various file reading methods.</p>

<pre class='listing'>&gt;&gt;&gt; f.read(3)
'Thi'
&gt;&gt;&gt; f.readline()
's is a simple file\n'
&gt;&gt;&gt; f.readlines()
['Containing only three lines\n', 'of text\n']
&gt;&gt;&gt;</pre>

		<p>Note how using the simple 'read' method, we get only three
		characters (being the count we specified), and the 'readline' method
		continues from where 'read' finished. This illustrates the file pointer
		in action. The file pointer is position between the 'i' and 's' of
		'This' on the first line of the file after the 'read' method is
		executed. After the 'readline' it is between the end of line one and
		the first character of line two ('C'). Thus, 'readlines' has two
		complete lines left to read when it is called.</p>

		<p>Often a more useful way to read the lines of a file in sequence, is
		the for loop construct over a file. When used in a for loop a file
		object acts as a sequence of lines, as in</p>

<pre class='listing'>&gt;&gt;&gt; f = open("input.txt","r")
&gt;&gt;&gt; for line in f:
...		 print line.rstrip()
... 
This is a simple file
Containing only three lines
of text
&gt;&gt;&gt;</pre>

		<h2>Writing To Files</h2>

		<p>Writing files uses methods very similar to those used to read from
		files, except that writing is often buffered in memory, meaning the
		file on disk is only actually updated when newlines are written, the
		buffer is explicitly flushed, or the file object is closed.</p>

		<ul>

			<li><code>&lt;file object&gt;.write(&lt;string&gt;)</code> writes
			the contents of 'string' to the file, starting at the file pointer
			and overwriting data in the file, or enlarging the file as
			necessary. Note that no newline is added, so multiple successive
			write calls without any newlines in their respective strings,
			produce only one line of text in the file.</li>

			<li><code>&lt;file object&gt;.writelines(&lt;list&gt;)</code>
			writes the elements of the list, which must all be strings, in
			order to the file. Newlines are not added, so if the strings have
			no newlines, only one line of output will be written.</li>

		</ul>

		<p>As an example, let's create a new file, and write some text out to
		it.</p>

<pre class='listing'>&gt;&gt;&gt; w = open("newfile.txt","w")
&gt;&gt;&gt; w.write("This is the first line of text ")
&gt;&gt;&gt; w.write("This is still on the first line\n")
&gt;&gt;&gt; w.writelines(["The second line\n", "The Salmon Mousse"])
&gt;&gt;&gt; w.close()
&gt;&gt;&gt; </pre>

		<p>Looking at 'newfile.txt', we see</p>

<pre class='listing'>This is the first line of text This is still on the first line
The second line
The Salmon Mousse</pre>

		<p>We see from the first two calls to 'write' that we have to supply
		our own newline characters to force line breaks in the output. And
		finally that closing our files is a good idea when they are opened for
		writing. Technically, python's garbage collector usually takes care of
		this for us, closing file objects before they are collected, but it is
		considered good practice to explicitly close files.</p>

		<ul>

			<li><code>&lt;file object&gt;.close()</code> closes the file,
			disallowing further read or write operation on the file. Any
			pending written data in the file buffer is flushed to disk.
			Technically, close is called automatically when a file object
			variable is garbage collected, but it is good practice to
			explicitly close the files your program opens, as the operating
			system has a limit to the number of files that may be open at one
			time.</li>

		</ul>

		<h2>Moving the File Pointer Manually</h2>

		<p>Occasionally we may want to move the file pointer manually, for
		example when reading a file of a format that allows us to skip or
		ignore large sections to get to the part of the file we want. Python
		provides two methods of file objects to do this, the first to tell us
		where the pointer is currently, and the second to move it. Both treat
		the file as a one dimensional stream of characters, much like a string.
		The file pointer is an integer index into this 'string' specifying the
		first character that would be read or written in the next
		operation.</p>

		<ul>

			<li><code>&lt;file object&gt;.tell()</code> returns the index of
			the file pointer for the file.</li>

			<li>

				<code>&lt;file object&gt;.seek(&lt;position&gt;[,
				&lt;whence&gt;])</code> moves the file pointer to position
				relative to a position indicated by whence. Whence can take on
				one of three values, defaulting to 0, each of which mean

				<ol start="0">

					<li>Set position relative to the beginning of the
					file.</li>

					<li>Set position relative to the current position, i.e.
					negative positions will be before the current position, and
					positive positions will be after it.</li>

					<li>Set position relative to the end of the file, i.e. only
					negative positions will give us a position inside the
					file.</li>
				
				</ol>

			</li>

		</ul>

		<h2>Truncating a File</h2>

		<p>Now that we can move the file pointer manually, we can do some funky
		things. We could for example seek to somewhere in the middle of a file,
		and overwrite the data there, without affecting the rest of the file.
		We might even want to remove information from a file, where we seek to
		the start of the information, and overwrite it with... well, spaces. Oh
		dear, we have a problem here! We can't actually <em>remove</em>
		information from a file, or can we? The truth is that we can only
		shorten the length of file, more formally known as truncating the file.
		What this means is that to delete information contained in the file
		starting at some position (x) up to some position (y), we must read
		everything from y to the end of the file, and then seek back to x, and
		write what we read. Finally we truncate the file.</p>

		<ul>

			<li><code>&lt;file object&gt;.truncate([size])</code> truncates the
			file to be a maximum of size bytes in length. If size is not given,
			the file is truncated at the current file pointer position.</li>

		</ul>

		<div class="centered">
			[<a href="random.html">Prev: Random Numbers</a>]&nbsp;[<a href="index.html">Course Outline</a>]&nbsp;[<a href="regexp.html">Next: Regular Expressions</a>]
		</div>
	</div>

	<div class="pagefooter">
		Copyright &copy; James Dominy 2007-2008; Released under the <a href="http://www.gnu.org/copyleft/fdl.html">GNU Free Documentation License</a><br />
		<a href="intropython.tar.gz">Download the tarball</a>
	</div>
</body>
</html>