A text file is simply a sequence of characters stored on disk, typically represented using either ASCII or Unicode, with absolutely no formatting information. This is different from most files, like a file saved by a word processor, which contains specific formatting information.
A text file might represent an e-mail message or a novel, but it is also often used for representing tabular data. For example, we might want to represent a table of county populations.
county state pop. Arkansas AR 18777 Ashley AR 21283 Baxter AR 40957 ⋮ ⋮ ⋮
One common way to store a table like this in a text file is tab-separated values, where rows of the table are separated by line breaks (typically ASCII character code 10), and where columns within each row are separated by tab characters (ASCII character code 9). So the first three rows of the above table might be represented as the following character sequence.
Arkansas AR 18777 Ashley AR 21283 Baxter AR 40957
Or more literally (though this will display well only
if your browser supports the character
“⇥
” to
reprent a tab and “↵
” to represent a line
break):
Arkansas⇥AR⇥18777↵Ashley⇥AR⇥21283↵Baxter⇥AR⇥40957↵
To access a text file in Python, you should use the built-in
open
, which takes a string representing a file's name
and returns a “file object” that can be used
to access the file's contents. In the below example, open
creates a file object corresponding to the text in the file
named data.txt
.
infile = open('data.txt')
(Of course, the word infile
is just a variable name I chose.
You could choose a different variable to reference the file
object returned by open
.)
Once you have an object referencing the file, you can iterate
through the file by using a regular for
loop:
for x in infile:
# code to process line x in the file
The body of the loop will be processed for each line of the file.
For example, if the file had the three lines mentioned above,
we would go through the loop three times, with x
being
each of the following.
Arkansas⇥AR⇥18777↵
”Ashley⇥AR⇥21283↵
”Baxter⇥AR⇥40957↵
”In practice, the body of the loop will almost always want to
take off the newline character at the end, so you would want to
use the rstrip
method; and then to divide the line into
its component parts, you'd want to use the strip
method
passing it the tab key as a parameter.
Here is a complete program; which goes through the file and displays all counties where the population is 100,000 or above.
infile = open('data.txt')
for line in infile:
data = line.rstrip().split('\t')
pop = int(data[2])
if pop >= 50000:
print('{0}, {1}'.format(data[0], data[1]))
Notice how the first part of each iteration is to use
rstrip
and then strip
to take a line and divide it into its component parts,
assuming that we're working with a tab-separated line.
As another more complex example, suppose we want to compute the total population of each state, computed as the total of its counties' populations. In this case, we want a dictionary that maps state abbreviations to the total population found so far for that state. Only after completing the file, then, would we go through the dictionary again and display the total for each state.
all_data = open('data.txt')
pops = {}
for line in infile:
data = line.rstrip().split('\t')
state = data[1]
pop = int(data[2])
if state in pops:
pops[state] += pop
else:
pops[state] = pop
for state in pops:
print('{0:32s}{1:9d}'.format(state, pops[state]))