Session 34: Binary files

Binary vs text files
The DataInputStream class (Section 14.5)
The DataOutputStream class (Section 14.5)
Writing objects to files (Section 14.7)

Binary vs text files

A text file is a file that is properly understood as a sequence of character data (represented using ASCII, Unicode, or some other standard), separated into lines. Typically, when a text file is displayed as a sequence of characters, it is easily human-readable (though perhaps not understandable - for example, the Breakout level file was a text file, but it certainly wouldn't be understandable without some sort of decoder telling which numbers represented what).

A binary file is anything else. A binary file will include some data that is not written using a character-encoding standard - typically, some number would be represented using binary within the file, instead of using the character representation of its various digits (in some base).

Whether you use a binary file or a text file for storing information depends on basically two questions.

For the Breakout lab, file size/speed was unimportant - it was a small file anyway, and you only had to read it once. But portability was a major concern, since each student had to implement the file format themselves. Thus we used a text file.

The DataInputStream class

Textbook: Section 14.5

The DataInputStream class is layed on top of the FileInputStream class (actually, it's layered on top of the InputStream class, of which FileInputStream is a subclass).

DataInputStream(FileInputStream in)
Creates a DataInputStream that uses in whenever it wants to read more bytes.

It provides a variety of methods for reading data from the file.
int read(byte[] b)
Attempts to fill b with bytes, returning the number of bytes read - this is the same thing as what FileInputStream provided.

int readInt()
Reads 4 bytes from the file and returns the int that these 4 bytes represent.

double readDouble()
Reads 8 bytes from the file and returns the double that these 8 bytes represent.

String readUTF()
Reads and returns a string from the file.

When reading a number, the computer reads the binary representation from the file, with the most significant byte first. For example, if the file contained the following four bytes (representing each byte as two hexadecimal digits)
00 00 02 04
the DataInputStream would read this as the binary number 1000000100(2), which is 1028(10).

The UTF format is a standard for representing strings in a file, engineered to make it easy to recover a string from a file, regardless of what computer you're using. It begins with two bytes saying in binary how many bytes long the remainder of the string representation is, followed by the characters of the string represented in UTF-8 format (which basically represents ASCII characters as is, with non-ASCII Unicode characters handled in a more complex way). For example, the string ``CAB'' would be represented by the following five bytes (represented here in hexadecimal).

00 03 43 41 42

The DataOutputStream class

Textbook: Section 14.5

DataOutputStream works analogously to DataInputStream. It's layered on top of the FileOutputStream class (technically, the OutputStream class, which FileOutputStream extends).

DataOutputStream(FileOutputStream out)
Creates a DataOutputStream that uses out whenever it wants to write more bytes.

It provides a variety of methods for reading data from the file.
int write(byte[] b)
Attempts to write all the bytes of b into the file - this is the same thing that FileOutputStream provided.

void writeInt(int i)
Writes 4 bytes representing the integer value i.

void writeDouble(double d)
Writes 8 bytes representing the double value d.

void writeUTF(String s)
Writes the string s using the UTF standard.

Writing objects to files

Textbook: Section 14.7

You'll notice that I didn't include any methods for writing entire objects into a binary file. There are two other classes that provide this capability: ObjectOutputStream and ObjectInputStream. They're analogous to DataOutputStream and DataInputStream - even providing the same methods. But each adds a new method: ObjectOutputStream provides the method

void writeObject(Object obj)
Writes a binary representation of the object obj.

and ObjectInputStream provides the method
Object readObject()
Reads a binary representation of an object and returns it.

In order to store an object in a file, Java will store the name of a class, followed by a list of the binary values of any instance variables for the object. Then, to read the object, it reads the name of a class, allocates memory to accomodate it, and then initializes its instance variables according to the following binary values.

It's already pretty complicated - this isn't the sort of thing you'd want to implement (or even that you could implement, without using some additional features of Java that we haven't discussed). But it gets more complex: The problem is that some instance variables could be pointing to objects. Simply writing out the memory address of these objects into the file is no good, since the memory address won't be same when we later read the object in a separate process.

Java has a nifty (and complex) technique for handling this called serialization. Serialization allows you to specify for an object that, when it is written into a file, it should also be sure to write any objects to which any of its instance variables points into the file also. (Moreover, some of these objects may require writing more objects into the file.)

But it gets even more complex: Often you have multiple pointers to the same object, and you need to make sure that when written, the multiple pointers still point to the same object. (As an example of when this would occur, suppose in our Very Small Video program wanted to save its entire database into a file. You could write the Store object into a file. If a Customer has a video checked out, that Customer would point to a Video, but it's important that the Store also point to the same Video. Otherwise, when the Customer checks the video back in, the Video pointed to by the Store would still display that the video is still unavailable.) Serialization handles this issue, too, but handling it gets pretty complex.

I'm only mentioning serialization in case some day in the remote future the issue pops up, and perhaps you might remember me saying something about serialization, and then you can look up how to do it. It's not something we'll be covering in class beyond the rough idea of what it is.