Fingerprinting the FAT File System
Tom Hennen
FAT Background
The MS-DOS FAT file system has two main sections the FAT area and the data area. There can be multiple copies of the FAT on the partition, this is to provide data redundancy should some FAT sectors become corrupt. A block’s physical location on disk corresponds directly with the physical location of the FAT entry for that block. Blocks (‘clusters’ in FAT terminology) are allocated to a file as a linked list. Thus a FAT entry will contain the location of the next FAT entry allocated (while that entries location within the FAT defines its location on disk).
Directory entries contain all of the information about a file, including the starting block of that files data (the other blocks must be looked up via the FAT).
Directory data is stored the same as any other file data, in the data section. With each directory entry taking up 32 bytes.
Long filenames are implemented via a hack. Directory entries with invalid attributes are allocated with UNICODE characters for the filename stored within the directory entry. The entries are themselves a linked list, with one pointing to the next until the filename ends.
There are three flavors of FAT, 12 bit, 16 bit and 32 bit. This code should work with both 16 bit and 32 bit. 12 bit would be very hard to support since FAT entries would not be byte aligned. This code assumes all FAT entries are byte aligned.
You can find more information in these documents:
http://www.microsoft.com/hwdev/download/hardware/FATGEN103.doc
http://home.teleport.com/~brainy/lfn.htm
Setup
Redhat Linux 7.3
Kernel 2.4.18-3 Uniprocessor
IDE Disk 100MB partition
The partition was formatted before the driver was loaded. I could not get formatting to work while the partition was mounted on the device.
Communication
All out of band communication between client and the device
driver is done via ioctl calls with argument values user defined (that is we
didn’t use pre-existing calls).
The Process
The process is divided into three phases
I. Determine Sector Size
II. Find information about the FAT itself: FAT Sectors, Number of Data blocks stored per FAT, FAT free block values, FAT last block values.
III. Identify directory data and fields.
Notes:
During the initial coding of this program I thought that the sectors I found were actually blocks, thus many of the variables have the word block in them, when sector is more appropriate.
All the issues listed are unresolved.
Phase I – Determine Sector Size
The client will first append a pattern of known size (8 bytes). The driver should see the traffic from the FAT, directory blocks, and data blocks. The driver will all add the blocks it sees to the all_blocks_ list. At this point the client will ask the driver if we are done. If we are done, go to the next section, otherwise continue appending. At this point we should see new data blocks, and directory blocks. We probably won’t see any more FAT traffic since the sector will be found before another block needs to be allocated. Every time the driver sees data with the pattern in it will check to see if it is in the all_blocks_ list (this list actually contains the sector numbers that have been seen). If it is in the list then it was the data sector we saw when the file was first created, so the driver will increment sector_size_ by the pattern size. The first time the driver sees data whose sector is not in the all_blocks_ list it will know the sector size. This is because the original sector will be filled before another sector is written to the disk. Thus, if we see a sector that contains the pattern, but is not in the list we must have filled the first sector. The driver will then let the client know that it can stop appending data by supplying the client with the sector size. This allows us to write entire sectors before syncing rather than 8 bytes at a time.
Issues:
This does not discover the number of bytes per BLOCK. The linux FAT driver writes data in a sector by sector basis, unlike the ext2 driver which writes it on a block by block basis. The number of sectors per block is determined in Phase II.
Phase II – Find information about the FAT itself
Section 1
The client
will simply create a new file to be used in the next sections. The driver will ignore all traffic during
this section.
Section 2 – Find the directory block for this file
The client will rename the file created in section 1. This should cause only the directory sector to be modified as the file is not growing in size and does not need space allocated, and the data itself is not changing. The driver will see this sector and it to the dir_blocks_ list. Since the FAT file system treats directories as regular data this sector may not always be a directory block. However, if the disk is empty, as it was in this case, the directory will probably be created in the same sector each time, all other things being equal. This property is assumed by some other parts of this code, but it is not strictly nessecary, as the directory block can be discovered with other means.
Section 3 – Find all of the FAT blocks
After erasing the disk of the data from the previous section the client will create a file. This will produce FAT, directory, and data block traffic. If the sector being written is in the dir_blocks_ list, or contains the known pattern, it will be ignored. This leaves only the FAT sectors. The driver code will add these sectors to the fat_blocks_ list. The client will then proceed to append to the file it created until the disk is full. This will allow the driver to see all of the FAT sectors (since it can ignore the directory and data sect0rs) and add them to the fat_blocks_ list.
Issues: This section is the slowest of all (~5 minutes on 100 MB partition) since the entire disk must be filled. If the number of data sectors stored per fat sector could be discovered and communicated to the client before the start of this section the client could write larger segments of data between each write. Similarly, if lseek() were used to simply allocate data to the file without writing to it write time may be reduced. One issue this code does not address is what happens when a file size limitation is reached before the disk is actually full. Presumably if this happened, not all of the FAT sectors would have been identified. One solution here would be to create additional files after the first error while writing to the disk. The client would then only stop once it encountered an error creating a file.
Section 4 – Discover the number of FATs
Motivation: The FAT filesystem keeps a number of FAT copies (most commonly two). In section 3 all of the sectors in all of the FATs were identified, however, we cannot yet distinguish between the different FATs.
The client, once it has removed the file from the previous section will once again create a file. This will of course produce directory and data traffic, which can be screened, but it will also cause all of the FATs to be written to. The driver, upon encountering a FAT block, will increment the num_fats_ variable; once this secion is finished this variable will contain the number of FATs. When the client indicates it is leaving this section, the fat_blocks_ list and num_fats_ can be used to create a primary_fat_blocks_ list. This list will be composed of all of the sectors of only one of the FATs on the disk. It is created by searching fat_blocks_ for the lowest sector. When it is found the highest sector of the fat can be computed by taking the total number of sectors in fat_blocks_ dividing it by num_fats_ and then adding the lowest sector (num_fat_sectors_ / num_fats_ + low_fat_sector_). The list is then assembled by going through fat_blocks_ and adding each entry which is greater than or equal to low_fat_sector_ and less than or equal to high_fat_sector_ to primary_fat_blocks_.
Section 5 – Figure out the number of blocks per fat block.
The client once again starts by creating a file, which creates FAT, directory and data traffic. It then proceeds to append to the file until the driver tells it to stop. During this time the driver will ignore any directory sectors. The driver will note the first FAT sector from the primary list written to disk. After this is noted each data sector written increments num_sectors_per_block_ until the driver sees a the FAT sector it saw previously. The idea is that the filesystem will only write to a sector if it needs to allocate another block. Thus any data sectors seen between fat sector writes belong to one ‘BLOCK’.
The driver
will then figure out the number of data sectors per fat sector. If it sees a data block it will increment the
number of sectors per fat sector (it will only do this once we have moved to a
FAT sector which has not yet seen any activity and as a result should be
empty). Once we have seen a FAT sector
that is not the first fat sector, nor the second FAT sector, we will have the
number of sectors per fat sector, and we should be able to compute the number
of blocks per FAT sector by dividing the number of sectors per fat block by the
number of sectors per block. We can also
determine the size of a FAT entry in bits by dividing sector_size_ by the number
of FAT entries per sector.
Issues:
The code currently assumes that the number of sectors per block will be found before the number of sectors per fat sector. This seems reasonable since the first can be found while the first fat sector is being written, and the number of sectors per fat sector won’t be calculated until the second fat sector is written. Also, in this section a new file is actually created after the disk has been wiped. It is assumed the file created will be placed in the same directory sector as when the directory entry was identified (this does not seem like an unreasonable assumption). However, it may be fairly trivial to modify the code so that the disk is not cleared and the same file can be used. The only thing that must be ensured is that there are no adverse affects from starting with a file which already has some data in it (from section 4 and 3). That or the file could be changed to a size of zero, the disk synced and then have section 5 continue.
Section 6 – Find the value of free and last blocks in
the FAT table
The client will create 10 files, each file create will cause FAT, directory, and data traffic. The 10 files are created so that we can make sure that we find a different entry, though this may not be strictly nessecary. The client will let the driver know each time it switches from one file to the next. The driver only cares about FAT blocks, so it will ignore any blocks which are not on the primary_fat_blocks_ list. If the block is a FAT block it will put a copy of its data in cur_fat_sector_ and move any existing data to old_fat_sector_. If there is data in both of these locations it will compare the two. They should only differ by the new FAT entry for the newly created file. The value in old_fat_sector_ should be the ‘free’ value for FAT entries, and the value in old_fat_sector_ should be the ‘last’ value for FAT entries. The reason the value is the last entry is that the file is small enough so as to only require one block, thus the one block it requires is the last block in the FAT chain.
Issues:
The first issue is that the entry values may not be as large as the fat entry size we computed in section 5. For example, FAT 32 only uses the last 28 bits to store the chain data, the upper 4 bits must be preserved by the driver, but do not have anything to do with whether the block is ‘free’ or ‘last’. For example, both the values 0xF0000000 and 0x00000000 mean the particular block is free.
The second issue is that FAT is uses a different byte endianness than is normally expected. If you have the following in an entry, FF00, 00 is actually the high byte, and FF is the low byte, thus the value is 255 and not 65280. My code does not currently take this into account, in fact it assumes the opposite. This will need to be corrected. There should probably also be a way to determine if a given file system uses high byte endianness or low byte endianness.
The code to
determine the value of the ‘free’ and ‘last’ entries fails with FAT 32. Of course, since it assumes the wrong byte
endianness in the first place it will have to be rewritten regardless.
Phase III – Directory Data
In this phase a ‘field’ is the offset in bytes from the start of the filename in a directory entry.
Section 1 – Find the number of directory entries per sector
The client will create files with a known pattern of filenames letting the driver know each time it switches from one file to the next. The driver will ignore any sector which contains the data pattern, or which is in the fat_blocks_ list. If the sector we see is in dir_list_ we should ignore it, since we would still be in the first directory block, which might have directory entries we don’t know about in it. Once we see a block which is not a FAT block or a data block, and it isn’t in dir_list_, then it is a new directory sector. Once we get a new directory block, note it and set the counter of directory entries per sector to 0. We then increment the counter each time the client tells us it has switched to a new file. When get a block which is not in dir_list_ we should then tell the client to stop creating new files since we just switched to a new (the third) directory sector, and so we have the number of directory entries per sector. At this point we can also try to find the filename within the directory sectors since we know what the filename will be at any given step.
Issues:
This entire section assumes that directory entries are of fixed size. Which, in FAT, they are. However, a more modern FAT like file system may use variable sized directory entries.
Another problem is that of how filenames are stored. The directory entries always store the filename in uppercase. This must be kept in mind when looking for the filename within the data. Also, long filename, and case information is stored in supplemental directory entries. Due to backwards compatibility issues the letters are not stored contiguously, they are also stored in UNICODE. This will make it rather difficult to infer how the long filenames are stored in the filesystem, as each character will have to be searched for one at a time. One implication of this is that these extra directory entries are not stored if the client uses filenames which are less than 8 characters and uppercase. If it is less than 8 characters and lowercase, then one extra directory entry will be stored, making it appear there are only 8 entries per sector. However, in my tests with uppercase filenames I was unable to identify the expected 16 entries per sector. I instead found 14 entries per sector, I am not sure what happened to the other 2 entires. As a result I don’t believe the data from this section should not be relied upon for other sections.
Section 2 – Find the directory entry fields where size data is stored
The client will create a file, all the directory entries will be modified. The driver will record all the data from the directory entry in this step. The driver will also record the sector of the directory entry, this will allow us to focus only on that sector in the next steps. The client, after sleeping for a few seconds, will then modify the file without changing its size this should cause only time related fields to change. The driver will record this new data and compare it to the data from the previous step. Any bytes that differ will be time bytes. The offset from the start of the filename (here called the ‘field’) will be added to the time list. The client will then modify the file again, for good measure, and the driver will once again do the same, looking for any differences between the two sectors and adding them to the time list. This particular step is probably just redundant. The client will then append a small amount of data to the file. This will cause the time and size fields in the directory entry to be modified. Since the time fields are already known, any fields not in that list will be added to the size list. The client will then append data to the file until the disk is full, this will once again cause the time and size fields to be modified, however, in this step all bytes of the size field should be modified, as opposed to the previous step where only one byte would be modified. The driver will once again ignore any time fields and add any different bytes to the size field list.
Issues:
If we knew the size of each directory entry we could more easily calculate the field as an offset within an entry. Instead we must use the filename as an anchor so that we have a way to find the fields regardless of the directory entry location within a sector.
There is case where not all of the size fields are found. That is if while appending the file a sync occurs when the files size in hex is 8, and another sync does not occur until its size is 80000008. In this case the first and last bytes of the field will be found, but not the inner two. One way around this would be to append certain sizes (or use lseek for that matter) and syncing to guarantee every byte is identified.
The time field identification is not ideal. This method easily identifies the low byte of the field since a second or so passes between each write. However, finding higher bytes such as minutes, hours, days, and years is more troublesome. The FAT file system has separate entries for time and date, thus they may be located at opposite ends of the entry and need not be contiguous. We cannot even assume a size for these fields. Somehow a way must be found to have every byte modified. Perhaps the system clock could be altered between writes so as to force a time and date change.
Another issue with time is that linux does not seem to store the creation time in the creation time field (at least in FAT16, I’m not sure about FAT32). Instead it stores the change time. This is probably acceptable since those fields are not defined for FAT16 while they are defined for FAT32.
This section does not seem to work properly with FAT32 as many fields which are not in the directory entry are added to the size field list. I am not sure why this is happening.
Section 3 – Find the starting cluster field in the directory entry and the bits which indicate if a entry is a directory or not.
The client will create a directory, this will cause the directory block to be modified, and the new directory block to be initialized. The driver will copy all of the data not a fat or data block into an array (dir_sector_array_). Also store the sector the data was from. This should contain all of the directory data traffic from the directory creation. The client will then create another directory in the root, with a different name. The driver will once again copy all of the data not a FAT or data block into an array (dir2_sector_array_). It will contain all the data from the second directory creation. The client will then create a file in the first directory. This will cause the first directory sector to be modified so that the new file can be added. The driver will look for the name of the file the client created in the sectors it sees. When it finds it the driver will compare the sectors stored in dir2_sector_array_ and dir_sector_array_ that correspond to the sector the new file’s directory entry was placed in. The driver will add any differences it finds to the dir_diff_sector_list_ list so that we can latter tell if the differences between sectors are due to the files or to simply to the directories the files are in being different. At this point the driver will also make a copy of the data (which should contain the file’s directory entry). The client will then create another file with the same name in the second directory (this is done so that the differences between files are minimized). The driver will once again look for the name of the file created by the client. It will then compare the data being written in this step (the second file in the second directory) to that of the previous step (the first file in the first directory). If there are any differences it will check to see if the offset is on dir_diff_sector_list_. If it is that indicates that the difference was present before the files were created, and it can be ignored. It will also check to see if the difference is in a known field, and if it is will ignore it. However, if the difference is not in a known field it will be added to the cluster_field_list_ as the only difference between the two files at this point should be the starting cluster number.
A similar process will be used to find the directory bits. The client will create a file in the first directory. The driver will look for the filename in the data. If it finds the name it will make a copy of the data since that sector should be the sector which contains the directory entry. The client will then create a directory in the second directory with the same name as the file it created in the first directory. Only the directory bits should differ between the two files. The driver will look for the directory name in the data. If it finds it the driver will compare the data with the data from the previous step. Any offsets that are in dir_diff_sector_list_ and any fields which are known will be ignored. The result should be one byte which differs between the directory and the file. The bytes will then be compared bit by bit, with any bits that differ being added to the dir_bit_list_. Once it has done this the bits which distinguish between a file and a directory will be known.
Issues:
Currently the sector field that is found may only be one of the bytes of the field. To make sure we have all the fields the first file created should probably be large so as to force a large difference between its first sector field and the first sector field of the second file.