HOME    PACKAGES    SCRIPTS    BLOG

Finding Which File to Extract from a Tar Backup

A primary reason I have so little free K bytes in internal memory on my Zaurus is that I want easy access to all the files and email I have written or downloaded, no matter which SD card I happen to be using, and whether or not I am using my CF modem at the moment. I do not want to have to putz around, hunting through backups and then having to extract the files, I want them quickly and easily.

But I have gotten very tired of the lack of space, and started wondering if I could find a way to quickly and easily locate old files I need and extract them. I have not found a perfect solution yet, not one good enough to make me feel comfortable about erasing a bunch of files I think I might need access to on a moment's notice, but I thought I would share what I have done so far, as I have found some techniques which I have found quite helpful when I need to examine the contents of a backup file.

Yes, theoretically I should be able to put these files on my main SD card, but that is also fairly full. Since the card is fat16, which means the space used by a file is increased in 16k increments (i.e., a 1k file uses 16k bytes on the card, a 14k file uses 16k, and a 17k file uses 32k, for example), it would use a lot of unnecessary space that I would rather use other ways unless I put everything in big tar files, and there we are again, looking at backups. The card is full of other things, I already have been moving my old backups off of it on to a CF card so I can use the space for research data I am gathering.

So, I asked a Linux pal what he would do, and he suggested extracting all the files on the backup to "stdout" and grepping those resultant virtual files, but I want to try identifying the file first with "grep," and then extract the file or files I definitely need. Using "stdout" might work if space was not an issue, I can think of a slick way to do it, but I do not have that kind of extra space. If I did, I would not be writing this page!

Okay, so here is what I am doing, I am looking at the backup file with "grep," and having it extract and number all the lines that contain the filename of the file being examined, and all the lines that contain my keyword or phrase. Then I pipe those results through "grep" again, having it select out just those lines pertaining to files containing my keyword(s), along with printing the lines containing my keyword(s).

I have used "-wn" to tell grep to look for just whole word phrases (which should make it run more quickly), and to number the lines. I am using "ustar" as a keyword because I know it appears on every tar header for every file in my backups, and is an easy way to tell grep to look for the lines containing the file names. The three backslashes in a row, just before the pipe, are needed to tell grep to find all lines which contain either my keyword or the header. The number of backslashes required might be different if you put this code into a script, but it is what works for me on the command line.

bash-2.05# grep -wn "Got your email"\\\|ustar mc/2005-12-18-21-39.backup| grep -B1 "Got your email"

60562:00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000home/root/Applications/qtmail/outbox.txt0000000000000000000000000000000000000000000000000000000000000100755?0000000?0000000?00001477576?10351417176?017531? 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ustar ?root0000000000000000000000000000root000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000?newmailid = 1008
60886:subject = Got your email
--
119007:00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000home/root/Applications/qtmail/ooutbox1205bad.txt00000000000000000000000000000000000000000000000000000100755?0000000?0000000?00001000607?10345076665?020647? 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ustar ?root0000000000000000000000000000root000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000?newmailid = 1008
119331:subject = Got your email
--
132595:0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000?home/root/Applications/qtmail/CPoutbox1214.txt0000000000000000000000000000000000000000000000000000000100755?0000000?0000000?00001500561?10350046524?020233? 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ustar ?root0000000000000000000000000000root000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000?newmailid = 1008
132919:subject = Got your email
--
160843:0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000home/root/Applications/qtmail/Copy of outbox.txt00000000000000000000000000000000000000000000000000000100755?0000000?0000000?00001501070?10351237505?020756? 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ustar ?root0000000000000000000000000000root000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000?newmailid = 1008
161167:subject = Got your email
bash-2.05#

I considered using the grep command to examine more than one line in the file before deciding which file to extract, but that did not work. When I tried using "grep" to filter the results by adding additional lines with the "-A" option, I could not retrieve the file name. The following did not give me the file names:

bash-2.05# grep -wn -A1 "Got your email"\\\|ustar mc/2005-12-18-21-39.backup| grep -B1 -A1 "Got your email"
60886:subject = Got your email
60887-date = Sun Nov 13 2005
--
--
119331:subject = Got your email
119332-date = Sun Nov 13 2005
--
--
132919:subject = Got your email
132920-date = Sun Nov 13 2005
--
--
161167:subject = Got your email
161168-date = Sun Nov 13 2005
bash-2.05#

But that makes sense. I know from the earlier example that the header for the file of interest is on the 60562th line of the backup, and both examples show that line I am interested in is the 60886th line. Simple math with the "expr" command will tell me the line number of my keyword or phrase in the file of interest. Yes, I could have used the calculator on my Zaurus, but it was much faster and easier to just paste the numbers in to a command:

bash-2.05# expr 60886 - 60562
324
bash-2.05#

This tells me that adding "-A1" to the first grep did not work because my keyword phrase is the 324th line in the file of interest. So, I suggest that, if you want to examine more than one line in the file before deciding which file to extract, you should probably pipe the results of the first grep through the "less" or "more" to read it.

It would take some fairly complex code to completely mechanize this type of ideal search for a file in a backup, and I have not decided how I want to approach the problem. I think I would have to use a tool like sed to select the lines I want, and do not think it will be easy for me to to get the filename, along with the content, without a lot more work. So, for now, I think I will not be erasing very many files, since I have not yet come up with an easy, automated method of retrieving the exact files I want, quickly and easily.

In the meantime, I do know that whenever I erase any files from internal memory, it will be easy for me to extract those files that I decide I want to retrieve from the tar backup files. At that point, I will be able to use the method described in detail in my post about how to extract a single file from a tar backup.

By the way, if you just want to find a few lines, and do not care about which file they are in, you could do something like the following, where I asked tar to just feed to grep all the files in my tkcMail "cur" subdirectories. I limited output to the first 30 lines, using the "head" command. (NOTE: I have modified the server numbers shown in the example below for security reasons)

bash-2.05# tar -xOf /mnt/cf/2008-04-27-20-39.backup home/root/tkcMail/*/cur/* | grep -B1 -A1 server| head -n30
> > bash-2.05# less /etc/resolv.conf
> > nameserver 69.16.159.116
> > nameserver 66.71.1.254
> > bash-2.05# ping 69.16.159.116
--
> pages and then upload them to my
> server. On the homepage I have
> several links to different pages. No
--
> up dead stuff, so to speak. Now,
> with the code that the server adds, I
> get a gazillion validation errors.
--
Received: from mx2.internal (mx2.internal [10.202.2.201])
by server1.messagingengine.com (Cyrus v2.3-alpha) with LMTPA;
Tue, 07 Feb 2006 01:42:36 -0500
--
Received: from mx2.internal (mx2.internal [10.202.2.201])
by server1.messagingengine.com (Cyrus v2.3-alpha) with LMTPA;
Wed, 08 Feb 2006 14:21:05 -0500
--
> Browsing the web, I run out of memory? Same thing with retreiving my
> email -- even with attachments sitting on the server.

--
use it
with my IMAP server but it wants to pull down headers from *every* e-mail I
have stored on the server and thus crashing with out-of-memory errors. That
bash-2.05#

If you want to examine more lines before, change "-B1" to "-B2" or whatever number you want, and if you want to see more lines after, then change "-A1" to "-A2" or whatever number you want.

If you want to view all the results, you can use the "more" or "less" commands. Here is the syntax for "more":

tar -xOf exact-path-to-tarfile exact-path-list-of-files-to-examine | grep -B1 -A1 your-keyword-phrase | more

But if there are many lines or files, and you need to examine more than that one line, I recommend using "less" along with asking grep for additional lines before and after the line containing the keyword or phrase, as in:

tar -xOf exact-path-to-tarfile exact-path-list-of-files-to-examine | grep -B1 -A1 your-keyword-phrase | less

When you pipe the results through "less," you can search back and forth through the results, instead of having to painfully scroll up and down using the space key to go forward and your stylus to scroll backwards, reading every line. Instead, you can enter a "g" to go back to the first window, a "G" to go to the last window, a "w" to go back one window, or a slash along with a keyword or phrase "/keyword" to look for a specific keyword or phrase in the output. "less" does not wrap around, so just be sure to start your search from the beginning of the file (using ":g" to go back there).

The "less" command is not built in to Sharp ROMs, but you can find out more about it on my "less" command page which is here.