THEX/TTH, queue cleaning and file management

Archived discussion about features (predating the use of Bugzilla as a bug and feature tracker)

Moderator: Moderators

Locked
WollyHood
Posts: 1
Joined: 2004-10-01 22:01
Location: Urth
Contact:

THEX/TTH, queue cleaning and file management

Post by WollyHood » 2004-10-03 21:44

I wrote this post up and then realized there was a bugzilla feature request. So what I am going to do is post this first, wait a day to see if anyone has useful comments or insights that I can use to moderate my request submissions. Tomorrow night I will submit each request seperately with whatever modifications are needed to have it make sense. Since some of the requests may be functionaly similar to others I now see I will use best judgement before submitting.

Oh, and I love TTH. Best thing since sliced bread.

Also, thought up some other functions that would be nice. I will post those later and seperately, in the interest of brevity.

Wow, all that writing and I still haven't even started my actual post. How do I turn off verbose output on myself?

I have been trying to clean up my queue recently. Over time my queue has grown rather large, which I am certain is symptom unique to me among DC++ users (cough cough). My first thought was to search through the queue and remove duplicate or otherwise unwanted files. That's when I first realized that I was going to have some trouble with my self-appointed task.

The short of it is this: I would humbly request file management features. The ideal solution would be the ability to map the queue as a virtual drive and then manage it with standard file managers (think WebDAV/Davenport or mounting a virtual device in Linux). But ideals are not often practical solutions, so I will describe some informal use case examples and sketch out some (simple?/easy?) feature requests to address them.

1. Searching the Queue
My first idea was to search the queue to find files to remove. I open up the Download Queue tab and ... no search function. Small problem, that.

Since there was no search available I went to my handy dandy text editor and found I could search the queue.xml easily enough. Moving between the editor and DC++ I could make some simple modifications but found it very time consuming. So I decided to modify the queue.xml file directly but found additional issues there.

XML ignores blank spaces in data files and therefore writes each record across multiple lines to make data structures formatted for easy reading by human eyes. Or something. While it is a great feature of XML it made doing a search and replace/delete of whole XML records difficult to accomplish in the text editor without creating a malformed XML file. Not insurmountable but worth mentioning as a difficulty. The real problem was with what happened post editing.

The queue.xml file appears to be read by DC++ only at startup and is subsequently written to as the queue is modified. Any changes I make to the queue.xml while DC++ is running would be overwritten by the blind writes. Making changes to queue.xml would have to occur after I had exited from DC++ and all final commits had been written by it to the file. Once that happened I could modify the file and start DC++ to load changes. (I won't go into using a copy of queue.xml while DC++ is running as the problems inherent in that are obvious.) Stopping and starting DC++ whenever I wanted to edit the Download Queue seemed a bit inelegant a solution so, except for major revisions, I think I will avoid that as a fix.

Any way I go about this searching seems a bit of a kludge.

So the requests here are:
A: A search function under the Download Queue tab. The search should have the same elements as are found in the dialog box for adding ADL Searches.
B: A filter function similar to ADL Search for the Download Queue, if it could retain the directory structure.

2. Looking for Duplicates within the Download Queue tab
Since searching and modifying the queue externally is impractical I turned my attention to hunting via the browsable interface. The easiest thing to do was go into directories containing many files and sort according to a variety of attributes (name, size, TTH) to find exact or probable duplicate files. This helped and was an easy quick fix, but the limitations were apparent.

As I could only filter within a single directory I could not find duplicates when they existed outside of the directory of the original file. Unfortunately this appears to be the majority case for duplicate files.

While the TTH sort made it very easy to spot exact duplicates small changes made to a file would make it impossible to use TTH. By small changes I mean changes which do not affect the function of a file but change file size or signature values. Examples would be ID3, IPTC or information files added to compressed archives.

The requests here are:
A: An automated duplicate file finder to sort and compare files by multiple attributes such as filename, file size and (especially) TTH. It should handle duplicates discovery automatically and display them for the user to make a final decision on disposition.
B: Like in example 1, request B; A filter function similar to ADL Search for the Download Queue. However it should display the filtered files as if they were in a single directory so that sorting can be done by hand to find similar but not duplicate files (ie: where ID3 or metadata is changed but the data load is the same).

3. Looking for Duplicates from Outside the Download Queue tab
Then it dawned on me that I would miss a whole class of duplicate files if I focused only on those found in the Download Queue. The class I thought of were those files which I currently had or had already discarded. Two sources for files to be removed from the Download Queue sprang immediately to mind, those files that I had downloaded and those files which were on my system.

Since I knew the search and filter functions available weren't going to meet my needs I decided to do a quick script to match TTH values from the two sources I noted against queue.xml (after exiting DC++). Two things became apparent after a little searching. First, there are no TTH values in the Download.log file. Second, there are no easily discovered ways to gather TTH values for local files. I did find a way around the second problem, kinda.

I looked around for integrity checkers that I could use to create file hashes and none seemed to come up with a good test hash. Looking into it I realized that I didn't need a simple Tiger hash, but one that did a THEX (apparently a Tiger/1024 hash would be just as well). I asked Gojomo and found out that Bitzi would return the needed hash value but for this use should have submission to the Bitzi DB supressed (bitcollider -p). Yay! This is still a problem as bitcollider is suboptimal for scanning large numbers of files like integrity checkers do (fsum, md5sum, tripwire, etc.).

Also, to deal with query to and from XML I found a tool called XML Startlet that looks ideal, so far.

As I prepare to put some time into this I thought I would write up these requests. I think most of the function I am requesting already exists in DC++ and could be adapted to the particular tasks I mentioned.

The requests are:
A: First and foremost, please add a TTH argument to the download.log configuration format. This one little change would make a world of difference when clearing detritus from the Download Queue, either by additional functions in future DC++ versions or with external modification of the queue.xml. Another option would be to have an alternative download log file that could be written as an XML file for easier parsing.
B: Allow creation of TTH hashes of any files on a computer, not just those shared. It would seem easiest to add a File Management tab with the same range of functions as the Sharing area in configuration. If you implement it this way please let the hashes be written to an XML file accessible by external applications. (But an external implementation of a command line THEX/TTH tool would be a powerful tool for many uses, even outside of DC++.)
C: Add the ability to query TTH values from the Download Queue against TTH values in the Download.log file and those files stored locally on the computer/listed in the File Management area (or from a user maintained XML file of file names, sizes and TTH values).

Would asking for the inclusion of a complete file manager/explorer type interface be too much? It would be nice. :-) And for Christmas I would like ...

If you have any questions about any of these requests or want to chat about design philosophy, or whatever, please feel free to post, email or IM me. When I finish doing whatever it is I end up doing I will try to get it into a final version for others to use and make it available.

Dan

.s
/s
.damnit!

TheParanoidOne
Forum Moderator
Posts: 1420
Joined: 2003-04-22 14:37

Post by TheParanoidOne » 2004-10-04 00:33

Your coherent thoughts and effort are a breath of fresh air. Kudos to you. :)

There is only one of your requests that already exists in DC++ and that is 2b. In the lower left corner of the download queue you will find a checkbox. That will toggle the queue between tree view and flat file view, allowing you to do sorts on any criteria you like.

For the other requests, I agree with most of what you say is not available, but I would like to point out the following:

DC++ has the ability (as you are no doubt aware) to open other people's open file lists. You can use this same feature to open up your own file list. This gives you access to the ADL search filters. These are not dynamic though, and would require re-opening the list each time you change a filter. This will also give you access to the Find feature. This will only let you search on file name though, not TTH.

Bugzilla Bug 97 filter for the download queue - As you can see, this has already been suggested. The request however isn't the most clearly written thing in the world and I would be happy to mark it as a duplicate if you were to submit a clearer request.

Bugzilla Bug 29: blacklist for download sources - You can add your comments here, and vote for it.

Bugzilla Bug 69: Don't queue a file if a one with a matching TTH is already shared - Ditto.

WollyHood wrote:Would asking for the inclusion of a complete file manager/explorer type interface be too much? It would be nice.

As far as I am aware, something similar is on a ToDo list somewhere.

That's the bulk of my comments. I have other minor comments but they can wait till later.
The world is coming to an end. Please log off.

DC++ Guide | Words

Sedulus
Forum Moderator
Posts: 687
Joined: 2003-01-04 09:32
Contact:

Post by Sedulus » 2004-10-04 02:30

http://dc.selwerd.nl/hublist.xml.bz2
http://www.b.ali.btinternet.co.uk/DCPlusPlus/index.html (TheParanoidOne's DC++ Guide)
http://www.dslreports.com/faq/dc (BSOD2600's Direct Connect FAQ)

PseudonympH
Forum Moderator
Posts: 366
Joined: 2004-03-06 02:46

Post by PseudonympH » 2004-10-04 15:24

3A: (This should work, but I don't have MSVC++ so I can't compile/test it.)

Code: Select all

diff -uNr dcplusplus.orig/client/DownloadManager.cpp dcplusplus/client/DownloadManager.cpp
--- dcplusplus.orig/client/DownloadManager.cpp  2004-10-04 16:13:33.046875000 -0400
+++ dcplusplus/client/DownloadManager.cpp       2004-10-04 16:17:52.812500000 -0400
@@ -722,6 +722,7 @@
                params["speed"] = Util::formatBytes(d->getAverageSpeed()) + "/s";
                params["time"] = Util::formatSeconds((GET_TICK() - d->getStart()) / 1000);
                params["sfv"] = Util::toString(d->isSet(Download::FLAG_CRC32_OK) ? 1 : 0);
+               params["tth"] = d->getTTH()->toBase32();
                LOG(DOWNLOAD_AREA, Util::formatParams(SETTING(LOG_FORMAT_POST_DOWNLOAD), params));
        }

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2004-10-06 11:16

PseudonympH wrote:3A: (This should work, but I don't have MSVC++ so I can't compile/test it.)

I submitted this, along with an upload equivalent and documentation changes, to arne.

Locked