Skip to content

MattHicks.com

Programming on the Edge

Menu
Menu

Multithreaded Unzip?

Posted on January 14, 2008 by mhicks

A co-worker and I were discussing the possibilities of performance gain doing extraction of ZIP files using multiple threads rather than the typically single-threading extraction that the majority (if not all) mainstream archive extraction utilities use and given Java’s great ZIP file support built-in it seemed rather trivial to give this a shot, thus, UberZip was born.

UberZip is my simple little sample Java command-line application to extract files where you can specify the number of threads to utilize during extraction. My #1 biggest headache is extracting Eclipse from the ZIP file you get, so I decided that would be an ideal test of my little program. I downloaded the Eclipse J2EE bundle (3.1.1), which is a happy 132 meg with thousands of files.

I need to test further with a machine that actually can better utilize multiple threads, but here are the stats for my AMD Athlon 64 3200+ (Hyperthreaded, so I actually get a very slight benefit) running on Windows XP:

14.7 seconds with 1 thread

13.4 seconds with 5 threads

11.6 seconds with 30 threads

Anything above 30 threads seems to actually create more overhead than gain.

Now, to be fair I matched this up against the fastest unzipper I am aware of, 7zip (http://www.7-zip.org). It took right at about 14 seconds to unzip the file inside 7zip, which seems to fall nearly perfectly in-line with the single-threaded execution of UberZip.

For those of you that would be interested in taking a look at this very simple example, I have committed the source to my public repository:

svn://captiveimagination.com/public/uberzip/trunk

Further, if you’d simply rather download the JAR or EXE I have uploaded copies of it for those of you that would like to unzip files “uber” fast. 🙂

uberzip.exe

uberzip.jar

After taking this to try out on my work machine (Dual Xeon dual-core 3.2 GHz processors + 4 gig of memory = 8 processors in Windows) I got the following stats:

7zip – 25 seconds
UberZip (1 thread) – 22.17 seconds
UberZip (5 threads) – 19.5 seconds
UberZip (30 threads) – 21.6 seconds

Oddly it would seem about 5 threads is the “sweet-spot” for this machine. Though some of the results are a bit strange and it would seem on this machine the hard drives don’t perform quite as well as my home machine it is obvious that at some level of configuration multithreading you can gain some good performance on the single-threaded applications out there.

Perhaps someone else will find this information useful and turn UberZip into a product people can use. 😉

Update (2017.05.31)

I’ve completely re-written this functionality in Scala and posted it on GitHub: https://github.com/outr/uberzip.  It’s faster than ever and more cleanly written.  Tested on the latest Eclipse ZIP (320 meg) it can unzip 2,981 files in 0.73 seconds. Doing the same test on the same machine with Linux unzip took 1.7 seconds.

15 thoughts on “Multithreaded Unzip?”

  1. David Rudder says:
    October 29, 2008 at 1:52 pm

    The SVN seems to be down. I get a 404. I’d love to comment on the code. Can you email me a zip (haha) of the code to [email protected]? At your convenience, of course!

    Have you benchmarked the code when the zip file is hosted on a shared drive? Off a local disk, the bottleneck will largely be disk I/O speed, which really doesn’t benefit much from threading. But, in theory, thread-aware network file sharing, like NFS and SMB, should perform *much* better when threaded. In theory, of course.

    Reply
  2. Jacob says:
    December 13, 2012 at 7:55 am

    I just wanted to let your know that UberZip is still impressive almost 5 years later. I just tried it on an i7-2720QM with a Samsung 830 SSD (using PGP full-disk encryption).

    7zip: ~25 seconds
    Windows 7: ~57 seconds
    UberZip(1 thread): 22.4 seconds
    UberZip(8 threads): 6.7 seconds
    UberZip(16 threads): 5.9 seconds

    Source file was a 400 MB zip of RTC client.

    Reply
  3. Petteri says:
    January 24, 2013 at 6:54 am

    > here are the stats for my AMD Athlon 64 3200+ (Hyperthreaded […])
    I have no idea what were you on when you wrote this.

    >Dual Xeon dual-core 3.2 GHz processors + 4 gig of memory = 8 processors in Windows […]
    > Oddly it would seem about 5 threads is the "sweet-spot" for this machine
    It's not odd. It's natural. Server boards have slower memory but as you mentioned it's probably your hard drive that's the bottle neck there.

    The logic behind n+1 threads (number of cores+1) is to better utilize the CPU by always having something for it to do. Depending on the CPU it might also work better with just n or n+2.

    Reply
  4. Matt Hicks says:
    January 24, 2013 at 2:38 pm

    Jacob, good to know that even though this is code that's a few years old it still performs well. 🙂

    Petteri, thanks for the feedback regarding CPUs.

    Reply
  5. drinehart says:
    July 3, 2013 at 9:28 pm

    Does uberzip also support password-protected zip files?

    Duane

    Reply
  6. Matt Hicks says:
    July 3, 2013 at 10:41 pm

    Duane, no, unfortunately not. UberZip was meant as a prototype and I never expected it to be as fast as it is.

    Reply
  7. Anonymous says:
    November 27, 2013 at 6:31 am

    I'm playing around with UberZip on an c3.8xlarge Windows 2008R2 instance on Amazon Web Services (32 vCPU's Intel Xeon E5-2680).

    Unzipping a single file containing 136,012 unique files (20.3Gb uncompressed; 1.7Gb zipped).

    7Zip: 10 mins 8.2 secs.
    UberZip (16 threads): 2 mins 22.5 secs.
    UberZip (32 threads): 2 mins 7.9 secs.
    UberZip (64 threads): 2 mins 19.9 secs.

    I also verified the unzipped content to check for corruption. It was perfect!

    Well done.

    Reply
  8. Matt Hicks says:
    November 27, 2013 at 1:47 pm

    Nick, That's some pretty impressive results. I've been thinking a lot recently of resurrecting this project. I wrote this so long ago I'm sure there are some significant optimizations I could make to make it run even faster. I just recently created a project for multi-threaded backup (https://github.com/outr/outrbackup). It is significantly faster than any other backup solution I've seen to-date if you're interested in such a thing.

    Reply
  9. Anonymous says:
    November 27, 2013 at 11:43 pm

    Hi Matt,

    I'm sure there is a real opportunity for a small command-line high-performance unzip utility. It would be good to remove the requirement to have a JRE installed too.

    -Nick

    Reply
  10. D L says:
    May 18, 2016 at 12:43 am

    Thanks — much faster than 7-Zip for a second drive on my machine. Do you still have the svn posted? I'd like to add messages to the output so I can see the progress, and the time steps take.

    Reply
  11. Matt Hicks says:
    May 18, 2016 at 3:00 am

    D L, just for you I dusted off the archives and found this project, converted the source code to Scala (since Java is just so darn ugly), made a couple of minor optimizations, and created a GitHub project: https://github.com/outr/uberzip

    It should be a little faster than the pre-built version referenced in this blog post. It wouldn't take much effort to optimize it further. I originally considered using Scala's Parallel Collections support, but figured that was premature optimization and honestly wasn't sure if it would help or hurt.

    Let me know if you have any trouble with this. I've often thought about resurrecting this project into a graphical zip / unzip tool, I just haven't had the time.

    Reply
  12. D L says:
    May 23, 2016 at 10:40 pm

    Thank you, Mr Hicks sir, I greatly appreciate this. Before finding this on your site, I tried to write a multi-thread extract myself and couldn't get it to work, I think it always had concurrency issues. This is great — and only 69 lines !! I wonder why more tools don't use the multi-threaded approach, if they find if doesn't work for any reason. But for me, this tool has cut my zip extraction time in half. Thank you again !!

    Reply
  13. Unknown says:
    May 31, 2017 at 11:54 am

    UberZip tool code uploaded on github has a bug!!! The main thread never exists as executor doesn't decrements the counter in its thread after finishing the task.
    Possible fix:
    override def run(): Unit = {
    UberZip.unzip(zip, entry, directory)
    counter.decrementAndGet()
    }

    Note: In java context, counter will need to be marked final with this change, however, I believe Scala won't require that

    Reply
  14. Luke Hutchison says:
    July 9, 2018 at 10:55 pm

    I added a bug to the GitHub repo, but UberZip is currently a lot less efficient than it could be. You're currently creating only one ZipFile instance, and sharing that with all worker tasks. There's a lock around all ZipFile methods (check out the source code) that prevents more than one thread from doing work at any given time. This is why in your blog post you weren't getting the performance you expected.

    If you want multiple threads to read from the same zipfile in a scalable way, you must open one ZipFile instance per thread. That way, the per-thread lock in the ZipFile methods does not block all but one thread from reading from the zipfile at one time. (It also means that when each thread closes the ZipFile after they're done reading, they close their own instance, not the shared instance, so you don't get an exception on the second and subsequent close.)

    Reply
  15. Luke Hutchison says:
    July 10, 2018 at 7:40 am

    This post inspired me to write my own parallel unzipper, which uses one ZipFile instance per thread (this is the only way to actually unzip in parallel using the standard ZipFile API, as I described in my previous comment — any other scalability wins you're seeing is from copying from an InputStream's buffer into a file in parallel).

    I released a version of my parallel unzip code here for comparison. Feel free to benchmark it on different zipfiles and different machines. It's twice as fast as InfoZip on my machine when unzipping the Eclipse distribution zipfile.

    I didn't benchmark against your Scala version, but this should be faster, because of using one ZipFile instance per thread.

    https://github.com/lukehutch/quickunzip

    Reply

Leave a Reply to Unknown Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Courio: E-Mail 2.0
  • Scribe 2.0: Fastest JVM Logger in the World!
  • Logging Performance
  • Publicity in Open-Source
  • Play Framework for Scala: An Evaluation

Recent Comments

  1. Luke Hutchison on Multithreaded Unzip?
  2. Luke Hutchison on Multithreaded Unzip?
  3. Unknown on Multithreaded Unzip?
  4. D L on Multithreaded Unzip?
  5. Matt Hicks on Multithreaded Unzip?

Archives

  • March 2019
  • February 2018
  • January 2017
  • June 2016
  • July 2015
  • February 2015
  • December 2014
  • September 2013
  • March 2013
  • February 2013
  • January 2013
  • November 2011
  • December 2010
  • October 2010
  • July 2010
  • June 2010
  • May 2010
  • October 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • March 2009
  • February 2009
  • September 2008
  • July 2008
  • May 2008
  • March 2008
  • January 2008
  • December 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007

Categories

  • adobe
  • android
  • apple
  • beans
  • benchmark
  • challenge
  • chat
  • comparison
  • courio
  • eclipse
  • example
  • flex
  • framework
  • games
  • hacks
  • helios
  • html
  • hyperscala
  • ios
  • java
  • javafx
  • javascript
  • jcommon
  • jgn
  • jmc
  • jseamless
  • jug
  • kickstarter
  • learning
  • linux
  • log4j
  • logging
  • mac
  • media
  • mediacenter
  • methodology
  • mobile
  • mythtv
  • nabo
  • open-source
  • opengl
  • opinion
  • personal
  • playframework
  • pojo
  • programming
  • publicity
  • rant
  • raspberrypi
  • review
  • scala
  • scribe
  • script
  • sgine
  • social
  • sudoku
  • sun
  • swing
  • tutorial
  • Uncategorized
  • webframework
  • website
  • windows
  • xjava
© 2025 MattHicks.com | Powered by Minimalist Blog WordPress Theme