What Is the Upload Limit on S3

Recently I've been working on a project where I've got millions of relatively small objects, sized between 5kb and 500kb, and they all take to be uploaded to S3. Naturally, doing a synchronous upload of each object, one past one, just doesn't cut it. We need to upload the objects in parallel to accomplish acceptable performance. But what are the optimal parameters when information technology comes to the number of simultaneous upload threads? Does it depend on the object size? How much of a difference does HTTPS over HTTP make? Let me share what I discovered during my testing.

Notation that some of these graphs are much larger than what I can show in-page. All can be opened in total size past clicking them.

Examination code

To reduce variance in the consequence, I've run all exam cases iv times and reported the average runtime. Each test example tries to upload 512 objects of a given size. In full, 2048 objects are uploaded beyond the four repetitions, before the average runtime is reported back. Fifty-fifty though I ran iv repetitions, I nevertheless saw some fluctuations in the results that I'll accept to adjure to variance.

I started out by using the thread puddle and the asynchronous Begin/EndPutObject methods. However, fifty-fifty when setting the thread pool max/min thread/IO settings explicitly, I establish the thread puddle usage caused besides much variance. Instead I went with manual thread control.

One major actor is the ServicePointManager.DefaultConnectionLimit – this limits the number of active connections to any given host at the aforementioned time. By default, this has a low value of 2 and thus limits you to just ii concurrent uploads to S3, before others are queued at the network level. If this limit is set below the number of agile threads, y'all will invariably have threads waiting to open network connections. To avoid this, I prepare the connection limit equal to the number of threads I was running.

I tried running the tests both with and without MD5 checksum generation & verification, simply I saw no measurable departure in the outcome.

At no betoken, in whatsoever of the test environments, were the CPUs stressed to the point where they were close to becoming bottlenecks. Equally the test object is all in-retention and no disk is involved, I've ruled out disks every bit a bottleneck factor besides. Thus, the number ane of piece of hardware affecting the results is the network interface menu (NIC).

Before starting the four repetitions of the test, I fire off a single PutObject request to warm up the stack. The test code is relatively simple, information technology runs in an infinite loop, checking whether nosotros need to upload more objects, or whether nosotros're done. If done, it breaks the loop and ends the thread. When launching I start up X corporeality of threads and immediately afterward join with them to wait for them all to complete. The runtime includes the amount of time required to instantiate the threads, though it should have no measurable impact on the result. The runtime calculation is done using integer math for output simplicity, but the impact should exist minimal in the big picture.

            using            System;            using            System.Collections.Generic;            using            System.Configuration;            using            Arrangement.Diagnostics;            using            System.Net;            using            System.Threading;            using            Amazon;            using            Amazon.S3;            using            Amazon.S3.Model;  namespace S3Optimization { 	class Program 	{            private            const            string            bucketName =            "meliorate.dk";            private            const            cord            serviceUrl =            "s3-eu-westward-1.amazonaws.com";  		            static            void            Main(string[] args) 		{            int            repetitions =            4;            int            uploadCount =            512;            int            objSize = Convert.ToInt32(args[1]) *            1024;            int            numThreads = Convert.ToInt32(args[0]);            string            dir =            "Optimization/"            + Guid.NewGuid();            var            config =            new            AmazonS3Config().WithServiceURL(serviceUrl).WithCommunicationProtocol(Protocol.HTTPS);            var            sw =            new            Stopwatch();            object            locker =            new            object();            string            obj = randomString(objSize);  			 			ServicePointManager.DefaultConnectionLimit = numThreads;  			            var            work =            new            ThreadStart(() => 				{            using            (var            s3Client = AWSClientFactory.CreateAmazonS3Client(ConfigurationManager.AppSettings["AccessKeyID"], ConfigurationManager.AppSettings["SecretAccessKey"], config)) 					{            while            (true) 						{            lock            (locker) 							{            if            (uploadCount <=            0)            break; 								uploadCount--; 							}            var            request =            new            PutObjectRequest() 								.WithBucketName(bucketName) 								.WithKey(dir +            "/"            + Guid.NewGuid()) 								.WithContentBody(obj); 							s3Client.PutObject(request); 						} 					} 				});  			            using            (var            s3Client = AWSClientFactory.CreateAmazonS3Client(ConfigurationManager.AppSettings["AccessKeyID"], ConfigurationManager.AppSettings["SecretAccessKey"], config)) 			{            var            asking =            new            PutObjectRequest() 					.WithBucketName(bucketName) 					.WithKey(dir +            "/"            + Guid.NewGuid()) 					.WithContentBody(obj); 				s3Client.PutObject(request); 			}  			 			sw.Showtime();            for(int            i=0; i<repetitions; i++) 			{            int            originalUploadCount = uploadCount;            var            threads =            new            Listing();            for(int            j=0; j<numthreads; j++)=""            threads.add together(new=""            thread(work));=""            threads.foreach(x=""> x.Starting time()); 				threads.ForEach(ten => x.Join());  				uploadCount = originalUploadCount; 			} 			sw.Finish();  			Console.WriteLine(sw.ElapsedMilliseconds / repetitions); 		}            static            cord            randomString(int            size) 		{            var            rnd =            new            Random();            var            chars =            "ABCDEFGHIJKLMNOPQRSTUVWXYZ";            char[] buffer =            new            char[size];            for            (int            i =            0; i < size; i++) 				buffer[i] = chars[rnd.Next(chars.Length)];            render            new            string(buffer); 		} 	} }

For running the tests, I'g using the following test runner awarding, testing all combinations of thread count and object size between 1 and 256/2048 respectively (in powers of 2):

            var            psi =            new            ProcessStartInfo("S3Optimization.exe")     {         UseShellExecute =            false, 		RedirectStandardOutput =            truthful            };            var            connectionLimits =            new[] {            ane,            two,            4,            8,            16,            32,            64,            128,            256            };            var            objSizes =            new[] {            ane,            2,            4,            8,            16,            32,            64,            128,            256,            512,            1024,            2048            };            foreach(int            connectionLimit            in            connectionLimits) {            foreach            (int            objSize            in            objSizes) 	{ 		psi.Arguments = connectionLimit +            " "            + objSize;            var            p = Process.Start(psi);            string            output = connectionLimit +            "t"            + objSize +            "t"            + p.StandardOutput.ReadToEnd(); 		p.WaitForExit();  		Console.Write(output); 		File.AppendAllText("Output.txt", output); 	} }

Initial results

The outset test is done using a colocation (colo) Dell PowerEdge 1950 placed at a Danish Internet access provider in Aarhus, Denmark. The server is running Windows Server 2003 x64 and has a single gigabit NIC with gigabit support throughout the network stack. Note that I won't be mentioning neither CPU, memory nor disk for any of the machines. Neither of those were ever close to being the bottleneck and are thus irrelevant. Suffice to say – they all had plenty of CPU, retentivity and deejay capabilities. A tracert from the server to the S3 EU endpoint in Dublin looks like this:

Tracing road to s3-eu-west-1.amazonaws.com            [178.236            .6            .31] over a maximum of            thirty            hops:            1            1            ms            4            ms            13            ms            89.221            .162            .249            2            one            ms            i            ms            3            ms            x.1            .2            .1            3            three            ms            2            ms            ii            ms            89.221            .161            .105            iv            <i            ms	<i            ms	<1            ms            92-62-199-177.customer.fuzion            .dk            [92.62            .199            .177]            v            <i            ms	<1            ms	<1            ms            87.116            .ane            .157            6            5            ms            four            ms            iv            ms            93.176            .94            .175            7            iv            ms            4            ms            4            ms	xe-3-three-0.cph10.ip4.tinet            .net            [77.67            .74            .37]            eight            47            ms            47            ms            47            ms	xe-2-2-0.dub40.ipfour.tinet            .cyberspace            [89.149            .181            .37]            nine            *	*	*	Asking timed            out.

The following graph has the number of threads (that is, simultaneous uploads) on the X-axis and the MiB/s on the Y-axis. The MiB/s was calculated using the formula (UploadCount x ObjectSizeInKB / 1024 / AvgTimePerRepetitionInSeconds). Each color bar represents a given object size in KB as noted on the fable on the right. Note also that these results were made using the standard HTTPS protocol. Y'all might enquire yourself why I'chiliad measuring MiB/s and not requests/s. Affair is – they're exactly the same. MiB/s and requests/s are simply calculations based on the time it took to run a stock-still number of requests. The accented values are less interesting than they are in relation to each other. If yous desire to take a look at the requests/sec, you tin can download my raw data at the end of the post.

In that location is an overwhelming amount of information in this graph alone. We tin can run across how the general throughput seems to increment relatively linearly forth the amount of threads, though they seem to reach their max potential at 128 threads.

Small object characteristics

Permit me zoom in on the 1KB object size:

For the 1KB object size nosotros see clear improvements all the style up to 64 threads, later on which it seems to stabilize. The 1KB object size is the one that incurs the most overhead due to S3 not utilizing persistent connections. Each request we make needs to create a new TCP connection and perform an SSL handshake. Compared to a 2MB object, we spend a lot more time and resources on overhead compared to actually transferring information. What if we disabled SSL and used unencrypted HTTP instead?

Now we become increased performance all the way upward 128 threads – and we actually end upwards pushing 200% equally much data as we did using HTTPS! For minor objects, HTTPS has an extremely high cost – you lot should avert information technology if you can.

Number of threads – finding the sweet spot

Take a look at this graph, showing the results for object sizes 1KB – 128KB:

Ignoring small-scale deviances, all of the objects seem to pinnacle at 64 connections. Any more than that either causes a meaning drop off, or simply modest variance. For objects less than 128KB, 64 threads definitely seem to be the cutting-off point. Compare it with the following graph, showing object sizes 256KB – 2048KB:

For these objects, we clearly see that going up to 128 connections actually provide a boost in throughput, leading me to conclude that for objects of size 256KB+, you can utilise somewhat more than threads successfully.

For all object sizes, using HTTP over HTTPS seems to increase the maximum throughput thread limit – this increasing it from 64 to 128 for smaller objects and from 128 to 256 threads for larger objects. If y'all're uploading objects of varying sizes, this means you'll have to do some testing with your specific workload to discover out the optimal amount of threads.

Object size vs. HTTPS performance bear on

In the post-obit graph I've calculated the boilerplate gain HTTP had over HTTPS for each object size, across all thread counts. As there is quite some variance, the trend line is the most interesting role of the graph. Information technology clearly shows that as object size grows, the HTTP over HTTPS advantage decreases.

Server 2008 R2 vs. Server 2003

You've probably heard well-nigh Server 2008 bringing forth a bunch of updates to the TCP/IP stack. I thought it would exist interesting to run the aforementioned tests on an identical server, merely running Windows Server 2008 R2 x64. Luckily, I take just that. A server with identical hardware, on the same subnet at the same ISP, just running Server 2008 R2 x64 instead. Question is, how big of a divergence does the Bone alone make?

For this graph, I calculated the maximum attainable transfer speed, using HTTPS, for a given object, beyond any number of threads. I've and so mapped those values into the graph for both Server 2003 and Server 2008 R2 (note the log(ii) scale!).

It clearly shows that Server 2008 R2 consistently wins out over 2003 - and this is using the exact aforementioned hardware, same switches, connection, etc. - only the Bone is different. What about HTTP?

Ignoring some minor variance, HTTP is still clearly the winner.

On average, I establish Server 2008 R2 to be 16.eight% faster than Server 2003 when testing HTTPS, and 18.vii% faster when using HTTP. That is a major gain just by changing the Bone!

The impact of locality – EC2 meets S3

At this signal I've demonstrated that you lot get rather crappy performance if y'all perform your uploads unmarried threaded. By just scaling out the number of threads, we tin actually stop up saturating a gigabit NIC, provided the object size is large enough. Yet, we practice spend a large amount of fourth dimension waiting for network latency. What divergence would it brand if we were to run this in the cloud… Say in EC2 for example?

I spawned an m1.xlarge case in the European union EC2 region, providing me with a stable example with plenty of CPU and memory. A tracert confirms that we are indeed very close to the S3 servers:

            Tracing            route to s3-european union-west-1.amazonaws.com [178.236.5.31] over a maximum of            30            hops:            1            <1            ms	<ane            ms	<1            ms	ip-ten-48-248-2.european union-west-ane.compute.internal [ten.48.248.2]            ii            <1            ms	<i            ms	<one            ms	ec2-79-125-0-242.eu-west-1.compute.amazonaws.com [79.125.0.242]            iii            <1            ms	<i            ms	<1            ms	ip-x-1-44-253.european union-westward-1.compute.internal [10.i.44.253]            iv            1            ms            ii            ms            i            ms	ip-10-i-0-v.european union-west-1.compute.internal [10.1.0.5]            5            <i            ms	<1            ms	<1            ms	ec2-79-125-1-97.european union-west-one.compute.amazonaws.com [79.125.1.97]            half-dozen            2            ms            2            ms            2            ms            178.236.0.138            7            ii            ms            twenty            ms            2            ms            178.236.0.123            8            2            ms            2            ms            two            ms            178.236.0.155            9            2            ms            ii            ms            2            ms            178.236.5.31

HTTP yet wins out over HTTPS

Just to brand sure, I compared the boilerplate performance of HTTP over HTTPS again. For now, I'thou hiding the actual units, and instead I'yard just showing the percent departure. Note that the blue HTTPS line is a baseline performance of 100%.

Ignoring variation, we come across an boilerplate performance improvement of nigh 150% compared to HTTPS. From this we can conclude that locality doesn't modify the performance characteristics of HTTP vs. HTTPS – HTTP still wins whatsoever solar day. Equally a result of this, numbers from now on volition be based on HTTP tests, unless explicitly stated otherwise.

Now nosotros're talking throughput!

Let's expect at a quick graph detailing the maximum attainable transfer speeds for whatever given object, comparison my colo Server 2008 R2 server with the m1.xlarge instance run in the AWS EC2 cloud (note the log(10) calibration):

Wow. I redid this test several times equally I just couldn't believe the results. Where my 2008 R2 example pushed almost 1 one thousand thousand/sec, I was getting five.2 megs/sec through the EC2 example. Okay, I gauge that's reasonable since the smaller objects are punished so severely by connectedness overhead - and that's the primary advantage of having close locality to the S3 servers, right?

Notwithstanding - once we get to object size 32, we're now pushing 120 megs/sec from EC2 - at the very border of the 100Mbps NIC that the server reports. But it doesn't finish there - oh no. I ended up hitting the ceiling at a stable transfer speed of 460 megs/sec, pushing 1024KB objects using 64 threads. Only how in the world am I able to push 3,680Mbps through a 100Mbps NIC?

The matter is, these are all just virtual machines sharing physical hardware. The server itself reports 100Mbps, but Amazon will scale your NIC depending on the instance type - typically telling yous to await a worst case of 250Mbps on a large instance. My guess is that these machines are running 10gige internally, and y'all'll get whatever is bachelor, though QoS'ed so you'll go your 250Mbps at a minimum. If that is the case, I can hands pull 3,680Mbps of the theoretically available 10,000Mbps, the residual existence used by other VPCs on the aforementioned physical network. To all my neighbors during these tests, sorry!

This begs the question though - what if I had that 10gige connection all past myself? What if I didn't have to share information technology?

Pushing the limits using compute clusters

If we take a look at the Cluster Compute Quadruple Extra Large Example (let's just telephone call it CC from at present on) specs, we're told to expect 10gige network connectivity:

Aha! Simply what nosotros need. Unfortunately the CC instances are merely available in the U.s. regions, so I had to setup a new bucket in the U.s., and change my test code to connect to said bucket, from the United states of america. As such, it shouldn't change anything, though it should be noted that the tests then far have been run in the Dublin DC, whereas this test is run in the North Virginia DC.

Allow's get-go out by looking at object sizes 1-16KB, comparing the m1.xlarge instance with the cc1.4xlarge instance:

Huh, that's kinda disappointing. It seems that the CC instance consistently performs worse than the m1.xlarge instance. Let'southward endeavor and take a look at object sizes 32-2048KB:

Now nosotros're talking! As presently as we cross 256KB in object size, we start to saturate the available material speed of the m1.xlarge case - the CC instance on the other hand, it just keeps going upwardly! In this examination I reached a max speed of 1091,7 megs/sec using 128 threads pushing objects of 2048KB. That's viii,733.6Mbps out of a theoretical max of 10,000Mbps - on a single virtualized instance, mind you.

To infinity and across

Nonetheless not satisfied? Well, neither was I. I tried to tweak the settings a scrap more than to encounter if I could push it fifty-fifty further. Given that an object size of 2048KB seemed to ameliorate the issue over 1024KB, would an even larger object help? How about more 128 threads?

It's axiomatic that more than 256 threads does not yield any do good, quite the contrary. However, using 256 threads and an object size of 4096KB, I was able to push button 1117,9 megs/sec to S3. I am extremely satisfied with that result. I honestly did non look to even get 25% of that from a single automobile, whether physical or virtual. That's 8,943,2Mbps of pure data - that is, non including the inevitable network overhead.

Expanding on the results

You tin can download an Excel sheet of all my raw data, including the diverse graphs and calculations that I've made. Annotation that all raw numbers are repeated, beginning sorted by the number of threads, and so sorted by the object size. Annotation too that there are some extra data here and there where I had to do some adhoc tests.

If you desire me to create some graphs of a specific scenario, compare 2 different result sets, environments, etc. - just let me know in the comments. I've probably overlooked something interesting every bit there is just then much data to pull out. Optimally I'd want to run each of these tests for 100 repetitions at different times of the day, but to weed out all of the variance completely. Unfortunately, that would cost me way too much, and information technology would take ages. I may do some loftier-rep tests for specific scenarios similar the HTTP vs. HTTPS tests as I feel in that location were too many fluctuations there.

Download: S3Optimization.xlsx

Conclusions

In that location are a lot of factors to consider when optimizing for upload speed. Nevertheless, there are but a few rules that you need to follow to reach virtually-optimal speed with limited effort.

Parallelization

The easiest way to scale is to simply parallelize your workload. Each S3 connexion doesn't get that much bandwidth through, just as long as you run many of them at the same time, the aggregate throughput is splendid. Most workloads showed 64, 128 or 256 to be the optimal number of threads.

Locality & bandwidth

Being close to the S3 saucepan servers is of utmost importance. Equally tin can be seen from the graph, I nearly exhausted my 1gige NIC on my colo servers, but I doubt I'd be able to frazzle a 10gige connection (anyone got a server I can infringe for testing?). The graph is slightly misleading though as the EC2 servers had anywhere from 4gige to 10gige of connectivity, and so it's non all just latency - bandwidth certainly matters likewise, peculiarly one time you reach high amounts of thread with large object sizes.

Operating organization

Now, yous shouldn't go out and format all of your Server 2003 servers just to get 2008 R2. However, 2008 R2 does consistently perform better than 2003. Though I haven't tested information technology, I expect 2008 and 2008 R2 to be about the same. Generally you'll get about 15% better performance on a 2008 R2 server over a 2003 server.

Saturating the S3 service

Not going to happen, simple as that. I'm utterly amazed at the throughput I managed to proceeds from only a unmarried machine. At the elevation, I was pushing more than one gigabyte of data to S3 every 2nd - 1117,ix megs/sec to be precise. That is an awful lot of information, all coming from a unmarried motorcar. Now imagine you scale this out over multiple machines, and you have the network infrastructure to support it - you lot can really send a lot of data.

Variance

Equally tin can exist seen in some of my results, y'all can't avoid running into variance when using a deject service. Even so, it'south important to look at the baseline numbers - what is the worst instance operation y'all can expect? Not merely did my best-example numbers blow my mind, then did the worst-instance numbers! Even though functioning does fluctuate, the median performance is what matters, and it's nothing short of impressive.

Optimizing the network stack

I'm certain I've left out some percentages by not looking at the NIC drivers and settings. However, more often than not that'll be your very last resort, and it'll only help you get those terminal few percentages. In most cases there's no reason to mess around with the remaining ane%, I'll easily settle for the 99%.

Mark S. Rasmussen

I'one thousand the CTO at iPaper where I cuddle with databases, mold lawmaking and maintain the overall technical & team responsibility. I'm an avid speaker at user groups & conferences. I love life, motorcycles, photography and all things technical. Say how-do-you-do on Twitter, write me an electronic mail or wait me upward on LinkedIn.

moralesforly1998.blogspot.com

Source: https://improve.dk/pushing-the-limits-of-amazon-s3-upload-performance/