02 Oct

How to deal with Black Friday

Black Friday, the day after Thanksgiving (US), is one of the biggest high street shopping days of the year.  Makes sense – a lot of Americans are off work that day, and for many it’s close to the last payday before Christmas.

In recent years, most of the action has moved into digital retail, which is good news for shoppers who want to avoid an actual fistfight over that last HD television. There’s even a .blackfriday Top Level Domain to capitalise on the increased consumer engagement.

The global rise of Black Friday for online retailers has certainly been interesting, but not without challenges. Even some of the most respected retailers can struggle to deal with the increased traffic. It’s usually more than just one day, though. Last year we saw plenty of “pre Black Friday” sales, and many of the deals will be on until Christmas. Argos took a whole Black Friday week. Amazon lead their own sales with Prime days and prolonged campaigns.

One thing I found most interesting: many retailers see traffic or conversions increase even when not advertising specific sales or deals. Some of our customers have been caught out by that.   People just seem to have a “buy now” mindset regardless.

Here are the things you should be doing to prepare your online store for Black Friday, or any other big sale event.

See also: another article I wrote about common pitfalls with extreme traffic. 

Performance testing

First of all, you need to know where your website actually stands in terms of traffic, and you need to do it properly.  If you have them, use analytics from Black Friday and Cyber Monday last year.

The key to success here is to use accurate user journeys and conversion rates in your load tests. For example, you might convert at 5% all year round, but as much as 50% on Black Friday if you’ve got a great deal on and your customers all pile in to the checkout. I’ve seen this happen, where a site was overwhelmed with a much higher conversion rate than was tested. A good problem to have, but it needed rapid action. Luckily we were already using Cloud, so it didn’t take long to scale out further. Additionally, our DBA team found and fixed a database locking issue (see below), which could have been picked up earlier with the right performance testing.

Human think times also play a big part in this kind of testing, which, again, can have an order-of-magnitude effect on the concurrency figures produced.

Don’t: just rely on tools like siege, or services like loader.io and blitz.io. They can be extremely useful of course, but only if you are able to interpret the results properly. Unless you have a deep understanding of HTTP protocol headers, cookies, unique session IDs, these tools might not give you the insight you need.

Do: get it done professionally. You need a test that can mimic actual human user journeys, and repeat them on a massive scale. jMeter is good for that, if you have the resources/bandwidth, but doing it right can be complicated and very time consuming. I would argue this is best left to professional performance testers; this is specialist work and a necessary investment for your website’s future. Your hosting provider might offer professional load testing services, or refer you to someone who can. Soasta are excellent.

Do: do this in plenty of time.  If you come out with a laundry list of changes, you need to be able to act on them before the shopping season starts.

 

Be ready to scale infrastructure

Black Friday is one of the best examples of how Cloud resources can be used to boost capacity when needed. Or, you might just need to increase your physical server resources with a CPU upgrade. More likely, it’s both. Talk to your hosting provider about Cloud Bursting or Hybrid hosting models, which can work really well for seasonal workloads.

Quick note about Autoscaling, if you use it. Use time-based events, or scale manually ahead of time, rather than waiting for Autoscale algorithms to detect the load, and kick in. It might be quite a few minutes before the new servers/containers are in place and ready, and you might be losing sales in that time. Bear in mind that everyone else in your public Cloud might be doing the same, so allow extra time in case the APIs are a little slower than usual.

Go big or go home: Cloud Resources, even with high CPU/memory, are great value when only online for a few days. Don’t be afraid to spin up those large instances. A couple of hundred dollars/pounds/euros is definitely worth the investment for the one weekend where you can’t afford downtime.

 

Have analytics in place

…such as NewRelic, AppDynamics, or similar. You need to know if your customers are getting a good experience. If they’re not, these tools will help you pinpoint areas for improvement at the application level.

 

Investigate database locks

Review your code. Log slow queries.  Database level locks can be a key limiting factor in concurrent transactions, and no amount of severs, CPU or memory will fix that. It is usually a case of changing table engines or removing/reworking locking queries. Investing in professional DBA time will be money well spent, especially if you can line that up with a performance/load test.

 

Cache all the things

Goes without saying – many applications will include some kind of caching layer. If not, there are specific HTTP caching tools and techniques. For Magento, I’ve written a couple of articles on Full Page Cache, and the Mirasvit FPC extension which I really like.

However, don’t just assume because the “Full Page Cache” box is ticked, it’s all rosy. Test thoroughly to make sure caches are actually working as expected. It’s not uncommon for a Magento extension to be blocking the Full Page Cache from being used properly. Be wary of any unique URL parameters in use, especially from mailshots (more on that later), which can sometimes completely bypass your caches.

If you can time curl -I your homepage, product and category pages, and they return within about 200 milliseconds, you’re probably OK.  Take the time to understand the  HTTP response headers; they usually give you a good indication of whether or not you are hitting a cache somewhere in the application stack.

https://tools.pingdom.com/ is a good way to measure that Time To First Byte (TTFB) and other factors in page rendering speed. You might need to run the test two or three times to get a feel of what’s being cached, and where you can make improvements. Combine these with NewRelic APM data, or for example, and perhaps varnishstat if appropriate; this you should give you a solid grasp of how effective your cache setup really is.

 

Use a CDN

At the very least, you should be using a CDN for static assets (images, CSS, etc) which can often be the bulk of the traffic. Especially if you do business overseas, where geographic latency comes into the mix. This is going to reduce load at the network level, and keep your server resources free to concentrate on the dynamic content like shopping carts and checkout.  In Magento, it’s trivial to change the URLs for your Media and Skin elements, to serve them via a CDN. 

Running your entire domain behind a CDN can also offer a host of security benefits, like WAF and DDoS protection. If you’re not already using the likes of CloudFlare, Incapsula, Akamai, Fastly, et al, then talk to your dev or hosting provider about implementation.

Consider full-page or aggressive caching at the CDN level where possible. Caution: This has its challenges, and perhaps one for another article. But done right, it can help you serve traffic on a significantly bigger scale. 

 

What not to do:

Don’t send customers to dynamic pages – E.g. mailshots or social campaigns.

If 20,000 Facebook users all hit a page which has to be uniquely generated on your servers, you will definitely have performance/scaling issues sooner or later. 

URL parameters. Critically, links from mailshots or ad campaigns often have unique or tracking identifiers. Same content, but each request is unique enough to bypass your various caches. To find out if that’s happening, your web server access logs should tell the full story of any URL parameters making their way through.  If you can’t remove the URL parameters at source – you might not be able to change the way your mailshot provider works –  then make sure your application (or cache) is going to ignore or ‘normalize’ the parameters. You could achieve this with Varnish VCL (only if you are already using Varnish) or perhaps work around it with a mod_rewrite rule to strip the parameters.  

Don’t overdo the AJAX. If you pages are static or cached (good), but have a ton of dynamic Javascript making requests, then you might still be causing undue load on the application layer.

Try to avoid that: don’t start unnecessary sessions; skip the POST requests where no session is present; perhaps even remove some of the dynamic blocks from your templates. Examples:

  • Do you really need the whizzy cart contents popping up onMouseOver(), causing server load on every page? Or, would a simple link to the shopping cart do the job?
  • Recently Viewed Products: often a unique block which can be removed to reduce traffic.
  • Search autocomplete: depending on your implementation, this might be making an AJAX call on every keystroke.

Obviously this is a functionality vs. performance trade-off, but it might be the one thing keeping your website online during the big sale event.

Instead, use as much static content as possible, because this will always be faster and create the least load on your infrastructure. If you don’t have the time for a code change, in an emergency, you could use web server rewrite rules (or similar) to return a 503 for those requests.

Landing pages should be flat HTML if possible, perhaps then offering links through to the real dynamic content. Having this first step can reduce server load by an order of magnitude.  If it absolutely has to be a dynamic page, then make sure it’s fully cacheable (see: Cache all the things, CDN), does not have unique URL parameters, and the page doesn’t have too many dynamic elements.

 

Performance testing, again 

Ideally you want to measure the difference after any changes to your infrastructure, application stack, code, or database.

This can be an iterative process – rinse and repeat until you’re really happy with the performance and capacity.  

 

01 Jun

6 reasons your Magento site went down

This article is about extreme traffic overwhelming your site.

There’s plenty of marketing you can do to drive traffic, but TV appearances are by far the most hard-hitting. Your target audience is sitting on the sofa with a laptop or tablet at the ready, and they’re all going to hit your site within the same 10 seconds or so. Email campaigns can have a similar effect, but usually you can stagger delivery to limit the impact. Read on especially if you’re a startup planning on launching with a big bang.

Specifically, we’re talking about:

Before we start, I’m assuming your site is well hosted and generally performs well under normal conditions (Magento category page Time To First Byte < 1.0 second without FPC). If it doesn’t, then stop reading. These will be band-aids rather than solutions.

Here are six things you can do to prepare for huge traffic spikes.
 
 

1. Plan for failure

Despite everything else in this article, it can be hard to gauge just how much of a load spike you’ll get.  Get some error pages in place, at every level possible. Your developers or digital agency should be able to knock something up in no time.

Do: Think about including a discount code on error pages, encouraging your customers to come back later. I’m told it really works.

Do: Make it look nice, on brand, including contact details. Some are light-hearted and fun – I’ve even seen Pac-Man embedded to play while waiting – but the main message you need is to encourage a repeat visit. Default or generic error pages do not inspire confidence in your brand.

Don’t: Include any images, CSS, etc, on this error page from your web servers, in case they’re not responding. Host these assets elsewhere; a CDN would be perfect.

 
 

2. Full Page Cache

A good Full Page Cache is essential to absorb the majority of your traffic, and the majority of server load.

Can be quite a complicated thing to get right though, so I wrote a separate article with my thoughts on Magento Full Page Cache.

If you’re doing it right, your pages and categories should be coming back in around less than 100 milliseconds.

 

3. Database

First, use persistent connections.  I’ve seen the sudden influx of DB connections overwhelm the TCP stack on the DB host. Make sure MySQL is configured to accomodate that, though. Every single PHP-FPM or Apache child process could be hanging onto a connection. We can do some quick maths here: If you have ten web servers with pm.max_children=250, then your MySQL max_connections needs to be 2500. Add a few more for any monitoring or diagnostics.

Configure it in local.xml:

<connection>
    <host><![CDATA[magento_db_host]]></host>
    <username><![CDATA[magento_db_user]]></username>
    <password><![CDATA[magento_db_pass]]></password>
    <dbname><![CDATA[magento]]></dbname>
    <active>1</active>
    <persistent>1</persistent>
</connection>

Replication?

No.
Three points about this:

  1. Database is not normally the bottleneck. CPU for PHP execution is. You will want a reasonably powerful machine with good I/O, but so long as your FPC is effective, it’s very unlikely that you’ll need to scale out.
  2. Broken shopping carts, and other weird behaviour. Replication takes time, like a second or two, which can be long enough to cause a problem. For the most part, the Magento read/write separation does account for replication delay but third party extensions might not. It’s especially important when they are related to shopping carts or checkout functionality.
  3. Replication is not resilience.  I want to mention this,  because I think a lot of people ask for replication out of a misunderstanding that it’ll make the site more resilient or Highly Available. A master-slave setup still has single points of failure, and Magento will error out if it can’t connect to the master or ANY of your slaves. A multi-master implementation could work if you have a floating IP, but in my experience the complexity far outweighs the benefit. At Rackspace, our go-to HA solution to run MySQL (Percona usually) under the Red Hat Cluster Suite, and that works brilliantly. Magento gets one Database connection, which is a floating IP between resilient nodes. The backend servers are often less powerful than the Web servers; see point #1

That’s it for the database. I’m not going  into general DB optimisation here, but Major Hayden’s mysqltuner.pl is a good start.

 
 

4. Backend Cache Scaling

Most of the time, you’ll be sharing your Redis cache between web nodes. This is important for management of the cache via Magento admin. At a massive scale though, you can overwhelm the physical network, TCP stack on the Redis host, and run into performance problems because Redis is single-threaded.

Too many servers on one Redis instance

The simple solution is to install a local Redis instance on each web server, and connect on localhost or a UNIX socket. Cuts out the network load completely, and scales out to the Nth degree.

One Redis instance on each Web server

The major disadvantage of doing this though is that management operations, like clearing the cache, or general invalidation when you make changes, will not happen across the board.  Here’s a quick-and-dirty proof of concept bash script for clearing out all your caches at once, assuming you are also configuring each to listen on your local or isolated network:

REDIS_SERVERS="192.168.100.5 192.168.100.6 192.168.100.7 192.168.100.8 192.168.100.9"
for server in $REDIS_SERVERS; do
    echo -e "FLUSHALL" | nc $server 6379
done

NB: This kind of cache setup is only for extreme cases; 99% of the time a single Redis instance is OK for Magento cache. I like to use a second one for Enterprise full_page_cache. You could look into Redis sharding for ultimate performance, but that’s a little more complicated than we need here. This is for a one-off event, and when it’s done you can scale back you the single Redis instance for easier cache management.

NB: Cache storage must not be confused with Session storage. It often is, when the same technology is involved. Despite the above, I would still advise keeping all your sessions in one place, mainly because I don’t like to rely on load balancers’ session persistence. It’s very unlikely to saturate the network as the Cache traffic can. I prefer Memcached over Redis for sessions; it’s simple and multi-threaded. On that note, ensure MAXCONN and CACHESIZE are suitably configured.

 
 

5. CDN

Content Delivery Networks are not a magic solution, and usually have no effect on those initial page loads, nor the PHP load on your web servers. While some CDNs do have full page caching features, I haven’t seen anyone successfully integrate them into an application as complex as Magento.

What a CDN will do, however, is speed up the delivery of extra content for the overall page load. Especially if you’ve got an ocean between your customers and your server(s). If all your customers are in the same country as your server, though, it probably won’t be that much faster and might not be worth the effort.

The biggest advantage for me is to reduce the network load on your infrastructure. On most Magento stores (most websites in general), the bulk of actual data content is product imagery. Offloading that to a CDN will definitely help to avoid network saturation, and load on net devices like firewalls and load balancers.

You need to be using a CDN which pulls from origin; the days of trying to upload with ImageCDN are long gone. And for faster pageloads, you can use separate URLs for your skin, media, and javascript elements, leveraging parallel downloads. Once those are set up, it’s pretty trivial in Magento to configure the URLs under System > Configuration > Web. It might be a little more work for SSL, but if most of your window shopping is done over plain HTTP then start with the unsecure base URLs for the quickest win.

 
 

6. Load testing

You need to know where your website actually stands in terms of traffic, and you need to do it properly.

Don’t: rely on tools like siege, or services like loader.io and blitz.io. They can be extremely useful of course, but only if you are able to interpret the results properly. Unless you have a deep understanding of HTTP protocol headers, cookies, unique session IDs, these tools probably won’t help you all that much.
 
Do: get it done professionally. You need a test that can mimic actual human user journeys, and repeat them on a massive scale. jMeter is good, but doing it right can be complicated and very time consuming. I would argue this is best left to professionals who do just that. Not cheap, but a necessary investment for your website’s future. Your hosting provider might offer professional load testing services, or refer you to someone who can. Soasta are excellent.