02 Oct

How to deal with Black Friday

Black Friday, the day after Thanksgiving (US), is one of the biggest high street shopping days of the year.  Makes sense – a lot of Americans are off work that day, and for many it’s close to the last payday before Christmas.

In recent years, most of the action has moved into digital retail, which is good news for shoppers who want to avoid an actual fistfight over that last HD television. There’s even a .blackfriday Top Level Domain to capitalise on the increased consumer engagement.

The global rise of Black Friday for online retailers has certainly been interesting, but not without challenges. Even some of the most respected retailers can struggle to deal with the increased traffic. It’s usually more than just one day, though. Last year we saw plenty of “pre Black Friday” sales, and many of the deals will be on until Christmas. Argos took a whole Black Friday week. Amazon lead their own sales with Prime days and prolonged campaigns.

One thing I found most interesting: many retailers see traffic or conversions increase even when not advertising specific sales or deals. Some of our customers have been caught out by that.   People just seem to have a “buy now” mindset regardless.

Here are the things you should be doing to prepare your online store for Black Friday, or any other big sale event.

See also: another article I wrote about common pitfalls with extreme traffic. 

Performance testing

First of all, you need to know where your website actually stands in terms of traffic, and you need to do it properly.  If you have them, use analytics from Black Friday and Cyber Monday last year.

The key to success here is to use accurate user journeys and conversion rates in your load tests. For example, you might convert at 5% all year round, but as much as 50% on Black Friday if you’ve got a great deal on and your customers all pile in to the checkout. I’ve seen this happen, where a site was overwhelmed with a much higher conversion rate than was tested. A good problem to have, but it needed rapid action. Luckily we were already using Cloud, so it didn’t take long to scale out further. Additionally, our DBA team found and fixed a database locking issue (see below), which could have been picked up earlier with the right performance testing.

Human think times also play a big part in this kind of testing, which, again, can have an order-of-magnitude effect on the concurrency figures produced.

Don’t: just rely on tools like siege, or services like loader.io and blitz.io. They can be extremely useful of course, but only if you are able to interpret the results properly. Unless you have a deep understanding of HTTP protocol headers, cookies, unique session IDs, these tools might not give you the insight you need.

Do: get it done professionally. You need a test that can mimic actual human user journeys, and repeat them on a massive scale. jMeter is good for that, if you have the resources/bandwidth, but doing it right can be complicated and very time consuming. I would argue this is best left to professional performance testers; this is specialist work and a necessary investment for your website’s future. Your hosting provider might offer professional load testing services, or refer you to someone who can. Soasta are excellent.

Do: do this in plenty of time.  If you come out with a laundry list of changes, you need to be able to act on them before the shopping season starts.

 

Be ready to scale infrastructure

Black Friday is one of the best examples of how Cloud resources can be used to boost capacity when needed. Or, you might just need to increase your physical server resources with a CPU upgrade. More likely, it’s both. Talk to your hosting provider about Cloud Bursting or Hybrid hosting models, which can work really well for seasonal workloads.

Quick note about Autoscaling, if you use it. Use time-based events, or scale manually ahead of time, rather than waiting for Autoscale algorithms to detect the load, and kick in. It might be quite a few minutes before the new servers/containers are in place and ready, and you might be losing sales in that time. Bear in mind that everyone else in your public Cloud might be doing the same, so allow extra time in case the APIs are a little slower than usual.

Go big or go home: Cloud Resources, even with high CPU/memory, are great value when only online for a few days. Don’t be afraid to spin up those large instances. A couple of hundred dollars/pounds/euros is definitely worth the investment for the one weekend where you can’t afford downtime.

 

Have analytics in place

…such as NewRelic, AppDynamics, or similar. You need to know if your customers are getting a good experience. If they’re not, these tools will help you pinpoint areas for improvement at the application level.

 

Investigate database locks

Review your code. Log slow queries.  Database level locks can be a key limiting factor in concurrent transactions, and no amount of severs, CPU or memory will fix that. It is usually a case of changing table engines or removing/reworking locking queries. Investing in professional DBA time will be money well spent, especially if you can line that up with a performance/load test.

 

Cache all the things

Goes without saying – many applications will include some kind of caching layer. If not, there are specific HTTP caching tools and techniques. For Magento, I’ve written a couple of articles on Full Page Cache, and the Mirasvit FPC extension which I really like.

However, don’t just assume because the “Full Page Cache” box is ticked, it’s all rosy. Test thoroughly to make sure caches are actually working as expected. It’s not uncommon for a Magento extension to be blocking the Full Page Cache from being used properly. Be wary of any unique URL parameters in use, especially from mailshots (more on that later), which can sometimes completely bypass your caches.

If you can time curl -I your homepage, product and category pages, and they return within about 200 milliseconds, you’re probably OK.  Take the time to understand the  HTTP response headers; they usually give you a good indication of whether or not you are hitting a cache somewhere in the application stack.

https://tools.pingdom.com/ is a good way to measure that Time To First Byte (TTFB) and other factors in page rendering speed. You might need to run the test two or three times to get a feel of what’s being cached, and where you can make improvements. Combine these with NewRelic APM data, or for example, and perhaps varnishstat if appropriate; this you should give you a solid grasp of how effective your cache setup really is.

 

Use a CDN

At the very least, you should be using a CDN for static assets (images, CSS, etc) which can often be the bulk of the traffic. Especially if you do business overseas, where geographic latency comes into the mix. This is going to reduce load at the network level, and keep your server resources free to concentrate on the dynamic content like shopping carts and checkout.  In Magento, it’s trivial to change the URLs for your Media and Skin elements, to serve them via a CDN. 

Running your entire domain behind a CDN can also offer a host of security benefits, like WAF and DDoS protection. If you’re not already using the likes of CloudFlare, Incapsula, Akamai, Fastly, et al, then talk to your dev or hosting provider about implementation.

Consider full-page or aggressive caching at the CDN level where possible. Caution: This has its challenges, and perhaps one for another article. But done right, it can help you serve traffic on a significantly bigger scale. 

 

What not to do:

Don’t send customers to dynamic pages – E.g. mailshots or social campaigns.

If 20,000 Facebook users all hit a page which has to be uniquely generated on your servers, you will definitely have performance/scaling issues sooner or later. 

URL parameters. Critically, links from mailshots or ad campaigns often have unique or tracking identifiers. Same content, but each request is unique enough to bypass your various caches. To find out if that’s happening, your web server access logs should tell the full story of any URL parameters making their way through.  If you can’t remove the URL parameters at source – you might not be able to change the way your mailshot provider works –  then make sure your application (or cache) is going to ignore or ‘normalize’ the parameters. You could achieve this with Varnish VCL (only if you are already using Varnish) or perhaps work around it with a mod_rewrite rule to strip the parameters.  

Don’t overdo the AJAX. If you pages are static or cached (good), but have a ton of dynamic Javascript making requests, then you might still be causing undue load on the application layer.

Try to avoid that: don’t start unnecessary sessions; skip the POST requests where no session is present; perhaps even remove some of the dynamic blocks from your templates. Examples:

  • Do you really need the whizzy cart contents popping up onMouseOver(), causing server load on every page? Or, would a simple link to the shopping cart do the job?
  • Recently Viewed Products: often a unique block which can be removed to reduce traffic.
  • Search autocomplete: depending on your implementation, this might be making an AJAX call on every keystroke.

Obviously this is a functionality vs. performance trade-off, but it might be the one thing keeping your website online during the big sale event.

Instead, use as much static content as possible, because this will always be faster and create the least load on your infrastructure. If you don’t have the time for a code change, in an emergency, you could use web server rewrite rules (or similar) to return a 503 for those requests.

Landing pages should be flat HTML if possible, perhaps then offering links through to the real dynamic content. Having this first step can reduce server load by an order of magnitude.  If it absolutely has to be a dynamic page, then make sure it’s fully cacheable (see: Cache all the things, CDN), does not have unique URL parameters, and the page doesn’t have too many dynamic elements.

 

Performance testing, again 

Ideally you want to measure the difference after any changes to your infrastructure, application stack, code, or database.

This can be an iterative process – rinse and repeat until you’re really happy with the performance and capacity.  

 

31 May

Can you run Magento on a Plesk server?

TL; DR – Yes, you can. But don’t. There are a lot of reasons why you shouldn’t.

If you must… scroll down for some tips.

I get asked this question a lot, by dev agencies or shared hosting resellers – often already using a control panel like Plesk.  They’ll pick up a new customer who happens to be using Magento, upload to their shared server like every other site, and then call us when it doesn’t work.

This applies loosely to any shared hosting environment – Magento on Plesk, Magento on cPanel, ServerAdmin, or any a DIY solution with multiple websites on one server or VPS.

This article should be useful for Plesk server administrators, and Ecommerce CTOs who are looking at Magento hosting.

The right way

In most cases, a small dedicated VM,  VPS or Cloud Server can be used to host a small Magento site in a more isolated way. LAMP Config can be honed to perfection for that site, and there’s no need to tip-toe around other websites on that same server.

A VPS, or Cloud Server with 4G memory is about where you need to start.

If you’re talking about a busier website with dedicated resources anyway, a control panel like Plesk will probably just hinder your ability to configure things at a low level. In that case, get a decent sys admin instead.

Security

Let me start with some doom-mongering. Shared servers get compromised all the time. Ecommerce is an area where a compromise can do serious damage to business reputation.

On a shared server, you have no control over how many other sites are there, how long ago they updated WordPress (for example) or whether there are compromised sites running rogue on that server that the owner hasn’t even noticed.

It’s a numbers game – the chance of any one site being compromised might be relatively slim, but there might be 200 sites on that server.  This is why shared environments (of any kind) usually don’t meet PCI compliance requirements. On the upshot,  panels like Plesk do go to great lengths to try to separate websites: users, permissions, config like PHP open_basedir, etc. Application-level compromises may not affect your site;  but if it’s a root level compromise then you’ve had it.

A more everyday problem might be that you’re on a compromised server which is sending a lot of spam.  Your crucial transactional emails could get lost in the server’s million-strong mail queue, or filtered as spam because of the sending server’s IP reputation.

Remember: If your Magento site is running on a shared server, then the security of your business is in the hands of the person running that server. Quite often – especially if they’re relying on control panels – those people are not very knowledgable on security topics, let alone the underlying LAMP config. I hate to say it, but it’s true more often than not. Disclosure: My first job in tech was with an IT services company, reselling shared hosting on Plesk servers.  I definitely didn’t know much about security. Ignorance was bliss.

Performance

For Magento to work well, you need to tweak the LAMP stack a fair bit. Plesk and other control panels can limit how easy/feasible this is, either server-wide or per website.

High level examples:

  • Varnish was popular for Magento 1 and is now a crucial part of a Magento 2 stack. Varnish config is very website-specific, so implementing Varnish on a shared server is very difficult. Possible, but will undoubtedly cause problems for other websites, and the VCL will be very messy. It’s just not practical.
  • PHP version: You might even be stuck with an older PHP version, to cater for a legacy website on that same server.

Recent Plesk versions do give a lot of granularity for PHP config, even offering different PHP versions per domain, so that’s good. But those extra PHP versions might be outside of your server’s main package management. Are they getting the latest updates? See: security.

One special mention here is a PHP setting called open_basedir.  Used per website, it restricts PHP to only a certain few directories – exactly the sort of thing you want on a shared environment. Plesk uses it by default as a sensible security measure.  But …and it’s a big butopen_basedir effectively disables PHP realpath cache – an internal PHP cache which massively speeds up PHP file includes by caching filesystem paths. It makes sense, because the realpath cache is global within PHP,  so having access to that cache could break the open_basedir restriction. The downside is a big performance hit; PHP has to query the filesystem for every single include() . In Magento, we all know that’s going to be a lot. The impact can be severe.

Resources

Shared servers can sometimes be packed to the rafters with small sites. And that’s fine – many low traffic sites use barely any resource. But even a low traffic Magento site can consume a fair amount of memory and CPU. Remember that low human traffic doesn’t mean there aren’t a dozen search engine bots constantly hitting the Magento site. Layered navigation on a large catalogue can lead to a lot of crawling;  you need to factor that in.

I’ve seen this several times, where a new or growing Magento site can engulf the CPU or memory on a shared server, causing downtime for many other sites. For example, Magento recommends a 512M memory limit – and it’s not uncommon to set a memory_limit of 2GB to allow for some large product import. What if you have 12G total, and 11G used by other sites? That might only be two visitors at once.

Server resources and website performance go hand in hand, but do think about the impact that the different websites will have on each other. In terms of resources (not necessarily security) it’s probably OK to run a busy Magento site and a few other, smaller sites (like a blog). But if you’re trying to run 20 Magento sites on one Plesk server, they’re going to trip over each other sooner or later. See below for how to mitigate that.

SEO makes no difference

A little off topic, but I just wanted to mention this as a non-argument. In case anyone mentions Search Engine Optimisaton as a reason not to use a shared server, or shared IP, it absolutely doesn’t matter. Matt Cutts said so, and he is pretty senior at Google. Sharing our diminishing IPv4 space with tools like SNI is very necessary, and it will not harm your SEO rankings. Or just use IPv6 already.

 

If you must…

If you’ve read this far , you probably have commercial or technical reasons you can’t avoid using a shared server. Here is how you run Magento on a Plesk server (or any shared hosting environment).

  1. Use a Full Page Cache Magento extension, which will massively reduce the server load impact as well as speed up page loads for your customers.  I recommend Mirasvit FPC, instead of spending weeks with Varnish config.
  2. Set open_basedir=none – if you are happy with the security implications. You can do that in Plesk under the advanced scripting options per domain – here are some instructions. My advice is to only do that for the Magento website(s), but leave the other sites restricted.
  3. Ask your administrator for general PHP optimisations to the global PHP config; the most important of which is to use an Opcode cache like Zend OPcache.
  4. Stick to the main PHP version on the server where possible, so it’ll get security updates. Anything above PHP 5.4 should be OK, but watch for EOL package sources.
  5. Use Redis for Magento cache, as long as you have the memory (1G should be plenty). Plesk doesn’t need to know or care that Redis is there, but it should be configured with requirepass to prevent other sites accessing the data. If you are running multiple Magento sites, you should use a separate Redis instance for each.
  6. For Magento sessions – just use <session_save>files</session_save> . Performance impact is minimal and it’s one less thing to worry about. Otherwise – if available – use memcached.
  7. MySQL –  set max_user_connections globally. I usually set it to ~80% of the max_connections, so as not to be too limiting but prevent any one site using all connections and effectively bringing down all database-driven sites on that server.
  8. Set limits on the resource usage, where possible. Plesk can use Apache mod_bw to do that – see this guide. I wouldn’t limit by Kb/s – just use the overall connections. Again, the value here is difficult to judge and will be different for each site/server. Start with about the number of CPU cores you’re happy for this site to consume.
  9. Use a CDN, like CloudFlare. It’s free, and gives an immediate boost to page load times, especially for overseas customers. Headers for GeoIP information are also really useful if you’re an international store. It helps to reduce server/network load, and tools like the WAF (~$20/mo) can help with security.
  10. Understand the resource limits – I’m expecting even a bottom end dedicated Plesk server will have at least 4 CPU cores and 12 or 24G RAM.  If you’re trying all this on a 1G VPS, you’re doing it wrong.

 

 

 

29 Oct

Redis session locking

Here is an excerpt from a NewRelic trace.
Cm_RedisSession_NewRelic
If that looks familiar, keep reading.

Cause

Hard to nail it down, but likely something to do with the way the excellent Cm_RedisSession module uses client-side locking. If you have NewRelic Pro, you might see this crop up now and again in the full transaction traces as above.

My hunch is that it only happens when end-users open a bunch of pages, in tabs, in quick succession (I know I do that!), and you get those locks on their session while the the pages are generated. Any kind of Full Page Cache might be masking this problem to some extent, making it harder to replicate reliably.  If they end up blocking each other, it would explain the traces  I’ve seen with  30+ seconds under Session::read.  Thats certainly enough to trigger a timeout  on most reverse proxies or load balancers, so this is a possible cause of those elusive 503 or 504 error you’ve been getting.

 

Solution, part 1

Add <disable_locking> to your Magento local.xml :

<session_save><![CDATA[db]]></session_save>
<redis_session>
   <host>12.34.56.78</host>
   <port>6379</port> 
   <password></password>
   <timeout>2.5</timeout>
   <persistent></persistent>
   <db>1</db>
   <compression_threshold>2048</compression_threshold>
   <compression_lib>gzip</compression_lib>
   <log_level>1</log_level>
   <max_concurrency>6</max_concurrency>
   <break_after_frontend>5</break_after_frontend>
   <break_after_adminhtml>30</break_after_adminhtml>
   <bot_lifetime>7200</bot_lifetime>
   <disable_locking>1</disable_locking>
</redis_session>

Solution, part 2

NB: Older versions of Cm_RedisSession do not have this feature.
To find out, open up app/code/community/Cm/RedisSession/Model/Session.php and look for “disable_locking”.

# grep disable_locking app/code/community/Cm/RedisSession -r
app/code/community/Cm/RedisSession/Model/Session.php:                    : ! (strlen("{$config->descend('disable_locking')}") ? (bool)"{$config->descend('disable_locking')}" : self::DEFAULT_DISABLE_LOCKING); 
  • If you see the above, it will work with no further action. This updated code is bundled as of Magento CE 1.9 or EE 1.14.
  • If you don’t see that, then the module needs to be updated.

The code is here: https://github.com/colinmollenhour/Cm_RedisSession

Thanks, Colin.

Do make sure that any changes are integrated with your version control and, of course, you’ll need to clear all Magento and PHP opcode caches for the changes to take effect.

If you don’t have the capacity to implement that, I have a couple of other options…

Fallback 1: Use Memcached for sessions.

If it’s not already available, ask your hosting provider to set it up.

 <session_save><![CDATA[memcache]]></session_save>
<session_save_path><![CDATA[tcp://127.0.0.1:11211?persistent=0&weight=2&timeout=10&retry_interval=10]]></session_save_path>

For sessions, there’s nothing really wrong with Memcached – performance is great – it’s just hard to find HA implementations like Elasticache or ObjectRocket Redis. So with a multi server setup, you might be introducing a single point of failure.

 

Fallback 2: Just use the default ‘files’ handler.

<session_save><![CDATA[files]]></session_save>

If you are on a single web server, it’s really not that bad for performance. Obviously this is not a good idea for multi-web-servers, because you’d have to rely on your load balancer’s session persistence and that’s another story.

To be clear, we are only talking about sessions. The  <cache> and/or <full_page_cache> should definitely stay in Redis.

28 Jul

Mirasvit Full Page Cache

I’ve talked about Full Page Cache before, and how a fast site is important for your customers (not to mention GoogleBot), and ultimately for better sales conversions.

From a SysAdmin point of view, sites with some kind of FPC can handle much more traffic, with fewer server resources (read: cheaper for you, IT Manager) and are usually much better at handling sudden traffic spikes.

Varnish is as fast as it gets. But Varnish requires a lot of skill to implement well and work around any niggles (and there are always issues).  Now, I love Varnish, but for many it’s just too complex, or time consuming, especially if you’re working to a tight deadline. Instead, there are plenty of code-based solutions which aim to implement Enterprise-like FPC but for a fraction of the cost.

I’ve seen a lot of Community Edition customers using this extension successfully, and I wanted to see what it was all about.

https://mirasvit.com/magento-extensions/full-page-cache.html

Special mention here to the folks at Mirasvit, who were were kind enough to send us a copy for evaluation at Rackspace. The turnaround was good so I’m happy you’ll get a responsive support experience. For us, that’s really important.

 

Your Milage May Vary

I was testing with stock Magento Community 1.9.1.0 and the sample data.

The settings I’ll discuss here should work fine for most, but your milage may vary if your Magento store is heavily customised. Always test new modules in a staging environment before implementing on your live website.

Installation

I pretty much followed the bundled instructions – no need for me to detail it here but it was very straightforward. See also the Mirasvit FPC user manual.

 

Configuration

Let’s dive into the config, in your Magento Dashboard (System > Configuration > MIRASVIT EXTENSIONS/Full Page Cache )

General Settings

MirasvitFPC_General

  • Enabled: Yes (obviously)
  • Cache Lifetime (sec): I’ve gone for two days here, you could use more. If your site gets indexed by a search engine once a day, the first hit will warm up, and the page won’t have expired by the next day’s index. If your site traffic is quite low, and it could be a few days between page views of any one particular product, then you should keep this value high, like a week (604800 seconds).
  • Flush Cache Expr: Leave it empty to disable the auto-flushing. I tested that saving a product will automatically expire the relevant pages, so you are not likely see out-of-date content.  My general rule is that you shouldn’t have to specifically flush caches (development aside); the more you flush them, the less effective they are.
  • Max. Cache Size (Mb): 128 is probably OK for most, but you might need more if you have a lot of products/categories. You should understand where you cache is, though, before increasing this. For example, if you’re using a 512M Redis instance from ObjectRocket, then setting this higher than 500 would start to cause problems if it gets full. For a local Redis instance, your maxmemory directive in /etc/redis.conf will be relevant here.
  • Max. Number of Cache Files: 20000 seems ample; you might need to increase this if you have a lot of SKUs, categories, etc.

Crawler Settings

The crawler seems to work really well. What I like is that it only crawls the pages  your customers are actually hitting, rather than just spidering the whole site needlessly. MirasvitFPC_Crawler

  • Enabled: Yes. If your site is pretty busy, and your expiry times are high, then you might find your customers do a great job of warming up the cache for you. For quieter sites though, or to ensure that most people hit cache most of the time, definitely enable it.
  • Number of Threads: 1.  First of all, you should find out how many CPU cores are available on your server. My test site is running on a small Cloud Server, with only one vCPU core, and my load testing experience tells me the default of ‘2’ might slow things down for my 1 vCPU core. lscpu is a command you can run to find out quickly.  Half that number, as a rule of thumb, should safely avoid impacting performance for real users.
  • Thread Delay: I’ve put half a second in there to further reduce load impact.
  • Limit of Crawled URLs per Run / Schedule : A higher limit here will warm up the cache more quickly, in conjunction with the Schedule, but the idea here is to prevent the crawler from running away with itself and endlessly hammering your server. The default setup is going to crawl up to 10 URLs every 15 minutes, which is fairly conservative and only 40 pages per hour. Something like 20 URLs every 10 minutes should be fine. If you wanted to get even more granular, we could run every 10 mins but avoid peak hours (let’s say they are 12-2pm and 6-10pm),  with something like:
    • */10  0-11,15-17,22-23  *  *  *
  • Sort Crawler urls by: Popularity. Sounds sensible; I didn’t bother setting up the custom order.
  • Run crawler as apache user: No. I didn’t need to do this, although my PHP is running under FPM as its own user, and that user is also running the Magento Cron job.

Cache Rules

This is really the nuts and bolts of what gets cached and what doesn’t.MirasvitFPC_CacheRules

  • Max. Allowed Page Depth: 10. I’ve seen sites where heavily layered or even cyclical navigation leads to endless unique URLs, and it’s not practical to cache them all. This is there to prevent over-caching of those pages, and 10 seems like a decent default value.
  • Cacheable Actions: The defaults here are the home page, product pages, and category pages. That’s probably fine for most; you might need to add bits if you have heavy CMS pages, or if your store is heavily customised.
  • Allowed/Ignored Pages: What it says on the tin. Maybe you have a special CMS page which includes a live Twitter feed, and you don’t want to cache it.
  • User Agent Segmentation: If you have a responsive theme, you won’t need this. If any part of your code relies on device detection, like a different tablet layout, or a popup about your iPhone app, then it’s likely you need to use this. Above is my example which should take care of most popular devices right now (2015); you might need to work on your own expressions depending on which devices/browsers your site care about. One thing you should not do is separate GoogleBot or other engines/bots/crawlers – if you do that then they’re less likely to get the page from your main cache. Faster for them is good for your rankings, and hitting the cache is good for your server load.

Debug

The debug options are pretty self-explanatory, and should usually be disabled in production.   The Time Stats are really handy to compare uncached vs. cached performance, and I like that you can show these only for your IP address(es). The code is using $_SERVER[‘REMOTE_ADDR’] though, so it won’t work behind reverse proxies or load balancers.

MirasvitFPC_DebugHints

 

Cache Management

The obvious thing here is that you get another option for Full Page Cache under System > Cache Management. You’ll need to enable that, then flush all cache, for the FPC to start working.

Everyone loves a nice graph – ask NewRelic – and understanding how your cache is performing will help you drive a faster experience for your users.  Here’s what Mirasvit FPC adds to your Cache Management page:

MirasvitFPC_stats

You can zoom the graph to a smaller time period, or get an overview for much longer.

Source: http://fpc.demo.mirasvit.com/admin/?demo=fpc (because my test store didn’t have enough data for an interesting graph).

More screenshots on the Mirasvit website or Magento Connect.

One Caveat

When testing with two browsers side by side, I did at first get some crossover where one browser would see the page with cart contents showing from the other session. I found that this was down to the way Magento includes the Session ID in the URL by default, combined with the default of not doing any session validation.  After disabling that, everything worked as expected.

  • System > Configuration > Web > Session Validation:  “Use SID on Frontend” =  “No”.
  •  Clear all cache to apply.

What I liked

  • Easy setup. I just plonked the files in place, and pressed “go”.  You may want to tweak the default settings as above, but it pretty much works out of the box
  • No extra local.xml config. It just uses whatever <cache><backend> you already have configured, which is great. I was already using Redis, and Mirasvit lapped it up.
  • Good support: That’s the main theme in the comments on Magento Connect, and the team did respond to my email within a day. For an extra $50 USD, Mirasvit will even install the plugin for you – great if you don’t have the skills or don’t have a developer on hand.
  • Cache Rules are really nice to configure, and extras like User Agent separation mean that it’s very flexible.
  • Built in Crawler seemed to work really well, and it won’t smash your server to pieces.

I didn’t test:

  • Dynamic blocks.  Blocks and layouts are going to be unique for each store, so working on the default probably won’t help. Mirasvit provide full documentation and offer to help with this as part of the installation service. You may not need to configure this – in my experience, hole-punching for dynamic blocks generally creates more complexity and extra work in the long run for your frontend developer. Simply using the cache as-is will still cut out 90% of server load while keeping your deployment simple.
  • Debug stats behind a reverse proxy.  A lot of the customers I work with have their main web server(s) behind a load balancer, or maybe a CDN like CloudFlare. It’d be nice to see this implemented from the Magento client IP, which can be configured in local.xml to get the real IP from  X-Forwarded-For or any other HTTP headers.
    • UPDATE from Mirasvit: “We will use similar approach in our next releases.”

Final thoughts

Quick to get going, feature rich, and not overly complicated, it’s a great alternative to a complex Varnish configuration. My cached page loads (coming from Redis) were showing around 37-70 ms, which is on par with the Enterprise FPC.  With great support too, and all for a one-off $149, it’s probably the best $149 you could spend for your Community Edition Magento store.

 

18 Jun

Magento deployment checklist

Just deployed a new Magento site? Or migrated to a new host?

Here are some things you should before you launch…

Secure it

  • Magento base code may not include the latest security patches!  Check the download page for the latest and apply them. For example, Community 1.9.1.0 is still vulnerable to Shoplift out of the box.
  • Change your Admin path – anything other than the default /admin, to make your Magento backend harder to find or brute-force.  This is usually configured in your app/etc/local.xml.
    ...
      <admin>
        <routers>
          <adminhtml>
            <args>
              <frontName><![CDATA[something_unique]]></frontName>
            </args>
          </adminhtml>
        </routers>
      </admin>
    </config>
  • Admin password: if you were using “admin123” while in development, now is a good time to change it.
  • File permissions: tend to get overlooked in development. Here’s a handy guide on what they should be.

Enable caches

Magento caches are often disabled during development, but in production it’s essential that they’re all ON.  In the Dashboard, go to System > Cache Management and enable them. Related article: Magento Full Page Cache

Log cleaning

This relates to the four log_* tables in the database. They’re a bit like access logs which are not rotated by default – this is bad because it bloats your database, wastes your InnoDB buffer, and makes database backups more cumbersome.

Go to System > Configuration > System > Log Cleaning and enable  it. The default 30-day retention should be fine.

Cron job

The Magento cron job should be run every five minutes. Top tip: run cron.sh instead of cron.php. The shell script first checks it’s not already running, then runs the PHP, preventing overlaps.

*/5 * * * *  /bin/sh /path/to/docroot/cron.sh

As of Magento 1.9.1, the cron job is responsible for sending customer email so it’s more important than ever.

Indexing

If you are using Community Edition, indexing may not be a problem to start with, but one day it’s going to cause issues. Here’s my advice on how to configure Magento indexing.

Error pages

Death, taxes, and Magento 503s. Even a well tuned Magento application infrastructure can be complex and one day something will break. Or maybe we’re talking planned maintenance; installing a new extension for example.  Here’s a great tutorial on customising your Magento error pages.  It’s best to get this done early on, so you’re ready for the unexpected.

Top tip: Other reverse proxies like Load Balancers and Varnish (if using) will probably show their own 503 page when something is broken. Talk to your hosting provider about modifying these – default error pages don’t inspire confidence in your brand.

The goal here is a nice customer experience even when something is broken. Make sure there’s a phone number or email address at the very least. Including a discount code will encourage customers to come back and buy later.

Load test

If time allows, the last thing to do before launch is a load or performance test to get insight into what your new solution can really do.  It’s best to do that before you have real world traffic, otherwise load testing is basically DOS’ing yourself. Load testing is also likely to show up areas for config optimisation/performance, which is always a good thing.

Don’t: rely on tools like siege, or services like loader.io and blitz.io. They can be extremely useful of course, but only if you are able to interpret the results properly. Unless you have a deep understanding of HTTP protocol headers, cookies, unique session IDs, these tools probably won’t help you all that much.

Do: get it done professionally. You need a test that can mimic actual human user journeys, and repeat them on a massive scale. jMeter is great, but doing it right can be complicated and very time consuming. I would argue this is best left to professionals who do just that. Not cheap, but a necessary investment for your website’s future. Your hosting provider might offer professional load testing services, or refer you to someone who can. Soasta are excellent.

01 Jun

6 reasons your Magento site went down

This article is about extreme traffic overwhelming your site.

There’s plenty of marketing you can do to drive traffic, but TV appearances are by far the most hard-hitting. Your target audience is sitting on the sofa with a laptop or tablet at the ready, and they’re all going to hit your site within the same 10 seconds or so. Email campaigns can have a similar effect, but usually you can stagger delivery to limit the impact. Read on especially if you’re a startup planning on launching with a big bang.

Specifically, we’re talking about:

Before we start, I’m assuming your site is well hosted and generally performs well under normal conditions (Magento category page Time To First Byte < 1.0 second without FPC). If it doesn’t, then stop reading. These will be band-aids rather than solutions.

Here are six things you can do to prepare for huge traffic spikes.
 
 

1. Plan for failure

Despite everything else in this article, it can be hard to gauge just how much of a load spike you’ll get.  Get some error pages in place, at every level possible. Your developers or digital agency should be able to knock something up in no time.

Do: Think about including a discount code on error pages, encouraging your customers to come back later. I’m told it really works.

Do: Make it look nice, on brand, including contact details. Some are light-hearted and fun – I’ve even seen Pac-Man embedded to play while waiting – but the main message you need is to encourage a repeat visit. Default or generic error pages do not inspire confidence in your brand.

Don’t: Include any images, CSS, etc, on this error page from your web servers, in case they’re not responding. Host these assets elsewhere; a CDN would be perfect.

 
 

2. Full Page Cache

A good Full Page Cache is essential to absorb the majority of your traffic, and the majority of server load.

Can be quite a complicated thing to get right though, so I wrote a separate article with my thoughts on Magento Full Page Cache.

If you’re doing it right, your pages and categories should be coming back in around less than 100 milliseconds.

 

3. Database

First, use persistent connections.  I’ve seen the sudden influx of DB connections overwhelm the TCP stack on the DB host. Make sure MySQL is configured to accomodate that, though. Every single PHP-FPM or Apache child process could be hanging onto a connection. We can do some quick maths here: If you have ten web servers with pm.max_children=250, then your MySQL max_connections needs to be 2500. Add a few more for any monitoring or diagnostics.

Configure it in local.xml:

<connection>
    <host><![CDATA[magento_db_host]]></host>
    <username><![CDATA[magento_db_user]]></username>
    <password><![CDATA[magento_db_pass]]></password>
    <dbname><![CDATA[magento]]></dbname>
    <active>1</active>
    <persistent>1</persistent>
</connection>

Replication?

No.
Three points about this:

  1. Database is not normally the bottleneck. CPU for PHP execution is. You will want a reasonably powerful machine with good I/O, but so long as your FPC is effective, it’s very unlikely that you’ll need to scale out.
  2. Broken shopping carts, and other weird behaviour. Replication takes time, like a second or two, which can be long enough to cause a problem. For the most part, the Magento read/write separation does account for replication delay but third party extensions might not. It’s especially important when they are related to shopping carts or checkout functionality.
  3. Replication is not resilience.  I want to mention this,  because I think a lot of people ask for replication out of a misunderstanding that it’ll make the site more resilient or Highly Available. A master-slave setup still has single points of failure, and Magento will error out if it can’t connect to the master or ANY of your slaves. A multi-master implementation could work if you have a floating IP, but in my experience the complexity far outweighs the benefit. At Rackspace, our go-to HA solution to run MySQL (Percona usually) under the Red Hat Cluster Suite, and that works brilliantly. Magento gets one Database connection, which is a floating IP between resilient nodes. The backend servers are often less powerful than the Web servers; see point #1

That’s it for the database. I’m not going  into general DB optimisation here, but Major Hayden’s mysqltuner.pl is a good start.

 
 

4. Backend Cache Scaling

Most of the time, you’ll be sharing your Redis cache between web nodes. This is important for management of the cache via Magento admin. At a massive scale though, you can overwhelm the physical network, TCP stack on the Redis host, and run into performance problems because Redis is single-threaded.

Too many servers on one Redis instance

The simple solution is to install a local Redis instance on each web server, and connect on localhost or a UNIX socket. Cuts out the network load completely, and scales out to the Nth degree.

One Redis instance on each Web server

The major disadvantage of doing this though is that management operations, like clearing the cache, or general invalidation when you make changes, will not happen across the board.  Here’s a quick-and-dirty proof of concept bash script for clearing out all your caches at once, assuming you are also configuring each to listen on your local or isolated network:

REDIS_SERVERS="192.168.100.5 192.168.100.6 192.168.100.7 192.168.100.8 192.168.100.9"
for server in $REDIS_SERVERS; do
    echo -e "FLUSHALL" | nc $server 6379
done

NB: This kind of cache setup is only for extreme cases; 99% of the time a single Redis instance is OK for Magento cache. I like to use a second one for Enterprise full_page_cache. You could look into Redis sharding for ultimate performance, but that’s a little more complicated than we need here. This is for a one-off event, and when it’s done you can scale back you the single Redis instance for easier cache management.

NB: Cache storage must not be confused with Session storage. It often is, when the same technology is involved. Despite the above, I would still advise keeping all your sessions in one place, mainly because I don’t like to rely on load balancers’ session persistence. It’s very unlikely to saturate the network as the Cache traffic can. I prefer Memcached over Redis for sessions; it’s simple and multi-threaded. On that note, ensure MAXCONN and CACHESIZE are suitably configured.

 
 

5. CDN

Content Delivery Networks are not a magic solution, and usually have no effect on those initial page loads, nor the PHP load on your web servers. While some CDNs do have full page caching features, I haven’t seen anyone successfully integrate them into an application as complex as Magento.

What a CDN will do, however, is speed up the delivery of extra content for the overall page load. Especially if you’ve got an ocean between your customers and your server(s). If all your customers are in the same country as your server, though, it probably won’t be that much faster and might not be worth the effort.

The biggest advantage for me is to reduce the network load on your infrastructure. On most Magento stores (most websites in general), the bulk of actual data content is product imagery. Offloading that to a CDN will definitely help to avoid network saturation, and load on net devices like firewalls and load balancers.

You need to be using a CDN which pulls from origin; the days of trying to upload with ImageCDN are long gone. And for faster pageloads, you can use separate URLs for your skin, media, and javascript elements, leveraging parallel downloads. Once those are set up, it’s pretty trivial in Magento to configure the URLs under System > Configuration > Web. It might be a little more work for SSL, but if most of your window shopping is done over plain HTTP then start with the unsecure base URLs for the quickest win.

 
 

6. Load testing

You need to know where your website actually stands in terms of traffic, and you need to do it properly.

Don’t: rely on tools like siege, or services like loader.io and blitz.io. They can be extremely useful of course, but only if you are able to interpret the results properly. Unless you have a deep understanding of HTTP protocol headers, cookies, unique session IDs, these tools probably won’t help you all that much.
 
Do: get it done professionally. You need a test that can mimic actual human user journeys, and repeat them on a massive scale. jMeter is good, but doing it right can be complicated and very time consuming. I would argue this is best left to professionals who do just that. Not cheap, but a necessary investment for your website’s future. Your hosting provider might offer professional load testing services, or refer you to someone who can. Soasta are excellent.

27 May

Magento Full Page Cache

I find myself talking about this a lot, so here are my musings on Magento and Full Page Cache, written down.

To run at scale, and keep 90% of the load away from your CPUs, you just have to have a Full Page Cache that works. A quick way to test is to measure the Time To First Byte. If FPC is working, your TTFB should be around or under 100ms (not including network latency) after about 2-3 requests.

From an SEO and Customer perspective:

Important: Before thinking about Full Page Cache, you do still need reasonable performance without it. Otherwise you’re really just masking other problems and your customers are going be less than satisfied when they hit a page that isn’t cached.  If your TTFB is much more than 1 second, you first need to talk to your developer and/or hosting provider about optimisation and config.  Additionally, this absolutely does not negate the need for a good cache backend, like Redis.

Full Page Cache is the icing on what should already be a tasty cake.

 

Magento 2.x

Magento 2.x has Varnish 3/4 support out of the box, and it’s recommended over Redis as a Full Page Cache.

  1. Install Varnish, and have it listen on port 80 in front of Apache/nginx.
  2. Configure Magneto to use Varnish, and export a VCL.  Stores > Settings > Configuration > Advanced > System > Full Page Cache. 
  3. This should create a Varnish VCL under <Docroot>/var. Configure Varnish to use that.
  4. If you have multiple web servers, you need to:
    • Ensure your local network range is included in acl_purge{} in the Varnish VCL.
    • define an array for http_cache_hosts in the app/etc/env.php configuration

Enterprise Edition 1.x

In Magento Enterprise, you just turn on Full Page Cache, under System > Cache Management. Simple as that! If your developers advise that it needs to remain off for X functionality to work, then get better developers.  If the TTFB is still slow after 3 or 4 requests, despite FPC being turned on, then it’s likely that a third party extension is preventing content being cached (or immediately invalidating the FPC entries). Engage your developers about this.

As far as the config goes, you can use the local.xml to configure <full_page_cache> independently from <cache> . Use a second Redis instance if you get plenty of traffic; it’s nicer for sys admin management and gives you an extra thread.

Community Edition 1.x

Third party code

As with most Enterprise features, there are plenty of Community extensions to replicate Full Page Cache. Usually though, you need a skilled Magento developer for good integration. In the case of FPC, it’s likely you’ll need to work on your templates for dynamic block placeholders.

Gordon Lesti’s free module is popular; Extendware, Mirasvit and Brim have good solutions for a small fee; there are countless others to choose from. If these extensions work for you, they’re often a great balance between performance and complexity.

UPDATE: Full article on how to configure Mirasvit FPC. It’s one of the best.

Varnish

Instead of a code-based FPC, it’s possible to use Varnish. Varnish can be insanely fast, but you need Magento extension to integrate and manage it properly. Amongst their weaponry are such diverse elements as setting the right cache-control headers, TTLs, purging expired content, and giving you the all-important Varnish VCL.

Turpentine is the most feature-rich of the Varnish config, and works great under most conditions. But my one issue with it is that the necessary ESI requests for form_keys can happen many times per page (depending on your templates), and these really add up. I have seen those form_key requests alone overwhelm even a very large web infrastructure, during high traffic events like TV promo and load testing.

My personal favourite is to use the Phoenix PageCache implementation for Varnish, with a few of my own VCL tweaks to sort out User-Agent normalisation, SSL termination, and one or two other bits. With PageCache, the form_keys are generated by an embedded C function, then stored in a header for later re-use. It’s a much more efficient way of dealing with Magento’s form_keys. The free PageCache module for Community Edition doesn’t handle hole-punching for logged-in users as Turpentine can, so everyone with a session bypasses Varnish, but that also makes it simpler to manage; you don’t need to mess with your templates. In my experience it does the best job of handling load spikes and gives you the fewest headaches.

 

Varnish is not a silver bullet

If configured badly, it can cause you no end of problems. Broken sessions, add-to-cart not working, seeing someone else’s shopping cart, and generic 503 errors are very common. On the other hand, a stray Vary: header here, or an unnecessary session_start() there, and the site will seem to work but Varnish probably won’t be caching much.   Varnish can make your site blisteringly fast, but you need some solid experience for a successful implementation.

Do: talk to your dev agency and/or hosting provider for advice and expertise.

Don’t: follow a beginners’ how-to and hope for the best.

20 Apr

Foolproof Magento Indexing

This is for Community Edition and Enterprise Editions before 1.13.

Once you have established whether Magento indexing is breaking your site, here is the simple 1-2-3 solution.

Generally, reindexing in the daytime on a busy site can cause problems, and by default Magento will fully reindex after any product/catalogue changes. The gist of this is that you probably don’t want that to happen in peak business hours.

1. Manual indexes.

Two of the indexes are more likely to cause you problems than any of the others – the URL rewrites and the Fulltext search. Set them to manual – the others should be OK.

Magento manual indexing

System > Index Management

Alternatively you can set this directly in the database:

mysql> UPDATE index_process SET mode="manual" WHERE indexer_code="catalog_url";
mysql> UPDATE index_process SET mode="manual" WHERE indexer_code="catalogsearch_fulltext";

2. Configure a cron job to do that manual reindex, every day.

crontab -e -u username

username is the user which runs your PHP-FPM, or just apache for mod_php.  I try to avoid having root run these jobs; it creates lock files in ~/var/ which the application user will not be able to work with.

Your added cron job should look something line this:

@daily /usr/bin/php /path/to/magento/documentroot/shell/indexer.php reindexall >/dev/null 2>&1

I’ve used @daily as a cron shortcut, which is usually midnight (server timezone). You could be more specific if you like, for example if you need to avoid other jobs like database backups. This is in addition to the normal Magento cron running every 5 minutes.
Obviously you need to replace /path/to/magento/documentroot with whatever’s relevant in your hosting environment.

If you don’t have access or confidence to do this via SSH, your hosting provider should be able to help.

3. Ignore the banner.

MagentoIndexingBanner
Might seem like a silly thing to mention, but  I’ve often seen cases where a diligent member of staff was following the advice and doing the reindex, unaware that it was causing problems and will be done by the cron job anyway. If you have a large team of admin staff, just be sure to let them all know.

 

 

Business critical updates?

When I suggest this, I’m often greeted with something like, “..but it’s absolutely essential that new products are searchable and available via their URLs IMMEDIATELY!

You have a few choices here:

  1. Think about your business requirements vs. impact vs. cost. Do you really need that? All the time? If it’s just occasionally, then continue as above and deal with the occasional manual reindex in the daytime.

  2. Third party code. This extension claims to do the job. There are probably others, too. I can’t vouch for it because I’m not really a developer. As with all third party extensions, the fewer the better and, of course, YMMV.

  3. Buy the Enterprise Edition. Or upgrade if you’re on EE < 1.13. There are plenty of other reasons for this, but index management is a major factor. If you have enough products for indexing to be an issue, and it really is business critical that your indexes are up-to-the-minute fresh, and you need vendor escalation with your software, then it’s a no-brainer. Talk to your finance director, bite the bullet, and invest in software that does the job out of the box.
27 Mar

Is Magento indexing breaking my site?

Magento indexing in Community Edition 1.x (and older EE ≤ 1.12) is an absolute train wreck.

Sooner or later it’s going to take your site down, or key parts of it like searching and checkout.

This article is about how to tell whether or not this is a problem for your site.

The worst thing is that the default behaviour is to reindex every time you save a product. Chances are this will be business hours, when your content editors are working on products. Not good.

System > Index Management.

System > Index Management.

 

The two that usually take the longest are are:

  1. Catalog URL Rewrites”This holds all references to every product, including old products if you’re keeping links for SEO.
  2. “Catalog Search Index”By default this is a MyISAM table with a FULLTEXT index, so the whole table gets locked during reindexing.  In MySQL 5.6 variants we can switch this table to InnoDB for row-level locking, but there is usually still disruption.

 

How long does it take?

Longer for larger catalogs, but you can find out from the database:

mysql> SELECT * FROM index_process;

Or, a quicker way:

mysql>  SELECT indexer_code, TIMESTAMPDIFF(SECOND, started_at, ended_at) as duration from index_process ORDER BY duration DESC;
+---------------------------+----------+
| indexer_code              | duration |
+---------------------------+----------+
| catalog_url               |       16 |
| catalog_product_attribute |        2 |
| catalog_category_product  |        2 |
| catalogsearch_fulltext    |        2 |
| catalog_product_price     |        1 |
| catalog_product_flat      |        0 |
| catalog_category_flat     |        0 |
| cataloginventory_stock    |        0 |
| tag_summary               |        0 |
+---------------------------+----------+
9 rows in set (0.00 sec)

This was just with uses the stock data and took 16 seconds for the catalog_url index. Most real-world shops will take several minutes; half an hour is pretty normal.  I’ve seen it take hours, where there are tens of thousands of SKUs.

Tell-tale signs

When indexing is causing a problem, you’re most likely to see Magento errors like this in your /var/report/ directory:

SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction

Anyone on StackOverflow might suggest increasing the innodb_lock_wait_timeout, but I don’t think that helps. Here’s why:
The default timeout 50 seconds, but a lot of browsers, reverse proxies etc, tend to time out after 30 seconds. You’re likely to see a 503/unavailable/bad gateway/ or something similar. Regardless, your customers are very unlikely to wait 50 seconds or longer for a page load. Increasing this value might make the SQL errors go away but it doesn’t address the root cause.

More subtly, if you’re using Nginx, especially behind a reverse proxy, you might see 499 errors in your access log. Although not an official RFC-defined error code, it means Nginx gave up because the connection was terminated at the client end. I wanted to mention that because sometimes this is the only error you’ll see, even if Magento/PHP/MySQL aren’t throwing errors lower down the stack.

Timing is key

Two ways to see exactly when it’s happening.

mysql> SELECT * FROM index_process;

The timestamps in this table will show you the most recent runs, and also remind you which ones are set to update on save (mode=”real_time”).

  • # grep [Rr]eindex access_log

    This will show up exactly when the Admin Dashboard was used to do indexing.

    # grep [Rr]eindex access_log
    127.0.0.1 - - [19/Mar/2015:13:43:34 +0000] "GET /index.php/admin/process/reindexProcess/process/1/key/3587af3c674a88e5304db11774e36326/ HTTP/1.1" 302 - "https://www.domain.com/index.php/admin/process/list/key/1129d37eabefa571c1956a19f45f632b/" "User Agent String"
    127.0.0.1 - - [19/Mar/2015:13:43:34 +0000] "GET /index.php/admin/process/reindexProcess/process/1/key/3587af3c674a88e5304db11774e36326/ HTTP/1.1" 302 - "https://www.domain.com/index.php/admin/process/list/key/1129d37eabefa571c1956a19f45f632b/" "User Agent String"
    127.0.0.1 - - [19/Mar/2015:13:44:04 +0000] "POST /index.php/admin/process/massReindex/key/0d75d56b5e90243c0175922deecb4e43/ HTTP/1.1" 302 - "https://www.domain.com/index.php/admin/process/list/key/1129d37eabefa571c1956a19f45f632b/" "User Agent String"
    
  • “massReindex” shows up when using the bulk reindex option
  • “reindexProcess/process/N” is a single index refresh, where N corresponds to the ID in the index_process table.

So, take the timestamp in the log here, let’s say 13:44:04. I know from earlier that it takes about 23 seconds to get through all my indexes, and given that log entries are generally made when a request is finished, I can count backwards and to work out we could’ve had possible website disruption between 13:43:41 and 13:44:04.  Not much of an issue for my test site, but in the real world, we’re usually looking at several minutes, usually enough for your customers to get bored and shop somewhere else.

I’ve found this to be really useful when supporting customers wondering why their website broke at X time on X day.

The solutions?

… another post for another day. UPDATE: here’s that post.