25 years after X terminals pioneered the use of graphical remote display technologies, the industry still lacks a standard way of assessing the performance of remote display technology protocols and associated endpoints. Standard benchmarking tools are readily available for all mainstream computing technologies, processors, GPUs, storage, networking, databases, Java, transaction processing systems etc. There is even a benchmarking tool to assess storage technologies for their ability to support VDI specific workloads (VDI-IOmark). Scapa Technologies and Login Consultants both offer excellent tools for performance testing Presentation Virtualization and VDI platforms. Scapa’s Test and Performance Platform (Scapa TPP) was designed from the outset to drive load and monitor application performance from the endpoint providing a realistic indication of end-to-end system performance. In the same vein, Login Consultants has recently updated its Login VSI benchmarking tool to introduce client-side performance testing features for the first time, including the ability to assess image quality and loading times. However, neither solution really gets to the heart of the primary difference between local and remote execution – the changes in user experience that are a direct result of the use of remote display protocols and clients.
There was a time when this didn’t really matter that much, it was easy to remember ICA, good; everything else, bad. But that’s no longer the case. When Citrix first introduced WinFrame there were if not hundreds of companies manufacturing server hardware, certainly many tens, but just two companies making thin clients – Wyse and Fujitsu. Now the reverse is true; the server market has has been whittled down to just 4 big names and several niche players, but the remote display technology ecosystem is in ferment. Not only are there many times more vendors, but they are building on more operating systems and on more hardware platforms than ever before. At the same time the underlying technology has never been more complex. Setting aside RFB which is used by X Windows, VNC and its many variants, I currently count at least 11 remote display protocols used by Presentation Virtualization and VDI products. All of which offer good performance under many circumstances, but few of which offer good performance under all. When WinFrame was launched remote display requirements were limited to delivering Windows across LAN, WAN and dial-up links. Now we need to think not just about Windows, but about the individual Windows graphics APIs – DirectX, GDI, WMV, WPF, OpenGL (both hardware and software), QuickTime, Silverlight, Flash, etc. We also have to take into account countless variations of fixed and mobile networking technologies and consider a far broader range of device types and operating systems with many client variants at different stages of maturity. This year will be even more demanding. RemoteFX is is beginning to make its presence felt and improvements in PCoIP performance are bearing fruit. Dell has just bought Wyse, nComputing is expected to make a big push into the US enterprise market,, Cisco’s new UC optimized thin clients are shipping, thin clients based on the HDX system-on-a-chip reference architecture are about to ship, HP is experimenting with some innovative thin-client system designs, and Via is looking to . More than that VDI sales are increasing, the DaaS Market is growing, mobile devices of every flavor are flooding the market, VMware AppBlaster is coming, and I would not be surprised if Citrix has something new to share as well.
With this level of complexity and this degree of change the opportunity to make expensive mistakes has never been bigger. Measuring application performance is no longer enough, we need to be able to measure and report on user experience, and that’s not easy. A couple of years ago, Citrix’s Ken Staples commissioned a report that showed how ICA performed in comparison to PCoIP. Ken’s report used a testing methodology that borrowed heavily from a broadcast industry benchmarking standard – ITU-R BT.500 – Methodology for the subjective assessment of the quality of television pictures This is a comparative blind testing methodology where test subjects are asked to compare two different systems and offer a subjective assessment as to which one they liked best. This approach has a lot to recommend it, instead of quoting facts and figures that are easy to manipulate or use out of context, this test went straight to the heart of the matter by asking of the test subjects the single most important user experience question there is “which do you like best?” This is not to say that the report was beyond the reproach. There were certain aspects of it that were open to question, and it has to be said that blind testing with live human beings is an expensive proposition, making it difficult to justify as a regular activity. Still it was by far the best attempt to date at measuring remote display technology user experience, and above all else shows that a fresh approach can deliver results.
Taking this work as a starting point, and spurred on by the recent revitalization of the market, I’ve looked at what can be done to improve on current testing methods, and as it turns out the answer is a lot. After a couple of months of trials, I now believe that it is possible to develop a user experience benchmark that overcomes the shortcomings of ITU-R BT.500, directly addressing the problems of high cost, low consistency and lack of objectivity that this testing methodology the subject to. Furthermore, I believe it is possible to do so in a way that is completely technology neutral enabling a common benchmark to be used for all remote display technologies with equal confidence.
The big challenge with remote display technologies is not to capture what happens on a virtual desktop, or within a remotely hosted application, that is relatively easy; the challenge is to capture what the user sees, analyze and quantify it so that a straightforward numerical scoring system can be applied. In this regard, the ITU-R BT.500 assessment process takes the easy way out, it sits people down in from of screen and asks them to compare different systems. The problem with this approach is that it is subjective, slow, inconsistent and expensive – not good attributes for a testing program. A better approach is to find a way to replace the test subjects with something more consistent and preferably cheaper. With that in mind, I’ve been experimenting with using advanced image analysis software that can analyse not what that VDI platform renders in the data center but what a user sees, more accurately what the end point (i.e. PC, thin client, tablet etc.) actually displays. By eliminating the human element from the benchmark, we eliminate any subjectivity from the testing methodology, we also make the testing program repeatable, consistent and independently verifiable, and equally importantly if widespread adoption is to be achieved we directly address the high cost of manual assessment.
Using the test environment shown in Figure 1 and running a standard test workflow it’s possible to capture output direct from the server and on the endpoint. By comparing any variations in timing and image quality it is possible to precisely measure the impact that the remote display infrastructure has on the output quality. This information can be represented graphically as a variable over time or averaged to produce a single figure for any given workload. By repeating the same test sequence using multiple configuration options (remote display protocol, remote display protocol configuration parameters, thin client model or specification etc.) it is possible to quantify the relative performance of multiple system configurations.
The graph in Figure 2 provides a frame by frame comparison of the performance of two different thin clients running a graphically intensive test workload for 30 seconds. If a thin client delivered output that was identical to that displayed in the data center the graph would show a smooth horizontal line running along the X axis. In practice no thin client is likely to achieve this. Even with a fully loss-less remote display protocol running on a close to zero latency network connection some frames will not match perfectly due to timing differences. Cursory examination of the graph shows that both thin clients exhibit deviation between the baseline captured in the data center. Thin client 2 (red line) offered consistent performance for the duration of the test, with only minor deviations from the baseline indicating that it maintained a close match to the display generated y the server. Thin client 1 did not do anything like as well, the high peaks indicating that it is struggling to keep up with the baseline for almost all of the test. However, more detailed examination shows that while the line graphing thin client 1′s performance has higher peaks it also has lower troughs indicating that it offered slightly better image quality when it was able to catch up. There’s a lot more to the testing methodology than this, but I think that this shows the direction I am heading well enough.
By comparing different configurations this way, it is possible to identify which offers the best user experience far more readily than is possible using any other techniques. Building from the starting point, it means that questions such as those shown below that are today very difficult to answer become far easier to address:
- Does a specific remote display protocol work well in a specific environment?
- Will enabling 256-bit encryption result in any perceptual performance loss?
- Will users notice a change in image quality if lossy compression is used?
- Will implementing QoS improve user experience?
- How will adopting Unified Communications impact thin-client performance?
- Which WAN accelerator delivers the best performance for a given remote display protocol?
- Which offers the best performance – a repurposed legacy PC or a new thin-client?
- Which of several thin-clients is best suited to a given workload?
- Do different user types need different hardware?
- Can I get the user experience I need on an iOS or Android tablet?
- Will spending an extra $50, $100, $200 per device result in appreciably better performance?
Above all else, this type of testing eliminates any guesswork as to the appropriateness of any one technology over another. Questions that might otherwise be too difficult or time consuming to answer with confidence can be addressed simply, quickly and with facts not opinion. And so, with this starting point I’m looking to take what I have learned and work towards releasing it as an open benchmark standard.
To clarify what I mean by an open benchmark standard, I want this work to be available to anyone who wishes to take advantage of it without payment of any licensing fee. This way, anyone with the necessary technical skills will be able to perform the testing themselves free of charge provided they have access to the necessary test equipment. The image analysis software that I have been been using in my preliminary investigations is rather costly, however I am committed to working with the developers to see what can be done to repackaging it in such a way to make it more affordable for this type of activity At the same time, I believe that it is important to permit commercial use of the benchmark so that it may be incorporated into existing testing tools if the market exists.
My goals then for the project are as follows:
- Deliver an draft benchmark standard by in time for the Citrix and VMware conferences in Barcelona this year, with a 1.0 release before year end.
- Make the benchmark available for use free of charge for both individual and commercial use.
- Ensure a minimum cost of entry. As I said, everything developed for the benchmark will be available free of charge, but it is not going to be possible to run the benchmark without using test and analysis tools. That said, I’m setting an initial goal that the benchmark should deliver sufficient value that someone deploying a solution supporting 500 or more users should be able to justify investing in the equipment needed to run the benchmark. Ideally I’d like to get this number down to 250, but 500 is good enough for now.
- Establish broad industry participation – I’d like to see as many industry stakeholders participate in the benchmark development as possible; that means software and hardware vendors, service providers, enterprise customers, etc. Obviously the greater the participation the better the chance that the benchmark will be adopted by the industry as a whole. I’ve already given informal briefings to some of the key industry stakeholders and based on initial feedback I think there is every chance of meeting this objective.
- Create a publicly accessible results repository.
- Continue the project through into 2013 with a 2.0 release offering support for a larger number of test scenarios and more advanced analysis capabilities.
However; having taken things this far, I’ve reach the point where I can’t move this any further forwards without additional resources. So with that in mind, my next step will be to open the door to organizations who are willing to sponsor the project and participate in developing the benchmark. With that said, if you are interested in learning more about the project or willing to consider sponsoring its continued development please get in touch.


Hi Simon,
I agree we need a (de facto) standard to qualify remote display technology. Not only would this make the discussions more objective, I’m confident it will improve the development of the protocols (and associated technology).
But in order to be usefull, the result of the benchmark should make sense. Quantifying an outcome is one thing, but what does that number mean? Are you developing a benchmark that produces an overall result or multiple categorized results?
As written by Andrew I’ve some research about the effect of response times in applications (caused by remoting protocols, virtualization, latency, etc.) on how an average user would grade that – http://www.papershare.com/app/paper.aspx?id=1069&o=817. This “Perceived Performance Index” produces a quantified number and a human-readable-result like ‘excellent’, ‘good’ etc. This is “just” an indication of the user experience which consists of many factors.
How do you expect to answer the question “Will spending an extra $50, $100, $200 per device resulted in appreciably better performance?”.
Anyhow, two thumbs up for the iniative!
I’m looking forward on reading the draft as soon as it comes out so I can investigate if we can incorporate it in the Denamik LoadGen.
Best regards,
Ingmar Verheij
I’ve been giving this some thought and I’d like to suggest that a delivery of a draft benchmark standard by in time for the Citrix and VMware conferences in Barcelona this year (which is October) could be achievable, but you’re unusually vague on what happens between then and now. Between now and October there are a number of conferences and events where this project can be discussed and critiqued, to alpha and beta ideas into a scope for October.
Citrix Synergy in SF, happens early May
Briforum in London, Mid May
E2E in Vienna, Mid May
Tech Ed Amesterdam in June
..off the top of my head.
Are you offering yourself as a curator of this project Simon? I’d recommend you if it helps. I think there should be an engagement with these events to publicise the concept.
Given the current level on interest I expect that I’ll be able to confirm that I have obtained both the sponsorship needed to proceed and sufficient SME participation to ensure that the benchmark will have teeth before the end of this week. Assuming I’m right, I don’t intend to walk this around any more conferences until there is a draft standard ready to review. I think that the development work can be done more quickly using remote collaboration tools than face to face and it will also allow greater participation than anything centered on the conference circuit. Once we have something worth sharing I’ll look for opportunities to present at appropriate conferences – ideally Barcelona in October.
What happens between now and then, is fully dependent on the overall level of support offered, the more support there is the faster we can work to deliver something and the greater benefit.
Looks like a great approach Simon. 2 areas I would consider adding. Sure crappy video sucks, but users get as upset with great video with broken or out of sync audio. Audio-Video sync should be part of the testing. Also, mouse lag is another issue that I don’t see addressed. Let me know if you want to discuss – we have done extensive testing across all protocols in our lab especially for WAN.
Dave
Desktone
Thanks Dave
A-V sync testing is something that I have looked at already and should not present a problem. I’ve not looked at capturing mouse pointer feedback performance yet, but I think i can see a way to do that without too much trouble.
We should discuss this week
Simon
This is potentially a really elegant solution for a difficult problem and I can see you’ve spent more than a little time putting it together.
It ought to rapidly become THE benchmark for comparisons of hardware (thin clients, WAN accelerators), but unfortunately things are apt to get a bit more interesting when we look at protocols.
Multimedia redirection (Flash, WMP), queue and toss (discard similar frames), variable frame rates (adaptive network optimization), error rates (retransmission), congestion are all going to produce differences between frame content and frame rates or create timeline skew (eg length of time to download multimedia content before displaying etc) between the data center- and WAN-connected devices. Some of the effects would be quite acceptable from a user assessment perspective despite looking bad from a frame analysis viewpoint.
That might leave us where we are with the existing tools, turning things off (eg bitmap caching) to get consistent results. It’s not an insurmountable problem, but it does make your analysis job a lot more complex.
But it’s most definitely worth doing. Nice Work.
Rick
thanks Rick, coming from you that means a lot.
There’s no hiding that this is difficult, but I’m not afraid of standing on the shoulders of giants
If I can be the catalyst behind a new way of looking at remote display technology performance, that will make it easier both to sell and to buy this technology, then I’ll be happy.
Of course, persuading Quest to formally back the project wouldn’t hurt either.
Regards
Simon
Hi Simon,
I find the image analysis approach interesting, yet one thing struck me. As far as I can see you’re looking at downstream display data only, right? Would it perhaps make sense to incorporate what’s going on upstream too? – In other words, from the moment a key is struck on the endpoint device to the moment where a glyph is stamped into the display memory of the client, if measured on a near-zero latency connection (or as good as it can get if you stick the client back into the datacenter next to the host/server), I recon you would be taking as much of the network performance out of the equation – wouldn’t that yield some interesting measurements as far as device performance goes? I don’t know – perhaps someone is doing that already, just my $0.02
Thanks,
Max Ranzau
Eldergeek @ RES Software
Thanks Max
This is one of many questions that I would be looking to the broader community to help answer. Clearly generating workload on the endpoint is more accurate, but I don’t know yet how much value is derived from doing this. Given the ratio of upstream to downstream traffic, my initial assumption is that any disparity caused by lack of upstream traffic will be drowned out by the sheer volume of downstream traffic. Assuming that this is correct, it may be acceptable to consider downstream traffic only.
Most importantly, this is precisely why I want to engage a broad spectrum of subject matter experts to help define his benchmark. I know I don’t have all the answers, and I’m equally confident I don’t even know all the questions.
Simon
I believe interaction and measurement has to happen from the end point – and I believe that based on the studies published on and around user perception of performance, and just plain quality of results when doing tests between testing focused on the server and testing from end-points. As you’ve stated yourself Simon, to not do that is testing the delivery stack and I agree you want to minimise that as much as possible otherwise your comparison gets skewed.
As I’ve mentioned – get a hold of Ingmar’s whitepaper on measuring performance for interaction, and importantly the info it references around measuring user perception of ‘better’.
Doesn’t this mirror/compliment the studies that Shawn Bass and Benny Tritsch have already done on protocol comparisons? (quoted here too http://blogs.citrix.com/2010/11/23/smile-if-youre-using-citrix-hdx/)
“the challenge is to capture what the user sees, analyze and quantify it so that a straightforward numerical scoring system can be applied” – this is indeed true: there’ve been some interesting studies and work around this – I know Ingmar Verheij from Pepperbyte has published a whitepaper on this topic (http://www.papershare.com/paper/UK-Citrix-User-Group-How-well-does-your-virtual-application-perform)
I can see the standard environment you’ve got – but is everything going to stay the same and only the broker going to change? How are you going to maintain a consistency if a remote protocol only supports a particular environment?
I’m unsure about each of your questions except “Will spending an extra $50, $100, $200 per device resulted in appreciably better performance?” – that would be very useful to consider; especially with a set of test scenarios – as long as everyone is clued into the concept of ymmv.
Still – it’d be a great comparison to have a benchmark to compare performance of end devices to and a standard methodology in considering results.
Andrew
It is very important to distinguish between those elements of an end to end environment that I’m looking to test and those that I consider out of scope. I don’t see any value in assessing the performance of XenDesktop with respect to View with respect to vWorkspace. That work has already been done. However it is important for me to be able to measure the overall user experience of RemoteFX compared to PCoIP, ICA/HDX,SPICE, etc. So from that point of view with RemoteFX only available as a component of Hyper-V, and PCoIP, only available in conjunction with View which is only available on ESXi, I will have to swap out the hypervisor and broker.
This means that care will have to be taken to minimize the impact of any changes in infrastructure components. I can think of a couple of ways to address this, but as I am looking to establish a collaborative approach to developing the benchmark, I’m not going to share those until an appropriate technical membership has been defined and then will endeavor not to let my proposals dominate any discussion. Having said that, given that the preliminary work I have done establishes a performance differential between endpoint and data center rather than between endpoint A and endpoint B, I believe that the benchmark should be immune to any problems brought about by changing the underlying infrastructure.
Regarding your second point, I think it is better that to say not that your mileage may vary, but that your mileage WILL vary, and to ensure that the benchmark takes this into account so that end-user organizations have the capability to customize benchmarks according to specific needs, while at the same time requiring technology vendors to work with a standard benchmark that withstands independent scrutiny.
I think I like the sound of this, but I don’t understand. Are you going to give this away?
What’s to stop someone just taking all the work and copying it?
Ikon
The whole point is to make the information available free of charge to as many people as possible.
I won’t stop anyone from copying the work. I want them to. I want to see people adopt this work as a ‘standard’, build on it, share it and profit from it. The goal is to create high quality peer reviewed data that can be used to help people reach decisions about the appropriateness of remote display technologies for specific business needs.
If someone takes this work and builds a testing or validation business on the back of it, then I consider that a success. If Scapa, Login VSI, or anyone else incorporates this work into their testing products then that too is a success. If Wyse, IGEL nComputing, Pano Logic publishes performance data based on this work, again that is a success.
Above all else, if more projects succeed because of this work, whether the implement VDI or not, then the work is justified, and if I get paid to help the occasional client select the right technology for their specific needs, well that would be good too.
Simon