Developing a Remote Display Technology User Experience Benchmark

25 years after X terminals pioneered the use of graphical remote display technologies, the industry still lacks a standard way of assessing the performance  of remote display technology protocols and associated endpoints. Standard benchmarking tools are readily available for all mainstream computing technologies, processors, GPUs, storage, networking, databases, Java, transaction processing systems etc. There is even a benchmarking tool to assess storage technologies for their ability to support VDI specific workloads (VDI-IOmark). Scapa Technologies and Login Consultants both offer excellent tools for performance testing Presentation Virtualization and VDI platforms. Scapa’s Test and Performance Platform (Scapa TPP) was designed from the outset to drive load and monitor application performance from the endpoint providing a realistic indication of end-to-end system performance. In the same vein, Login Consultants has recently updated its Login VSI benchmarking tool to introduce client-side performance testing features for the first time, including the ability to assess image quality and loading times. However, neither solution really gets to the heart of the primary difference between local and remote execution – the changes in user experience that are a direct result of the use of remote display protocols and clients.

There was a time when this didn’t really matter that much, it was easy to remember ICA, good; everything else, bad. But that’s no longer the case. When Citrix first introduced WinFrame there were if not hundreds of companies manufacturing server hardware, certainly many tens, but just two companies making thin clients – Wyse and Fujitsu. Now the reverse is true; the server market has has been whittled down to just 4 big names and several niche players, but the remote display technology ecosystem is in ferment. Not only are there many times more vendors, but they are building on more operating systems and on more hardware platforms than ever before. At the same time the underlying technology has never been more complex. Setting aside RFB which is used by X Windows, VNC and its many variants, I currently count at least 11 remote display protocols used by Presentation Virtualization and VDI products. All of which offer good performance under many circumstances, but few of which offer good performance under all. When WinFrame was launched remote display requirements were limited to delivering Windows across LAN, WAN and dial-up links. Now we need to think not just about Windows, but about the individual Windows graphics APIs –  DirectX, GDI, WMV, WPF, OpenGL (both hardware and software), QuickTime, Silverlight, Flash, etc. We also have to take into account countless variations of fixed and mobile networking technologies and consider a far broader range of device types and operating systems with many client variants at different stages of maturity. This year will be even more demanding. RemoteFX is is beginning to make its presence felt and improvements in PCoIP performance are bearing fruit. Dell has just bought Wyse, nComputing is expected to make a big push into the US enterprise market,, Cisco’s new UC optimized thin clients are shipping, thin clients based on the HDX system-on-a-chip reference architecture are about to ship, HP is experimenting with some innovative thin-client system designs, and Via is looking to . More than that VDI sales are increasing, the DaaS Market is growing, mobile devices of every flavor are flooding the market, VMware AppBlaster is coming, and I would not be surprised if Citrix has something new to share as well.

With this level of complexity and this degree of change the opportunity to make expensive mistakes has never been bigger. Measuring application performance is no longer enough, we need to be able to measure and report on user experience, and that’s not easy. A couple of years ago, Citrix’s Ken Staples commissioned a report that showed  how ICA performed in comparison to PCoIP. Ken’s report used a testing methodology that borrowed heavily from a broadcast industry benchmarking standard – ITU-R BT.500 – Methodology for the subjective assessment of the quality of television pictures This is a comparative  blind testing methodology where test subjects are asked to compare two different systems and offer a subjective assessment as to which one they liked best. This approach has a lot to recommend it, instead of quoting facts and figures that are easy to manipulate or use out of context, this test went straight to the heart of the matter by asking of the test subjects the single most important user experience question there is “which do you like best?”  This is not to say that the report was beyond the reproach. There were certain aspects of it that were open to question, and it has to be said that blind testing with live human beings is an expensive proposition, making it difficult to justify as a regular activity. Still it was by far the best attempt to date at measuring remote display technology user experience, and above all else shows that a fresh approach can deliver results.

Taking this work as a starting point, and spurred on by the recent revitalization of the market, I’ve looked at what can be done to improve on current testing methods, and as it turns out the answer is a lot. After a couple of months of trials, I now believe that it is possible to develop a user experience benchmark that overcomes the shortcomings of ITU-R BT.500, directly addressing the problems of high cost, low consistency and lack of objectivity that this testing methodology the subject to. Furthermore, I believe it is possible to do so in a way that is completely technology neutral enabling a common benchmark to be used for all remote display technologies with equal confidence.

The big challenge with remote display technologies is not to capture what happens on a virtual desktop, or within a remotely hosted application, that is relatively easy; the challenge is to capture what the user sees, analyze and quantify it so that a straightforward numerical scoring system can be applied. In this regard, the ITU-R BT.500 assessment process takes the easy way out, it sits people down in from of screen and asks them to compare different systems. The problem with this approach is that it is subjective, slow, inconsistent and expensive – not good attributes for a testing program. A better approach is to find a way to replace the test subjects with something more consistent and preferably cheaper. With that in mind, I’ve been experimenting with using advanced image analysis software that can analyse not what that VDI platform renders in the data center but what a user sees, more accurately what the end point (i.e. PC, thin client, tablet etc.) actually displays. By eliminating the human element from the benchmark, we eliminate any subjectivity from the testing methodology, we also make the testing program repeatable, consistent and independently verifiable, and equally importantly if widespread adoption is to be achieved we directly address the high cost of manual assessment.

Figure 1

Using the test environment shown in Figure 1 and running a standard test workflow it’s possible to capture output direct from the server and on the endpoint. By comparing any variations in timing and image quality it is possible to precisely measure the impact that the remote display infrastructure has on the output quality. This information can be represented graphically as a variable over time or averaged to produce a single figure for any given workload. By repeating the same test sequence using multiple configuration options (remote display protocol, remote display protocol configuration parameters, thin client model or specification etc.) it is possible to quantify the relative performance of multiple system configurations.

Figure 2

The graph in Figure 2 provides a frame by frame comparison of the performance of two different thin clients running a graphically intensive test workload for 30 seconds. If a thin client delivered output that was identical to that displayed in the data center the graph would show a smooth horizontal line running along the X axis. In practice no thin client is likely to achieve this. Even with a fully loss-less remote display protocol running on a close to zero latency network connection some frames will not match perfectly due to timing differences. Cursory examination of the graph shows that both thin clients exhibit deviation between the baseline captured in the data center. Thin client 2 (red line) offered consistent performance for the duration of the test, with only minor deviations from the baseline indicating that it maintained a close match to the display generated y the server. Thin client 1 did not do anything like as well, the high peaks indicating that it is struggling to keep up with the baseline for almost all of the test. However, more detailed examination shows that while the line graphing thin client 1′s performance has higher peaks it also has lower troughs indicating that it offered slightly better image quality when it was able to catch up. There’s a lot more to the testing methodology than this, but I think that this shows the direction I am heading well enough.

By comparing different configurations this way, it is possible to identify which offers the best user experience  far more readily than is possible using any other techniques. Building from the starting point, it means that questions such as those shown below that are today very difficult to answer become far easier to address:

  • Does a specific remote display protocol work well in a specific environment?
  • Will enabling 256-bit encryption result in any perceptual performance loss?
  • Will users notice a change in image quality if lossy compression is used?
  • Will implementing QoS improve user experience?
  • How will adopting Unified Communications impact thin-client performance?
  • Which WAN accelerator delivers the best performance for a given remote display protocol?
  • Which offers the best performance – a repurposed legacy PC or a new thin-client?
  • Which of several thin-clients is best suited to a given workload?
  • Do different user types need different hardware?
  • Can I get the user experience I need on an iOS or Android tablet?
  • Will spending an extra $50, $100, $200 per device result in appreciably better performance?

Above all else, this type of testing eliminates any guesswork as to the appropriateness of any one technology over another. Questions that might otherwise be too difficult or time consuming to answer with confidence can be addressed simply, quickly and with facts not opinion. And so, with this starting point I’m looking to take what I have learned and work towards releasing it as an open benchmark standard.

To clarify what I mean by an open benchmark standard, I want this work to be available to anyone who wishes to take advantage of it without payment of any licensing fee. This way, anyone with the necessary technical skills will be able to perform the testing themselves free of charge provided they have access to the necessary test equipment. The image analysis software that I have been been using in my preliminary investigations is rather costly, however I am committed to working with the developers to see what can be done to repackaging it in such a way to make it more affordable for this type of activity At the same time, I believe that it is important to permit commercial use of the benchmark so that it may be incorporated into existing testing tools if the market exists.

My goals then for the project are as follows:

  • Deliver an draft benchmark standard by in time for the Citrix and VMware conferences in Barcelona this year, with a 1.0 release before year end.
  • Make the benchmark available for use free of charge for both individual and commercial use.
  • Ensure a minimum cost of entry. As I said, everything developed for the benchmark will be available free of charge, but it is not going to be possible to run the benchmark without using test and analysis tools. That said, I’m setting an initial goal that the benchmark should deliver sufficient value that someone deploying a solution supporting 500 or more users should be able to justify investing in the equipment needed to run the benchmark. Ideally I’d like to get this number down to 250, but 500 is good enough for now.
  • Establish broad industry participation – I’d like to see as many industry stakeholders participate in the benchmark development as possible; that means software and hardware vendors, service providers, enterprise customers, etc. Obviously the greater the participation the better the chance that the benchmark will be adopted by the industry as a whole. I’ve already given informal briefings to some of the key industry stakeholders and based on initial feedback I think there is every chance of meeting this objective.
  • Create a publicly accessible results repository.
  • Continue the project through into 2013 with a 2.0 release offering support for a larger number of test scenarios and more advanced analysis capabilities.

However; having taken things this far, I’ve reach the point where I can’t move this any further forwards without additional resources.  So with that in mind,  my next step will be to open the door to organizations who are willing to sponsor the project and participate in developing the benchmark.  With that said, if you are interested in learning more about the project or willing to consider sponsoring its continued development please get in touch.

Benchmark, iCA, PCoIP, presentation virtualization, Testing, User Experience, UX, VDI