SamKnows Website Performance Test: Part 2
At SamKnows, we have been hard at work developing a new website performance test, called the SamKnows Website Performance Test (we’re pretty “out there” with the name!).
In our previous blog post, we covered the overall concept. In this blog post we look at the technology behind it.
- The ISP router (with our Router SDK integrated) or the Whitebox don’t have enough resources to run a full web browser.
- Different web content may be served based on geographical location. We need to initiate connections from the home for the correct content.
- There may be a network address translation (NAT), so the connections must originate from the client side.
- The system must be able to perform a large amount of tests simultaneously.
- The test must be efficient and the duration of each test must be as short as possible.
- Some networks have big bandwidth limitations, mostly in the uplink. With this and the test-duration, we may need a shared cache.
All communication between the client and the server is based on the Website Performance Test Control Protocol (WTCP), where a single TCP connection is used to send and receive messages, and multiplexing multiple socket connections (relaying the browser’s socket on the Offload Server to the real Web Server via the router). To reduce unnecessary bandwidth consumption the protocol supports compression (Deflate).
There are 7 phases to the test sequence.
The first phase of the test is to determine which objects the test client needs to measure. This relay phase involves no measurement itself - it simply works out what objects the client needs to fetch and measure later on.
When the browser sets up a connection to a particular web server it must first look up the DNS name. This functionality is wrapped and ends up asking the client side to look up the IP (relayed DNS request). DNS lookup time is stored on the client side until it is needed later in the test.
At this point it is possible to establish a socket connection to the web server. This is also wrapped and instead of passing stream data to the kernel (which normally passes through the TCP stack and later out as IP packets). The stream data is relayed to the client side. The client then establishes a real TCP connection to the web server and sends the stream data to the server.
When the server responds, it takes the incoming stream data and sends it back to the offload server, which in turn feeds the data back to the browser's wrapped socket connection.
Typically the initial stream data to/from the web server is the SSL negotiation. Later an HTTP request header is sent and the server replies with HTTP response header and body. The bi-directional TCP stream relay forwards any data that is needed over the TCP connection. The same is done for all parallel relay connections that the browser chooses to establish.
As mentioned above, the relay phase involves loading the entire web page in the browser. While objects are being loaded the complete HTTP request/response header/body pairs are stored on the server, which is needed later. This is done by wrapping the HTTP layer and storing everything in a data cache located in the main offload server process.
Due to the increased latency (especially for connections with slow uplink bandwidth) the relay phase may take a considerate amount of time. This puts a limit on how many tests the server can handle. For this reason the offload server make use of the caching fields in the HTTP response header. If data from previous test can be re-used then the object does not need to be fetched over the tunnel. Typically the majority of objects are cacheable. What remains for each test is to fetch non-cacheable objects and to relay DNS requests (not needed for relay but the measurements are needed in later phases). Note that even if data is taken from the cache as a part of producing a list of needed objects they will later always be fully measured on the client side.
Transfer measurement phase
When the web page is fully loaded a complete list of URLs are available. The list of URLs, together with saved HTTP request headers, are sent to the client side so that the transfer measurements can be carried out. The client will imitate a browser by carrying out the transfers in parallel. Detailed measurements about TCP connection time, SSL negotiation time, waiting time and transfer time are saved on a per-object basis.
When the client has finished recording its measurements, it asks the offload server to replay the web site using the DNS lookup timing and transfer measurements. The server starts a new web browser session. This time, while in replay mode, it does not fetch anything from the internet or make any external communications. The HTTP layer is wrapped and for every object that is requested the data is taken from the data cache.
Due to the multi-process design of modern browsers the data should not be fetched on the fly from the main offload server process, as this would introduce lookup latencies. Instead, when the top-level object is requested the previously generated URL list is used to fetch all needed objects to be put in a process-local data cache which has minimal lookup overhead. Once all of the objects have been fetched and stored in the hash table, the page load measurement begins. Each object is given to the browser based on its individual measured latency from the user's network. The same goes for DNS lookup time the first time a FQDN is accessed.
Whilst the page is being loaded in the browser during the replay phase, periodic screenshots of the browser window are taken every 200ms and stored in memory. When the page is fully loaded the screenshots are encoded in PNG format and stored on the file system.
As a final step, a replay report is generated for the entire session as well as replay timing measurements for each object.
Image and video generation phase
Next is the image analysis phase. Screenshots are analysed to determine page load progress. Elapsed time for 'first initial', 'visually useful', and 'visually complete' are identified as well as the overall page load progress expressed as a percentage.
With the available information, a video is generated where screenshots are mapped to a high frame-rate video timeline. An on-screen display of ticking timestamp, page load progress, and measurement timestamps are also included. The video gives a visual feel of how fast the page is loaded.
HAR file generation phase
By combining all of the available measurements, an HTTP Archive (HAR) file is generated. The network characteristics from the user’s network, the HTTP request, and response header/body pair from the data cache, the replay report, as well as conclusions from the image analysis, are all used to produce the HAR file. It allows detailed inspection of the timing and complete content of the page load operation using third-party tools.
File upload phase
The video file and HAR file are too large to be stored in a database. They are also likely to become less relevant over time and my be purged after a period of time. The files are uploaded using unique test session prefix to a file storage repository. Each resource can be accessed via a public URL.
CSV result generation phase
The final output of the test is the CSV data on the client side, which consists of overall test results as well as per-object statistics which are later stored in a columnar database for data analysis via SamKnows One.
Website Performance Test implementation
WebKit is the browser engine that we use for the Website Performance Test. It has historically been used in Chrome and Android as well as currently being a part of modern iPhones, tablets, and desktops. The source code is available and in particular underlying dependencies are also built as a part of the build system (not only assumed to be installed in the system). This makes wrapping requests and responses easier, which is an essential part of the project. For HTTP monitoring and replay libsoup is wrapped to enable communication to the client. For DNS lookup and TCP stream wrapping glib is wrapped.
Various system libraries have been wrapped, allowing us to manipulate the data and response times delivered to WebKit, without having to modify WebKit itself.Ideally, to reduce complexity, wrapping should be done in the lowest level of WebKit before calling external libraries, although the implementation would be similar in terms of needed message passing between the client and the Offload Server.
State data and storage of HTTP data
State data is stored in hash tables to improve performance. Some tables need multiple keys (e.g. one key to identify the client connection, another to identify a socket wrapper process). For this reason a multi-key hash table has been implemented, which essentially is one hash table per key pointing to a shared record with a life cycle as long as the last remaining key. The data structures have been implemented specifically to reduce dependencies on the embedded client side.
Each test cycle is described as a test session and all required state data is stored in a hash table. HTTP related data is stored in the data cache, which allows partial re-use of previous test runs when allowed by the HTTP caching rules. The data is used during both relay and replay. The tables separate HTTP request headers, HTTP response headers, and HTTP response bodies.
Some websites make use of random values or timestamps appended to URLs at the client-side to avoid caches. This results in some table lookups failing since the exact URL is not present in the table. To solve this problem, we use fuzzy matching as a fallback when an exact lookup fails. If the URL is close enough a representative object can be fetched from the cache. Typically the difference would be a different timestamp in the URL. This is implemented by computing Levenshtein distances, which is an edit-distance algorithm to calculate string similarity.
Image analysis and video generation
Periodic screenshots are recorded during the replay phase. They are later analysed to compute various metrics (such as the time to first visual). The PNG files need to be compressed in order to reduce the amount of storage space it uses, and makes it possible to identify duplicates. The images are first filtered with respect to uniqueness, which is decided based on their size (a single byte difference represents a unique screenshot). Duplicated files are ignored, which is a common situation when the web engine doesn’t render anything new after repeated 200ms iterations.
The images that remain are analysed more deeply. This is done using a combination of pixel accurate delta calculations and structured image dissimilarity (DSSIM).
DSSIM is a metric widely used in the broadcast, cable and satellite industries. The quality of the picture is measured against a perfect uncompressed and distortion-free reference. In other words the goal of the measurement is to focus on the overall structure in the image, which is similar to how a human detects objects in a picture. In this project the reference picture is another screenshot and is a measurement of the difference in load progress.
The choice of DSSIM implementation has changed in this project. Currently OpenCV is used, which is a highly efficient library commonly used for computer vision projects.
Also, the GDK library is used for image file access and manipulation. Image delta has been implemented specifically for this project, which also supports using a pixel threshold for ignoring minor pixel differences. The delta is represented with bounding rectangles.
During the browser's page rendering process the engine may shift web content down graphically as it receives more data. Progress calculations compare against the final image. If shifting has occurred the DSSIM calculation will detect a large structural difference but for a human there is almost no visual difference.
This is corrected by performing image shift compensation. The shift is first detected by repeatedly comparing scanlines between both images. For each scanline a forward scan is performed to determine the shift length. By building a histogram of the shift lengths it is possible to come to a conclusion if the shift is significant and the vertical range of all the shifts. Then a new image is generated where the affected area is shift compensated so that when comparing against the reference screenshot the image structure is aligned between the two images. The DSSIM calculation is now representative of how a human perceives the image difference.
The conclusion of the image analysis is the time to 'first visual', time to 'visually done', and all intermediate visual progress steps between the screenshots. It is then used as input for the video generation.
The screenshot frequency is much lower than the video playback rate. The screenshots are mapped to the video timeline. The implementation is flexible and can present playback of any frame rate, but currently 30fps is used. The reason for such a high frame rate is the smoothness of the ticking timestamp, which has millisecond resolution. A very small randomness is added to the timestamp to look more representative, which otherwise would be increments of the same identical millisecond number.
PNG image generation for the video frames is performed with the help of libCairo. A vector graphics header is included in each frame where timestamps and load progress is presented. Then as time elapse screenshots are added/updated to represent how the user would have seen the web page being loaded. Finally, FFmpeg is used to encode an x264 video file based on the generated image frames.
If you would like to talk to a member of our team about this performance test in more detail, please contact us here. The final blog post will be published next week.