Christopher D. Clack

Other Research | Publications
Consultancy | Financial Computing | Start-up IT Ventures
Teaching | Functional Programming | Databases | Student Projects
subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link

MemWise - Smart Memory for Smart Phones

Overview

Smart Phones are small, hand-held devices that integrate the functionality of a mobile phone with the functionality of a Personal Digital Assistant ("PDA"). These devices typically have additional hardware such as single-shot or video cameras, and MP3 players. Increasingly, Smart Phones are adopting an "open box" approach, where the applications are written by third-party software houses. Currently these applications are typically written in C++, but Java is fast becoming the favoured programming language in order to reduce the time to develop and deploy applications in the market. There is increasing competition pressure for more, more flexible, and better (larger) applications. Simultaneously, there is an increasing pressure for these applications to have better real-time response.

 

Problem

Applications developers rely on dynamic (C++) or automatic (Java) memory allocation to reduce errors and reduce development time. Unfortunately, these foundational memory allocation algorithms have a fundamental flaw: they can be fast, or they can be small, but not both !

Typically, a Smart Phone Operating System will incorporate a good general-purpose memory allocator that will be perfectly acceptable for applications that are efficient and fast most of the time. However, the phrase "most of the time" is a killer if there are real-time constraints. By fast we mean "have guaranteed fast response times in the worst case". Current algorithms can guarantee real-time response where there is plenty of available memory (say, twice as much as the program needs), but Smart Phones are by their nature small and therefore have constrained memory.

Solution

Following on from nearly 20 years research into systems software such as run-time systems for sequential and parallel architectures, and especially memory management algorithms for both sequential and parallel architectures, we have developed a new memory allocator that offers: (i) fast worst-case response times, and
(ii) small memory usage, by minimising fragmentation.

Results

We have run performance tests on a benchmark suite of 'generic' smartphone applications (for each of which we had access to the source code) - for example, an Agenda, Alarm, Address book and a Communications program. For each application, run with a repeatable interaction sequence, a trace file was generated that recorded the details of every memory allocation event (a request for more memory, a request to free memory, etc.). Our data included the memory operations of all user interface library routines as well as for the application code itself.

WinCE results

We first chose to determine the behaviour of a standard smartphone memory management system - the Microsoft Windows CE malloc subsystem. We used a cycle-accurate simulator to generate results for all the programs in the benchmark suite. The following sample graph gives a typical histogram of the observed response times (for all types of memory operation):

The above graph is typical of the consistent results obtained. Notice that the horizontal axis scales automatically and that, despite there being a reasonably good response time for most operations (with the best response times being 43 cycles), we consistently find "rogue" memory operations that take more than 1,000 times longer than the best response time - e.g. 47,227 cycles. These "rogue" operations tend to be malloc() and realloc() operations, with free() operations having better worst-case behaviour. The extremely slow response times in the worst case can manifest themselves to the smartphone user as irritating and unpredictable pauses in operation, which degrade user experience of the product.

47,000 cycles in itself is not a very long time - representing only 0.2 msecs for a 300MHz processor. However, these slow responses may occur in bursts, may interact poorly with the cache, may cause real-time deadlines to be missed thereby leading to poor behaviour, may interact poorly with the scheduler, and may cause problems during video playback. The unpredictability and burstiness of the slow operations is illustrated in the following graph:

The space (fragmentation) behaviour of the WinCE allocator was measured for each program in the benchmark suite. Typical behaviour is illustrated in the following sample graph which shows the difference between the memory requested by the program (red) and the memory actually used by the allocator (blue) as the program proceeds (the horizontal axis is measured in memory operations rather than seconds):

Our practical interest in the above graph is to compare the maximum amount of memory requested by a program at any time during its run with the maximum amount of memory actually used by the allocator at any time during its run. The difference is "wasted" memory that could be used, for example, to run other applications concurrently. The above sample indicates memory "waste" of roughly 75,000 bytes and is typical of the behaviour observed in all programs.

New Allocator Results.

Given the above results for an industry-standard allocator, our initial aim was to improve the worst-case response time considerably (e.g. a reduction in worst-case response times from 47,000 cycles to say 24,000 cycles would give a big improvement in end-user experience) whilst hopefully not losing too much in terms of best (or normal) response times and not increasing memory wastage.

The following graph illustrates the typical time performance of our new allocator on the same benchmark applications used for the WinCE tests.

Our results indicate that best-case response times have slowed slightly to 53 cycles (WinCE's best-case was 43 cycles), but worst-case response times have reduced massively to only 2,496 operations (WinCE's worst-case was 47,227 cycles). Thus, we have achieved a worst-case response time that is nearly twenty times faster than WinCE (!)

It is rare in computer science to achieve a massive benefit in the time-performance of mature technology without paying for it with a reduction in space-performance - so what has happened to the memory wastage? The following graph is typical of the observed behaviour:

The above graphs illustrate that memory wastage is only about 63% of that of WinCE (reduced from about 75,000 bytes to about 47,000 bytes).

Finally, has this massive improvement in worst-case response time been bought at the expense of massively increased code size for the memory management routines themselves? The answer is no - the compiled code for the WinCE malloc library was 228,176 bytes and the compiled code for our new malloc library was 223,908 bytes (in both cases, this includes a main() routines to read from the trace files, and additional code to capture memory usage statistics).

Summary:

On the tested applications and user-interface library routines, the new allocator provides outstanding improvement in worst-case response times for malloc() and realloc() (nearly twenty times better), with only a slight slowing of best-case response times. It also provides a reduction in memory wastage (reduced fragmentation).

Continuing Research:

  • improvements to the current algorithms to accommodate hard-real-time (i.e. extremely short) response times;
  • optimisations to the current algorithms to support allocations of extremely small blocks;
  • a theoretical model that can be used to reason about the fragmentation benefits of dynamic memory management;
  • the application of game theory to memory management, to provide an insight into the nature of fragmentation and how it can be improved (or made worse!) by a dynamic allocator.
   
   
About | Site Map | Contact | ©2005 Christopher D. Clack