The Three ARM Cortex-A (ARMv7) SoC generations and some GCC optimizations options for

I started to write this post during 2013 autumn, some changes come since, they are in bold in the text.

Some references:
* ARM Compiler toolchain Assembler Reference on ARM documentations site.
* GCC ARM Options, that is also in the gcc manpage.

Translation not finished, work in progress

There are 3 series in ARM architecture

To start ARMv7 architecture, that is the last one from 32 bits architectures of British ARM company. This architecture exists in 3 versions:
* Cortex-A (meaning appliance), computer processors (smartphones, tablets, personal computer, servers).
* Cortex-M (meaning microcontroler), microcontoler for embedded systems (domotic, electronisc…).
* Cortex-R (meaning real-time), for realtime world (robotics, transportation, etc…).

The next ARM architexture, ARMv8 (or AARM64 in Linux world), is a 64-bits architecture, the alter-ego of Cortex-A serie in ARMv8 is Cortex-A50 serie.

Cortex-A three generations

Cortex-A serie we focus on in this post, are divided in three generations. At each generation, new functionnalities are added, energy efficiency improved, and powerness of the most powerfull processor of its generation grown.

On energy efficient versions, some lose are made on computing power, by reducing pipelines for example, or the total number of registers. In all cases, total compatibility is kept between processors of the same generation, but if a piece of software is optimized for one of them, it will probably be less optimal for another one.

* The first generation was limited to Cortex-A8, with only one CPU core, it updates ARMv6 SIMD to NEON (also called advanced SIMD), change the Vector floating point unit to VFP3, but in a light version (ten times slower than next generations), add Thumb-EE and improve Thumb2, allowing him to use 16bits instruction to make code more compact, meaning, more efficient in caches and bandwith.
* The second generation add multiprocessor (or multicore) support, this one supporting onl one kind of processor at one time. There is still VFPv3, but in full version. This generation include Cortex-A9, the most powerfull one (called Cortex-A9 MP-Core) and Cortex-A5, the lower power version.
* The third generation, add hardware virtualization, LPAE, allowing an extended 40 bits (until 1 TB) addressing range, because with 32 bits addressing, only 4 GB (232= 22 × 210 × 210 × 210 = 4 × 1024 × 1024 × 1024 = 4194304 bytes), there are 3 versions, Cortex-A15, the more powerfull, Cortex-A7, more energy efficient, and the futur Cortex-A12, of intermediate power and energy consumption (Cortex-A12 is dropped for Cortex-A17, that has better computing power with far less electric energy. It is used for example in the top of the power list Rockchip RK3288 that use 4 Cortex-A17 cores. Finally, the big.LITTLE architecture is added, allowing to put together on the same chip different power class processors (like A7 and A15) to improve energy efficiency, when there is no need to power, ans improve computing power, when this is needed. The floating point unit is updated to VFPv4.

Which one to choose?

Differences between implementations, can make advantages on some models for some points and to other on another one.

As example, if pure computing power is the goal. Cortex-A8 has about the same level than a Cortex-A7. Cortex-A7 is far more efficient at energy level, bette for floats (VFPv4 vs VFPv3 lite), but a bit less efficient for integer computation and for general purpose execution. The fact that Cortex-A7 largely compensate it’s lower performances in these domains, to have a lower power drain and better performances when needed by leveraging multicore.

The rest has been adapted with the huge evolution between october 2013 and august 2014

Big.LITTLE, a balance between very low power usage and some good computing power

big.LITTLE type architectures have a low efficiency reputation. This is mainly due to the fact that first models out, in Samsung smartphones, only allowed to use the 4 big cores or only the 4 LITTLE at the same time. This wasn’t hardware but software limitation. At this time, the first big.LITTLE manager developped in Linux kernl, called IKS (“In Kernel Switching”, also known as “CPU migration”), still rudimentary only allowed this kind of change between cores. At the same time the currently used mode was developped in the Linux kernel, and use all the big.LITTLE cores. It is called GTS mode (“Global Task Scheduling”). This mode allow to only one, few or all processors cores simultaneously . Read this to go further. During this time, GCC has also been improved in these architectures optimisation, via the -mcpu=cortex-a15.cortex-a7 option, its counterpart for ARM 64 bit (ARMv8 or AARM64) architecture that is coming on market is -mcpu=cortex-a57.cortex-a53.

Leave a Reply