My experiences with the Luminary Micro LM3S6965 Evaluation Board

Features

Obtaining the EK-LM3S6965

I ordered the unit from Mouser at 7:30pm and received it the next morning. Yay Mouser! They're awesome. I registered the kit on 10 September 2007 via the Luminary Micro web site. I noticed that it said my registration was "pending", as was my registration for the EK-LM3S811 that I got from the Circuit Cellar Design Stellaris 2006 contest. I wonder when they're going to make up their minds about my registrations. Embedded Artists AB also requires a product registration to access product documentation. Update: 13 September 2007, an email to the support department at Luminary got all my products promoted from "Pending" to "Activated", according to both Wendell Smith, Director of Marketing and the web site.

Goals

Toolchain

Writing to the flash memory

The included software includes the LMI Peripheral Driver Library, or PDL. There are also several example applications, including the "Quick Start" game/demo. The source code for all these programs is included, but with Very Scary Legal Warnings. I want to be able to generate my own software to be released to the public domain. The first step would normally be to establish a complete toolchain, but since I am already provided with executable code and a mechanism to write to the device, I thought I would master that first. The most obvious choice is to reprogram the unit with the game/demo program that is shipped with the product. Of course, the question becomes, "If nothing changes, did I really program anything?", also known as "The Fastest Gun in the West" problem. The solution to that is to also establish a method for erasing the device completely. Currently on power up, the unit plays a short song and displays both a LMI and CodeSourcery splash screen, followed by a scrolling maze, with the assigned IP address on the bottom of the OLED screen. If erased, the unit should no longer do that. If successfully reprogrammed, it should revert back to the previous behavior.

The first mechanism that I want to try is the lmiflash utility that is provided from the LMI web site. The version that I downloaded on 10 September 2007 was "Version 1.2.440, 06/04/2007". The README.txt file that was included in the ZIP archive provided installation and usage notes, as well as a change log. Since lmiflash is a command line tool, the normal PATH bother is encountered. LMI suggest three potential methods (execute from install directory, explicitly specify path or modify PATH environment variable) while I prefer to add lmiflash.exe and its attendant FTCJTAG.dll to a directory already in the PATH. In this case I copied these two files to the directory where the CodeSourcery G++ Lite toolchain is installed, as it kindly added itself to the PATH upon installation. Using the ever-handy "Open Command Window Here" XP PowerToy (available for free from Microsoft), I opened a command window where the Quick Start application binary file, qs_ek-lm3s6965.bin, resides. I typed "lmiflash help" from the command prompt and got the program version and copyright information as well as a list of 'flags' and usage examples. This verifies that it is installed and accessible from the command line anywhere.

I typed in the following command line:

lmiflash -f qs_ek-lm3s6965.bin -v -r

The "-f" flag specifies the binary file (.bin extension) to send. The "-v" flag requests a verify operation after programming and the "-r" flag requests a device reset. It first said it was "Mass erasing flash..." which took a few seconds to complete. Then it proceeded to the next line, saying which page number it was currently programming. Since the demo/game program is 154K long, it was no surprise that it counted up to 153 (starting at 0, of course). Each page took just under a second to program, so the total programming time was approximately two minutes. It then went into the verification phase, which also counted the pages for me. This process is quicker but still takes several seconds to complete.

After completing these actions, the demo unit started playing its little anthem on its little speaker, as well as displaying the correct screen images. Next I simply erased the device using the "mass erase" option:

lmiflash -m -v

This not only erased the flash in a few seconds, but also verified every page (all 256 of them), which again took quite a while. Repeatedly pressing the RESET button produced no anthems or visuals, so I'm confident that the device is erased.

Just to be thorough, I reprogrammed the quick start program and verified that it was singing songs and drawing pictures again. So now I know how to errase, write and verify the flash memory using the lmiflash command line utility. This is the most likely candidate for continued usage as I can easily integrate it into my standard makefiles.

I am a bit confused about what Luminary consider to be the "default" crystal frequency. On their previous devices with which I have experience, such as the LM3S811, the "default" crystal frequency is 6MHz. The notes for the lmiflash utility would indicate that this is now 8MHz, to accomodate the new "Fury" class of devices, as well as the "Sandstorm" parts. I don't know what code names go with which parts, and I am puzzled because my shiny, new evaluation board has a 6MHz crystal on it. It is labelled "FS6.000P" and registers as a solid 6MHz on my kick-ass HP 16500B 1Gsa/s scope. I will consult the Luminary web site and try to understand this conflict. Failing that, I will post a question on their forum and wait patiently for an answer. They have been very prompt at answering questions in the past.

According to a forum reply from "LMI Eric" on or about "2007/06/06 13:45", the Fury class are the LM3S2xxx and LM3S6xxx families. I couldn't find anything about the Sandstorm class, so I posted a question on the forum. In that post, I also asked for clarification on the XTAL frequency. The part I have is 6MHz, but the lmiflash usage notes and change log indicate that 8MHz is the new favorite. Perhaps all these burning questions will be answered shortly.

Now that I can burn other people's bits into the part, it's time to make up some bits of my own. That is a project for tomorrow.

Time passes...

Right on schedule, "LMI Eric", the Luminary Micro forum moderator responds to my post and explains that the "Sandstorm" class consists of the first generation, 28 and 48 pin devices. He also gives some reasons for changing from 6MHz to 8MHz, but that still doesn't explain why my eval board has a 6MHz crystal. I explained this in a subsequent post and I'm sure he will set me straight any minute now.

Something else he mentioned suggested that tweaking the "-x" parameter of the lmiflash utility would speed up the writing process. I will now try that and see what happens, and see if there is any improvement over last night's experimental results.

OK, hard to tell, but it might seem a little faster. I would have to go and get all scientifically rigorous and actually time the two variations with a stop watch or something to get any actual data. That might happen at some point, but for now I'm more concerned with getting back up to speed with the assembler.

Update: LMI Eric has immediately responded to my follow-up question and explained that there are not one but two crystals installed on the board. So it's got both a 6MHz crystal for the USB interface chip (FTDI) and a 8MHz crystal for the LM3S6965 part. On a side note, I see that I have been promoted from "Junior Boarder" to "Senior Boarder" on the LMI forum. Yay for me! I'm somebody.

Assembling a Minimalist Program

As of today, 11 September 2007, I will be using the GNU assembler (as), version 2.17 (arm-none-eabi) as provided by CodeSourcery. I had some example ARM7 code for both the ARM/Keil tools and the GNU assembler, but they were lost in the Catasptrophic Data Loss of '07. Luckily, I posted one example for the LM3S811 evaulation board on the LMI forum (in ARM/Keil assembler format: link) which was subsequently paralleled by another user (hilarycheng) in GNU assembler format (link).

I'm going to break this down into teensy, weensy steps so as not to get too far ahead of myself. The first step is to go back and review the assembler documentation. This will remind me of little details (e.g., assembler files have an extension of ".s" and the single line comment character is '@'). Once I review this material I should be able to scare up a short source file that will at least not make the assembler burp, even if it does nothing interesting. The second step, following on the success of the first, will be to make the STATUS LED blink, or at least turn on or off under program control. This will involve initializing the GPIO port (after reviewing the schematic to find out which pin the STATUS LED is connected to), determining its default state, programming the part to set the line to its opposite state, making a loop that toggles the bit, and finally slowing that down so as to be human-perceptible. All this to re-create the "hello, world" of embedded programming, the blinking LED.

An absurdly small example that "works". I created an empty file (called 'empty.s') and assembled it. This works. It produces a file called 'a.out' that is 529 bytes long. Here is the command line that I used:

arm-none-eabi-as empty.s

Just to be thorough in my uselessly minimalistic exercise, I translated the object file into a binary file, using the objcopy command:

arm-none-eabi-objcopy a.out empty.bin -O binary

That's a capital letter 'O' for the output format option, not a zero. This command produces a (drumroll, please...) empty file called 'empty.bin'. If you omit the output filename, the objcopy command replaces the contents of the input file, effectively destroying the original.

Now I will try the same thing with only comments in the file, and see if the resulting binary file is still zero bytes in length. The traditional C language comments /* comment */ work as expected. The single line comment character is '@'. This seems to be unique in the as family and only applies to the ARM architecture. Semicolons and double slashes don't work. An interesting thing that happened during this experiment was that I learned that there has to be a blank line at the end of the source file. Also, a blank line is acceptable all by itself.

I sleep. I forget many things. Best to write the important ones down.

There's a new version of the lmiflash utility available from the LMI web site. The previous documentation pertained to the 1.2.440 version. I will now grab the 1.2.459 version, as it addresses a potential mass erase problem.

What I should have said was that I would try to grab the newest version... it's not available on the LMI web site yet. I left a note on the forum and would expect it to pop up presently. This is not a critical update for me. I only found out about it this morning as I was browsing the LMI forum. It's a great way to stay up to date with what's going on with these products. LMI is doing a great job here. Now back to re-reading the as documentation and learn more about sections.

Concerning Sections

as will always generate at least three sections for any given source file. The names of the three sections are text, data and bss. Any or all of the sections can be empty.

The text section contains the executable code of the program. Why is it named text? It's not text. Perhaps this is a throwback to the days of interpreted languages, like BASIC. Who knows these things? Perhaps I'm being anthropocentric here. Text is something that can be read. Just because I can't read it doesn't mean someone or something else can't. In this case, the processor should have no trouble reading the 'text', in the sense that it fetches and executes instructions from memory. It still seems counterintuitive to me. In any case, text sections are usually considered to be constant, in the sense that they are not modified during the normal course of events.

The data section, on the other hand, is quite pliant. It is generally the Land of Variables and the part of the memory map where the processor would exert its will over the contents; i.e., the RAM. This brings up a semi-mind-bending concept that is encountered with small-scale embedded development. The data section might be situated amongst the RAM, but remember that RAM is a form of voltaile memory. When you turn the power switch off, the contents go bye-bye. Consider the example of the initialized variable and specifically one that is initialized to a value other than zero. Where does this information persist? What limbo does it inhabit during the Times of Forgetfulness? The only permanent (i.e., nonvolatile) section we've discussed so far is the text section. That means that initialized variables must pack their informational baggage amongst the "Thou Shalts" and "Thou Shalt Nots" of the executable code. Upon release from the arms of Orpheus, the values thus interred must be resurrected and arranged in the proper order. This is one of many tasks for a very special and important piece of code called the 'startup' code. At this point in the process of re-creating the "hello, world" application, this specific function is not needed. We will not be using any initialized variables. One day soon we will and the startup code will address this need.

The bss section is the remainder of the volatile portion that does not have a predefined value or is otherwise not in need of initialization. The contents of the bss section are expected for historical reasons to be filled with zeros. This zero-filling is another of the duties of the startup code. Our quaint "hello, world" does not have any particular requirements for the bss section.

While the various sections can be defined in the source, it is the linker/loader application that actually arranges them in the order that appear in the final product. The GNU linker/loader is called ld and has its own arcane methods for shuffling the puzzle pieces. One of these is the linker script, which is unavoidable for any but the most trivial of applications. A linker script defines the working areas of memory that are available as well as the mapping of the different sections into them. Linker scripts historically have the extension '.ld'.

Take a Look

Let's take a look at what's in the a.out file. Using the objdump command, we can see the organization of the file and what is in each section. Type objdump on a line by itself to see all the options available. You have to pick at least one for it to know what you want it to do, but it will guess that you're interested in the file named 'a.out', which happens to be the case right now. I used this command to have a look at all the headers in the file (one for the file itself and one for each section):

arm-none-eabi-objdump -x

This resulted in the following output from objdump:

a.out:     file format elf32-littlearm
a.out
architecture: arm, flags 0x00000010:
HAS_SYMS
start address 0x00000000
private flags = 4000000: [Version4 EABI]

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         00000000  00000000  00000000  00000034  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .data         00000000  00000000  00000000  00000034  2**0
                  CONTENTS, ALLOC, LOAD, DATA
  2 .bss          00000000  00000000  00000000  00000034  2**0
                  ALLOC
  3 .ARM.attributes 00000010  00000000  00000000  00000034  2**0
                  CONTENTS, READONLY
SYMBOL TABLE:
00000000 l    d  .text  00000000 .text
00000000 l    d  .data  00000000 .data
00000000 l    d  .bss   00000000 .bss
00000000 l    d  .ARM.attributes        00000000 .ARM.attributes

That's a lot of information to describe an empty file! Now that we know how to do all this fun stuff with nothing, let's add some stuff to the file and see what happens.

Here's a really short but complete ARM assembly language program for the Cortex-M3:

.cpu cortex-m3
.thumb

.word	0x20010000
.word	reset
.word	reset
.word	reset

reset:	b .

It still doesn't do anything, but it does it properly. Let's take it apart, line by line.

Note the complete absence of comments within the code. Normally I'd be hanging my head with shame, as I am a big advocate of liberal program comments, but this is intended to be a meek & mild introduction. I don't want to scare anybody off this soon.

The first line, ".cpu cortex-m3", tells the assembler what processor is being used. This is important because the GNU assembler speaks many, many languages, and in the case of the ARM family, many subdialects for a given architecture. We're using a Cortex-M3 part, so we make that plain up front. Note that any line that starts with a "." is an assembler directive, which is sometimes called a pseudo-op. These are instructions to the assembler and affect the way that the assembly of the source code happens.

The next line, ".thumb", tells the assembler that of the two possible instruction sets that are supported by the ARM family (ARM and Thumb), we will be using the Thumb set. This is somewhat odd to me, as the Cortex-M3 architecture is Thumb-only; it doesn't support ARM instructions at all. Specifically, it supports the new Thumb-2 instruction set. It just seems redundant to me that after we specify a processor that only supports a particular instruction set, we then have to explicitly tell it what instruction set to use. Since the Cortex-M3 is a relatively new part of the ARM family, I suspect that this might change in the future: specifying ".cpu cortex-m3" would imply ".thumb". There is no assembler directive for ".thumb-2" (yet).

The next four lines are assembler directives that define the minimum required vector table for the Cortex-M3. Each line causes a "word" of program memory (32 bits or 4 bytes) to be defined.

The first one is the value to be loaded into the stack pointer on reset. The hexadecimal number 0x2001 0000 is the address just past the end of the built-in RAM on the chip. [I'm not sure if this needs to be 0x2000 FFFF instead; I'll find out]. We don't use the stack in this example and probably won't need to in the final, blinking version, as no subroutine calls will be made and no interrupts will be handled. The stack pointer setting is part of the minimum vector table, so we set it up with a reasonable value in case anything were to happen, because the processor would detect a fault if non-existent memory were accessed.

The next vector is the address of the code to be executed on reset. Here we just point to a label that is defined (and explained) later.

Next, in a real program, you'd find the vectors for the ISR (interrupt service routine) handler and then the "hard fault" handler. Since this is far from being a real program yet, these vectors just point to the reset routine. We're not going to be using interrupts (maskable or non-maskable) and we have no clue what to do if a "hard fault" occurs (or really exactly what that is at this point), so it effectively doesn't matter. As a point of interest, the normal (maskable) interrupts are disabled coming out of reset because the NVIC (Nested Vectored Interrupt Controller) has yet to be initialized. This means we need not worry about pesky interrupts of that stripe bothering us at this time. The non-maskable interrupt (NMI) on the other hand is said to be "not disabled" coming out of reset, meaning it's armed and dangerous. It could go off at any time (depending on what you've got connected to the NMI input). The same applies to the hard fault vector. What all this means is that while we're not using these last two vectors in this application we must still define them, even if they point to code that does nothing particularly clever. As a low-level AVR assembly programmer, I'm always trying to save a byte here and there, and omitting unused vectors from the end of the AVR vector table is a good way to do that (or omitting the vector table altogether, in trivial applications). The Cortex-M3 minimal vector table (these four entries) cannot be omitted, but the remainder can be.

This is a good time to point out the difference between this vector table and the traditional vector table, such as used by many microprocessors, like the Z80, the AVR and the ARM7TDMI family (not to be confused with the ARMv7 architecture - two different things). The traditional vector table is a list of instructions at known locations that are executed when different things happen to the processor during its adventures. On the Cortex-M3, this is a list of pointers, usually the addresses of subroutines that should be executed when Things Happen (the exception being the first vector, which is the pointer for the initial stack pointer). It's a subtle but important distinction. For one thing, the Cortex-M3 responds to a reset by basically popping the top two values off the vector table like it was the program stack. This initializes the real stack pointer (SP) and the program counter (PC) in two cycles. Normally an ARM7TDMI vector table would be populated with branch (B) instructions that diverge to the various needful places. Another important aspect of this difference is that it allows a Cortex-M3 program to be written entirely in C. All ARM7TDMI programs (C or assembly) must implement their startup code in assembler - no bones about it.

Now on to the real part of the program. Here is the actual executable code for which all the previous stuff was mere preable:

reset:	b .

The "reset:" part is a label. We know this because it's the first thing on the line and because it ends with a colon. The name of the label is "reset", but I just picked that because it seemed descriptive. Any name would do. Labels are used by us lazy humans so that we don't have to keep up with actual addresses. When I was a boy and I had a Z80 but no assembler, I had to keep track of memory addresses. Now I'm all growed up and I have a pretty cool assembler and it handles all this kind of stuff for me. At other places in the program I can refer to this place in the code by referencing the label only. So for example if I wanted to jump back here I could write:

b reset

...and execution would continue from that point. The "b", as you might recall from earlier, stands for "branch (unconditional)". You might remember him as "Jump" or "GOTO"; same thing. It basically replaces the program counter (PC) with some other value; in this case the address of the reset routine.

By now you've noticed that I did not write "b reset", but instead the cryptically short alternative "b .". "." here means the present program counter (PC) value. This effectively means "I jump to myself". We also know this as the famous "infinite loop".

Thus completes the listing of a functional, operational Cortex-M3 application program. Although you could argue that it doesn't do anything useful, it is doing something, (with all the fuss & muss needed to get to this point) and if we can tell it to do this successfully, we can tell it to do other, more complex tasks. Like blink a freakin' LED already.

Make it Work

To assemble this opus magnum, issue this command:

arm-none-eabi-as hello.s

This translates our source program hello.s into an object file called a.out. This is not yet in a form that can be understood by the microcontroller. Because this example is trivial, all our code is in a single, contiguous chunk (or section). We can skip the traditional linking/loading stage and just copy the data and instruction (singular) to a binary file. This is done with the objcopy command like so:

arm-none-eabi-objcopy a.out hello.bin -O binary

This creates a wee file of eighteen (18) bytes called hello.bin. This file is the binary image of our complete if trivial application and corresponds to the pattern of ones and zeros that the processor craves. Next we send the happy bits to go live in the flash memory of the processor using the lmiflash utility:

lmiflash -f hello.bin -v -r

If all goes well, the part is mass erased, the file gets copied over and verified and the part is reset, which should start the wheels in motion. I've tried it and it seems to be working. At least it's no longer playing the corporate hymn every time I turn on my PC.

Preparing to Blink

Here's where we get to some actual programming. We need to configure the general purpose I/O (GPIO) line that is connected to the STATUS LED to be an output capable of driving an LED. To do that, several things need to happen. The first is that we need to take a look at the schematic and see how the LED is connected to the processor.

In the User's Manual, we find the schematic for the EK-LM3S6965 beginning on page 18. The STATUS LED is connected via JP15 to pin 47, also known as PF0/PWM0. JP1-15 are a row of tiny pads with a weensy trace connecting across them. Each jumper has two pads with one trace between them. These are in series with all of the external peripherals on the evaluation board. This allows you to disconnect (by physically cutting the trace) any of the on-board gizmos if you need to use that particular I/O line for something else. If you change your mind back, you can solder on a wee surface mount resistor of the 0 ohm variety. It's not as hard as it sounds. These jumpers are located directly below the OLED display.

The STATUS LED circuit is a simple one. Pin 42 of the LM3S6965 (U1) is connected to JP15, which then goes to R16, a 330 ohm resistor. The other end of the resistor is connected to the anode of LED2, the green STATUS LED. The cathode of LED2 is connected directly to ground. This makes things simple from a programming standpoint. If the GPIO pin is configured as an output, setting the logical value of that pin to a "1" will cause the voltage there to go up to or very near the I/O supply voltage, which is 3.3V for this part. This will cause current to flow through the jumper, the resistor and ultimately the LED, which will then begin to emit light. The resistor limits the amount of current that can flow in the circuit. Setting the logical value to "0" will cause the voltage on the GPIO pin to drop to ground, which effectively removes any voltage differential from the circuit, causing the current to stop flowing. The LED goes off.

Note that the name of the GPIO pin in question is "PF0/PWM0". When there are two parts to a pin name, separated by a slash, that means that the pin can have one or two functions. The only exception are the JTAG pins (which can be GPIO, JTAG or JTAG-SWD, depending on configuration). The nice thing is that on this part there is no multiple peripheral functions per pin. Some high-complexity/low-pin-count parts make you choose between equally essential hardware functions, such as PWM or UART. That's fine until you reach a conflict where you need both functions but they're mapped to a single pin. One of the features of this part touted by LMI is "No pin sharing!" Technically, many of the built-in peripherals share a pin with the multiple GPIO ports. Luckily, the GPIO ports are non-differentiated, meaning they're all interchangeable as far as features and performance. Thankfully, the four analog-to-digital converter (ADC) inputs are not multiplexed with anything.

The "PF0" part of "PF0/PWM0" is the first aspect of this pin name that we will want to explore. It's the shortened designation of "Port F, bit 0". The LM3S6965 has seven GPIO ports, cleverly named A though G. Each port has from two to eight bits and each corresponds to an individual GPIO pin. Here is a list of how many GPIO pins are available on each port:

Port Pins
Port A 8
Port B 8
Port C 8
Port D 8
Port E 4
Port F 4
Port G 2
Total 42

Port F has 4 available bits, numbered 0-3. Of the other three pins on Port F, one of them (PF1) is connected to the SELECT button (SW2, immediately to the left of the STATUS LED) and the other two (PF2, PF3) drive the two LEDs incorporated into the Ethernet connector, P4.

The other function of this pin is "PWM0" (Pulse-Width Modulation output #0) which we will explore later when we want to get all fancy and fade the LED's brightness up and down instead of just turning it on and off.

Referring again to the LM3S6965 data sheet (something we will be doing a lot of before we're finished), we see that each of the GPIO "blocks", as they call them, is quite sophisticated and full of functional goodness. I will list some of the major features that apply to each and every one of the available GPIO pins. Each pin can be an input or an output. In fact, the direction can be changed under program control, so that the pin is an input one moment and an output the next, depending on application requirements. As in input, each pin can have a weak pull-up or pull-down resistor assigned to it. They can also be a maskable interrupt input, triggering on rising, falling or both edges, as well as high or low levels. As inputs or outputs they are tolerant of 5V levels, can have open drain characteristics. As outputs they can have 2mA, 4mA or 8mA drive levels. The 8mA drive outputs can have slew rate limiting applied. Individual bits can be accessed via masking in the address bits. This allows updates to be made to particular bits without affecting adjacent pins in the same port, but without resorting to the traditional read-modify-write cycle for doing so. There's more but that's all my soggy brain can handle at the moment.

When the LM3S6965 wakes up in the morning, all GPIO lines are configured as tri-stated, undriven inputs, except for the JTAG lines. To configure the GPIO blocks for operation, we must first enable the appropriate peripheral clock(s) in the RCGC2 register. That register's full name is, get this, "Run Mode Clock Gating Control Register 2". That's a mouthful! It lives at offset 0x0108 in the System Control Register acre, whose base address is 0x400F E000. That makes its actual address something like 0x400F E108.

The RCGC2 register has individual bits in it for spinning up the seven GPIO Ports as well as the built-in Ethernet MAC and PHY peripherals. Reading or writing to any of the registers belonging to those peripherals while not clocked will result in a "bus fault". That sounds bad, so we won't do it. We're a long, long way from needing to play with the network interface at this point, so I'll just concentrate on getting Port F up to speed.

Off By One: A Long Delay

After spending two weeks trying to get this LED to blink, I meet with success. Along the way, I learn a thing or two about the tools that are available and a dark mystery.

I still don't know exactly why (although I have a vague-ish idea) but us fierce and rugged assembly language programmers can't just drop the reset vector in the table and be done with it. There's a trick to it. The trick is to add one to the address in the vector. I have posted a query on the Luminary forum and eagerly await an answer. My suspicion is that the Cortex-M3 performs the equivalent of a "BX" (branch and exchange) instruction when it reads the reset vector during device startup. If it were a simple "B" (branch) instruction, the actual address would suffice. If (and that's a big "if" at this time) it is acting more like a "BX", then the last bit of the vector, which would normally be a "0" because all instruction fetches are from halfword aligned memory locations (i.e., even-numbered), is used to indicate what instruction set should be used in the branched-to code. Since the Cortex-M3 only supports Thumb-2 mode, that bit would have to be a "1". Obvious, no? I haven't seen it in any of the documentation that I've read so far. I can only assume at this point that the Powers That Be don't expect anyone to be coding in assembler for this part.

We Now Return...

Before we can blink the LED on and off we have to be able to just turn it on. To do this several things must happen in sequence. The first has already been discussed: enabling the peripheral clock to GPIO Port F. This is done by setting the appropriate bit ("GPIOF", bit 5) in that cleverly named system control register, RCGC2. To do this we execute an instruction to load ("LDR" = LoaD Register) the address of the system control register into one of the general purpose registers. I used the first register, known as "r0", in this example. Then we load the data to be written into another register (r1 comes to mind) using another instruction of the same sort as before. Then we can execute a "store" instruction to write the data in the one register into the address pointed to by the other register. Sound complicated? It kinda is. It gets more complicated when you want to do even fancier tricks with addressing modes, but I'll save that for a little later. Here's what the code to do those three steps looks like:

	ldr	r0, =0x400FE108
	ldr	r1, =1<<5
	str	r1, [r0]

Remember that the RCGC2 register had an offset of 0x108 in the System Control space, whose base address is 0x400F E000. That explains that tasty run of hexidecimal jibberish on the first line. The "=" prefix tells the assembler that this is an "immediate" value, which is a form of constant.

The "1<<5" is just my shorthand for a logical "1" in bit position 5. It's a syntax borrowed from the C programming language that means "1 shifted left 5 times". I could have writeen it as "0x20" or "32" and it would have worked identically; in fact, after assembly, the code would be identical. I try to write what I mean and mean what I write, so when I'm talking about bits (and not numbers) I use the shift notation. It helps me to remember later what I was trying to do at the time.

The store instruction ("STR" = StoRe Register) stores the first parameter (r1) into the location pointed to by the second parameter. That's why it's written in brackets ("[r0]" instead of simply "r0"). This is what actually writes the value into the register, flipping bit 5 of RCGC2 from its power-on default of "0" to a much more interesting "1". After this instruction executes, GPIO Port F is on the air.

If you stop now and look at the output that is generated by this code, you might be a little bit surprised. Or you might not be. You're pretty clever. You might have already been asking yourself, "How can a 16 bit instruction encode a 32 bit immediate value?" I would answer, "Good question! It can't possibly!" The way it performs this trickery is to store the whole 32 bits somewhere later in the program, and refer to it by its offset from the program counter (PC). The place where it stores these numbers is called a "literal pool". We will have four values in our literal pool before we're done turning on the LED. The assembler dutifully keeps track of all these tedious details for us.

It's downhill from here for the most part. The next step is to tell the GPIO Port F that we want its bit 0 to be an output. Remember, the LED is connected to "PF0". It could have been (and still could be) an input or it could have had a completely different job altogether. As an output it will have the best chance of actually affecting the brightness of the LED, and that is still our primary goal. There is a register in each of the GPIO ports that controls the direction of each of its bits. This register is called the data direction register. For Port F, it is located at Port F's base register address (0x4002 5000) with an offset of 0x0400. All of the Port F stuff (with the exception of the peripheral clock controls in the RCGC2 register) are a short distance from the base address. We write a logical "1" to bit position 0 of the data direction register to make it an output. A zero (the default state) would configure it as an input. We repeat the magic incantation and come up with the following code snippet to configure the bit as an output:

	ldr	r0, =0x40025400
	ldr	r1, =1<<0
	str	r1, [r0]

There's one more step that's required before we can actually flip the bit on that lights the LED up. The GPIO pin has to be enabled for digital duty. That's because some of the pins can have other-than-digital duties, specifically analog ones. It just doesn't do to have the two mixing. The Digital Enable register is at offset 0x051C and we need to write another "1" to bit position 0. We could just assume that r1 still contains what we left in it, and that would work until we came back and craftily added something in between and forgot about that dependency. I choose to write explicity what I want done and have no questions about it. Here is the code, although you can probably see it coming:

	ldr	r0, =0x4002551C
	ldr	r1, =1<<0
	str	r1, [r0]

Now we have an output pin all configured and ready for data. The data register for the GPIO port is at offset 0x0000. This means that it's the same as the base address itself. Unfortunately, we can't just write the data out to the port. Well, we could, but nothing would happen. In order to be able to write to an individual output pin without obliterating all the other pins, a bit masking scheme was implemented. This is a good thing and doesn't end up costing us any extra effort except to tell you about it. Bit positions 2 through 9 of the data port address are used for the bit mask. Why they're shifted over two bits is a mystery to me, but they are. So to write to bit 0 of Port F, we really have to write to address 0x4002 5004. The last digit "4" is bit position 0 shifted left two places. Writing a logical "1" to this location will finally turn on our LED:

	ldr	r0, =0x40025004
	ldr	r1, =1<<0
	str	r1, [r0]

Here's the complete program (with pitifully few comments):

.cpu cortex-m3

.word	0x20010000
.word	reset + 1
.word	endless_loop
.word	endless_loop

.thumb

reset:

@ spin up GPIO Port F

	ldr		r0, =0x400FE108
	ldr		r1, =1<<5
	str		r1, [r0]

@ configure Port F, bit 0 as output

	ldr		r0, =0x40025400
	ldr		r1, =1<<0
	str		r1, [r0]

@ enable digital output for Port F, bit 0

	ldr		r0, =0x4002551C
	ldr		r1, =1<<0
	str		r1, [r0]

@ turn on LED

	ldr		r0, =0x40025004
	ldr		r1, =1<<0
	
endless_loop:

	b		.

.end

I added the "endless_loop" to effectively halt the processor at the end. I also used it as a dummy destination for the interrupt and hard fault vectors.

Assemble the program (call it led_on.s) with the following command:

arm-none-eabi-as led_on.s -o led_on.o

Convert the object file (led_on.o) into binary format for downloading:

arm-none-eabi-objcopy led_on.o led_on.bin -O binary

Download the binary image to the evaluation board:

lmiflash -f led_on.bin -v -r

Gaze in rapturous awe and wonderment at the glory that is the illuminated LED.

A Day Later and So Much Wiser

So I guessed right. The extra "1" in the reset vector was confirmed to be an instruction set hint by the wise folk of the Luminary Forums. Odd that this is not documented anywhere as it pertains to the vector table. I also discovered that the ".thumb_func" preamble will not only replace the ".thumb" declaration in the source code, but will also mark the section in the file as Thumb-like and instruct the linker to add the appropriate "1" all automatic-like. I was omitting a separate "link + load" step to avoid the evils of linker scripts (hint: they're not that bad). The trivial examples I have been playing with only contain a single, contiguous section and need no special linking (or so I thought). The linker is the one that makes all right again.

So now we can turn on the LED. Yip! We've also paved the way to being able to turn it back off again. All it will take is writing a logical "0" to the GPIO data port. The code would normally look something like this:

	ldr	r0, =0x40025004
	ldr	r1, =0<<0
	str	r1, [r0]

You might think that specifying "zero shifted left zero places" is overkill, but it is consistent with my desire to 'write what I mean & mean what I write'. It's a bit, not a number. It's also a bit nestled amongst other, identical bits. Recall that we'd already loaded a pointer to the GPIO data register in the previous step. We used r0 just like in this example. What I will do differently this time is to load both possible values (0 and 1) in two different registers ahead of time and then write them out over and over. You didn't think we we just going to turn on the LED then turn it off again and be done with it, did you? No, we are going to tell the mighty Cortex-M3 to do this over and over again until its pins fall off. The basic code to do this would look like this (and replace both of the previous example of turning the LED on and off):

	ldr	r0, =0x40025004
	ldr	r1, =1<<0
	ldr	r2, =0<<0
	
loop:	str	r1, [r0] @ turn LED on
	str	r2, [r0] @ turn LED off
	b	loop

The only immediate problem with this example is that it will execute so incredibly swiftly as to be undetectable by the human eye. To get a nicely modulated output that is both visible and pleasantly paced, we need to add a couple of short delays in between the turning on and turning off of the LED. These delays can be achieved by having the processor do some "busy work". We can tell the processor to count down from 1,000,000. You might think that excessive and bordering on abuse, but remember that the Cortex-M3 is a swift part and at this point really has nothing better to do.

The code to count down from 1,000,000 is crazy short and takes less than half a second to execute:

	ldr	r3, =1000000
d1:	sub	r3, #1
	bne	d1

The first instruction loads r3 with the big number, 1,000,000. Since all the general purpose registers of the Cortext-M3 are 32 bits wide, they can hold unsigned integer numbers from 0 to 4,294,967,295. If we use the top bit (bit 31) as a sign indicator, we can store signed (positive or negative) numbers in registers from -2,147,483,648 to 2,147,483,647. So now you can see that a paltry million is no big deal for this computing powerhouse.

The next line starts with a label ("d1:") because we're going to want to come back here - several times, in fact. The instruction sub r3, #1 tells the processor to subtract the immediate value of one (#1) from the register r3. This is in fact a shorthand notation for the full instruction syntax, since the source and destination registers in this case are the same (r3). We could have, for example, subtracted the immediate value "1" from register r3 and stored it in any other register. Since we're just counting down, we can state it as shown.

Any time arithmetic operations are performed, certain aspects of the results are noted in a special set of flags. One example is the "zero flag". If the result of an operation results in a zero, this flag is set. If the result is anything except a zero, the flag is cleared. The state of the various flags can be used to modify the behavior of other instructions. This means that some instructions can be considered "conditional" and used to branch in different program directions. While our rather large number remains more than zero, we want the processor to jump back to the sub instruction and repeat as needed. When the register finally decreases from one to zero, the zero flag will be set and conditional branch instruction bne loop ("branch if not equal") will not evaluate as true and will not execute, allowing the program to continue onwards instead of looping back to the beginning of the loop. So it counts down from a million in less than a second. Can you? Bear in mind that the processor is still running at the fundamental crystal frequency, which for this evaluation board is 8 MHz. We haven't even fired up the phase-locked loop that can increase the operating frequency up to 50 MHz.

I named the label in the delay loop "d1" because I knew I would need another one. Guess what I named that one? Correct! I called it "d2". One delay occurs after we turn the LED on, to give us time to bask in its luminance, and the other is after the LED is turned off, so that we can develop an appreciation for it now that it's gone.

Here's what the whole program looks like, this time with proper comments and white space:

@ hello.s

.cpu cortex-m3

@ minimum vector table

.word	0x20010000				@ stack pointer
.word	reset					@ reset vector
.word	endless_loop				@ interrupt handler
.word	endless_loop				@ hard fault handler

.thumb_func 

reset:						@ reset vector

@ spin up GPIO Port F

SYSCTL_RCGC2	= 0x400FE108			@ run mode clock gating control register 2
RCGC2_GPIOF	= 1<<5				@ bit position for GPIO Port F clock gating control

	ldr	r0, =SYSCTL_RCGC2		@ load system control (RCGC2) register address in r0
	ldr	r1, =RCGC2_GPIOF		@ load data to write in r1
	str	r1, [r0]			@ 0x400FE108 <= 0x20

GPIO_PORTF	= 0x40025000			@ GPIO Port F base address
GPIODATA	= 0x0000			@ GPIO data register offset
GPIODIR		= 0x0400			@ GPIO data direction register offset
GPIODEN		= 0x051c			@ GPIO digital enable register offset

PF0		= 1<<0				@ bit position for PF0
GPIO_MASK	= PF0<<2			@ bit mask

@ configure Port F, bit 0 as output

	ldr	r0, =GPIO_PORTF + GPIODIR	@ load GPIODIR for Port F register address in r0
	ldr	r1, =PF0			@ load data to write in r1
	str	r1, [r0]			@ store data in r1 to address pointed to by r0

@ enable digital output for Port F, bit 0

	ldr	r0, =GPIO_PORTF + GPIODEN	@ load GPIODEN for Port F register address in r0
	ldr	r1, =PF0			@ load data to write in r1
	str	r1, [r0]

@ prepare for loop:
@	load r0 with GPIO data register address (plus mask bits) for Port F
@	load r1 with bit pattern to turn LED on
@	load r2 with bit pattern to turn LED off

	ldr	r0, =GPIO_PORTF + GPIODATA + GPIO_MASK	@ load GPIODATA for Port F register address in r0, including mask bits
	ldr	r1, =PF0			@ to turn LED on
	ldr	r2, =0x00			@ to turn LED off
	
loop:

	str	r1, [r0]			@ turn STATUS LED on

@ a short delay

	ldr	r3, =1000000
d1:	sub	r3, #1
	bne	d1
	
	str	r2, [r0]			@ turn STATUS LED off

@ another short delay

	ldr	r3, =1000000
d2:	sub	r3, #1
	bne	d2
	
	b	loop				@ do it again & again
	
endless_loop:

	b	.				@ loop forever & ever

.end						@ [end-of-file]

It's 79 lines long, including the obligatory blank line at the end of the source file (and ironically just beyond the [end-of-file] comment). Here is the makefile that will assemble, link and download the program to the device:

# makefile - LM3S6965 hello application - blink the LED

all: hello.bin

hello.bin: hello.o
	arm-none-eabi-objcopy hello.o hello.bin -O binary

hello.o: hello.s
	arm-none-eabi-gcc hello.s -nostdlib -nostartfiles -T hello.ld -o hello.o

prog: hello.bin
	lmiflash -f hello.bin -v -r

clean:
	rm hello.bin
	rm hello.o

# [end-of-file]

Note that instead of invoking the assembler and the linker in separate lines, I combined them in a single call to the arm-none-eabi-gcc compiler. The compiler is smart enough to know that assembler files (ending with ".s") should be sent to the assembler and then linked. I pass the name of the linker script ("hello.ld") to the linker via the gcc command line with the "-T" option. I suppress the normal, operating-system-centric tasks that the C compiler would normally perform on any respectable executable by adding the "-nostdlib" (do not link with the "standard" library, libc) and "-nostartfiles" (do not include the "standard" startup code, crt0.S). The "-o" option just tells the compiler what I want it to name the output file, in this case "hello.o".

Here is the linker script, hello.ld, just to be complete. I'm not going to go into any detailed explanation of that right now:

/* hello.ld - LM3S6965 linker script */

MEMORY {
	flash (rx) : ORIGIN = 0x00000000, LENGTH = 256k
	ram (rwx) : ORIGIN = 0x20000000, LENGTH = 64k
}

SECTIONS {
	.text : { *(.text*)	} > flash
}

/* [end-of-file] */

This file describes the physical memory map of the device and clues the linker in to where the various program sections should go. We only have one section, .text, and it goes into the flash memory at the very beginning. We'll need to be more explicit when we write a C program that has initialized variables and other storage requirements. I find linker script syntax bizarre.

At this point, we could make the LED flash faster by reducing the delay count. Conversely, we could slow it down by making it spend more time counting down. What I propose to do instead is to 'step up' to the GNU GCC compiler and let it do all the work. For trivial examples such as the blinking LED application, this is overkill and usually not worth the trouble. Once we get beyond trivial, however, it saves precious development time and lets us concentrate on other aspects of the project, like delivery.

Compiling a Minimalist Program

The C language was not designed to develop firmware for embedded systems. It was designed as an application programming language for general purpose computers running an operating system. There are only two major hurdles to leap to blink the LED. The first is to provide the necessary vector table in the final binary image file. The second is to access the registers in I/O address space. I will address each of these tasks in order.

There are a number of techniques that could be used to define the vector table. The first one I will demonstrate is the old favorite, the 'startup' file. This is a separate source file, almost always in assembly code, that handles the low-level initialization of the hardware and provides any necessary data structures, such as our vector table. This technique is used for ARM7TDMI programming as well as PC programming. Here is what a startup file for the Cortex-M3 must have at a minimum:

.word	0x20010000
.word	main
.word	endless_loop
.word	endless_loop

endless_loop:	b .

Once assembled, this file specifies the vector table as well as a fake handler for the non-maskable interrupt (NMI) handler and the hard fault handler. The first word in the vector table defines the initial stack pointer. The second entry is the pointer to the main() function in our C program, who looks a lot like this:

#define SYSTEM_CONTROL_RCGC2	(*((volatile unsigned int *) 0x400FE108))
#define RCGC2_GPIOF		1<<5
#define GPIO_PORTF_DIR		(*((volatile unsigned int *) 0x40025400))
#define GPIO_PORTF_DEN		(*((volatile unsigned int *) 0x4002551C))
#define GPIO_PORTF_DATA		(*((volatile unsigned int *) 0x40025004))
#define PF0			1<<0

void main(void) {

	int i;
	
	SYSTEM_CONTROL_RCGC2 = RCGC2_GPIOF; // spin up GPIO Port F
	GPIO_PORTF_DIR = PF0; // configure GPIO Port F, bit 0 as output
	GPIO_PORTF_DEN = PF0; // enable GPIO Port F, bit 0 for digital duties

	while(1) {

		GPIO_PORTF_DATA = PF0; // LED on
		for(i = 0; i < 100000; i++); // short delay
		GPIO_PORTF_DATA = !PF0; // LED off
		for(i = 0; i < 100000; i++); // another short delay
	}
}

The six #defines and subsequent gibberish at the beginning of the file are the method I like to use to address the I/O registers in a C program. Yes, it's ugly and almost incomprehensible (to us human-types) but it has a distinct advantage later on. Most likely you would hide all this un-beauty in a header file. What is defined here are the addresses of the registers that we need to write to and the bit positions of interest within those registers. The utility of this method is that register reads and writes can be performed as simple assignments in the C program. This method works with the tools at hand but is not guaranteed to be portable, as integer bit lengths and other implementation-specific details can and do vary in foreign lands. Bear with me and stipulate for the moment that the first six lines are necessary and correct and let's proceed. Also, note that no other #include files are needed at this stage.

With our ugly-proof blinders on, we can start to look at the rest of the code. Let's take an especially close look at the main() function, and its various decorations in particular. You'll note that it is written as void main(void) which tells us and the compiler that the function takes no parameters and that it returns no value once it completes. The lack of parameters is understandable in our context. From whence would they come? We have no operating system that is invoking this function. It is a standalone application and as such relies on no prerequisites, as such. While a 'normal' C program would have its main() function declared as "int main(void)" or possibly "int main(int argc, char *argv[])" or variations thereof, we will not be returning any meaningful data, or returning at all for that matter.

Next the ever-popular variable i is declared as an integer. We will be using i as a counter in our delay loop.

The required hardware initialization happens in the next three statements. See how a simple assignment operation (=) accomplishes the desired write to the register? This, to me, is much more obvious in its meaning than other methods I have seen used, such as a generic function call to a routine that incorporates in-line assembly instructions. REGISTER = value does the trick and comes much closer to the 'mean what I say & say what I mean' philosophy that I aspire to. My use of ALL CAPS for register names is based on the fact that they are constant values, previously defined. This convention dates from Dayes of Olde.

We've managed to get the hardware set up in the three previous statements, in the correct order: enable GPIO clock for PORT F, define PF0 as an output and then enable it for digital functions. This takes up much less space in the source file than the assembly language version previously explicated. Of course, we could have written a macro definition that would have effectively typed out the three line snippets that loaded the registers with the target address and data, then the write itself, and that is precisely what the C compiler ends up doing for us.

Since we want the LED to blink, on and off, continuously and forever, it makes sense to bracket it in an endless loop. My preference in C is to use the while(1) construct. Others prefer the for(;;) construct but I prefer the legibility of using a keyword that better represents what I'm trying to express. They are technically similar and probably result in identical generated code.

Inside each iteration of the while() loop, we first find an assignment that turns the LED on (by writing a logical '1' to the output pin). Note that because of the masking bits already incorporated into the pointer to the GPIO data port, PF0 and only PF0 is affected. This is a nice feature of the Cortex-M3 and shows one of many embedded optimizations that have been implemented. Next is an obtuse delay scheme to produce a human-perceptible delay: counting to an arbitrarily large quantity. As our simple example has nothing better to do at this point, this is an acceptable waste of resources. We then turn the LED off with the simple expedient of writing the opposite of PF0 (expressed as "!PF0") to the GPIO data port, again leveraging the bit-isolation afforded by the bit masking address bits. Another delay is performed to preserve symmetry and the infinite loops repeats.

The relative elegance of the C code is purchased at the price of a slightly more complex command line to compile and link the project. In fact, only the compiler invokation changes; all other aspects of the makefile remain unchanged:

# makefile for LED blinkage in C

all: blink.bin

blink.bin: blink.o
	arm-none-eabi-objcopy blink.o blink.bin -O binary

blink.o: startup.s blink.c
	arm-none-eabi-gcc -mthumb -nostdlib -nostartfiles -ffreestanding startup.s blink.c -T LM3S6965-rom.ld -o blink.o

prog: blink.bin
	lmiflash -f blink.bin -v -r

clean:
	rm blink.bin
	rm blink.o

The target dependency line now contains the additional source file, startup.s. On the compiler invokation line, we've added two important options: -mthumb & -ffreestanding. We have to tell the compiler that we need Thumb instructions, because the ARM instruction set is the default (even when the CPU is specified as 'Cortex-M3', oddly enough). The -ffreestanding option relieves us of declaring main() disingenuously as returning an integer value, as well as reducing some of the other assumptions made about available system resources, such as standard libraries.

We add the startup.s source file before the C source file to force the desired arrangement: vector table first, code second. The linker script remains the same, except that I've decided to name it after the device it describes, instead of changing it for every different project.

This method works just fine and produces a binary image that is 136 bytes long. Moreover, it blinks the LED, and is easier on the eyes, allowing for more experimentation and learning. Try to add code and change the timing to produce different flashing patterns; extra points for Morse coding "Cortex-M3 r0xx0rs".

Yet I am not satisfied. One of the bold promises of the Cortex-M3 was that embedded code could be crafted entirely in C. To do this, we will have to shoehorn the vector table into our C source file somehow. The upside is that we can then drop the startup.s as if it were thermally enhanced.

Add the following lines to the C source file, just before the register address definitions:

void main(void); // prototype

unsigned int * vectors[4] __attribute__ ((section("vector"))) = {
	(unsigned int *) 0x20010000,	// stack pointer
	(unsigned int *) main,		// code entry point
	(unsigned int *) main,		// NMI handler (not really)
	(unsigned int *) main		// hard fault handler (let's hope not)
};

We were forced to add a prototype for the main() function because we need to refer to its address before we've defined it. Next we spell out a simple four word table (our vector table), despite the grisly embellishments. The __attribute__ ((section("vector"))) puts this table in its own program section. This is a GNU GCC extension. We do this so that we can later tell the linker to put this valuable information exactly where we want it: at the very beginning of the binary image. This will involve altering our otherwise uncluttered linker script, which now looks like:

/* LM3S6965-rom.ld - LM3S6965 linker script */

MEMORY {
	flash (rx) : ORIGIN = 0x00000000, LENGTH = 256k
	ram (rwx) : ORIGIN = 0x20000000, LENGTH = 64k
}

SECTIONS {
	vector : { *(vector*) } > flash
	.text : { *(.text*) } > flash
}

All that's changed is to add the line vector : { *(vector*) } > flash just before the .text section. There's probably a 'more elegant' way to do this, but using the term 'elegant' in the context of linker scripts is a stretch. You might have reviously noticed that I declare a RAM memory area even though we've yet to take advantage of it. We'll eventually do so when we start to use RAM-based variables, as well as using a symbolic value for the stack pointer initial value, instead of hard-coding it as 0x2001 0000.

So now we have an embedded application written entirely in C, with only a minimal makefile and mysterious linker script needed to produce a working executable image. The only downside to this approach is that none of the other internal workings of the Cortex-M3 or the peripherals of the LM3S8965 in particular are accessible, at least not until we laboriously transliterate the required addresses and bit patterens from the datasheeet to our source code. Some sort of modularity would be nice to have at this point to avoid ridiculously long source files. This will bear some thinking and I will report my conclusions after I have had time to ponder some possible designs.

Two things popped into my wee brain while away and obstensibly pondering. One was that the arbitarily large number of counting was 1/10th the value of the assembly language constant. The second was that I had used a named variable without making any provision for where that variable should reside in system memory. These mysteries cried out for investigation. The two issues turned out to be somewhat related. In a program this small, like a town, almost everything is related in one way or another.

The first is not too mysterious, after all. I used the value of one million and noticed that the blink rate of the C version was too slow, so I lopped off a zero with no further contemplation. An eager, bouncy LED blink was my reward so I moved on, without thinking of looking any closer at the details. Looking at the disassembled code produced by the compiler, I start to see what is going on. To see the disassembled code, use the objdump utility with the 'disassemble' option, '-d':

arm-none-eabi-objdump -d blink.o

The compiler adjusts the stack pointer down by twelve, thus creating a data space for its use. It then uses r7 as a pointer to this area. This is where my trusty iterator, i, lives. This accounts for some of the extra delay, as the value must be loaded, manipulated and stored every time it is needed.

I also see that I have made the classic blunder and used an up counter instead of a down counter, even though I know better. Let's fix that first, as it is the easiest thing to do at this point. Replacing the previous delay with for(i = 1000000; i > 0; i--); results in a shorter but not perceptibly faster program. I suspect that convincing the compiler to keep i in a register will provide better performance for this application.

The GNU GCC compiler offers two types of register variables: global and local. Neither one seems to make much difference to the compiler. While it will keep the value of i in the register as directed, it still copies it to another register (usually r3) to manipulate and test it. I played with various levels of compiler optimizations (some of which optimized the delay loop out completely) but could not come up with anything as fast as the hand-coded assembly version. To be fair, it's a trivial example and not representative of a typical C program.

Just how fast is the loop, anyway? I took out the delays and measured the period of the blinking. The positive portion lasts 625 nS (5 cycles) and the negative portion lasts 750 nS (6 cycles), for a periodic rate of ~727 KHz. That's just under half the speed of the tweaked assembler version (1.6 MHz). The main difference is that the compiler reloads both the register address and the data to be written each cycle. With every optimization option selected (option "-O3"), the compiler only loads the register address once but still loads the data each time for the required "1" and "0".

While it make look like I'm picking on the compiler's lack of efficiency, I'm not. What I'm trying to illustrate is that there will be very specific instances where it will be possible to realize over 100% improvements in speed by using assembly language. One day you might need this. This state of affairs may very well change in the near future when the unique capabilities of the Cortex-M3 architecture are leveraged by the compiler. This will, no doubt, happen in the proprietary (i.e., for sale) compiler products before the public ones. The best thing about the compiler is that it will always be diligently looking for all possible optimzations, whereas I will at most be able to juggle one or two trivial ones at a time. Just remember that there are alternatives to compiler-generated code and I will be happy(er).

Using the PWM Hardware

Blinking an LED is awesome and all that but fading an LED is what all the cool kids are doing. Since the LED is wired as a digital circuit (either on or off), we can't control the brightness of the LED by adjusting the voltage or current beyond the two extremes (again, on or off). What we can adjust is the apparent brightness of the LED using a technique known as pulse-width modulation (PWM). This involves rapidly turning the LED on and off with a designated duty cycle. If the LED is on half the time and off the other half it will appear to be shining at half the normal brightness, assuming that the switching happens faster than the human eye can see. This magic frequency is anything over about thirty Hertz (30 Hz).

This trick is easy to do in software. I wrote an assembly language program that not only dimmed the LED effectively but also periodically cycled through all the brightness values. It was at that point that I decided to concentrate on writing the example programs in C, as the source code was becoming unwieldy and not as effective as a learning tool. What I'm trying to get across are the concepts involved and not always the precise implementations, if I consider the idea to be more important than the realization thereof.

I also made a slight change to the organization of the C source file. Since the arrangement of the two program sections ("vector" and ".text") are now defined in the linker script, I could in theory place them in any order in the source file. I could (and eventually will) place them in separate files. For now I will place the vector table after the main() function. Why? Because then I can get rid of the prototype for main() at the beginning of the code. If (when) I split the file in twain, I will have to put the prototype declaration back in, but for now I'd rather not look at it.

We'll need yet another integer iterator for my next trick, so let me invite the beautiful & talented j to join us on stage:

int i, j;

The changes to the remaining code are minimal but interesting. Replace the previously defined while(1) loop with this code:

while(1) {

	for(i = 1000; i > 0; i--) {
		GPIO_PORTF_DATA = PF0; // LED on
		for(j = i; j > 0; j--); // "on" period delay
		GPIO_PORTF_DATA = !PF0; // LED off
		for(j = 1000 - i; j > 0; j--); // "off" period delay
	}
}

This example uses a nested loop, i.e., a loop within another loop. The outer loop is the for() loop that counts i down from 1,000. The "on" time and "off" time are contolled by two consecutive inner loops. First the LED is turned on. The first inner loop times the "on" period of the cycle by having j count backwards from i's present value to zero. Then the LED is turned off. The second loop times the "off" period by counting down the remainder of the fixed period, which in this case is 1,000 cycles (1,000 - i).

Technically, both of these loops are interior to the outer-most while(1) loop. You will find that most embedded programs are contained within one big loop of one flavor or another.

This version of the ever-evolving example program turns the LED on full brightness, then proceeds to slowly fade it down to darkness, then repeats the cycle ad infinitum.

As long as this is all you want your shiny new Cortex-M3 to do, this example is the perfect solution. What if, for some reason, you wanted to do something else at the same time? You'd have to cleverly interweave whatever that task was into the existing code, taking care not to indulge in anything too time consuming lest you corrupt the illusion of dimness. This can be a very difficult thing to do. Don't despair! You are in luck. The LM3S6965 (and a lot of the other LMI parts) have dedicated PWM hardware that will be happy to handle this mundane if important task for you, automatically! We are doubly lucky, it turns out, as the LED on the EV-LM3S6965 just happens to be connected to one of the six (count 'em: six) PWM outputs. Let's take a look at what would be involved to leverage this resource for our amusement & edification.

The LM3S6965 has an overall PWM control block and three individual PWM blocks (PWM0, PWM1, PWM2). Each PWM block has a single counter and two comparators (A & B) and can produce two PWM signals, for a total of six PWM outputs (confusingly called PWM0, PWM1, PWM2, PWM3, PWM4 and PWM5). This may lead to some bewilderment, so try to be careful to differentiate between PWM blocks (PWM0-2) and PWM outputs (PWM0-5).

The two channels from each block can be used independently or in tandem. The tandem modes are for generating synchronized pairs of complementary signals that could be used, for example, to drive motors, via half-H bridges. Since we're driving a single LED and not motors at this time, we will begin by looking at the individual uses of the PWM signal and how to make it do what we want, which at this point is fade the LED up & down.

At this point I would like to point out that all the PWM capabilities of this part are in addition to four other general-purpose timer/counters provided as well as the dedicated SysTick timer. You don't have to choose between timer functionality or PWM unless you need more than four separate timers and six PWM channels. For example, many of the 8-bit Atmel megaAVR parts have six PWM outputs available, but each pair of PWM outputs is tied to one of the three available timer/counter channels, and each channel differs in design and capability.

The LED on the EV-LM3S6965, as we have discussed previously, is attached to PF0/PWM0. This means that it can be driven directly by bit 0 of GPIO Port F (which is what we have been doing so far) or by PWM0 block output A. PWM0's other channel is called "B" (surprise!) and happens to be connected to the speaker on the evaluation board. We would do well to keep that in mind while setting up the PWM0 output for its LED duties.

Each PWM block (and therefore each pair of associated PWM outputs) has a single PWM timer, which is a 16-bit counter. The PWM counter for block PWM0 is called PWM0COUNT for some reason. This counter register performs the same function as our old friend i the ever-faithful iterator in our previous software-only PWM example. The nice thing about dedicated hardware PWM is that for the most part we never have to do anything with or to this register. We tell it (indirectly) what frequency to use and it just magically counts all by iteself, while other dedicated hardware makes the necessary comparisons to our target values and carries out our directives pertaining thereunto.

Each of the three PWM block counters can be configured as a down-counter or as an up-and-down-counter. The difference is a subtle but important one. You will not be able to see the difference between the two modes in our LED project, unless you attach an oscilloscope to the output and that's just cheating. In down-counter mode, the resulting waveforms are generally left- or right-aligned with the "zero" or reload event. In up-and-down mode, the waveforms are centered on the zero or reload event, depending on how you look at it. Our first example will use the down-only mode for simplicity's sake, as it is slightly easier to keep track of what is going on. When we add some speaker functionality using the other channel, we will want to shift to the up-and-down mode as it will preserve the phase of the audible signal. A phase-shifted sound is an interesting effect when you want it but loses its novelty quickly when you don't.

In down-counter mode, the PWM block counter reloads itself automatically with a 16-bit value of our choosing, and starts counting down again. In up-an-down mode, it counts first up to this designated value, then switches directions and starts to count back down to zero, where it starts counting up again. We write this value to a register called PWM0LOAD for block PWM0. You can probably figure out the other PWM block register names.

Our "target" value, as mentioned before, is the counter value at which we want the PWM output to change states. By varying this value, we change the output duty cycle of the PWM signal, without changing the overall output signal frequency. Each PWM block has two comparator registers (A and B) that hold the respective target values. When the PWM block counter matches one of the comparator values, a "match" event occurs. Other events that are important to the life and times of our PWM signal are "zero" and "load" (or "reload", to be more accurate). If we were using the up-and-down mode, there would be a separate "match up" and "match down" event for each channel.