Extending CHIPOS with more CHIP-8 instructions

16 minute read

In which I learn 6800 assembly in order to hack CHIPOS.

This post is part of a series on writing an emulator for the DREAM 6800 computer. Read the DREAM 6800 posts and look at the emulator’s repository.

The source code and binary for the modified CHIPOS (“CHIPOSLO”) can be downloaded from the CHIPOSLO GitHub repository.

OK, this is a long post. There’s a table of contents to the right. I’ll get into the nitty-gritty of writing short 6800 assembly, and even though I’m an amateur on that subject, I’m apparently better at it than writing short posts.

CHIPOS logo

Quick recap: The DREAM 6800 was created in 1978 by Michael J. Bauer, and it came with a small, 1024 byte operating system called CHIPOS. CHIPOS incorporates an interpreter for a bytecode VM called CHIP-8, which originally appeared on RCA’s CDP1802-based COSMAC VIP computer in 1977. You can read more about CHIP-8 on Wikipedia.

I’ve written a couple of CHIP-8 games myself, so I obviously wanted to see them run in my brand new emulator. However, I targeted those games for the original COSMAC VIP implementation, and that posed a problem for running them on the DREAM. This is my game Mini Lights Out in my DREAM emulator:

Mini Lights Out, broken

Undocumented CHIP-8 instructions

CHIP-8’s arithmetic and logic instructions are two bytes (like all instructions) and start with 8, followed by two nibbles that refer to two of the 16 “V” variables/registers, and ending with one nibble which denotes the operation. They look like this, where X and Y can be any hexadecimal number:

  • 8XY0: VX = VY
  • 8XY1: VX = VX OR VY
  • 8XY2: VX = VX AND VY
  • 8XY4: VX = VX + VY (VF is set to 1 if there’s a carry, and 0 if there’s not)
  • 8XY5: VX = VX - VY (VF is set to 0 if there’s a borrow, and 1 if there’s not)

Notice the inverted use of the carry flag for the subtraction. That’s how the 1802’s carry flag worked, but it’s the opposite of the way it works on the 6800, necessitating some extra work later.

For the instructions that don’t specify what happens to VF, the resulting value in VF is undefined. 1802’s ALU would mangle it.

Due to the way the original CHIP-8 interpreter dispatches these instructions to the 1802’s ALU (details in this great blog post), it actually had four undocumented instructions. They are as follows:

  • 8XY3: VX = VX XOR VY
  • 8XY6: VX = VY - VX (VF is set to 0 if there’s a borrow, and 1 if there’s not)
  • 8XY7: VX = VY >> 1 (VF is set to the bit that’s shifted out)
  • 8XYE: VX = VY << 1 (VF is set to the bit that’s shifted out)

Note that starting with the CHIP-48 and SUPER-CHIP interpreters in 1990, 8XY7 and 8XYE were changed to do VX >>= 1 and VX <<= 1 respectively, ignoring the Y operand. For that reason, modern games should use the same V register as both operands to be cross-compatible.

The remaining instructions, 8XY88XYD and 8XYF, do not elicit useful functionality.

These four undocumented instructions were discovered pretty early on, in 1978, and described in the COSMAC VIP’s newsletter VIPER (issue #2), but Bauer apparently wasn’t aware of them at the time he made the DREAM 6800. So his CHIP-8 interpreter in CHIPOS does not support them. My games, however, make heavy use of the shift instructions in particular.

This meant I couldn’t play my games in my DREAM emulator. I decided to take a little detour on my emulator development to see if I could cram them into CHIPOS. All I know about Motorola 6800 assembly is what I’ve learned from making this emulator, so this would be an interesting exercise.

The DREAM 6800’s CHIP-8 interpreter and monitor program, CHIPOS, fits in 1K of memory. Naturally, I wanted to squeeze in these extra instructions in that space. Was it doable?

CHIPOS code

You can download the CHIPOS code and ROM from Michael J. Bauer’s website.

Here’s a nice opcode table for 6800 assembly which I used as a reference during development.

The original CHIPOS interpreter uses the first nibble in the CHIP-8 instruction to index a jump table, and it jumps here when it encounters an opcode that starts with 8 (labels are Michael J. Bauer’s, comments are mine):

LETVV: LDAA VX    ; A will be used as result
       LDAB PIR+1 ; B = final byte of instruction
       ANDB #$0F  ; Pseudo-"opcode" in final nibble
       BNE  LETV1 ; 8XY0
       LDAA VY    ; VX = VY
LETV1: DECB
       BNE  LETV2 ; 8XY1
       ORAA VY    ; VX |= VY (OR)
LETV2: DECB
       BNE  LETV4 ; 8XY2
       ANDA VY    ; VX &= VY (AND)
LETV4: DECB
       DECB
       BNE  LETV5 ; 8XY4
       CLR  VF    ; VF = 0
       ADDA VY    ; VX += VY
       BCC  LETV5 ; Check carry
       INC  VF    ; VF = 1 (overflow)
LETV5: DECB
       BNE  PUTVX ; 8XY5
       CLR  VF    ; VF = 0
       SUBA VY    ; VX -= VY
       BCS  PUTVX ; VX < VY? (unsigned)
       INC  VF    ; VF = 1
PUTVX: LDX  VXLOC
       STAA 0,X   ; Replace VX with result
       RTS        ; Return

This is basically a switch/case statement with fall through. First it sets up a couple of things: It loads the argument VX into accumulator A, and the last nibble of the CHIP-8 instruction into accumulator B. Then, if B is equal to zero (instruction 8XY0), it sets A equal to VY. Otherwise, it decrements B, and if that makes it equal to zero, it will execute 8XY1 and load A with the result of OR-ing A with VY. This goes on all the way to the bottom, where it finally loads VX with A and returns. Notice how it decrements B twice after checking 8XY2, explicitly skipping the undocumented instruction 8XY3.

So, how could I squeeze in checks for 8XY3, 8XY6, 8XY7 and 8XYE here, and the code needed to execute those instructions, while not using a single byte? I tried a couple of approaches.

First try: Shaving off bytes

In the code block above, LDAB PIR+1 sets the B accumulator to the second byte of the CHIP-8 instruction (which in this case is used as a counter to execute the correct instruction based on the last nibble). I noticed quickly that many of the destinations from the major jump table did this, since a lot of instructions have operands in that last nibble. None of them expected B to hold any particular value first. So I moved the LDAB PIR+1 to occur before the jump table, and removed it from all the destinations.

This actually turned out to save me a lot of bytes. It also let me simplify that initial jump table a little; some of the jumps were to small routines, like the one for VX = KK (where KK is a one-byte constant) only loaded A with that last nibble, and then immediately branched to the PUTVX routine, for example. Now that could be a simple jump to PUTVX.

I changed some other things too. To give one (perhaps idiomatic?) example, I changed this:

EXCALL: LDAB 	PIR         ; GET INSTR FIRST BYTE
        BNE     CALL        ; IF != 0, CALL IT
        LDAA    PIR+1       ; GET INSTR LAST BYTE
        CMPA  	#$E0        ; IF 00E0 (CLEAR SCREEN)
        BEQ     ERASE
        CMPA  	#$EE        ; IF 00EE (RETURN)
        BEQ     RETDO
        RTS                 ; NOP, BACK TO FETCH
ERASE:  CLRA                ; WRITE ZEROS TO SCREEN
        LDX 	#DISBUF     ; POTNT TO DISPLAY BUFF
FILL:   STAA 	0,X         ; FILL SCREEN WITH ACC-A
        INX
        CPX  	#ENDBUF     ; DONE?
        BNE     FILL
        RTS

To this:

EXCALL: LDAB 	PIR         ; GET INSTR FIRST BYTE
        BNE     CALL        ; IF != 0, CALL IT
        CMPA  	#$EE        ; IF 00EE (RETURN)
        BEQ     RETDO
        CMPA  	#$E0        ; IF 00E0 (CLEAR SCREEN)
	BNE     RET         ; NOP, BACK TO FETCH
ERASE:  CLRA                ; WRITE ZEROS TO SCREEN
        LDX     #DISBUF     ; POTNT TO DISPLAY BUFF
FILL:   STAA    0,X         ; FILL SCREEN WITH ACC-A
        INX
        CPX     #ENDBUF     ; DONE?
        BNE     FILL
RET:    RTS

Notice how I replace two branches and a return with two branches, where one of them is just a branch to an existing return. Seems silly, but you can’t conditionally return on the MC6800, so this saves one whole byte (plus the aforementioned LDAA PIR+1).

When all was said and done, it actually looked like I had room to insert all the new cases in the original switch. It ended up looking like this:

LETVV: TAB        ; B = final byte of instruction
       LDAA VY    ; A will be used as result (preloaded with VY now)
       ANDB #$0F  ; Pseudo-"opcode" in final nibble
       BEQ  PUTVX ; 8XY0
       CLR  VF    ; VF = 0
LETV1: DECB
       BNE  LETV2 ; 8XY1
       ORAA VX    ; VY |= VX (OR)
LETV2: DECB
       BNE  LETV3 ; 8XY2
       ANDA VX    ; VY &= VX (AND)
LETV3: DECB
       BNE  LETV4 ; 8XY3
       EORA VX    ; VY ^= VX
LETV4: DECB
       BNE  LETV5 ; 8XY4
       ADDA VX    ; VY += VX
       BCS  INCVF ; VF = 1 (overflow)
LETV5: DECB
       BNE  LETV6 ; 8XY5
       LDAA VX
       SUBA VY    ; VX -= VY
       BCC  INCVF ; VF = 1 (no borrow)
LETV6: DECB
       BNE LETV7  ; 8XY6
       LSRA       ; VY >>= 1
       BCS INCVF  ; VF = 1 (overflow)
LETV7: DECB
       BNE LETVE  ; 8XY7
       SUBA VX    ; VY -= VX
       BCC INCVF  ; VF = 1 (no borrow)
LETVE: SUBB #7    ; skip 8XY8 through 8XYD
       BNE PUTVX  ; 8XYE
       ASLA       ; VY <<= 1
       BCC PUTVX  ; VF = 1 (overflow)
INCVF: INC VF
PUTVX: LDX VXLOC
       STAA 0,X   ; Replace VX with result
       RTS        ; Return

The beginning is slightly different. The A accumulator now starts off with the last nibble of the instruction, but I need A for the result that will eventually be put in VX, so I transfer it to B with a TAB. I could have changed PUTVX to use B instead of A, but a lot of routines call PUTVX, including some routines that put their results in A.

Speaking of the result that will eventually end up in VX: Many of the instructions are symmetrical concerning the operands (VX &= VY is equivalent to VX = VY & VX, for example), and as for the ones which are not, with the new shift instructions there are actually more instructions that operate on VY (VX = VY - VX, VX = VY << 1 and VX = VY >> 1) than VX (only VX -= VY), so I now do LDAA VY up top instead.

I also shortened the VX = VY instruction, and do CLR VF on the top as well – VF was changed by all the arithmetic/logic instructions in the original COSMAC VIP interpreter, even when the result was undefined, so I do it here too (hoping no CHIP-8 games for DREAM rely on the opposite, and probably being wrong). Likewise, setting VF to 1 is moved to the bottom.

So there you have it: I’d successfully shaved off as many bytes as I could in CHIPOS, and inserted the missing instructions in the handler. Then the only thing that remains is to assemble it and – oh.

1025 bytes. One byte over budget.

Take two: Dynamic programming

Try as I might, I wasn’t able to find another byte to optimize away. And not all bytes are equal; if possible, I wanted to stay compatible with CHIPOS’s subroutines and calling sequences (CHIPOS provides several of its subroutines for use by DREAM programs and games). There probably is one somewhere, but I can’t find it. If you do, please leave a comment, because the above code is much cleaner and elegant than the monstrosity I’m about to unveil.

I noticed that many of the CHIP-8 instructions here map cleanly to a single MC6800 instruction each. However, some need an operand (VY before, VX in my revised routine), so that’s two bytes, and some don’t need an operand. Hardcoded jump tables can get pretty big, since each case needs a return. I tried to make a jump table, but that approach was much larger than the 1025 byte switch/case above. (It was 1040 bytes when I abandoned it.)

Instead of a jump table, I thought: What if I had a table of the 6800 instructions that CHIP-8 maps to… And then instead of putting the operand and everything else that’s needed in there, since it will be almost the same for each case, I build a jump target in RAM? I have no idea if this wass common to do back in the day, but it ended up pretty nicely. Here’s the table:

JUMP8:  .byte $96 ;LDAA 8XY0
        .byte $9A ;ORAA 8XY1
        .byte $94 ;ANDA 8XY2
        .byte $98 ;EORA 8XY3
        .byte $9B ;ADDA 8XY4
        .byte $90 ;SUBA 8XY5
        .byte $48 ;ASLA 8XY6
        .byte $90 ;SUBA 8XY7
        .byte $44 ;LSRA 8XYE

This is just a lookup table, where entry 0 holds the byte code for LDAA (direct addressing) which is used by 8XY0, entry 1 is 8XY1 with ORAA, etc. Then I could just use accumulator B (the final nibble of the CHIP-8 instruction) to look up the correct opcode in the table and stick it somewhere.

Of course, it’s not quite that easy. As above, entries 0 through 5 take VX as an operand, 7 takes VY, and 6 and E don’t take an operand at all. I needed to put the opcode into RAM, followed by one or no operand, followed by an RTS to get back.

So I thought, what if I move VY to the memory location (in zero page, so an 8-bit address) that is equal to the byte code for RTS, and put two RTS in a row? RTS is $39. Unfortunately, CHIPOS uses that location to store all the V variables, and I didn’t want to move them in case some other programs might rely on them. I couldn’t move VY to the memory location that has the same binary value as a NOP either, since that’s $01 which is used by the interrupt vector. I don’t use the overflow flag for anything, though, and the instruction CLV (clear overflow flag) is opcode $0A. That address is used by CHIPOS as temporary storage for digit patterns, so I could move that easily. I put it at $0080, which is available scratch space.

Now I could put the opcode from the table above at address $0040 (semi-randomly selected), followed by an $0A that will either be VY’s memory location or a “NOP”-like CLV instruction, followed by a $39 for RTS. Boom!

Not so fast. What do I do about the hole in the table between entry 6 and entry E? I could just fill them with 8 bytes of garbage (or maybe a useful subroutine), but I needed to fill them with something, otherwise the lookup wouldn’t work. Unless… What’s that right below PUTVX?

RANDOM: LDAA #$C0
        STAA RNDX
        INC  RNDX+1

A routine for generating a new random number. Let’s count bytes; RNDX+1 appears right at the location of entry $E if I remove it from the table. LSRA has opcode $44, so I moved RNDX from location $002C to $0043. Tada! Of course, this means that the CHIP-8 instructions 8XY8 through 8XYD plus 8XYF (invalid instructions) would all make interpreter crash and burn utterly, but I figured that was OK.

Now we need to set VF, CHIP-8’s flag register, based on the 6800 carry flag. In the switch/case approach, I just cleared VF at the start, and set it in the cases that needed it. That cost a lot of bytes! Now that we do a single jump, we can actually do this instead, and not mangle VF when we don’t need to:

        ROR  VF    ; save VF in carry
        JSR  $0040 ; dispatch to RAM routine
        ROL  VF    ; load carry into VF

After executing the RAM routine, the carry flag will either be the one I saved before, or it’ll be the new one in those cases that CHIP-8 cares about it, since there’s a one-to-one correspondence between 6800 instructions that touch carry and CHIP-8 instructions that touch VF.

Well, almost. MC6800’s SUB instruction sets carry on overflow, while CHIP-8 clears it. We’re not out of the woods yet! This turned out to be the most costly thing to take care of, because I didn’t find a better way than to simply check for each case. Luckily I could kill two birds with one stone, and use this check to also do another thing: Switch the operand for 8XY7, which does a VX = VY - VX.

The final blob ended up looking like this:

LETVV: TAB         ; B = final byte of instruction
       LDAA VY     ; A = result (preloaded with VY)
       ANDB #$0F   ; Pseudo-"opcode" in final nibble
       BEQ  PUTVX  ; 8XY0
       LDX  #$0A39 ; Load X with the sequence VX RTS
LETV5: CMPB #$05   ; 8XY5
       BNE  LETV7
       LDAA VX     ; Swap operand to VX
       LDX  #$2F7E ; Load X with the sequence VY JMP
LETV7: CMPB #$07   ; 8XY7
       BNE  done
       LDX  #$0A7E ; Load X with the sequence VX JMP
LETV:  STX  $0041  ; Store X at address $0041
       LDX  #INVC  ; Load X with address to invert carry
       STX  $0043  ; Store X at address $0043
FINDV: INX         ; Loop to find byte in lookup table
       DECB
       BNE  loop
       LDAB 3,X    ; B = address in X, plus an offset of 3
       STAB $0040  ; Store X's operation byte at $0040
       ROR  VF     ; Store VF in carry flag
       JSR  $0040  ; Dispatch to the RAM routine we built
       ROL  VF     ; Load VF from carry flag
PUTVX: LDX  VXLOC
       STAA 0,X    ; Replace VX with result
       RTS         ; Return from this whole thing
INVC:  ROLB        ; Invert carry if needed
       INCB
       RORB
       RTS
JUMP8: .byte $9A   ; ORAA 8XY1
       .byte $94   ; ANDA 8XY2
       .byte $98   ; EORA 8XY3
       .byte $9B   ; ADDA 8XY4
       .byte $90   ; SUBA 8XY5
       .byte $44   ; LSRA 8XY6
       .byte $90   ; SUBA 8XY7
      ;.byte $48   ; ASLA 8XYE (byte found in code below)

Whoo boy. Okay. So we build a subroutine in zero-page RAM that consists of the opcode for the instruction we want at $0040, followed by $0A which is the address of VX for those instructions that require that or it’s CLV for those that don’t or it’s $2F for VY since one instruction requires that, followed by either a $39 to return or a $7E to jump to INVF. Then we call that thing.

Doesn’t look like much, but that’s not important. The moment of truth: Time to assemble this baby…

Success

1024 bytes! I’d done it!

Building instructions in RAM like a crazy person only saved me one byte compared to the switch/case, but it was one crucial byte.

Let’s fire up my Mini Lights Out game and see if all the work paid off. CHIPOS runs, and lo and behold:

Mini Lights Out, working

Beautiful. As a bonus for reading all of this you get to see the clunky UI of my emulator so far too. Time to get back to working on that!

I decided to call my version of CHIPOS “CHIPOSLO”, named after my home town Oslo, Norway and also short for “CHIPOS with Logical Operators”. Michael J. Bauer graciously released CHIPOS into the public domain, so here’s CHIPOSLO’s GitHub repository.

You can view the full diff between Michael’s CHIPOS and my modifications there (I didn’t cover all the details in this blog post).

Comments