The other obvious thing missing is a subroutine system.
My first thoughts along this line were rather borrrring. Basically implementing a latch and copying the information to from the stack on call and return.
However, in keeping with the easily extensibility and TTL Feasability idea, I decided to pinch an idea from the RCA 1802 Microprocessor.
Rather than have a stack, I may go for multiple Program Counters (and Page Registers). Well, two.
This makes it very simply plug in. Duplicating the program counter and paging hardware is fairly simple - you can copy most of the inputs directly and just gate the clock and preset lines, and multiplex the output lines (the 5 or 9 bit program counter) - this way you have 2 9 bit Program Counter/Page register pairs, with a JK Flip Flop to toggle between the two of them.
This models - sort of - the 1802 subroutine system which has 16 index registers and you pick one to be a program counter. Subroutines are done by loading the another index register with the routine address and making that the program counter. To return, set the original one to the program counter.
The simplest way of doing this switch is to make it part of the STA 0x instruction (which now becomes an extended jump remembering the return).
If there are two sets of program counter/page register then an STA 0x instruction would switch to the other set before doing the page jump (in T1), so that the new page and resetting of the PC would be done in the other set of program counter, page register - the original one would have the return address.
The remaining issue is that of the return, which would reset the flip flop selecting the program counter/page register. This could be done using the STA 00 instruction, easily detected using a NOR gate.
Normally you would not do a long jump to page 0 because that would be effectively rerunning the program.
So you could disable the long jump for STA 00 and have it only do the actual page write / program counter clear for STA 01-STA 0F. STA 00 would still toggle the pair selector flip flop back again, but wouldn't change the page register or the program counter, thus resuming code from where the routine called the subroutine.
This is obviously only a single level of subroutine call. But it's only 512 bytes anyway.
And we still have perfect backwards compatibility.