Teacup possible patch

ItsDrone

Teacup possible patch
January 01, 2013 09:57PM

Registered: 11 years ago
Posts: 11

I have been looking though teacup firmware and I believe I found a possible resolution to the speed calculation overflow issue.

There is also a minor change to reduce a few cpu cycles per G# line.

Diff list

The background in these ideas:

Calculation starts with the upper 16 bits and then process the lower 16 bits.

This doesn't return the exact same result as the original would have if it was 64 bit value, which would be a lot more processing.

at 65,536 the correct value would be 104,755,299, but this returns 104,752,103.
65,700; 105,280,243; 104,752,767
262,652; 1,682,587,148; 419,014,805

So as you can see, the returned value isn't completely accurate as the value goes up but it will not overflow. The 65535+ value is pre-calculated value.

I believe when it comes to those values for speed accelerations the variable would not be noticeable. ( does anyone have a setup that would clock that much per meter? )

As for the speed change.
By changing the function to a define it will force it to be inline and remove a CALL command, which consumes 4-5 clock cycles per G# instruction ( 6 instances of enqueue() just in gcode_process.c ).

Note: I have not been able to test out the speed effect as my current build ( plasma cnc table running completely stripped down TeaCup ) cannot run over 1500 without having quality issues. It is also, 2d so very little detail to bog the mpu down.

Setup:
Linux Ubuntu 10.04 32-bit
Pronterface
Teacup ( stripped, x/y only, added plasma trigger, no heater, no temperature, no z/e )
Gen7 1.4.1
ATMega644 @ 20Mhz

Let me know what ya think. I'd really like to see the sdcard/lcd setup and might work on it sometime too.

Reply Quote

Traumflug

Re: Teacup possible patch
January 02, 2013 07:12AM

Registered: 13 years ago
Posts: 7,616

Thanks a lot for the help, ItzDrone.

Regarding #define SPEED: that's not neccessary, as we always want speed smiling smiley

Regarding the stuff wrapped with this: I've put in an equivalent: [github.com] (please test yourself; I'm currently in progress of preparing Teacup for ARM, so I currently have a pretty mess)

Regarding fixing this overflow: First, I'm glad somebody pointed me out where the problem is. I heard rumors long ago, but nobody actually pointing me where to search for the problem.

Now, the unfortunate thing is, this calculation still works only if STEPS_PER_M_X = STEPS_PER_M_Y = STEPS_PER_M_Z. Because if an axis different than the X axis is the fast axis, the difference in STEPS_PER_M isn't taken into account. So we have to resort to the safe value ( = do ramp calculations all the time).

Thanks again.

Generation 7 Electronics	Teacup Firmware	RepRap DIY

Reply Quote

Traumflug

Re: Teacup possible patch
January 02, 2013 09:28AM

Registered: 13 years ago
Posts: 7,616

P.S.: Once I could compile again, I saw my mistake. In case you updated your sorces the last two hours, please do so again.

Generation 7 Electronics	Teacup Firmware	RepRap DIY

Reply Quote

ItsDrone

Re: Teacup possible patch
January 02, 2013 05:44PM

Registered: 11 years ago
Posts: 11

That works with the same result.

The main reason I had it set as a user define was due to it using more flash space and give the user the option to chose.

9 instances of enqueue in the complete code.

Done the original way it consumes 8 bytes per reference and 10 bytes for the function.

With the SPEED/inline method 12 bytes per reference.

original: 82 bytes
inline: 108 bytes

I guess it isn't as bad as I imagined it but when trying to squeeze it into an ATMega168 it might be needed.

As for the calculation, I'd suggest looking into it once the axis are redesigned to be an array to allow for proper expansion for a 6/7 axis or more extruders. The initial idea was to get it a bit more accurate than just 10,000 for all setups. I believe with the array setup then allow the user to decide the letters associated with each axis. ( I for one would like to make it X,Y,Z,U,V,W if i were to make a setup that had that many axis but allow for E, F[?] for extruder related setups )

Reply Quote

ItsDrone

Re: Teacup possible patch
January 02, 2013 11:12PM

Registered: 11 years ago
Posts: 11

Traumflug,

I did a little more digging and a little confirmation on how a snippet was ordered and came up with this change for a little more speed.

Instead of testing for every possible letter and then processing the digits, I've changed it around so that it processes the digits first and the letters second. This makes it not have to go through 110 byte's of tests, roughly 55 cpu cycles before it reaches the digit test.

Here are some test files I used to validate why this will help speed up the code.

File source: MendelMax

./Top_Vertex_X_logo_optional.gcode A-Z 81662   0-9 368578   * 141505  : 62.29 %
./Y_Rod_Clasp_4_off.gcode ........ A-Z 25929   0-9 114685   * 44585   : 61.93 %
./Lower_Vertex_Upper.gcode ....... A-Z 287576  0-9 1321574  * 500702  : 62.64 %
./Z_Motor_Mount_2_off.gcode ...... A-Z 256141  0-9 1179896  * 445627  : 62.70 %
./LowerVertex_cupcake.gcode ...... A-Z 58261   0-9 264303   * 101076  : 62.39 %
./Y_Idler_Mount.gcode ............ A-Z 65405   0-9 291736   * 113460  : 61.99 %
./Y_Rod_Setting_Jig.gcode ........ A-Z 29689   0-9 132666   * 51093   : 62.15 %
./Top_Vertex_X_4_off.gcode ....... A-Z 42949   0-9 194872   * 74292   : 62.44 %
./Top_Rod_Jig_70mm.gcode ......... A-Z 67342   0-9 308888   * 116759  : 62.66 %
./Lower_Vertex_Middle_4_off.gcode  A-Z 86394   0-9 397807   * 149739  : 62.75 %
./Z_rod_clasp_4_off.gcode ........ A-Z 35614   0-9 158416   * 61427   : 62.01 %
./Y_Rod_Mount_2_off.gcode ........ A-Z 115330  0-9 525199   * 200108  : 62.48 %
./Top_Vertex_Y_2_off.gcode ....... A-Z 43333   0-9 196092   * 74945   : 62.38 %
./LowerVertex_4_off.gcode ........ A-Z 50195   0-9 226696   * 86943   : 62.31 %
./Z_Rod_Lower_Mount_2_off.gcode .. A-Z 243787  0-9 1097486  * 424358  : 62.16 %
./Lower_Vertex_Lower.gcode ....... A-Z 412148  0-9 1905730  * 718436  : 62.76 %

character counts; 0-9 vs everything else.

Not quite 2/3rds of each file is a digit.

Teacup Diff - gcode_parse.c

Again, I have no means to test and validate this apart from comparing the decompile list. If someone is willing to test this out on a print that they know has issues with slowing down let us know.

It looks like testing for the other characters before letters might be a good idea too. There is more of those than A-Z but nothing compared to digits.

Side note: the DEBUG if statements are still tested even when #define DEBUG is not in the config. This reduces the number of tests per character received.

Reply Quote

Traumflug

Re: Teacup possible patch
January 03, 2013 05:28AM

Registered: 13 years ago
Posts: 7,616

Quote

the DEBUG if statements are still tested even when #define DEBUG is not in the config.

These are optimized out. The code size difference for this part of the patch should be zero.

Generation 7 Electronics	Teacup Firmware	RepRap DIY

Reply Quote

Traumflug

Re: Teacup possible patch
January 03, 2013 09:42AM

Registered: 13 years ago
Posts: 7,616

Quote

Done the original way it consumes 8 bytes per reference and 10 bytes for the function.

With the SPEED/inline method 12 bytes per reference.

Code size should be the same for a #define vs. an inline function. The advantage of the inlining is its type checking, which doesn't happen with a #define. Yes, inlining can make things a bit bigger, but it's faster regardless.

I've commited the patch part with early parsing of digits. Let's cross fingers it works smiling smiley

Also committed a pair of patches wrapping EEPROM configuration. Disabling EECONFIG in config.h reduces code size by some 600 bytes.

Last not least for today, the build process happens now in its own folder.

All on the Gen7 branch.

Generation 7 Electronics	Teacup Firmware	RepRap DIY

Reply Quote

ItsDrone

Re: Teacup possible patch
January 03, 2013 03:26PM

Registered: 11 years ago
Posts: 11

I would agree with you but the compiled code doesn't match up to optimizing it out.

The code snippet in question:

				case 'G':
					next_target.G = read_digit.mantissa;
					#ifdef DEBUG
					if (DEBUG_ECHO && (debug_flags & DEBUG_ECHO))
						serwrite_uint8(next_target.G);
					#endif
					break;
				case 'M':
					next_target.M = read_digit.mantissa;
					#ifdef DEBUG
					if (DEBUG_ECHO && (debug_flags & DEBUG_ECHO))
						serwrite_uint8(next_target.M);
					#endif
					break;

That section with DEBUG not defined:

				case 'G':
					next_target.G = read_digit.mantissa;
     758:	80 91 66 01 	lds	r24, 0x0166
     75c:	80 93 40 01 	sts	0x0140, r24
     760:	c4 c0       	rjmp	.+392    	; 0x8ea 
					if (DEBUG_ECHO && (debug_flags & DEBUG_ECHO))
						serwrite_uint8(next_target.G);
					#endif
					break;
				case 'M':
					next_target.M = read_digit.mantissa;
     762:	80 91 66 01 	lds	r24, 0x0166
     766:	80 93 41 01 	sts	0x0141, r24
     76a:	bf c0       	rjmp	.+382    	; 0x8ea 
...
     8e6:	60 93 64 01 	sts	0x0164, r22
						serwrite_uint8(next_target.checksum_read);
					#endif
					break;
			}
			// reset for next field
			last_field = 0;
     8ea:	10 92 3c 01 	sts	0x013C, r1
			read_digit.sign = read_digit.mantissa = read_digit.exponent = 0;
     8ee:	10 92 66 01 	sts	0x0166, r1
     8f2:	10 92 67 01 	sts	0x0167, r1
     8f6:	10 92 68 01 	sts	0x0168, r1
     8fa:	10 92 69 01 	sts	0x0169, r1
     8fe:	10 92 6a 01 	sts	0x016A, r1
		}
	}

G/M jumps to 0x8ea, which is after the switch statement.

That same code with DEBUG defined:

 				case 'G':
					next_target.G = read_digit.mantissa;
     8c4:	60 91 66 01 	lds	r22, 0x0166
     8c8:	60 93 40 01 	sts	0x0140, r22
     8cc:	1a c1       	rjmp	.+564    	; 0xb02 
					if (DEBUG_ECHO && (debug_flags & DEBUG_ECHO))
						serwrite_uint8(next_target.G);
					#endif
					break;
				case 'M':
					next_target.M = read_digit.mantissa;
     8ce:	60 91 66 01 	lds	r22, 0x0166
     8d2:	60 93 41 01 	sts	0x0141, r22
     8d6:	15 c1       	rjmp	.+554    	; 0xb02 
...
     afe:	60 93 64 01 	sts	0x0164, r22
					#ifdef DEBUG
					if (DEBUG_ECHO && (debug_flags & DEBUG_ECHO))
     b02:	80 91 e7 03 	lds	r24, 0x03E7
     b06:	87 ff       	sbrs	r24, 7
     b08:	05 c0       	rjmp	.+10     	; 0xb14 
						serwrite_uint8(next_target.checksum_read);
     b0a:	70 e0       	ldi	r23, 0x00	; 0
     b0c:	80 e0       	ldi	r24, 0x00	; 0
     b0e:	90 e0       	ldi	r25, 0x00	; 0
     b10:	0e 94 d7 1e 	call	0x3dae	; 0x3dae 
					#endif
					break;
			}
			// reset for next field
			last_field = 0;
     b14:	10 92 3c 01 	sts	0x013C, r1
			read_digit.sign = read_digit.mantissa = read_digit.exponent = 0;
     b18:	10 92 66 01 	sts	0x0166, r1
     b1c:	10 92 67 01 	sts	0x0167, r1
     b20:	10 92 68 01 	sts	0x0168, r1
     b24:	10 92 69 01 	sts	0x0169, r1
     b28:	10 92 6a 01 	sts	0x016A, r1
		}
	}

G/M jumps to 0xb02, which is a consolidated "if (DEBUG_ECHO && (debug_flags & DEBUG_ECHO))" test, followed by another jump 0xb14, after the switch. at the very least it is two extra cycles and a jump, 4 clock cycles maybe.

And the same code section with DEBUG not defined and the #ifdef's removed:

				case 'G':
					next_target.G = read_digit.mantissa;
     8c4:	60 91 66 01 	lds	r22, 0x0166
     8c8:	60 93 40 01 	sts	0x0140, r22
     8cc:	1a c1       	rjmp	.+564    	; 0xb02 
					if (DEBUG_ECHO && (debug_flags & DEBUG_ECHO))
						serwrite_uint8(next_target.G);
					break;
				case 'M':
					next_target.M = read_digit.mantissa;
     8ce:	60 91 66 01 	lds	r22, 0x0166
     8d2:	60 93 41 01 	sts	0x0141, r22
     8d6:	15 c1       	rjmp	.+554    	; 0xb02 
...
     afe:	60 93 64 01 	sts	0x0164, r22
					#ifdef DEBUG
					if (DEBUG_ECHO && (debug_flags & DEBUG_ECHO))
     b02:	80 91 e7 03 	lds	r24, 0x03E7
     b06:	87 ff       	sbrs	r24, 7
     b08:	05 c0       	rjmp	.+10     	; 0xb14 
						serwrite_uint8(next_target.checksum_read);
     b0a:	70 e0       	ldi	r23, 0x00	; 0
     b0c:	80 e0       	ldi	r24, 0x00	; 0
     b0e:	90 e0       	ldi	r25, 0x00	; 0
     b10:	0e 94 d7 1e 	call	0x3dae	; 0x3dae 
					#endif
					break;
			}
			// reset for next field
			last_field = 0;
     b14:	10 92 3c 01 	sts	0x013C, r1
			read_digit.sign = read_digit.mantissa = read_digit.exponent = 0;
     b18:	10 92 66 01 	sts	0x0166, r1
     b1c:	10 92 67 01 	sts	0x0167, r1
     b20:	10 92 68 01 	sts	0x0168, r1
     b24:	10 92 69 01 	sts	0x0169, r1
     b28:	10 92 6a 01 	sts	0x016A, r1
		}
	}

As the code shows, it still does the test even though DEBUG is not defined.

Reply Quote

Traumflug

Re: Teacup possible patch
January 04, 2013 06:58AM

Registered: 13 years ago
Posts: 7,616

#defines are considered in the preprocessor, the optimizer kicks in later (after unoptimized compilation). Are you compiling with optimization turned off, by chance? This would bloat Teacup a lot. Are you perhaps looking at unoptimized assembly? This would explain your findings, too.

To do such tests, I simply compare binary sizes. Same binary size = same code in the executable. At least for such small changes.

Generation 7 Electronics	Teacup Firmware	RepRap DIY

Reply Quote

ItsDrone

Re: Teacup possible patch
January 04, 2013 10:54AM

Registered: 11 years ago
Posts: 11

Here is the method I followed for this:

new pull:
git clone [github.com] teacup_master
cd teacup_master
git pull origin master

ln config.gen7-v1.7.h config.h
ln ThermisterTable.double.h ThermisterTable.h

adjust per previous post.
make
gedit mendel.lst

mendel.lst is being processed from mendel.elf. I sure hope by then it would be considered optimized.

A side note: I cleared out all patches and started back at the master branch and added a small M254 command to track how many times the movebuffer becomes completely empty and for how long ( pin change / scope ).

In a 1,260 line segment the movebuffer ran out 18 times. The code sample used is the first 1260 lines of MendelMax's LowerVertex_4_off.stl passed through slic3r with the default configuration. (home axis, set temp, wait for temp all removed to test without physical setup)

I have not tracked down what is causing this but changing the movebuffer size doesn't effect it. 4, 8, 16 and it does the exact same number of buffer overruns.

As for how long it is waiting for a new command to be ready I have it set the pin as soon as it detects the buffer is empty ( end of void next_move() ) and stays high until a command reaches enqueue_home().

I noticed there is a large wait period at the start of a print and another large wait period at the end running roughly 15-20ms long. Throughout the print there are seemingly repeatable spots were the buffer runs out for 800uS to 1mS. That's 16,000 to 20,000 commands of waiting at 20Mhz.

For my setup I have to force a BAUD of 38400 as 115200 doesn't work with my improvised usb2serial adapter. That would consider 3.5mS to send a 15 byte command.

Even with adding the few speed related changes didn't effect the outcome of this.

I'm thinking there might be a different issue at hand.

Reply Quote

ItsDrone

Re: Teacup possible patch
January 04, 2013 02:40PM

Registered: 11 years ago
Posts: 11

It looks like I was jumping a head of myself. The spikes for when the queue completely emptied out was due to the rehoming commands waiting for the queue to empty out before it would continue.

I'm not sure if that is a reason for flaws or not major enough to cause an issue.

If someone has a gcode file that has caused issues with slowing down while running Teacup please let me know. I'd like to see the issue first hand so I can attempt to track it down.

Reply Quote

Traumflug

Re: Teacup possible patch
January 05, 2013 08:37AM

Registered: 13 years ago
Posts: 7,616

Quote

I [...] added a small M254 command to track how many times the movebuffer becomes completely empty and for how long ( pin change / scope ).

Actually measuring performance is always an excellent idea! Can you make a patch which adds this as another DEBUG option?

A bigger buffer doesn't make G-code parsing or the serial line faster, it just helps bridging bigger "gaps" where serial line / G-code parsing can't keep up with demand. I experience such gaps when feeding G-code by a shell script (sender-mac.sh in the repo).

To push the envelope, you have to raise printing speed. Like changing F2000 to F3000. This will raise the demand on serial line and G-code parsing speed. You'll get maximum serial line performance when using the XON/XOFF protocol (see config.h) instead of Pronterface' waiting for "ok". In case you exhaust step pulse generation capabilities, which is somewhere at 12'000 steps/second on 16 MHz electronics, you can get back into business by lowering the steps/mm values.

A feedrate override shouldn't be too difficult to implement, either. The G-code wiki page appears to recommend M220 for this: [reprap.org]

Before I forget it: I put new stuff, like your patches, always on the Gen7 branch. Simply consider the Gen7 branch to be the development/experimental branch.

Generation 7 Electronics	Teacup Firmware	RepRap DIY

Reply Quote

Phizinza

Re: Teacup possible patch
January 17, 2013 01:39AM

Registered: 12 years ago
Posts: 160

Traumflug Wrote:
-------------------------------------------------------
> A feedrate override shouldn't be too difficult to
> implement, either. The G-code wiki page appears to
> recommend M220 for this:
> [reprap.org]
> r_override_percentage

I have to admit, I don't understand much of this thread. But I just wanted to say that this M220 feature would be highly appreciated by me!
Will be keeping my eye out on Teacup hoping it is included soon!

Thank you!

Reply Quote