Advanced Speed Optimization Techniques for QB

by Toshi Horie (updated March 26, 2001)

- OPTIMIZE YOUR ALGORITHM FIRST *before* you try these optimizations!
  That's usually where you get the biggest speed increases.
  
  e.g. BEFORE:
	        ' a very slowly converging formula for pi
	        ' This takes a minute or so on a fast Pentium III.
			r# = 0
			s% = 1
			FOR i& = 1 TO 90000000 STEP 2
			    r# = r# + s% * 1# / i&
			    s% = -s%
			NEXT
			pi! = CSNG(r# * 4)
			PRINT pi!
         
       AFTER:
			' a fast! constant time formula for pi
			' This is faster than you can blink.
			PI!=ATN(1)*4
			PRINT pi!

Once you've optimized the algorithm all that you can, you can start 
looking at algebraic and loop optimizations.

- The classic one is to use DEFINT A-Z.  This forces you to use 
  as many integer variables as possible.
- Use integer variables to index FOR loops.  This may require
 substitution and algebraic simplification.
 e.g.   BEFORE:
           FOR i!=0 to 0.3 STEP 0.01
              p!=i!*3
           NEXT
        AFTER:
           FOR i%=0 to 30
              p!=i%*0.03
           NEXT

          
- if your code has a lot of floating point calculations that need 
high accuracy, compile with QB 4.0.
 e.g. a raytracer

- if your code has a lot of floating point calculations that don't 
need more than 8 bits of accuracy, then definitely convert it to fixed 
point.  Even if it needs up to 16 bits of accuracy, it might be worth 
converting to fixed point, if it is being used in the main loop.
 e.g. a rotozoomer or voxel terrain.

- don't use IFs (conditional branches).  Some comparison results 
  can be directly be used in a calculation.  Note that in QB,
  a TRUE boolean expression equals -1, and a FALSE one equals 0.
  e.g.  BEFORE:
           IF a>4 THEN
              b=5
           ELSE
              b=0
           ENDIF
       AFTER:
           b=-5*(a>4)
  actually, the above example is too simple for the AFTER: version 
  to be faster.  But for more complicated expressions involving 
  multiplication and division, it can make a difference.

- buffer your reads from a file.  This is especially useful in a 
 non-disk-cached environment like DOS 4.0.
	BEFORE:  
		  DIM c AS STRING*1
		  OPEN "file.bin" FOR BINARY AS #1
		  FOR i=0 TO 10*256
		     GET #1,,c
		  NEXT
		  CLOSE #1
	AFTER:
		  DIM buffer(0) AS STRING*256
		  OPEN "file.bin" FOR BINARY AS #1
		  FOR i=0 TO 10
		     GET #1,,buffer(0)
		  NEXT
		  CLOSE #1

- use an assembler keyboard handler or INP(&H60) plus keyboard buffer 
  clearing routines instead of INKEY$.
  e.g. For a user controlled floormapper routine, this made a huge 
  difference in rending fps.
  
- for a straight QB multikey handler, don't bother to clear the 
  keyboard buffer every vertical retrace.  Instead, slow down 
  the keyboard repeat rate, and check every few frames.
  e.g. This made a huge difference in QBMKEY.BAS

- use integer division for integers
   BEFORE:
       x%=x%/y%
   AFTER: 
       x%=x%\y%

- make an integer division lookup table if there is a division
  slowing down the inner loop.

- store the results of complicated expressions in look-up tables.
  e.g. BEFORE:
            pi=ATN(1)*4
	        DO
	        FOR i=0 to 360
	           x!=100+COS(i*pi/180!)
	           y!=100+SIN(i*pi/180!)
	           PSET(x!,y!),c
	        NEXT i
	        LOOP until LEN(INKEY$)
	    AFTER
	       pi=ATN(1)*4

- if several complicated expressions in a loop has common subexpressions,
  move the common subexpressions out of the loop.       
    BEFORE:
     FOR x=1 to 32767
        FOR y=1 to 10
         c=sin(x)*30+sqr(x)+y
        NEXT y
     NEXT x
    AFTER:
     FOR x=1 to 32767
       xc=sin(x)*30+sqr(x)
       FOR y=1 to 10
         c=xc+y
       NEXT
     NEXT

- make constants CONST.  Unfortunately, you can't use transcendental 
  functions like ATN on the right side anymore.
     BEFORE:
          pi=ATN(1)*4
          piover2=pi/2
     AFTER:
          CONST pi=3.14159265358979#
          CONST piover2=pi/2
          
- unroll short loops
     BEFORE:
          FOR a=1 to 8
            POKE(a,0),a
          NEXT
     AFTER:
          POKE 1,1
          POKE 2,2
          POKE 3,3
          POKE 4,4
          POKE 5,5
          POKE 6,6
          POKE 7,7
          POKE 8,8

- partially unroll long loops

     BEFORE:
          
          FOR x=0 TO 319
              POKE x,a
          NEXT
     AFTER:
          ' this is a silly example, you should be using 
          ' MMX filling or REP STOSB at least.
          FOR x=0 TO 319 STEP 4
              POKE x,a
              POKE x+1,a
              POKE x+2,a
              POKE x+3,a
          NEXT x
         
- move junk outside of the inner loops (code movement)
    BEFORE:
       for y=0 to 199
         for x=0 to 319
            a=x*4+cos(t)
            b=y*3+sin(t)
         next
       next
    AFTER:
       for y=0 to 199
         b=y*3+sin(t)
         for x=0 to 319
            a=x*4+cos(t)
         next
       next
                       
         
- use cache sensitive programming.  This means, try to 
 access your arrays in a sequential manner if possible. 
 If not, access them in small blocks that are adjacent to 
 each other. For example, QB arrays are usually stored in 
 a column major order, so dimension your arrays as 
 vscreen(xmax,ymax) if you are doing scanline-based algorithms, 
 and only change move in the x (scanline) direction in the 
 inner loop.
   BEFORE:
	     '$DYNAMIC
	     xmax=319:ymax=199
	     DIM buf(xmax,ymax)
	     FOR x=0 to xmax
	     	FOR y=0 to ymax
	            buf(x,y)=INT(RND*256)
	         NEXT
	     NEXT
	     DEF SEG
   AFTER:
	     '$DYNAMIC
	     xmax=319:ymax=199
	     DIM buf(xmax,ymax)
	     FOR y=0 to ymax
		     FOR x=0 to xmax
	            buf(x,y)=INT(RND*256)
	         NEXT
	     NEXT
	     DEF SEG
        

- use a precalculated (canned) pseudo-random number sequence
   BEFORE:
      'main loop
      FOR i=1 TO 1000
           x=INT(RND*256)
           y=INT(RND*256)
           c=INT(RND*256)
           PSET(x,y),c
      NEXT i
   AFTER:
      'precalculation
      DIM rand(8191)
      FOR i=0 TO 8191
         rand(i)=INT(RND*256)
      NEXT i
      
      count=0
      FOR i=1 TO 1000
      	x=rand(count)
      	y=rand(count+1)
      	c=rand(count+2)
      	PSET(x,y),c
      	count=(count+3) ' AND 8192 (needed in general)
      NEXT i


- prefer array indexing over user defined TYPEs. (1)
  Warning: This makes code unreadable unless it is well commented.

- cache often-used array elements in scalar variables. (2)

- cache intermediate values into temporary variables (3)

example of both optimizations being used.
  
  BEFORE:
  
	  TYPE PtType 
	    x as INTEGER
	    y as INTEGER
	    z as INTEGER
	  END TYPE
	  TYPE TriType
	    pt1 AS INTEGER 'index of first point in points array 
	    pt2 as INTEGER 'index of second point in points array 
	    pt3 AS INTEGER 'index of third point in points array 
	  END TYPE
	  DIM points(numpoints, 1 TO 3) as PtType
	  DIM tri(numtriangles) as TriType
	  CONST screendist=200
	  CONST lightx=1,lighty=0,lightz=0
	  CALL loadobject(filename$,points())
	  
	  FOR i=1 TO numtriangles
	      V1x=points(tri(i).pt2).x - points(tri(i).pt1).x
	      V2x=points(tri(i).pt3).x - points(tri(i).pt1).x
	      V1y=points(tri(i).pt2).y - points(tri(i).pt1).y
	      V2y=points(tri(i).pt3).y - points(tri(i).pt1).y
	      V1z=points(tri(i).pt2).z - points(tri(i).pt1).z
	      V2z=points(tri(i).pt3).z - points(tri(i).pt1).z
	      
	      length1=sqr(V1x*V1x+V1y*V1y+V1z+V1z)
	      length2=sqr(V2x*V2x+V2y*V2y+V2z+V2z)
	      
		  vx = V1y * V2z - V2y * V1z
	      vy = V2x * V1z - V1x * V2z
	      vz = V1x * V2y - V2x * V1y
	      CALL normalize(vx,vy,vz)
	      brightness=vx*lightx + vy*lighty + vz*lightz
	      
	      xp1 = screendist*x1/z1
	      yp1 = screendist*y1/z1
	      xp2 = screendist*y1/z1
	      yp2 = screendist*y2/z2
	      
	      '... and so on...      
	   NEXT is  
  
  
  AFTER:

	  ' index 1 = x coordinate of point
	  ' index 2 = y coordinate of point
	  ' index 3 = z coordinate of point
	  DIM points(numpoints, 1 TO 3)
	  DIM tri(numtriangles, 1 TO 3)
	  CONST screendist=200
	  CONST lightx=1,lighty=0,lightz=0
	  CALL loadobject(filename$,points())
	  
	  FOR i=1 TO numtriangles
	      x1=points(tri(i,1),1) ' example of optimization 1
	      y1=points(tri(i,1),2) ' and optimization 2
	      z1=points(tri(i,1),3)
	      x2=points(tri(i,2),1)
	      y2=points(tri(i,2),2)
	      z2=points(tri(i,2),3)
	      x3=points(tri(i,2),1)
	      y3=points(tri(i,2),2)
	      z3=points(tri(i,2),3)
	      V1x=(x2-x1):V2x=(x3-x1)
	      V1y=(y2-y1):V2y=(y3-y1)
	      V1z=(z2-z1):V2z=(z3-z1)
	      
	      length1=sqr(V1x*V1x+V1y*V1y+V1z+V1z)
	      length2=sqr(V2x*V2x+V2y*V2y+V2z+V2z)
	      
		  vx = V1y * V2z - V2y * V1z
	      vy = V2x * V1z - V1x * V2z
	      vz = V1x * V2y - V2x * V1y
	      CALL normalize(vx,vy,vz)
	      brightness=vx*lightx + vy*lighty + vz*lightz
	      
	      xp1 = screendist*x1/z1
	      yp1 = screendist*y1/z1
	      xp2 = screendist*y1/z1
	      yp2 = screendist*y2/z2
	      
	      '... and so on...      
	  NEXT i
      
      
- use REDIM to clear a large array instead of using 
  a FOR loop to set each element to zero.
e.g.

	BEFORE:
		DIM x(32000)
		DO
			FOR i=0 TO 32000
				x(i)=0 'clear array slowly
			NEXT i 
			x(RND*32000) = 50
			x(RND*32000) = 93
		LOOP UNTIL LEN(INKEY$)     

	AFTER:
		DIM x(32000)
		DO
			REDIM x(32000) 'clear array faster
			x(RND*32000) = 50
			x(RND*32000) = 93			
		LOOP UNTIL LEN(INKEY$)     
  

- avoid multidimensional arrays
  Use array head lookup tables like in the POKE vs. PSET example for 
  faster access of single dimension arrays as multidimensional ones.
    BEFORE:
        DIM x(63,63)
    AFTER:
        DIM x(4095)
        
- don't waste an extra element.  Unlike C arrays, the declaration of QB arrays 
  specify the first and last element indicies rather than the size of the array.
  This matters when you want to make a 64KB array without using '$DYNAMIC.
     BEFORE:
         DIM x(256,256) 'allocate 66049 elements
     AFTER:
         DIM x(0 TO 255,0 TO 255) 'allocate 66536 elements
       or
         'OPTION BASE 0
         DIM x(255,255)

- Use incremental calculation instead of evaluating the entire equation 
  every loop.  This usually means multiplies will be replaced by addition.
  It's very important that you do this in any linear interpolation 
  function you use for Gouraud Shading, Texture Mapping, etc.  Most 
  line DDAs (digital difference analyzers) use this method.
e.g.  BEFORE:
          slope!=0.1
          FOR x=0 TO max
             y!=slope!*x
          NEXT x
      AFTER:
          slope!=0.1
          y!=0
          FOR x=0 to max
             y!=y!+slope!          
          NEXT x
 
- Use POKE instead of PSET
 This is a simple way to get 2x performance in graphics intensive apps.
 e.g.
   BEFORE:
     CONST xmax=319,ymax=199,scansize&=320
     FOR i=0 TO 255
       PSET(i,0),i
     NEXT i
     FOR i=0 TO 255
       PSET(i,10),i
     NEXT i
     
   AFTER:
     CONST xmax=319,ymax=199,scansize&=320
     DIM ytab&(ymax)
     FOR y=0 to ymax
       ytab&(y)=y*scansize&
     NEXT y
     
     DEF SEG=&HA000
     FOR i=0 to 255
       POKE i, i
     NEXT i
     FOR i=0 to 255
       POKE ytab&(10)+i, i
     NEXT i
     DEF SEG       
 
 
- Use INTEGER variables instead of LONGs for unsigned integers in 
  the range 0 to 65535.  This will only work when the program is compiled.
  

- PEEKing from video memory is slower than PEEKing from system 
memory.  Therefore, use double buffering when you need to do feedback effects.

- Use DEF SEG sparingly.  You don't need to DEF SEG back to the default 
  segment when you are accessing arrays in the default segment.  DEF SEG 
  only applies to PEEK and POKE and SETMEM.

- Don't use '$DYNAMIC
QB arrays in the default segment are accessed at blazing speed, because 
there is no segment switching.  However, '$DYNAMIC  puts them in different 
segments, which need extra instructions to accessed, slowing them down.  
This makes a big difference in programs that use large lookup tables 
in their inner loop.  It seems that huge arrays (allowed using the QB/AH 
command) are the slowest to access.
e.g. BEFORE:
       '$DYNAMIC
       DIM hugetable(319,199)
       FOR y=0 TO 199
       FOR x=0 TO 319
       	  xo=(x-160)\2
       	  yo=(y-100)\2
          hugetable(x,y)=xo*xo+yo*yo
       NEXT:NEXT
          
    AFTER:
       '$STATIC
       DIM hugetable1(319,99)
       DIM hugetable2(319,99)
       
       FOR y=0 TO 99
       FOR x=0 TO 319
       	  xo=(x-160)\2
       	  yo=(y-100)\2
          hugetable(x,y)=xo*xo+yo*yo
       NEXT:NEXT
       FOR y=100 TO 199
       FOR x=0 TO 319
       	  xo=(x-160)\2
       	  yo=(y-100)\2
          hugetable(x,y-100)=xo*xo+yo*yo
       NEXT:NEXT


- Use SELECT CASE instead of a bunch of ELSEIFs.
  The only exception is when one case executes much 
  more often than the others.
        BEFORE:
e.g.      IF i=1 THEN
               CALL DrawSprite
          ELSEIF i=6 THEN 
               CALL PlaySound
          ELSEIF i>9 AND i<16 THEN
               CALL Calculate(i)
          ELSE
               PRINT "."
          ENDIF
       AFTER:
          SELECT CASE i
             CASE 1
                 CALL DrawSprite
             CASE 6
                 CALL PlaySound
             CASE 10 TO 15
                 CALL Calculate(i)
             CASE ELSE
                 PRINT "."
          END SELECT

- use AND instead of MOD for MODing by a power of 2.
    BEFORE:
        a=b MOD 64
    AFTER:
        a=b AND 63

- simplify compares against zero
    BEFORE 
        IF a%<>0 THEN
        	b%=b%-1
        END IF
    AFTER:
        IF a% THEN 'note <>0 is gone 
        	b%=b%-1
        END IF

- use -x to find the negative of a number instead of -1*x.  
  This is an obvious optimization if you know that the CPU 
  has a NEG instruction, which is faster than IMUL.
  
- don't put the main loop in the main code-- put it in a SUB.
  This makes a difference in the IDE, probably because the 
  p-code interpreter has less variables to wade through when 
  you are in a SUB.


- use static storage for non-recursive SUB parameters.
  This makes very little improvement in speed, unless there are 
  tons of variables passed to a SUB.

	BEFORE:
		SUB drawcircle(x%,y%,r%)
		   'routine to draw a circle
		END SUB
	AFTER:
		SUB drawcircle(x%,y%,r%) STATIC
		   'routine to draw a circle
		END SUB
        
- pass dummy parameters to functions that take an odd number of 
 arguments in order to improve data alignment. 
 The dummy parameter is not used by the function, but is there 
 to encourage burst memory writes. This only makes a minimal 
 difference in speed.
e.g.
	BEFORE:
    	CALL drawcircle(x%,y%,r%)
    AFTER:
    	CALL drawcircle(x%,y%,r%,dummy%)
    	
- don't initialize QB array elements to zero.  Warning: this is 
a dangerous habit to get into, if you plan to use C or C++ later on.
This is because C does not initialize variables by default.
e.g.
   BEFORE:
        DIM div320(32767)
        FOR i=0 to 32767
           div320(i)=i\320
        NEXT
   AFTER:
        DIM div320(32767)
        FOR i=320 to 32767
           div320(i)=i\320
        NEXT

- for floating point, multiply by the reciprocal of a number instead of 
  dividing by a number.
    BEFORE:
      SUB normalize(x!,y!,z!)
        norm!=sqrt(x!*x! + y!*y! + z!*z!)
        x!=x!/norm!
        y!=y!/norm!
        z!=z!/norm!
    AFTER:
      SUB normalize(x!,y!,z!)
        recipnorm!=1/sqrt(x!*x! + y!*y! + z!*z!)
        x!=x!*recipnorm!
        y!=y!*recipnorm!
        z!=z!*recipnorm!
      END SUB

- Simplify comparisons using simpler monotonic functions.  Monotonic 
functions are functions that always grow upwards or always grow downwards.
For example, x^2 is a monotonic function of x, so is 2*x.  In the example, 
an expensive square root was removed by squaring both sides, since squaring 
is a monotonic function.
   BEFORE:
      dist=sqr( x*x+y*y)
      IF dist < radius THEN
          'inside circle
      ENDIF
   AFTER:
      r2=radius*radius
      distsquared=x*x+y*y
      IF distsquared < r2 THEN
          'inside circle
      ENDIF

Advanced optimizations for C++
Intel Optimization Best Practices - 1) system level tuning (disk I/O, network, memory bus overload) 2) application level tuning (data structures, APIs), 3) microarchitecture-level tuning

Thanks for Qasir, entropy, Pasco, and Eclipzer for their critiques and suggestions.