mfaktc: a CUDA program for Mersenne prefactoring

TheJudger · 2009-11-25, 16:49

[edit by LaurV (2012-Dec-28)]
As the thread got longer and longer and people continue coming and asking "Where is this or that version of mfaktc?" I am editing this post to give the link to the folder which Xyzzy kindly created:
https://www.mersenneforum.org/mfaktc/
Here you can find many different versions of mfaktc, for different OS-es, different Cuda drivers, bla bla.
Select the suitable one for you, download it, clear some exponents!
Thanks!
[end of edit by LaurV]

[edit by James Heinrich (2022-Nov-06)]
The mersenneforum mirror hasn't been updated in several years, current mfaktc builds are available here:
https://download.mersenne.ca/mfaktc/
[end of edit by James Heinrich]

Hi,

maybe a bit offtopic...

as a proof-of-concept I can do TF on my GTX 275 :)
A straight forward implementation of the multiprecision arithmetic gives me ~15.000.000 tested factors per second for a 26bit exponent and factors up to 71bit. (The numbers are out of my mind, I hope they are correct)
I've checked the correctness for some 1000 cases against a basic implementation using libgmp on CPU, no errors found. :)

Currently there is no presieving of factor candidates... but I've alot cycles free on the CPU. ;)

The quality of the code right now is.... "proof-of-concept" (means really ugly, alot of hardcoded stuff, ...) :D

Most time is spent on the calculation if the reminder (using a simple basecase division).
jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :(

TheJudger

jasonp · 2009-11-25, 19:34

Quote:

Originally Posted by TheJudger

jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :(

It doesn't map very well to CUDA, but there are few alternatives if you are not performing a very large number of multiplications with fixed modulus.

TheJudger · 2009-11-26, 17:09

Hi jasonp!

Quote:

Originally Posted by jasonp

It doesn't map very well to CUDA, but there are few alternatives if you are not performing a very large number of multiplications with fixed modulus.

Please tell me which alternatives you're thinking about.
Maybe you've allready tried by yourself some of them on CUDA?

jasonp · 2009-11-27, 02:43

Quote:

Originally Posted by TheJudger

Please tell me which alternatives you're thinking about.
Maybe you've allready tried by yourself some of them on CUDA?

No, I went straight to writing the Montgomery code because that's what I know :)

The other alternatives are Barrett modular multiplication with a precomputed inverse, or split floating point / integer division. For a*b mod N, the latter uses the FPU to find the top 24 bits of the quotient q, converts to integer, does one step of Newton iteration and then subtracts q*N from a*b. This is the fastest method on x86 because the FPU can compute up to a 61-bit q in one step, but on a GPU floating point division is much faster than integer division, and floating point reciprocal instructions are faster still. So it may still be a win to do things that way.

TheJudger · 2009-11-27, 11:33

jasonp: thank you for your hints.

I had allready the barrett modular multiplicatoin on my radar but didn't take a deeper look at it so far.

Current speed: 30 million checks per second on a 26bit exponent with factors up to 71bit :)

Next on my to-do list:
- more flexible input (e.g. the exponents are still hardcoded in the hostcode)
- presieving of factor candidates
- reduce the amount of data transfered from/to device
- interleave host/device code

TheJudger · 2009-12-03, 16:10

Hi,

Next on my to-do list:
- more flexible input (e.g. the exponents are still hardcoded in the hostcode): DONE!
- presieving of factor candidates: 50%
- reduce the amount of data transfered from/to device: DONE!
- interleave host/device code: DONE!

Raw speed on my GTX 275: ~34 million factor candidates per second for a 26bit exponent.

presieving is supported in the code now, what is missing is a fast siever. The current code does "sieving" by calculation the remainder of each factor candidate which isn't fast, offcourse.

So with presieving factor candidates with primes <= 11 *lol* I can factor M66xxxxxx to 60 bits in 80 seconds. This removes 2/3 of the factor candidates, with a good siever this should increase ;) (the prime95 help notes that ~95% of the candidates are removed with the siever). If I can do the same this will be speedup of ~5-6 (~33% vs. ~5% remaining factor candidates) compared to the current version. :)

There is one known bug (introduced in the current version): if 2 (or more) factors are found at the same dataset often only one factor is returned.
It is even possible that the returned factor is wrong (mixup of both factors) in this case. I haven't noticed a mixup so far but I'm sure it can occur. :(

TheJudger · 2009-12-05, 19:41

Hi,

found a bug in my horrible "siever" (actually it is no sieve right now), it removed only the half of the factor candidates which are not 1 or 7 mod 8.

Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 7: 57s
Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 11: 55s (siever is too slow, causes some stalls on the GPU code)

I ran some tests from M66000000 to M66004000 and "verified" the factors known allready to the primenet and found some unknown factors (which I've checked in manually).

henryzz · 2009-12-06, 09:09

Does anyone else get the feeling especially the mods that TheJudger's stuff should get its own thread and not hijack this one?

TheJudger · 2009-12-06, 14:24

henryzz: good idea!

Maybe a mod can move all posts related to tf on CUDA to a separate thread?

TheJudger · 2009-12-08, 09:48

Hi,

improved siever (sieving the first 600 primes):
Sieving M66xxxxxx to 2^60: 17 seconds
Sieving M66xxxxxx to 2^64: 270 seconds
Sieving M66xxxxxx from 2^64 to 2^65: 270s seconds

I think I'm close to the first "public beta". :)
I have some ideas in my mind that I want to test/implement before that.

Regards,
TheJudger

TheJudger · 2009-12-15, 09:54

Hi,

my first attempt to reduce the number of sieve initialisations to one per group failed greatly... :(

The good news: the 2nd attempt works fine an took only ~10 lines of code :)

sieving the first 2000 primes:
TF M66xxxxxx to 2^60: 17 seconds (too much overhead for sieve initialisation...)

sieving the first 4000 primes:
TF M66xxxxxx to 2^64: 220 seconds :)

2009-11-25, 16:49	#1
TheJudger "Oliver" Mar 2005 Germany 2·13·43 Posts	mfaktc: a CUDA program for Mersenne prefactoring [edit by LaurV (2012-Dec-28)] As the thread got longer and longer and people continue coming and asking "Where is this or that version of mfaktc?" I am editing this post to give the link to the folder which Xyzzy kindly created: https://www.mersenneforum.org/mfaktc/ Here you can find many different versions of mfaktc, for different OS-es, different Cuda drivers, bla bla. Select the suitable one for you, download it, clear some exponents! Thanks! [end of edit by LaurV] [edit by James Heinrich (2022-Nov-06)] The mersenneforum mirror hasn't been updated in several years, current mfaktc builds are available here: https://download.mersenne.ca/mfaktc/ [end of edit by James Heinrich] Hi, maybe a bit offtopic... as a proof-of-concept I can do TF on my GTX 275 :) A straight forward implementation of the multiprecision arithmetic gives me ~15.000.000 tested factors per second for a 26bit exponent and factors up to 71bit. (The numbers are out of my mind, I hope they are correct) I've checked the correctness for some 1000 cases against a basic implementation using libgmp on CPU, no errors found. :) Currently there is no presieving of factor candidates... but I've alot cycles free on the CPU. ;) The quality of the code right now is.... "proof-of-concept" (means really ugly, alot of hardcoded stuff, ...) :D Most time is spent on the calculation if the reminder (using a simple basecase division). jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :( TheJudger Last fiddled with by James Heinrich on 2022-11-06 at 16:21

2009-12-03, 16:10	#6
TheJudger "Oliver" Mar 2005 Germany 2×13×43 Posts	Hi, Next on my to-do list: - more flexible input (e.g. the exponents are still hardcoded in the hostcode): DONE! - presieving of factor candidates: 50% - reduce the amount of data transfered from/to device: DONE! - interleave host/device code: DONE! Raw speed on my GTX 275: ~34 million factor candidates per second for a 26bit exponent. presieving is supported in the code now, what is missing is a fast siever. The current code does "sieving" by calculation the remainder of each factor candidate which isn't fast, offcourse. So with presieving factor candidates with primes <= 11 lol I can factor M66xxxxxx to 60 bits in 80 seconds. This removes 2/3 of the factor candidates, with a good siever this should increase ;) (the prime95 help notes that ~95% of the candidates are removed with the siever). If I can do the same this will be speedup of ~5-6 (~33% vs. ~5% remaining factor candidates) compared to the current version. :) There is one known bug (introduced in the current version): if 2 (or more) factors are found at the same dataset often only one factor is returned. It is even possible that the returned factor is wrong (mixup of both factors) in this case. I haven't noticed a mixup so far but I'm sure it can occur. :( Last fiddled with by TheJudger on 2009-12-03 at 16:11

2009-12-05, 19:41	#7
TheJudger "Oliver" Mar 2005 Germany 45E₁₆ Posts	Hi, found a bug in my horrible "siever" (actually it is no sieve right now), it removed only the half of the factor candidates which are not 1 or 7 mod 8. Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 7: 57s Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 11: 55s (siever is too slow, causes some stalls on the GPU code) I ran some tests from M66000000 to M66004000 and "verified" the factors known allready to the primenet and found some unknown factors (which I've checked in manually). Last fiddled with by TheJudger on 2009-12-05 at 19:53

2009-12-06, 14:24	#9
TheJudger "Oliver" Mar 2005 Germany 2·13·43 Posts	henryzz: good idea! Maybe a mod can move all posts related to tf on CUDA to a separate thread? Last fiddled with by akruppa on 2009-12-07 at 12:41 Reason: Postings moved

2009-12-15, 09:54	#11
TheJudger "Oliver" Mar 2005 Germany 2×13×43 Posts	Hi, my first attempt to reduce the number of sieve initialisations to one per group failed greatly... :( The good news: the 2nd attempt works fine an took only ~10 lines of code :) sieving the first 2000 primes: TF M66xxxxxx to 2^60: 17 seconds (too much overhead for sieve initialisation...) sieving the first 4000 primes: TF M66xxxxxx to 2^64: 220 seconds :) Last fiddled with by TheJudger on 2009-12-15 at 09:55

2009-11-27, 11:33	#5
TheJudger "Oliver" Mar 2005 Germany 2×13×43 Posts	jasonp: thank you for your hints. I had allready the barrett modular multiplicatoin on my radar but didn't take a deeper look at it so far. Current speed: 30 million checks per second on a 26bit exponent with factors up to 71bit :) Next on my to-do list: - more flexible input (e.g. the exponents are still hardcoded in the hostcode) - presieving of factor candidates - reduce the amount of data transfered from/to device - interleave host/device code

2009-12-06, 09:09	#8
henryzz Just call me Henry "David" Sep 2007 Liverpool (GMT/BST) 2³×3×263 Posts	Does anyone else get the feeling especially the mods that TheJudger's stuff should get its own thread and not hijack this one?

2009-12-08, 09:48	#10
TheJudger "Oliver" Mar 2005 Germany 2×13×43 Posts	Hi, improved siever (sieving the first 600 primes): Sieving M66xxxxxx to 2^60: 17 seconds Sieving M66xxxxxx to 2^64: 270 seconds Sieving M66xxxxxx from 2^64 to 2^65: 270s seconds I think I'm close to the first "public beta". :) I have some ideas in my mind that I want to test/implement before that. Regards, TheJudger

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
gr-mfaktc: a CUDA program for generalized repunits prefactoring	MrRepunit	GPU Computing	92	2024-06-02 00:10
mfakto: an OpenCL program for Mersenne prefactoring	Bdot	GPU Computing	1748	2024-05-28 17:28
The P-1 factoring CUDA program	firejuggler	GPU Computing	753	2020-12-12 18:07
mfaktc 0.21 - CUDA runtime wrong	keisentraut	Software	2	2020-08-18 07:03
World's second-dumbest CUDA program	fivemack	Programming	112	2015-02-12 22:51