The high load test problem - thoughts?

Discussion in 'Development' started by Fuserleer, May 26, 2017.

  1. Fuserleer

    Fuserleer Radix Founder Staff Member

    Calling for thoughts on this "problem" and how to solve it.

    I'm sure you're all aware of the scalability possibilities that we have, and if you weren't already, you've no doubt seen the claims I've made on Twitter the past few days.

    On the back of those claims we now have to prove it, which is fine, I want to prove it, and do so in a public facing test net (not some closed off lab).

    But there is a problem.

    Due to the way that Radix scales, no single machine will be able to store all the transactions at high load. We'll have multiple machines serving different partitions, so while collectively they may be processing 100k TPS, each of them may only be reporting 2k TPS.

    I've done load testing recently while developing exceeding 70k TPS over 8 of my own machines, but none of them, even my fastest, would have any chance of storing all of those transactions and keeping up.

    Even tricks like turning off things like signature validation, because we know all nodes are honest, still wouldn't enable a single machine to get anywhere near.

    So I'm calling for ideas on how to run a high load test and report it unquestionably, over a number of machines, which as a network are processing say 100k TPS but individually are only processing a fraction of that.

    100k TPS is also too many for a single machine to produce (and broadcast over a single connection) so transaction production also needs to be over multiple nodes.

    Ideas?
     
    CryptoScalper likes this.
  2. bidji29

    bidji29 Beta Testers

    What about an "overwatch" of partition.

    It would not be doing any transaction stuff, just looking at all the machine/partition and checking their TPS.
    It would check the TPS of all machine on a same partition and report only the highest one.
    By doing it on all the partitions and adding them, you get the magic number.


    Now, how does the overwatch verify the machine is telling the real TPS and not spoofing a number, I don't know
     
  3. Fuserleer

    Fuserleer Radix Founder Staff Member

    Might work, though the main problem there is making sure that whatever machine is the "watcher" has connections to all nodes so it can report the most accurate TPS.

    Not too hard if we only have 20 or so, but with 100s or more, that becomes an issue.

    I don't think we have to worry about that too much for a test net. We should be able to assume that all nodes are honest.
     
  4. hamiltino

    hamiltino Beta Testers

  5. Jazzer

    Jazzer The Dutch connection Staff Member

    We can do some setup off-line, right? And it doesn't have to be super long, just sustain it for an hour should be plenty of evidence. So I would approach this as a classic testcase:
    • Construct a rich node/transaction set of 1h x 100k tps = 360 M tx offline.
    • Look for a sufficiently large set of service nodes that wish to join the test.
    • Look for a sufficiently large set of simulated enduser nodes (SEN).
    • Distribute the 'orders' to the SEN. Thus each SEN gets a work packet of tx requests to issue, with timestamp to do it.
    • Bring service node network online, setup correct initial conditions (ie distribution of funds matches start of test).
    • Connect all the SEN
    • Check everyone is online, synced on time and ready to start.
    • Let her rip.
    • After 1h, verify that the final network final state passes the test.
    This would be a test like you'd do in the lab, but executed in the real world. Thus, anyone could join.

    Because you're working with real nodes, you're going to see dropping SEN which the test needs to be robust against. Thus, I would suggest defining the work packets loosely. Ie, one component could be smth like "at time 0 + 10 min, check balance on acct X. Spread out to list of addresses Z in 10000 tx over 10 min."

    We could start small and scale up the test once the test infrastructure is verified. Maybe we don't need a full hour either, this stuff either works or it doesn't. If this is expensive to run we could cut the time down a bit.

    This test will be a milestone that people are going to make movies about later. The key thing you've done is implement a working network partitioning scheme. That's exactly what everyone else is struggling so hard with due to the inherent limitations of block chain and DAG. FABRIC is a monster, can't wait to see it gobble up those tx..
     
    skywave and Fuserleer like this.
  6. tesslerc

    tesslerc Beta Testers

    Best option as I see it:
    1. Create a client that just spams at the maximal rate.
    2. Run the test for X hours and then stop everything.
    3. Allow your node to sync up to that point.

    Once you have synced, I assume you know the timestamp of the transactions, you can select a large enough time frame in which there was a high load. Check the average TPS and the peak TPS (peak being a small time frame of several minutes for instance).

    *** But for this kind of test, we need clients that can connect properly without port forwarding etc.. (like in the previous versions) - otherwise many won't be able to participate.
     
  7. jonas452

    jonas452 Beta Testers

    How about just running a big ass test net for a fixed period of time. Then sending all verified transaction to a centralised database after the test(not during) and then calculating how many transaction where made for the entire network?

    To me this seems the most easy option to prove we can handle the load. Downside, we can't prove it live..
    Just my thoughts or am i completly wrong in my thinking?
     
  8. bidji29

    bidji29 Beta Testers


    You could just hardcode the "watcher" IP/adress on the public test client. If the test clients only report the TPS like once every second, then the watcher could keep up.

    I think we should really go big on this test, and get hundreds or thousands of machine.

    Normal account on AWS can only launch like 20 instance at the same time but some people doesn't have those restrictions and can litteraly get thousands of machine up in a hour.


    I personally have an account with a 200 instance limit and I don't mind throwing hundreds of dollars if it allow us to get insane TPS.
     
    Last edited: May 26, 2017
  9. danisapfirov

    danisapfirov Beta Testers

    It seems not possible to show tx speed as it happens, so maybe it is a good idea to report the network load per hour.
    0:00 - 01:00; 01:00-02:00... 23:00-00:00.

    1. Synchronize time of reports as machines are scattered in different time zones. Make sure the report is based on GMT.
    2. After every challenge or tx force the node to report the number of transactions in the last time frame. This will be updated and the last number will be reported at the top of the hour.
    3. Somehow the node should store this info in a database locally and send it from time to time on the network.
     
  10. ulfarsgco

    ulfarsgco New Member

    Don't really know anything about network load testing but as other have already said, you may forge offline a sufficiant amount of transactions and when it's ready, then send them all online (by only one or two devices)?

    for the nodes validating the transactions, if needed I can set my internetbox to bridge thus disabling the port forwarding. I"m not sure my FAI will let me have several IP adresses, by I can connect at least one device. Let me know if it is usefull.
     
  11. jonas452

    jonas452 Beta Testers

    Do have a solution for your problem? Any idea when we will go for the magic 500K?
     
    fyyt likes this.
  12. Fuserleer

    Fuserleer Radix Founder Staff Member

    Yeh Ive got a solution sorted....its kind of a mix of a few of the suggestions in here :)

    Going to wrap up what I'm working on now then get back to partition stuff so we can run this test.
     
    Mario, Collett, Lloyd and 2 others like this.
  13. Sharky

    Sharky Founders Staff Member

    Why is it a problem in the first place? We have been challenged by IOTA to prove it, but no IOTA node has the slightest idea of the current network TX/s. They get the results not live, but only when creating a checkpoint.
     
  14. Mario

    Mario Beta Testers

    Why shouldn't Radix not be better at measuring?
     
    jonas452 likes this.
  15. Sharky

    Sharky Founders Staff Member

    Because measuring a paritioned network is only possible by centralization (reporting server) or database parsing after the fact.
     
  16. tesslerc

    tesslerc Beta Testers

    For measurings sake, it is more than fine to do this -- this obviously won't be in the final version..
    Even if not in order to show the other cryptos what radix is worth, at least to see ourselves what it is capable of.
     
  17. Snail2

    Snail2 Beta Testers

    Where is the bottleneck (while processing the transactions or when storing all that stuff)?
     
  18. Lloyd

    Lloyd Founders Staff Member

    When you have 100 nodes each with separate or full sets of partitions how would even the full nodes know how fast the nodes with specific partitions are processing. The only real way seems to be after the fact of the period you are checking on the tx/s.
     
  19. Shannon B

    Shannon B New Member

    Just wondering what transactions per second you are up to? and if you will get up to the 100k+ transactions per second as quoted on the website before public launch?
     

Share This Page