Node failing to solidify milestones correctly #1655

DyrellC · 2019-11-08T07:48:29Z

Bug description

When trying to solidify over wide gaps, there is a possibility that the node will not properly solidify. This can be identified in two ways. Either the node will hang on the same milestone, producing an output as follows:

Solidifying milestone #100 [20 / 397]
Solidifying milestone #100 [20 / 398]
Solidifying milestone #100 [20 / 399]
Solidifying milestone #100 [20 / 399]

or the solidifer fails to print a message, at all and the latestSolidMilestone remains the same while the latestMilestone continues to grow. In the latter instance, it can sometimes mean that there is one milestone in the unsolidMilestonesPool so no output is printed. In this case it is the same as the former scenario where the same milestone is being requested, and no further milestones are being added to the unsolidMilestonesPool.

When investigating the milestones that were failing to solidify it appeared that some of the transactions were present in the db, but were not marked as milestones due to the other milestone in the bundle not being solid. In other instances the milestone was never found through transactionValidator.checkSolidity call. This may be caused by a milestone being left behind/effectively orphaned during large spam events or splitting. In these cases the milestone would not be found when solidifying backwards through the tangle.

IRI version

v1.8.2

Hardware Spec

Linux Mint, 8GB 4 cpu 160GB SSD x 2

Steps To Reproduce

Testnet

Start one node node with https://s3.eu-central-1.amazonaws.com/iotaledger-dbfiles/dev/SyncTestDB.tar and another with https://s3.eu-central-1.amazonaws.com/iotaledger-dbfiles/dev/EmptyDB.tar. Make sure the nodes are already neighbored. For faster syncing add extra solid nodes to the mix.
Start nodes with the following configuration
java -jar iri-1.* -p 14265 -t 15600 --zmq-enabled true --zmq-port 5556 --testnet true --testnet-coordinator EFPNKGPCBXXXLIBYFGIGYBYTFFPIOQVNNVVWTTIYZO9NFREQGVGDQQHUUQ9CLWAEMXVDFSSMOTGAHVIBH --testnet-no-coo-validation true --milestone-start 0 --mwm 1 --remote true --remote-limit-api "" --snapshot ./snapshot.txt -n 'your.neighbours.here'
Issue a milestone using python milestone.py -i 1001 from https://github.com/DyrellC/iri-regression-tests/tree/add-sync-tests/Nightly-Tests/Sync-Tests to kick-start the solidification
Wait for the node to finish "syncing"

Mainnet (Doesn't always happen)

Start up a node from a couple hundred milestones behind
Let node try to sync
Watch it spin out (Sometimes)

Expected behaviour

Nodes should synchronise properly.

Actual behaviour

Nodes hang on solidifying specific milestones.

The text was updated successfully, but these errors were encountered:

DyrellC · 2019-11-08T20:07:01Z

@karimodm and I discussed another issue with solidification within the LatestMilestoneTrackerImpl which could be the culprit for failed milestone issuance from the coordinator in devnet. As is, the collectMilestoneCandidates call will pull all transactions with the coordinator address and scan through them. However when it does this, the ordering of transactions in the set pulled is randomised, and the maximum amount of transactions to analyse before stopping the scan is set to 5000 currently. With ~1.3 million milestones, and the randomised nature of pulling the candidates means that each time the collectMilestoneCandidates call is made, there's a 0.38% chance that the newest added milestone is present in the first 5000 analysed transactions. You could improve the probability of the new milestone being found by increasing the maximum analysed transactions, but this isn't a solid solution. I proposed to @achabill that a possible solution would be to order the set of transactions by the attachment timestamp associated with the hash so that within the first X transactions analysed there is a much higher probability of the milestone being seen faster. This would however increase the processing requirements for the scan because the list of hashes will need to pull each transaction from the db to filter by timestamp. The increase in processing per scan should be offset by the reduced time needed to find new milestones.

achabill · 2019-11-11T10:00:52Z

Considering the sorting proposal, we could sort the hashes in AddressViewModel and use the list of TransactionViewModel in collectNewMilestone candidates.

AdressViewModel

List<TransactionViewModel> loadSorted(tangle,hash){
    hashes = load(Address.class, hash)
    return hashes.map(fromHash(tangle,item))
                         .sorted(attachmentTimeStamp)
                         .toList()

We would also have to change milestoneCanditatesToAnalyze and related variables to use TransactionViewModel instead of Hash and then call processMilestoneCandidates(TransactionViewModel tvm) directly.

GalRogozinski · 2019-11-11T11:07:05Z

If the above solution is quick and it works, we can do it.
Else, we can consider the following:
#1447 (comment)

GalRogozinski · 2020-01-20T14:18:05Z

Even though we merged a solution, I will be closing this once #1674 is done.
We may erase the changes of #1660 as a result

GalRogozinski added T-Bug C-Solidification labels Nov 11, 2019

achabill self-assigned this Nov 11, 2019

achabill mentioned this issue Nov 11, 2019

sort milestone candidates before analyzing #1660

Merged

4 tasks

GalRogozinski assigned DyrellC and unassigned achabill Nov 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node failing to solidify milestones correctly #1655

Node failing to solidify milestones correctly #1655

DyrellC commented Nov 8, 2019

DyrellC commented Nov 8, 2019

achabill commented Nov 11, 2019 •

edited

Loading

GalRogozinski commented Nov 11, 2019

GalRogozinski commented Jan 20, 2020

Node failing to solidify milestones correctly #1655

Node failing to solidify milestones correctly #1655

Comments

DyrellC commented Nov 8, 2019

Bug description

IRI version

Hardware Spec

Steps To Reproduce

Testnet

Mainnet (Doesn't always happen)

Expected behaviour

Actual behaviour

DyrellC commented Nov 8, 2019

achabill commented Nov 11, 2019 • edited Loading

GalRogozinski commented Nov 11, 2019

GalRogozinski commented Jan 20, 2020

achabill commented Nov 11, 2019 •

edited

Loading