Announcement

Collapse
No announcement yet.

Reliability issues at competition- Any overall system architecture resources?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reliability issues at competition- Any overall system architecture resources?

    Our team just ended our season at the state competition. We did much better this year than last, got to the semi-finals.

    However, we are now trying to understand what happened to our team as well as loads of other teams at competition. We had a lot of issues earlier in the season with the Rev 2M distance sensors. They would randomly just not respond with valid data. Once we updated to 5.4, the problem went away.

    Or so we thought....

    We had developed a 43 point autonomous, 48 points with an alliance partner that simply parked on line. Our robot used Vuforia to look at 2 stones, if one of them was the skystone, note its relative position and run the auto from that point assuming those positions. We used the distance sensors to find the foundation, to assess and trim robot position in the lane at both ends, trim the position when finding the second stone, and trim again into the lane. Our robot used a claw, so we used the IMU for the 90 degree turns, and also used the IMU to drive and strafe along a heading.

    With red and blue, and 3 possible locations for the randomization, it meant 6 permutations. During tweaking right before State competition, the team asked "how much testing should we do?" I told them to set a goal of at least 5 runs for each, or 30 total runs with no crashes or errors. The team did it. It was working great.

    However, during competition at States, our autonomous only worked 3 out of 6 times. We went from 30 out of 30 to 3 out of 6, not good. Here is a summary:

    - First auto that failed, Vuforia did not select the right skystone. Our robot auto moves forward to get the 2 end skystones in field of view. Through a lot of testing under different lighting, we found the system would recognize the skystone (if one of the two) within 500 milliseconds. So our program had an elapsed timeout of 750 ms, if it didn't see a skystone in 750 ms, it would know the 3rd stone was the skystone. In the failed auto, Stone #1 was the skystone, but the program didn't recognize it so it grabbed a plain stone at #3 position. Then once auto was over and teleop started, our robot was about useless, jerky movement, lag in responding. The drivers reported after the match that the ping was jumping between 500 and 1000 ms!

    - Second failed auto was due to one of the Rev sensors not responding. Due to the unreliability we had previously, I had the team put in safety stops, if the sensor reports an out of bounds number or nothing, put in a value that at least keeps the robot from running over to the competing alliance and drawing penalty points. Our robot crashed into the bridge but no damage done, we just didn't get the points.

    - Third auto failure was the dreaded "IMU Not Found", after the auto started. Since we use the IMU for our turns, on the first turn our robot simply went in a circle for 30 seconds.


    While our team was dejected, there was lots of company, most teams had some sort of robot drama. The obvious culprit is the huge number of wifi devices at the event. While we were sitting in the pits, I often took note of the ping times while sitting idle in the pits, and saw wildly jumping numbers, typically between 25 and 75 ms at the low end, sometimes jumping into the 100's of ms. (while sitting quietly in our school today, I noted the typical ping is 2-3 ms, jumping to 15 or so about every 5 seconds, but back to 2-3 again...)

    I would like to understand what happened, so I can better plan for the future. Are there any resources that clearly define the overall system architecture and interaction of all the devices in the chain? How are the API's constructed and the interdependencies of the elements?

    Obviously the long latencies between the driver and robot phones over wifi are screwing up our auto, but it shouldn't be possible, if they architected the system correctly. (obviously it isn't designed properly) For teleop, naturally you are using the driver station to control the robot, but the only valid purpose the driver station has during auto is to start it, and stop it in the event of an emergency. In other words, the driver/robot link doesn't need to be "in the loop" during auto, you could revert to a periodic poll approach to ask if the driver requests a stop. Or am I not understanding the system correctly? And I have never seen a rogue auto "killed" by the refs, so what is the point?

  • #2
    First auto that failed, Vuforia did not select the right skystone. Our robot auto moves forward to get the 2 end skystones in field of view. Through a lot of testing under different lighting, we found the system would recognize the skystone (if one of the two) within 500 milliseconds. So our program had an elapsed timeout of 750 ms, if it didn't see a skystone in 750 ms, it would know the 3rd stone was the skystone. In the failed auto, Stone #1 was the skystone, but the program didn't recognize it so it grabbed a plain stone at #3 position.
    A couple things here. First, the image target on the SkyStone was a very poor choice. It has basically no contrast at all, which is very very bad for feature matching which is how Vuforia works. Secondly I think the only reason FIRST is pushing Vuforia is because it's developed by PTC, one of the program sponsors. You'll notice that in FRC, image targets are made of reflective tape which glows with strong contrast when light is directed at it, similar to how a stop sign glows brightly in your headlights at night:



    Both this season and last season I have used custom OpenCV vision processing to detect the gold mineral / SkyStone and have so far had 100% detection accuracy even across the varying lighting conditions from home to the Cobo Center in Detroit. What's more, I run the detection while waiting for start, and when start is pressed I simply take the last analysis from memory. This means that I spend literally zero time waiting on detection after start. What I do this year is convert the camera image to YCrCb color space, then extract just the Cb channel. YCrCb is a 3-plane color space in which Y is the luma plane and Cr and Cb are chroma planes, with Cr being difference from red and Cb being difference from blue.



    Here's what the quarry looks like in only the Cb plane:

    Screenshot from 2020-03-05 10-53-01.png

    I then take three 5x5 pixel boxes from that image, at hard-coded XY positions (one positioned over each stone), and then compute the average Cb value of each of those 5x5 pixel boxes. Then I simply compare those averages, and the largest (lightest) average is the SkyStone.

    Then once auto was over and teleop started, our robot was about useless, jerky movement, lag in responding. The drivers reported after the match that the ping was jumping between 500 and 1000 ms!
    Sometimes the FTAs mess up when assigning channels and certain channels become overloaded. This happened in at East Super Regionals in Relic Recovery, for instance. The FTAs put ALL the teams from BOTH divisions on a single channel, resulting in insane lag during the practice inspection match. Upon showing this to one of the FTAs with a WiFi analyzer app, he told me I could switch channels at my own discretion, which I did, and did not experience issues for the rest of the event. Additionally, it seems that the phones with the newer radios seem to throttle the radio much more aggressively. For instance, in a clean WiFi environment, I have seen Moto G5 Pluses have pings going all over the place and even disconnecting once while sitting right next to each other, while on the patched SDK that sends KAs more aggressively to combat this! I personally uses Nexus 5s and have been very pleased with their WiFi performance.

    Second failed auto was due to one of the Rev sensors not responding. Due to the unreliability we had previously, I had the team put in safety stops, if the sensor reports an out of bounds number or nothing, put in a value that at least keeps the robot from running over to the competing alliance and drawing penalty points. Our robot crashed into the bridge but no damage done, we just didn't get the points.
    I am not a fan of the TOF sensors, as the laser wavelength is such that it shoots through the lexan wall. I also have had one randomly die on me. Besides, the range just isn't great. I have had great success with the MaxBotix MB1242 narrow beam sonar rangefinders.

    - Third auto failure was the dreaded "IMU Not Found", after the auto started. Since we use the IMU for our turns, on the first turn our robot simply went in a circle for 30 seconds.
    Does this mean you're initializing your IMU after start as opposed to during init? Don't do that. If you initialize during init, you can catch that error and power cycle / restart whatever is necessary to resolve it before the match. This error seems(?) to happen much more often when using the internal BNO055 in the Expansion Hub as opposed to when using an external BNO055 from Adafruit. I'm not entirely sure why.

    Now, if you have I2C comms errors during an auto run, it's possible that ESD could be the culprit. The I2C ports seem to be particularly sensitive. In Relic Recovery at ESR I know 8644 killed the I2C ports on like 4 Expansion Hubs from ESD.

    Obviously the long latencies between the driver and robot phones over wifi are screwing up our auto, but it shouldn't be possible, if they architected the system correctly. (obviously it isn't designed properly) For teleop, naturally you are using the driver station to control the robot, but the only valid purpose the driver station has during auto is to start it, and stop it in the event of an emergency. In other words, the driver/robot link doesn't need to be "in the loop" during auto, you could revert to a periodic poll approach to ask if the driver requests a stop. Or am I not understanding the system correctly?
    Incorrect. Latency between the DS and RC should have absolutely zero impact on autonomous. The RC/DS are never polling each other. If stop is pressed on the DS, it sends a command to the RC, repeating the command till the RC sends an ack. Furthermore, all network I/O happens on a separate thread from your OpMode. During TeleOp, the DS sends new gamepad data to the RC at 40ms intervals. The RC does not have to ask for this data. The DS just sends it.

    Comment


    • #3
      Thanks for the good info!

      Some comments:

      Vuforia: Yes, we are done with Vuforia. We wanted to switch to OpenCV this year, but didn't make the leap. Next season for sure! However, Vuforia was working 100% before the State event. It only died on the auto that also was associated with the wifi issues, thus some of my concern, which you cleared up.

      WIFI channels: Yes, this is what happened. 60 teams, and the FTA's put half on Channel 10 and half on Channel 11. They also put out pleas to the audience to shut off their active phone hotspots, but half the crowd likely had no clue how to do that. Something must change at these big events, it ruined lots of peoples day.

      Distance Sensors: Thanks for the tip on the MaxBotix sensors. We are going to avoid all REV products in the future as they can't seem to get it right. (We have had two defective hubs, right out of the box, one last year, one this year)

      IMU: No, we initialize the IMU properly in init. The IMU died once the program started. Like above, thanks for the external IMU tip. Again, we will avoid REV stuff where possible. Regarding ESD, I don't believe this was the issue. We are in FL, we don't have the static issue that the poor guys in the North have. I have not had a REV hub apart, but I suspect they didn't do the proper protection for I/O on all ports. I suspect they didn't protect the USB port either or any of the power for that matter. (standard automotive practice) I design products for vehicle applications; I am fond of the Bourns CDSOT23-T24CAN, and use it on CAN, RS-485, I2C, RS-232 even though they were made for CAN. They work beautifully for controller applications. I use the NUF2221W1T2G on USB. Works great in a noisy automotive environment.

      DS to RC. Thanks for enlightening me. Where did you learn how this all works? Is there any documentation anywhere? I have asked Rev, but they don't respond.

      Comment


      • #4
        Originally posted by 11343_Mentor View Post
        DS to RC. Thanks for enlightening me. Where did you learn how this all works? Is there any documentation anywhere? I have asked Rev, but they don't respond.
        FYI, none of the SDK was written by REV. (Oh by the way, REV didn't write the Expansion Hub firmware either - DEKA did). The original SDK was written by Qualcomm, who then gave FIRST a "code dump" and the (mostly volunteer) Tech Team took it from there. No, there isn't any official documentation on how the low levels of the SDK work. I know as much about the SDK as I do because I've been digging through the source code for years now. If you want to take a look for yourself, my ExtractedRC repository makes this extremely easy. And yes, I've experienced the same thing with REV - any highly technical questions are just ignored.

        Comment

        Working...
        X