Announcement

Collapse
No announcement yet.

USB Bus latency, sleep(), and I2C traffic

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • USB Bus latency, sleep(), and I2C traffic

    The latest mystery...

    Our team wrote a method that commands a robot to move to a certain distance from an object based on a value of an I2C distance sensor (Rev2m). This routine varies the motors power proportionally to the distance and then finally stops the motors by setting their powers to 0. This routine is then called from runOpMode on a LinearOpMode object followed by a call to LinearOpMode.sleep(1000). Here is some pseudo code:

    void moveToDistance() {
    while (sensor.getDistance() > target) {
    setMotorPower(max(sensor.getDistance()-target)*p, MIN_MOTOR_SPEED)
    RobotLog.d (details...)
    }
    setMotorPower(0)
    RobotLog.d ("done...")
    }

    void runOpMode () {
    <initialize...>
    moveToDistance();
    sleep(1000);
    <do some other stuff>

    Mostly, this works fine. BUT, sometimes, the motors stay on for a significant period of time AFTER they have been commanded to 0.

    In a log from a documented failure (below), we can see that the elapsed time from the initial call to moveToDistance() was around 300 msecs. In the log we can also see falling readings from the distance sensor and the log entry after the motors have been commanded to zero. We can also see the next operation starting after the 1 second pause.

    And yet, on the video we captured, the robot trundles forward for what appears to be about another second before it moves on to its next movement.

    So what is happening here? Is the call to sleep() somehow interfering with the USB communication queue? Are the high frequency calls to the I2C distance sensor overwhelming the communication manager in the hub (notice the 3 identical readings in a row in the log...not something we typically see, even at these very fast cycle times)? Something else?

    Thanks in advance!


    2020-01-16 18:08:57.881 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newMoveToDistance: Moving to Distance: 6.0Velocity: 1500.0
    2020-01-16 18:08:57.901 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newMoveToDistance: Initial Distance Reading: 7.04724409448819
    2020-01-16 18:08:57.902 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newMoveToDistance: preDriftTarget: 6.5slowThreshold: 11.0decelThreshold: 21.0slope: 115.0
    2020-01-16 18:08:57.922 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newMoveToDistance: CRAWLING: Distance:7.04724409448819 velocity: 350.0
    2020-01-16 18:08:57.965 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newMoveToDistance: CRAWLING: Distance:6.929133858267717 velocity: 350.0
    2020-01-16 18:08:58.010 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newMoveToDistance: CRAWLING: Distance:6.929133858267717 velocity: 350.0
    2020-01-16 18:08:58.063 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newMoveToDistance: CRAWLING: Distance:6.929133858267717 velocity: 350.0
    2020-01-16 18:08:58.107 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newMoveToDistance: CRAWLING: Distance:6.850393700787402 velocity: 350.0
    2020-01-16 18:08:58.153 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newMoveToDistance: CRAWLING: Distance:6.5748031496063 velocity: 350.0
    2020-01-16 18:08:58.170 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newMoveToDistance: Finished Move to Distance

    2020-01-16 18:08:59.170 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newRotateTo: Starting to rotate to 270.0

  • #2
    Welcome to Java!

    We see this wacky stuff all the time, you just have to live with it. I am surprised that the overall system works as well as it does, quite frankly. Java works great for databases and UI design, but it is terrible for embedded control applications. It just is not reliable, especially running on a phone platform.

    The whole architecture should have been built with more of the control in the hub and relegate the phone to UI-only. It would have been fantastic to have C-based embedded controller with the ability to dial in parameters at the base code level.

    That said, linear op mode seems to be more unreliable than iterative op mode with these wacky burps. You might try converting your code to do a state machine approach.

    Comment


    • #3
      Oh, and to answer your question about the three identical log files, it is likely due to the app being paused by the phone as it is doing garbage cleanup. The hub controller is still running so things get out of whack, and it cannot recover.

      Comment


      • #4
        Originally posted by 11343_Mentor View Post
        Oh, and to answer your question about the three identical log files, it is likely due to the app being paused by the phone as it is doing garbage cleanup. The hub controller is still running so things get out of whack, and it cannot recover.
        If garbage collection is causing the RC app to pause, then perhaps the call to sleep() is triggering the garbage collector to wake up and do its thing.

        Comment


        • #5
          Originally posted by Alec View Post

          If garbage collection is causing the RC app to pause, then perhaps the call to sleep() is triggering the garbage collector to wake up and do its thing.
          That's a good thought. We've had sleep() seem to cause unexpected issues, so now always try to use ElapsedTimers instead.

          Comment


          • #6
            Thanks for the thoughtful responses. It did cross my mind that the GC might be involved so we scanned the logs. We see entries like these :

            2020-01-16 18:08:44.129 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newRampMotors: newRampMotors - Finished
            2020-01-16 18:08:44.522 24490-24501/com.qualcomm.ftcrobotcontroller I/art: Background sticky concurrent mark sweep GC freed 553880(15MB) AllocSpace objects, 0(0B) LOS objects, 27% free, 41MB/57MB, paused 1.305ms total 156.529ms
            2020-01-16 18:08:45.440 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newRampMotors: New Ramp Motors: Start:2200.0 End:250.0 heading:268.0


            In this case, the flanking log entries are separated by very little code, so it seems logical to conclude the GC was responsible for nearly all of the 1.3 seconds of elapsed time. However, that wasn't there in the failure mode described above. So that raises these questions:
            • We are presuming the Java Runtime for the FTC RC app is private to that app and thus all GC events will be in the Robot Log?
            • Is there a way to tune/control the GC? There are plenty of places (especially in the autonomous program) where we would be happy to cede cycles to the GC. Can we direct the process? Would that be wise or would it interfere with the other threads in the SDK?
            • Is there a way to know if the Android OS has "paused" the entire FTC RC app?

            Regardless, we still suspect something else in this specific case. Removing the call to sleep() seems to eliminate the problem describe above, which is hard to explain. Is anyone familiar with the thread architecture of the SDK? s it possible the thread RunOpMode is called inside is related to the thread that is handling USB communication in some way that the call to sleep() within RunOpMode could effect the other thread?

            Finally, we haven't yet built tests to confirm this, but I'm wondering if the rate at which we ping that sensor over the I2C bus might have something to do with it. There is a place in the program where the sensor tends to fail ("problem with xyzsensor" - haven't been able to isolate the associated log entry, so no idea what the SDK thinks the failure is). That failure tends to happen in a spot where that "moveToDistance()" method is called once, and then again, in short order. In practice, it always seems to be on the second call.

            Would introducing a maximum rate (perhaps 5-10 times per second) for calls to getDistance() on the sensor perhaps help?

            Thanks for any continued thoughts on all this!

            Comment


            • #7
              Just another thought, are you guys using vision? I know some teams are using OpenCV. Sometimes if vision code is not written properly, for example, allocating a Mat for every frame captured by the camera, that will cause excessive memory allocation resulted in very active garbage collection. So it may be beneficial to scan through your code to check memory allocation in very tight loops. They don't have to be vision but vision could be an obvious culprit.

              Comment


              • #8
                @mikets has a very good point. I have seen several OpenCV FTC code examples where several large mats were allocated every time through the loop.
                There are other non-OpenCV cases where teams allocate memory - either directly - via calls with "new", or unknowingly through calls that return allocated objects.
                There is not a lot you can do about memory allocation within the sdk, but you can try to reduce the problem by controlling scope of allocated memory and trying to reuse the memory you allocate. This can significantly reduce GC. For OpenCV as an example, there are calls that will return a newly allocated mat, and there are often "equivalent" calls that will fill an already allocated mat/memory. There are other places to fairly easily reduce memory allocation - use StringBuilder and/or StringBuffer - especially if you are doing a lot of logging. You might be amazed at the number of allocations that get performed when doing simple String operations in java.

                That said, unless the GC print in your log snippet occurred during the middle (as opposed to beginning or end) of a GC activity, it doesn't seem likely that GC used nearly all of the 1.3s gap, and 1.3s seems like a long time. Most of the time, you don't want/need to manually trigger GC, but perhaps it wouldn't be a bad idea to do so at init of an opmode (maybe at start), and between auton and teleop. Unless a lot of allocation is going on, major GC really shouldn't need to run very frequently.

                Comment


                • #9
                  Thank you for the additional ideas. The team is using Vuforia/Tensor flow but that is shut down (and presumably its garbage collected) many seconds before the failure mode described above.

                  We think we have ruled out the notion of overloading the I2C bus and/or USB comms. We ran some tests this AM where the robot was stationary and we pinged the sensors as rapidly as possible. Consistently we see that we can get about 32 readings per second from one of these Rev2m sensors and about 11 readings/second from 5 of them (yes the robot has 5, ex FLL team was pretty excited to go beyond 4 sensor limit in FLL...). Running for extending periods did not provoke any visible or logged issues. We DID see the GC kick in from time to time, but with no negative consequences. In fact, we can't find noticeable additional latency between two log entries anywhere near the GC log entries...so all seems well in that department. Nevertheless, the advice about memory consumption is well received. As someone who predates byte code languages and common use of GCs, I do still remember the first time I saw what string concatenation in a loop did to a heap...that was quite shocking!

                  So now we are back to looking at more pedestrian causes...ESD events? USB disconnects? In the meantime, I'm now very wary of using sleep(). I was taught to care and give back (in all things, including processor cycles), but I think we will hog those cycles in a timing loop instead. Haven't seen the particular failure mode described at the top of this thread since we made that change...but I still find it bothersome that I can't figure out a logical rationale for why a call to sleep on the opmode thread would interfere with the SDK communicating (and the hub executing) a motor power command that had been issued before the call to sleep()...maybe it has something to do with the SDK synchronization mechanisms?

                  Comment


                  • #10
                    Originally posted by dbreise2017 View Post
                    Thanks for the thoughtful responses. It did cross my mind that the GC might be involved so we scanned the logs. We see entries like these :

                    2020-01-16 18:08:44.129 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newRampMotors: newRampMotors - Finished
                    2020-01-16 18:08:44.522 24490-24501/com.qualcomm.ftcrobotcontroller I/art: Background sticky concurrent mark sweep GC freed 553880(15MB) AllocSpace objects, 0(0B) LOS objects, 27% free, 41MB/57MB, paused 1.305ms total 156.529ms
                    2020-01-16 18:08:45.440 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newRampMotors: New Ramp Motors: Start:2200.0 End:250.0 heading:268.0


                    In this case, the flanking log entries are separated by very little code, so it seems logical to conclude the GC was responsible for nearly all of the 1.3 seconds of elapsed time. However, that wasn't there in the failure mode described above. So that raises these questions:
                    • We are presuming the Java Runtime for the FTC RC app is private to that app and thus all GC events will be in the Robot Log?
                    • Is there a way to tune/control the GC? There are plenty of places (especially in the autonomous program) where we would be happy to cede cycles to the GC. Can we direct the process? Would that be wise or would it interfere with the other threads in the SDK?
                    • Is there a way to know if the Android OS has "paused" the entire FTC RC app?

                    Regardless, we still suspect something else in this specific case. Removing the call to sleep() seems to eliminate the problem describe above, which is hard to explain. Is anyone familiar with the thread architecture of the SDK? s it possible the thread RunOpMode is called inside is related to the thread that is handling USB communication in some way that the call to sleep() within RunOpMode could effect the other thread?

                    Finally, we haven't yet built tests to confirm this, but I'm wondering if the rate at which we ping that sensor over the I2C bus might have something to do with it. There is a place in the program where the sensor tends to fail ("problem with xyzsensor" - haven't been able to isolate the associated log entry, so no idea what the SDK thinks the failure is). That failure tends to happen in a spot where that "moveToDistance()" method is called once, and then again, in short order. In practice, it always seems to be on the second call.

                    Would introducing a maximum rate (perhaps 5-10 times per second) for calls to getDistance() on the sensor perhaps help?

                    Thanks for any continued thoughts on all this!
                    I have not looked into this in depth, but I don't think GC activity is logged. Keep in mind that GC, as it affects our program, isn't necessarily just related to our specific program, it also is called for numerous other programs concurrently running on the phone. (the phone is doing GC and other sorts of priority tasks unrelated to the FTC program) The philosophy of the phone manufacturers is that since Android is non-deterministic, then there is no reason to give any latitude to developers to try to make more deterministic program needs run more reliably, such as scheduling GC. As the others have said, there are potentially things you can do in your code to make sure you are not allocating new memory, but not really anything you can do with the program or phone to change GC and other tasks.

                    Comment


                    • #11
                      Originally posted by 11343_Mentor View Post
                      ...The philosophy of the phone manufacturers is that since Android is non-deterministic, then there is no reason to give any latitude to developers to try to make more deterministic program needs run more reliably, such as scheduling GC.
                      Agreed. Stock Android (and underlying Linux distro) are non-deterministic; non-realtime. This is the crux of the matter.

                      emteria.OS (formerly "RT Android") is a realtime variant of the Android/Linux stack. Among other realtime enhancements, it has a concurrent non-blocking realtime garbage collector which can be preempted by applications. So emteria.OS could be a solution.

                      FIRST should allow teams to use a Raspberry Pi 3 B+ (B Plus) running emteria.OS (instead of a phone) for the RC, as a pilot. If the pilot is a success, then work towards getting emteria.OS running on the Control Hub.

                      Comment


                      • #12
                        Originally posted by dbreise2017 View Post
                        ...but I still find it bothersome that I can't figure out a logical rationale for why a call to sleep on the opmode thread would interfere with the SDK communicating (and the hub executing) a motor power command that had been issued before the call to sleep()...maybe it has something to do with the SDK synchronization mechanisms?
                        I know you ruled out overloading the USB and I2C buses, but you might want to take a look at the next version of the SDK, 5.4, and in particular its bulk read methods. See the README here, under the heading, "Adds bulk read functionality for REV Hubs". It seems to have been designed for use cases such as yours that have a lot of sensor usage. Obviously, it's very new, so I wouldn't recommend converting all your code to use it, but it might be worth trying out.
                        John McDonnell
                        Volunteer, former mentor

                        Comment


                        • #13
                          Originally posted by JohnMcDonnell View Post

                          I know you ruled out overloading the USB and I2C buses, but you might want to take a look at the next version of the SDK, 5.4, and in particular its bulk read methods. See the README here, under the heading, "Adds bulk read functionality for REV Hubs". It seems to have been designed for use cases such as yours that have a lot of sensor usage. Obviously, it's very new, so I wouldn't recommend converting all your code to use it, but it might be worth trying out.
                          Thanks for pointing out the new BulkCaching support in FTC SDK 5.4. Our team has been struggling with USB performance for a long time. That prompted us to add a lot of support in our library to cache all sensor/encoder data and make sure all sensor/encoder reads are going through this cache. But still we ultimately had to call FTC SDK to do the sensor reads and USB bulk read will greatly help minimizing the time spent in reading all the sensors even just once in a control loop.
                          I just added support to enable USB bulk caching in our library. We will test the performance improvement when we meeting tonight.

                          Comment


                          • #14
                            Forgot to mention that currently we have multiple threads reading different sensors, so I just enabled BulkCacheMode AUTO instead of MANUAL. If the performance improvement looks good, we will move all sensor reads back to a single thread so we can use MANUAL mode. If reading from the bulk cache takes no time at all, we may even move all the reads into the robot main thread!

                            Comment


                            • #15
                              Originally posted by mikets View Post
                              Forgot to mention that currently we have multiple threads reading different sensors
                              FYI, multiple threads is not going to help anything.

                              The Lynx part of the SDK (i.e. the portions related to the REV Hub) are completely thread-safe and are protected by synchronized blocks and a low-level ReentrantLock for controlling access to the USB bus. This means that not only is it guaranteed that only one thread can have a command in flight to a particular REV Hub at any one point in time, but also that only one thread can be talking to ANY REV Hub at any one point in time. This is done because the RS-485 bus does not have any collision prevention. Any messages on the bus would be garbled if two modules were trying to talk at the same time. So thus the host is responsible for ensuring this doesn't happen. Additionally, for I2C read/write requests which actually take more than one command to accomplish, there is a lock in place to ensure that I2C commands are properly synchronized such that, say, an I2C write command cannot be issued while the response to a previous I2C read command is being awaited.

                              HOWEVER - it is NOT more efficient to have, say, one thread talking to each REV Hub, because the two threads will be fighting for bus time (Or, even worse, one thread could starve the other). Not only that, but also if you start another thread in your OpMode, YOU are responsible for stopping that other thread. If you forget to stop it, it will keep running in the background, introducing possibly very unhappy consequences. Worse, while if your main OpMode thread crashes, you can simply recover by restarting robot, if your secondary thread crashes, the entire Robot Controller application will crash irrecoverably - not something you want to have happen during competition.

                              Comment

                              Working...
                              X