Announcement

Collapse
No announcement yet.

USB Bus latency, sleep(), and I2C traffic

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 4634 Programmer
    replied
    Originally posted by dbreise2017 View Post

    In that scenario, what would you expect to see? Slower performance on each thread? One thread getting the vast majority of the bus time? Timed out responses with zero'd out data? In other words, what should we be on the look out for?

    Thanks!
    For SDK versions prior to v5.3, I might expect either slower performance on each thread or one thread getting the vast majority of the bus time. For 5.3 and higher, I'd probably just expect slower performance on each thread. (This change is due to a new "fair" lock being added in 5.3 on top of the "unfair" lock that was there previously)

    Let me try to explain why a state machine can make this situation better. Let's say you have task A, which is your drive loop and is polling your IMU, reading an encoder, and setting motor powers. You also have task B running, which is polling, say, your lift encoder, waiting for it to come all the way down before turning off the lift motor. In this case, task A takes about 22ms, but task B takes only 3ms. This means that task B will try to run many, many times while task A hasn't even gotten through one iteration yet! But if both tasks are run using a cooperative multitasking through a state machine, then each task will get run in series, and both A and B will get an iteration done every 25ms or so.

    Leave a comment:


  • dbreise2017
    replied
    Thank you, the above is consistent with my understanding. However this:
    Using another thread that say, issues a command, waits 5 seconds, then issues another command, is fine. When it becomes a problem is if you have another thread constantly polling an encoder, touch sensor, etc., because it could then steal a significant amount of bus time from your main thread.
    was quite interesting. In some cases, I suspect the kids are doing just that. They might have a "long running" operation in a separate thread that is polling an encoder value in a tight loop (waiting for a motor driven arm to get in a certain position) while in the "main" thread they are adjusting power values to some drive motors in another tight loop based on sensor values and/or encoders on those motors.

    In that scenario, what would you expect to see? Slower performance on each thread? One thread getting the vast majority of the bus time? Timed out responses with zero'd out data? In other words, what should we be on the look out for?

    Thanks!

    Leave a comment:


  • 4634 Programmer
    replied
    Originally posted by dbreise2017 View Post

    I'd like to know more about this (Hopefully without pulling this thread into the ongoing debate on this forum about state machines versus more "linear" design patterns.) As a compromise between the two worlds, I have encouraged my younger students to adopt a design pattern that boils down to this:
    1. Create methods (on various classes) for all "long running" operations that will not return until the operation is complete. In practice, these might be things that sequence movement of a motor with movement of a servo, wait for a motor to finish movement, move the robot to a certain distance from an object, etc. Anything that takes more than a handful of milliseconds.
    2. Ensure those methods will exit early if a "stop" control signal is sent. All loops and subsequent operations within one of these long running operations include a conditional to check the op mode status and will exit if a stop is requested.
    3. For any of these long running methods that involve operations that we may want to do in parallel with something else, create a "No Wait" version of the method that spawns a new thread which runs the method in question, and immediately returns. This also allows them to execute these operations in teleop without losing control of the robot.
    4. Use state variables on the relevant classes to allow for simple inter-thread communication so that one thread can wait for another to finish when needed.
    The presumptions are:
    • Each of these threads will be quickly recycled when they finish
    • having multiple threads talking to the FTC SDK "at the same time" will not pose problems
    • putting some of these threads to "sleep()" while they wait for certain operations to complete is not only OK, but a good practice to help the Java Run Time and Linux instance manage resources effectively.
    But now I'm wondering...perhaps one of our "sleeping" threads has hijacked the bus somehow? Any thoughts?

    Using another thread that say, issues a command, waits 5 seconds, then issues another command, is fine. When it becomes a problem is if you have another thread constantly polling an encoder, touch sensor, etc., because it could then steal a significant amount of bus time from your main thread.

    My point about efficiency is simply that using multiple threads to be able to perform more RX/TX commands "faster" will actually have the opposite effect, since only one thread can be talking to ANY of the hubs at a time. I.e., you cannot have one thread talking to the master hub and another talking to the slave hub. They both have to fight for bus time. The only way this could work is if either the "USB portal" lock is removed when using <=2 Expansion Hubs (since 485 bus contention is only an issue with >=2 slaves on the 485 bus) or if both modules are connected to the RC over USB.

    A sleeping thread cannot hijack the bus. After a hardware call returns (which, btw, could be as much as 250ms if retransmission is required) the bus lock is released.

    Unlike Modern Robotics, there is no "hardware cycle" with REV. Every single command is synchronous. When performing any hardware call, the bus lock will be acquired, the command will be sent, up to 250ms will be allotted for either an ACK or response to come in (with retransmissions scheduled every 100ms if needed) and then the bus lock will be released. If the command was a query operation, real data will be returned if a response was received, otherwise if it timed out fake (all 0) data will be returned.

    Leave a comment:


  • dbreise2017
    replied
    Here is a short article on Android and GC: https://dzone.com/articles/understan...ndroid-gc-logs. I note that things have changed substantially in this area in the last few years as Android moved from DALVIK to ART.

    From this we can see that we should expect to see GC events in the log. In our log, they do occur, but not too often and not for too long. For example, the one mentioned above is the following:
    2020-01-16 18:08:44.522 24490-24501/com.qualcomm.ftcrobotcontroller I/art: Background sticky concurrent mark sweep GC freed 553880(15MB) AllocSpace objects, 0(0B) LOS objects, 27% free, 41MB/57MB, paused 1.305ms total 156.529ms

    The "total" time on this GC event was ~150ms. This is considerably more than we would expect and seems to point at heap abuse by the app, however, its not long enough or frequent enough to explain any of the weird behavior described above.

    Thus, while I agree that GC is an issue worthy of exploration, I suspect it is not the culprit here...

    Leave a comment:


  • Alec
    replied
    Originally posted by dbreise2017 View Post
    A few things in some recent posts in this thread I would love to dig into:

    As I understand it (barely), each Android app gets its own ART (Android Runtime) instance. Which includes a private Linux process and a private heap mostly managed by the Java Runtime instance in that process. So while yes, our FTC RC App is in contention for system resources with other apps on the phone, that contention is managed by the underlying Linux instance. Presumably, on the Control Hub, there is nothing running that is not directly relevant to the FTC RC mission, so things will be better. Does this match other's understanding? ...
    My understanding is that Andoird's GC pauses all JVMs whenever it decides to run. In essence, your OpMode is in a race condition with the GC. This needs to be addressed before we dig down deeper.

    Leave a comment:


  • dbreise2017
    replied
    A few things in some recent posts in this thread I would love to dig into:

    I have not looked into this in depth, but I don't think GC activity is logged. Keep in mind that GC, as it affects our program, isn't necessarily just related to our specific program, it also is called for numerous other programs concurrently running on the phone. (the phone is doing GC and other sorts of priority tasks unrelated to the FTC program) The philosophy of the phone manufacturers is that since Android is non-deterministic, then there is no reason to give any latitude to developers to try to make more deterministic program needs run more reliably, such as scheduling GC. As the others have said, there are potentially things you can do in your code to make sure you are not allocating new memory, but not really anything you can do with the program or phone to change GC and other tasks.
    As I understand it (barely), each Android app gets its own ART (Android Runtime) instance. Which includes a private Linux process and a private heap mostly managed by the Java Runtime instance in that process. So while yes, our FTC RC App is in contention for system resources with other apps on the phone, that contention is managed by the underlying Linux instance. Presumably, on the Control Hub, there is nothing running that is not directly relevant to the FTC RC mission, so things will be better. Does this match other's understanding? Anything to correct or add?

    HOWEVER - it is NOT more efficient to have, say, one thread talking to each REV Hub, because the two threads will be fighting for bus time (Or, even worse, one thread could starve the other)
    I'd like to know more about this (Hopefully without pulling this thread into the ongoing debate on this forum about state machines versus more "linear" design patterns.) As a compromise between the two worlds, I have encouraged my younger students to adopt a design pattern that boils down to this:
    1. Create methods (on various classes) for all "long running" operations that will not return until the operation is complete. In practice, these might be things that sequence movement of a motor with movement of a servo, wait for a motor to finish movement, move the robot to a certain distance from an object, etc. Anything that takes more than a handful of milliseconds.
    2. Ensure those methods will exit early if a "stop" control signal is sent. All loops and subsequent operations within one of these long running operations include a conditional to check the op mode status and will exit if a stop is requested.
    3. For any of these long running methods that involve operations that we may want to do in parallel with something else, create a "No Wait" version of the method that spawns a new thread which runs the method in question, and immediately returns. This also allows them to execute these operations in teleop without losing control of the robot.
    4. Use state variables on the relevant classes to allow for simple inter-thread communication so that one thread can wait for another to finish when needed.
    The presumptions are:
    • Each of these threads will be quickly recycled when they finish
    • having multiple threads talking to the FTC SDK "at the same time" will not pose problems
    • putting some of these threads to "sleep()" while they wait for certain operations to complete is not only OK, but a good practice to help the Java Run Time and Linux instance manage resources effectively.
    But now I'm wondering...perhaps one of our "sleeping" threads has hijacked the bus somehow? Any thoughts?

    our team is running six REV 2m distance sensors, without issues.
    That's great to hear! Care to share any best practices with these notoriously "fiddly" sensors? Polling rates? Dealing with noise? Calibration and Initialization? other?


    Leave a comment:


  • 4634 Programmer
    replied
    There's no way to "overload" the I2C/USB bus. It's simply a matter of the more hardware calls you make (3ms minimum) and I2C calls (7ms minimum) the longer your loop time will be.

    Leave a comment:


  • RoyM
    replied
    As for the possibility of overloading the I2C bus, our team is running six REV 2m distance sensors, without issues. On the other hand, we never call sleep(). Instead we run a state machine and set and check a timer if we need to wait before we do a subsequent step.

    Leave a comment:


  • 4634 Programmer
    replied
    Originally posted by mikets View Post
    Forgot to mention that currently we have multiple threads reading different sensors
    FYI, multiple threads is not going to help anything.

    The Lynx part of the SDK (i.e. the portions related to the REV Hub) are completely thread-safe and are protected by synchronized blocks and a low-level ReentrantLock for controlling access to the USB bus. This means that not only is it guaranteed that only one thread can have a command in flight to a particular REV Hub at any one point in time, but also that only one thread can be talking to ANY REV Hub at any one point in time. This is done because the RS-485 bus does not have any collision prevention. Any messages on the bus would be garbled if two modules were trying to talk at the same time. So thus the host is responsible for ensuring this doesn't happen. Additionally, for I2C read/write requests which actually take more than one command to accomplish, there is a lock in place to ensure that I2C commands are properly synchronized such that, say, an I2C write command cannot be issued while the response to a previous I2C read command is being awaited.

    HOWEVER - it is NOT more efficient to have, say, one thread talking to each REV Hub, because the two threads will be fighting for bus time (Or, even worse, one thread could starve the other). Not only that, but also if you start another thread in your OpMode, YOU are responsible for stopping that other thread. If you forget to stop it, it will keep running in the background, introducing possibly very unhappy consequences. Worse, while if your main OpMode thread crashes, you can simply recover by restarting robot, if your secondary thread crashes, the entire Robot Controller application will crash irrecoverably - not something you want to have happen during competition.

    Leave a comment:


  • mikets
    replied
    Forgot to mention that currently we have multiple threads reading different sensors, so I just enabled BulkCacheMode AUTO instead of MANUAL. If the performance improvement looks good, we will move all sensor reads back to a single thread so we can use MANUAL mode. If reading from the bulk cache takes no time at all, we may even move all the reads into the robot main thread!

    Leave a comment:


  • mikets
    replied
    Originally posted by JohnMcDonnell View Post

    I know you ruled out overloading the USB and I2C buses, but you might want to take a look at the next version of the SDK, 5.4, and in particular its bulk read methods. See the README here, under the heading, "Adds bulk read functionality for REV Hubs". It seems to have been designed for use cases such as yours that have a lot of sensor usage. Obviously, it's very new, so I wouldn't recommend converting all your code to use it, but it might be worth trying out.
    Thanks for pointing out the new BulkCaching support in FTC SDK 5.4. Our team has been struggling with USB performance for a long time. That prompted us to add a lot of support in our library to cache all sensor/encoder data and make sure all sensor/encoder reads are going through this cache. But still we ultimately had to call FTC SDK to do the sensor reads and USB bulk read will greatly help minimizing the time spent in reading all the sensors even just once in a control loop.
    I just added support to enable USB bulk caching in our library. We will test the performance improvement when we meeting tonight.

    Leave a comment:


  • JohnMcDonnell
    replied
    Originally posted by dbreise2017 View Post
    ...but I still find it bothersome that I can't figure out a logical rationale for why a call to sleep on the opmode thread would interfere with the SDK communicating (and the hub executing) a motor power command that had been issued before the call to sleep()...maybe it has something to do with the SDK synchronization mechanisms?
    I know you ruled out overloading the USB and I2C buses, but you might want to take a look at the next version of the SDK, 5.4, and in particular its bulk read methods. See the README here, under the heading, "Adds bulk read functionality for REV Hubs". It seems to have been designed for use cases such as yours that have a lot of sensor usage. Obviously, it's very new, so I wouldn't recommend converting all your code to use it, but it might be worth trying out.

    Leave a comment:


  • Alec
    replied
    Originally posted by 11343_Mentor View Post
    ...The philosophy of the phone manufacturers is that since Android is non-deterministic, then there is no reason to give any latitude to developers to try to make more deterministic program needs run more reliably, such as scheduling GC.
    Agreed. Stock Android (and underlying Linux distro) are non-deterministic; non-realtime. This is the crux of the matter.

    emteria.OS (formerly "RT Android") is a realtime variant of the Android/Linux stack. Among other realtime enhancements, it has a concurrent non-blocking realtime garbage collector which can be preempted by applications. So emteria.OS could be a solution.

    FIRST should allow teams to use a Raspberry Pi 3 B+ (B Plus) running emteria.OS (instead of a phone) for the RC, as a pilot. If the pilot is a success, then work towards getting emteria.OS running on the Control Hub.

    Leave a comment:


  • 11343_Mentor
    replied
    Originally posted by dbreise2017 View Post
    Thanks for the thoughtful responses. It did cross my mind that the GC might be involved so we scanned the logs. We see entries like these :

    2020-01-16 18:08:44.129 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newRampMotors: newRampMotors - Finished
    2020-01-16 18:08:44.522 24490-24501/com.qualcomm.ftcrobotcontroller I/art: Background sticky concurrent mark sweep GC freed 553880(15MB) AllocSpace objects, 0(0B) LOS objects, 27% free, 41MB/57MB, paused 1.305ms total 156.529ms
    2020-01-16 18:08:45.440 24490-24860/com.qualcomm.ftcrobotcontroller D/RobotCore: 14140LOG:newRampMotors: New Ramp Motors: Start:2200.0 End:250.0 heading:268.0


    In this case, the flanking log entries are separated by very little code, so it seems logical to conclude the GC was responsible for nearly all of the 1.3 seconds of elapsed time. However, that wasn't there in the failure mode described above. So that raises these questions:
    • We are presuming the Java Runtime for the FTC RC app is private to that app and thus all GC events will be in the Robot Log?
    • Is there a way to tune/control the GC? There are plenty of places (especially in the autonomous program) where we would be happy to cede cycles to the GC. Can we direct the process? Would that be wise or would it interfere with the other threads in the SDK?
    • Is there a way to know if the Android OS has "paused" the entire FTC RC app?

    Regardless, we still suspect something else in this specific case. Removing the call to sleep() seems to eliminate the problem describe above, which is hard to explain. Is anyone familiar with the thread architecture of the SDK? s it possible the thread RunOpMode is called inside is related to the thread that is handling USB communication in some way that the call to sleep() within RunOpMode could effect the other thread?

    Finally, we haven't yet built tests to confirm this, but I'm wondering if the rate at which we ping that sensor over the I2C bus might have something to do with it. There is a place in the program where the sensor tends to fail ("problem with xyzsensor" - haven't been able to isolate the associated log entry, so no idea what the SDK thinks the failure is). That failure tends to happen in a spot where that "moveToDistance()" method is called once, and then again, in short order. In practice, it always seems to be on the second call.

    Would introducing a maximum rate (perhaps 5-10 times per second) for calls to getDistance() on the sensor perhaps help?

    Thanks for any continued thoughts on all this!
    I have not looked into this in depth, but I don't think GC activity is logged. Keep in mind that GC, as it affects our program, isn't necessarily just related to our specific program, it also is called for numerous other programs concurrently running on the phone. (the phone is doing GC and other sorts of priority tasks unrelated to the FTC program) The philosophy of the phone manufacturers is that since Android is non-deterministic, then there is no reason to give any latitude to developers to try to make more deterministic program needs run more reliably, such as scheduling GC. As the others have said, there are potentially things you can do in your code to make sure you are not allocating new memory, but not really anything you can do with the program or phone to change GC and other tasks.

    Leave a comment:


  • dbreise2017
    replied
    Thank you for the additional ideas. The team is using Vuforia/Tensor flow but that is shut down (and presumably its garbage collected) many seconds before the failure mode described above.

    We think we have ruled out the notion of overloading the I2C bus and/or USB comms. We ran some tests this AM where the robot was stationary and we pinged the sensors as rapidly as possible. Consistently we see that we can get about 32 readings per second from one of these Rev2m sensors and about 11 readings/second from 5 of them (yes the robot has 5, ex FLL team was pretty excited to go beyond 4 sensor limit in FLL...). Running for extending periods did not provoke any visible or logged issues. We DID see the GC kick in from time to time, but with no negative consequences. In fact, we can't find noticeable additional latency between two log entries anywhere near the GC log entries...so all seems well in that department. Nevertheless, the advice about memory consumption is well received. As someone who predates byte code languages and common use of GCs, I do still remember the first time I saw what string concatenation in a loop did to a heap...that was quite shocking!

    So now we are back to looking at more pedestrian causes...ESD events? USB disconnects? In the meantime, I'm now very wary of using sleep(). I was taught to care and give back (in all things, including processor cycles), but I think we will hog those cycles in a timing loop instead. Haven't seen the particular failure mode described at the top of this thread since we made that change...but I still find it bothersome that I can't figure out a logical rationale for why a call to sleep on the opmode thread would interfere with the SDK communicating (and the hub executing) a motor power command that had been issued before the call to sleep()...maybe it has something to do with the SDK synchronization mechanisms?

    Leave a comment:

Working...
X