Fixed activity model updates logging at ~3000 messages/second in a tight loop, producing excessively large log files (BIG-305)
Fixed map not updating zoom and context correctly when navigating between timeline days with background views still loading (BIG-324)
Fixed dark mode backgrounds showing incorrect colours on the activity tab, Place Edit view, Workout view, and timeline day list (BIG-264)
Fixed sample export filenames using calendar year instead of ISO week-numbering year near year boundaries, causing week files (e.g., 2015-W01.json.gz) to contain samples from the wrong year
Improvements
New installs and fresh migrations now include a bundled base activity classification model, providing reasonable activity type detection before location-specific models have been trained (BIG-42)
Locked to portrait orientation (BIG-291)
Added loading spinner to photo viewer while fetching images from Photos (BIG-330)
Standardised loading spinners across the app to a consistent larger size (BIG-326)
Moved delete button to the right side of the confirm button on visit edit toolbar (BIG-329)
A grab bag of little things with the new build; mostly hangs.
If you’d prefer that I make separate threads for stuff like this, instead of lumping it into the release topic, please let me know.
One hang was a hard UI freeze in the main Timeline view, as I was reviewing today’s events. It wouldn’t hang instantly on starting the app, but within a few seconds. It was hung hung. I’d have to kill the UI and restart.
After a few such attempts, I moved on and figured I’d debug it more later when I had time.
…of course, now it’s that later time, and I can’t reproduce it. Lots of those events were still “processing” back then, which have now completed, so maybe that’s related? I’m just speculating wildly.
I tapped on a segment that was incorrectly classified, planning to change it to another type. When I tap on it, I just get this spinner:
I killed the UI and restarted, and that fixed things briefly. Once I’d opened and changed a couple of activity types, now it’s back in this spinning-wheel state for any one I open.
A similar issue, but this one is a Place. I tap on the Place, then on the pencil to select/name it, and it never populates:
All good with dumping them in here! I think if we get something going off on enough of a tangent we’ll know. For these bug reports, this feels like the right place.
Hm. This one’s going to be the most troublesome I think. Because it’s one I haven’t seen myself, meaning it’s not common or not easy to reproduce. And because it hangs-not-crashes it’s not gonna hit crash report logging. Even if you left it to eventually get terminated by iOS, it’d be a watchdog report, not a crash report.
I think the best we can hope for is that it happens to me while connected to the debugger in Xcode, and I can pause the app and have that “ah hah! there it is!” moment.
This one I think I know. With some earlier work Claude and I did to reduce the risk of this we identified a list of potential pain points and non-ideal code. We fixed the headliners but didn’t do the full list. Wanted to feel out whether the problem would be resolved by that (and then there’s no point in bothering with the rest) or whether it could still happen. Turns out yes, it can still happen
Anyway I’m fairly confident I know which of the remaining bits in the list will be the causes of this. So I should be able to jump back into that stuff and get more fixed up.
Yep, I’m pretty sure that’s it. Well, not the skipping, but the fact that there’s a classify task still sitting there, doing something, but that should’ve already finished long ago. And that then jams up everything else that wants to get busy after it. It’s not a serial queue, but … enough people asking for things across actor boundaries and not getting answers, and you end up with a quasi (or complete) deadlock.
I suspect this one is a victim of the previous one. The places system doesn’t have anything heavyweight that could backlog or deadlock or anything else. But it does in the view model / UI chains have to ask for various things across actor boundaries. Probably what’s happened is the aforementioned jamming is getting in the way of the places system finishing up what it’s trying to do.
So hopefully fixing one will fix both. If I’m lucky
This one’s new to me! Never seen it before. I’ll search my own logs to see if that’s in there.
Usually the HealthKit errors I see are ones about trying to access the HealthKit db from the background / when phone is locked. Which is entirely harmless. In those cases a HealthKit request is started when the app is in the foreground, but then HealthKit takes too long to pay attention, the app goes into background, the phone is locked, HealthKit db gets locked, and the error is thrown, “you’re not allowed to ask for this stuff when the phone is locked!” Sigh.
But yeah, that one’s harmless. While this nilError is … curious. I’ll see what I can find.
Thanks for all the reports btw! Super helpful stuff! We’re getting close to the final App Store v1.0 release hopefully before the end of the month. So we won’t want any of these weird behaviours hanging around by then.
Turns out I left it in that state overnight, and it’s still in that state, so something is definitely wanged. Does this list of stuck active operations help at all?
After restarting it, those 18 ActivityType model pending updates are back in the queue, of course. But as long as I don’t view yesterday’s timeline, that seems to avoid activating them.
It happens with today’s timeline, too – and unlike yesterday, is so far 100% reproducible. It hangs about 10 seconds after loading.
The bad news is that of course I don’t have your code or debugger. And since the hang is triggered by viewing today’s timeline – i.e. the default view – it means the app has about a 10-second lifespan. Fortunately it still seems to be recording in the background, and hopefully tomorrow will give it a new lease on life.
It’s been doing this most of today, but there have occasionally been times where it doesn’t trigger the hang immediately. You can see in the screenshot below that I managed to get into the debug info screen, and those timers were more than 30 seconds old. Whereas if I open it right now, I can’t even get to 10 seconds.
I feel like those times when it didn’t hang immediately were times when I’d moved recently, and the default view wasn’t showing very many events (because there were a bunch of segments that it hadn’t combined yet). But it’s only a vague feeling.
Even when it didn’t hang immediately, if I went back out to the timeline view and scrolled the page up and down a bit, it would hang. I think there may be a correlation between being fully zoomed out (i.e. at the top of the timeline) and triggering the hang, but this too might be a red herring.
In other news, today’s timeline has also really jammed up the operations queue, similar to yesterday’s. None of these ever make any progress, even without the hang; the queue is just growing and growing:
Note that one of them says ActivityClassifier.results(for:timeout:), which seemed noteworthy. And even TimelineSegment.fetchItems() – which sounds, to this naïve observer, like something that should complete very quickly – will hang indefinitely.
Finally, I feel like there might be a small correlation between these issues happening when I’m outside the US (i.e. in locales with less-than-stellar place + geofencing coverage), but that might be yet another distraction.
Hope that helps! I’ll be curious to see tomorrow, when I roll the timeline view back to today, if it still triggers the hang.
It does! Mostly in the sense that it looks very similar to ones I’m seeing. But also the specific top ones that’re hung the longest are strong clues. Or at least where to start looking first.
I’ve got a new fix that landed yesterday, that improves part of this significantly. But I’m still seeing some bad behaviour today, post-fix, so I’m going to do more investigation today and send out a build 40 this afternoon, rather than rushing it out this morning. I’m not convinced yesterday’s fix is substantial enough.
And curiously, yesterday’s fix I think has unleashed other pre-existing problems, that were previously being held back from living their true selves due to other blockages. So by unlocking one blockage I think I’ve opened the door for other offenders to run wild and cause the problems they’ve been eager to cause all along
Fun and games in the world of “too much parallelism, but it’s also all kind of necessary”. Sometimes I wish Arc were a simple CRUD app!
Eek. So this is a full UI hang? App frozen? This one feels like something unrelated. Though unfortunately I’m in the same “it’s only a vague feeling” zone as you. There’s enough complexity in there that it’s hard to quickly point at any single thing and say “ok that’s gotta be it”.
Yeah both noteworthy. Yesterday’s fix was in the classifier chain, fixing some potential (and sometimes common) blockages there. And the fetchItems() one is most likely a problem of chatter across Swift actor boundaries getting blocked, due to the actor on the other side being blocked (eg the ActivityTypesActor being blocked by CoreML hang, then TimelineActor stuck waiting on something from ActitytypesActor while trying to do the fetchItems).
I’ll dig into that specific chain today.
It can definitely be a factor, though not so much data quality / US-vs-outside. It’s more of the familiar vs unfamiliar. Your phone inside Location Services builds up better accuracy when dealing with the familiar. It maps out wifi hotspot triangulation in more detail, and… does other mysterious things that only Apple insiders know.
That means that when recording location data at places you’re frequently at (and recently at) the location data is better. And then when you go somewhere new the phone has a minor freak out, and spits out pretty crap location data for the first day or two.
That can sometimes make a monumental mess of the timelines in Arc. Then the timeline processing engine has to work overtime, there’s more things moving around, and many more chances for things to jam up and get into these ugly stalled / backlogged states.
Anyway, fingers crossed Claude and I find some good clues today and get another fix in! But even if not, we’ll still ship a build 40 this afternoon, with yesterday’s fix. That should alleviate some of the pain, though still won’t be the full root cause fix.
As I expected, it happened 100% of the time for the rest of the day yesterday.
Right now, navigating to yesterday still triggers the hang – but less reliably. Maybe 75%? Sometimes all I need to do is navigate to yesterday and wait 10 seconds. But sometimes it takes longer, and sometimes I can’t make it happen at all.
Interestingly, even if I can’t make it hang, I have noticed this pattern: I go to yesterday and fiddle around long enough to convince me that the UI hasn’t hung, then put the app in the background. When I reopen it a minute later, the app has restarted. Which makes me wonder if something is hung enough for iOS to kill the app.
That means that when recording location data at places you’re frequently at (and recently at) the location data is better. And then when you go somewhere new the phone has a minor freak out, and spits out pretty crap location data for the first day or two.
Ah, I see. My life is an endless series of going to new places and not staying there very long, so this sounds like about the worst case.
Anyway, fingers crossed Claude and I find some good clues today and get another fix in! But even if not, we’ll still ship a build 40 this afternoon, with yesterday’s fix. That should alleviate some of the pain, though still won’t be the full root cause fix.
Thank you for that! Unfortunately (though it sounds like unsurprisingly), this has not fully resolved the queue blocking issue.
One thing that seems interesting to me, but may be totally irrelevant: when I first restart the app, the queue has something like 130 tasks in it. When it gets stuck, there will be 9-10 active, and zero in the queue. But when I restart, it’s back to 130. So what happened to the other 120 in that middle phase? If they were completed, then why did they reappear? If they didn’t complete, then why weren’t they in the queue? (Don’t feel compelled to answer these questions; they’re rhetorical.)
Everything in this reply was done with build 40. Do you want me to continue discussing these issues here, or take this conversation to the build 40 thread (or a whole separate thread)?
Thanks for your hard work! I spent 30 years building software, so I get the struggle of trying to debug something remotely through an unreliable narrator.
Yeah that’s concerning. In the past any hangs or crashes have been UI triggered, so going into the background is the safe escape - recording continues fine. But this sounds like… well it could still be UI triggered, but then on a longer fuse, that doesn’t get extinguished if you go into the background. Hmm.
I’ll check Sentry and Apple’s crash reports again. I’m pretty sure there’ll be nothing of use - hang to watchdog termination don’t produce useable reports most of the time. But… yeah, gotta find something some more clues.
Same for me! Which is great in that I get plenty of troublesome test data. But annoying in that it’s the situation in which the system works least well, both on the iOS side and Arc side. Machine Learning loves the familiar, so when you’re constantly giving it new and unfamiliar, results are not so great.
In build 40, what’s the “Active Operations” looking like? So far I haven’t been able to get it backlogged again, after yesterday’s fixes. I’m not convinced we’ve finally caught all the root causes, but… so far it’s looking far more promising than previous attempts.
I’m thinking the permanent hang → termination might be something else unrelated. Like, something that won’t show up as a operations backlog/deadlock.
Oh wait what?! That’s crazy. There should 0-3 in there almost always! Uh… Ok, I take back what I just said.
In this case how are you distinguishing active versus in queue?
Hm. These are in Active Operations yeah? Or are you meaning Task Queues? Is that the active vs queue distinction? Actually on reflection, that matches what you’re saying well. I was thinking only narrowly in Active Operations terms.
Ok so, I was going to say the two are unrelated, but they’re not quite. The queued tasks (Task Queues) are ones that should mostly only run in the background, fired off by iOS’s background tasks system. But they can sometimes fire off in the foreground, if Arc decides the task hasn’t completed recently enough.
The next section, Scheduled Tasks, is the tasks that work through the Task Queues. I’m now realising the naming of these sections is quite arbitrary and ambiguous. Heh.
Ok so yeah, let’s say the activityTypeModelUpdates scheduled task is overdue by 2 days. And Task queues shows non-zero ActivityType models pending update. At that point Arc might decide to run that task in the foreground, to catch up on the backlog there, because iOS has been neglecting running it in the background.
If that happens you’ll see the top of timeline view’s little info/button bar say “Updating activity models”, and in Active Operations you’ll see some entries about updateModel() or some such.
If that’s happening, that’s definitely useful info. Because it could mean that catchup running in the foreground is creating unexpected contention and blockage.
Though if at the time of the hangs the Active Operations list is empty (though I guess you can’t see, because hung), then… that’s that case where I’m guessing the hang is unrelated to the recent fixes. Because the recent fixes are all about making sure Active Operations doesn’t get backlogged/deadlocked.
Here seems fine I think. The threads aren’t high volume at the moment, so seems harmless to continue wherever.
Hah! Thanks You’re more reliable than most! There’s quite a skill to playing detective on this kind of mystery solving debugging task. I can tell you’ve done it plenty of times before
But something over the past 24 hours has managed to catch up without hanging, because now I can navigate to the previous problematic days without triggering either the UI hang or the endless Active Operations. Right now I just have 60 pending in the Task Queues (21 ActivityType updates, 39 Places), which I guess will run when they feel like it. [Oop, I partially retract this; see more below.]
In this case how are you distinguishing active versus in queue?
I’m referring to the sections labeled “Active Operations” and “Task Queues” in the Debug Info. It’s all I have to go on.
For example, in these build 39 screenshots, first showing 9 active 0 queued:
Ok so, I was going to say the two are unrelated, but they’re not quite. The queued tasks (Task Queues) are ones that should mostly only run in the background, fired off by iOS’s background tasks system. But they can sometimes fire off in the foreground, if Arc decides the task hasn’t completed recently enough.
As far as I can tell, these were getting triggered every time I reopened the app after either a hang, or a manual restart to kill the backlogged actions that were never finishing.
If that’s happening, that’s definitely useful info. Because it could mean that catchup running in the foreground is creating unexpected contention and blockage.
That all sounds like it matches pretty accurately what I was seeing.
And I think that’s a perfectly fine system, when those Activity model updates (or whatever) are performing normally. But when they get stuck forever…
If you’re not confident that you can absolutely and deterministically prevent these from getting stuck, due to the nature of the task, then it reminds me of a similar situation in a past life. I ended up adding my own internal watchdog that would trigger when a running action was hung for an unreasonably long time, gather a bunch of debug info, send us a report, and kill whatever thing was running too long.
Having just restarted with build 41, it decided to try running some of those backlogged tasks, and now I have the updateModels that don’t want to finish:
…and of course those days’ timelines aren’t visible. Just spinners.
Interestingly, after I went to retrieve that screenshot and airdrop it to my laptop, when I reopened Arc, it had been killed! So even though it didn’t feel like a full UI hang, someone was unhappy enough to restart the app.
This time it hasn’t restarted any of those queued tasks, and I haven’t found any way to make it start doing so.
Heh. There’s already two systems somewhat like that To protect from Apple frameworks forever-hanging.
The first is the Photos framework, that sometimes goes off with a simple request and never comes back. That thread is lost for good. So Arc has to be doing those requests off the actor queues, in detached tasks, and after a timeout give up on them entirely. (iOS might eventually clean up the thread later, once Photos framework stops being a dopey pest about it).
I added another of the same kind just the other day for CoreML too. In some cases we’ve seen CoreML hang in a similar way. And both Photos and CoreML frameworks are … older ones that don’t respond to cancellation, etc. So yeah, once they’ve hung, that thread is burnt. Fun.
I suspect though what you’re seeing there is… actually I’d better read the rest of your reply. Build 41 has a significant fix for database contention, which is actually the point at which I suspect the CoreML model updates are getting hung up on (often the heaviest part of model updates is the fetching of the tens or hundreds of thousands of training samples).
Ok so it’s still happening in build 41 then. Hmm. Perhaps my db contention theory for the model updates wasn’t right, and it’s something else.
Also these are CD2 models. Neighbourhood scale. CD1 are the country/state level models, and CD0 is the global model. So CD0 is the chonkiest, CD1 can be pretty bulky for anywhere you’ve lived for any reasonable period of time, and CD2 is only ever chonky if it’s a neighbourhood you’ve lived many months/years in.
I’m not sure if those CD2s in your screenshots correlate with neighbourhoods you’ve spent a lot of time in or not. But given there’s a whole bunch of them, I’m suspecting they’re not all big boi models that pull in hundreds of thousands of samples. More likely on the smaller side. Which again makes the db contention theory not so convincing…
There’s also the question of why those models are updating in the foreground. If a model is too small, then it’s fired off for update immediately, as soon as you make any activity type correction/confirmation in its bounding box. But you’re saying that these ones are firing up on app startup, so it’s more likely that they’re being updated because the scheduled task is >2 days overdue, not because they’re small enough to warrant updating immediately on confirm/correct.
That’s curious. Maybe they’re all done now? If the model updates task is >2 days overdue then it’ll keep trying to run ~10 seconds after foregrounding. Oh wait I should check if that’s actually 10 seconds, because I think that matches some of your earlier reports? Yep, it’s a 10 seconds wait on foregrounding before running overdue scheduled tasks. Curious.
Anyway seems like the remaining problem now is these models updates running in the foreground (which in itself shouldn’t be a problem), but being hefty enough tasks that they’re either saturating the db reader pool (though I’m not sure how - the pool size is … oh wait it’s 12, I thought it was over 20. Hmm).
Aight, time to get to hunting! If I find anything convincing today maybe there’ll be a build 42 soon
Very interesting details! I’ve done very little iOS development. Disappointing that such widely-used libraries, into which such an enormous pile of money has been poured, still have such fundamental defects…
Anyway!
When I opened the app yesterday morning, it had somehow cleared the backlog overnight. I don’t really know how or why or what exactly changed, but once it pushed through all that, I was able to use all portions of the app again as usual.
We’ll see how these new builds go as I hit the road again. Thanks for your help, as always!
I think a good part of it is just “old stuff”. Like, the Photos framework first arrived in iOS 8 in 2014. The code in there will be of the kind that no one willingly wants to update or maintain, I imagine. And it’ll all be patterns and techniques that they abandoned in newer frameworks many years ago.
CoreML is more curious. But on that one, it turns out the main pain points Arc was experiencing were actually on the Arc side! So CoreML is mostly off the hook there. Though our safety net of siloing the requests to detached tasks is still good to have.
Good to hear! I suspect what happened there is the ActivityTypes model update scheduled task finally ran to completion in the background overnight. iOS finally decided to both honour the run request and allow it to run long enough to finish.
Which means that the app is no longer trying to build or update those in the foreground, while also trying to serve UI, live processing, etc.
Anyway, fingers crossed that’s the end of the troubles!