Wandering Thoughts archives

2012-02-03

Understanding a subtle Twitter feature

One part of getting on Twitter has been following people, which led me to discover that when you follow someone Twitter doesn't show you all of their public tweets. To summarize what I think is the rule, Twitter excludes any conversations they're having that purely involve other people you don't also follow. Their tweets in the conversation will appear in their public timeline, but not in your view of their tweets.

(This may only apply to relatively new Twitter accounts, or even only to some of them. I've seen Twitter give two different interfaces to two new accounts.)

On the one hand, when I discovered this I was infuriated. If you really did want to see everything (for example, so you could find other people to follow based on who your initial people had interesting conversations with), this made having a Twitter account worse than just perusing the Twitter pages of interesting people.

On the other hand, once I thought about it more I've come to reluctantly admire Twitter's trick with this feature. What it is, from my perspective, is a clever way to reduce the volume impact of following someone and thus make doing so less risky. Without it, following someone would immediately expose you to both their general remarks and to the full flow of whatever conversations they have. With Twitter's way, you are only initially exposed to people's general remarks; you ramp up your exposure to their conversations by following more people, and ramp it down by the reverse.

My feeling is that exposure to an overwhelming firehose of updates is the general problem of social networking. Social networks usually want you to be active and to follow lots of people. But if those people are themselves active, the more people you follow the more volume descends on you, and it's especially bad when you follow very socially active users, the ones having a lot of conversations. This creates a disincentive to follow people and pushes you to scale back. Twitter has this especially badly because it has no separate 'comment' mechanism (comments are important for reducing volume). Twitter's trick here is thus a clever way to reduce the firehose in a natural way that doesn't require user intervention and tuning; you could see it as a way of recreating something like comments in a system that doesn't naturally have them.

Once I realized this, it's certainly been working the way that Twitter probably intended. When I'm considering whether or not to follow someone I don't really look at the volume of their tweets in general; I mostly look just at the volume of their non-conversation tweets, because those are the only ones that I'm going to see. Often this makes me more willing to follow people (and thereby furthers Twitter's overall goal of getting me more engaged with their service).

tech/TwitterVolumeLimit written at 22:48:37; Add Comment

Understanding Resident Set Size and the RSS problem on modern Unixes

On a modern Unix system with all sorts of memory sharing between processes, Resident Set Size is a hard thing to explain; I resorted to a very technical description in my entry on Linux memory stats. To actually understand RSS, let's back up and imagine a hypothetical old system that has no memory sharing between processes at all; each page of RAM is either free or in use by exactly one process.

(We'll ignore the RAM the operating system itself uses. In old Unixes, this was an acceptable simplification; memory was statically divided between memory used by the OS and memory used by user programs.)

In this system, processes acquire new pages of RAM by trying to access them and then either having them allocated or having them paged (back) in from disk. Meanwhile, the kernel is running around trying to free up memory, generally using some approximation of finding the least recently used page of RAM. How aggressively the operating system tries to reclaim pages depends on how much free memory it has; the less free memory, the faster the OS tries to grab pages back. In this environment, the resident set size of a process is how many pages of RAM it has. If the system is not thrashing, ie if there's enough memory to go around, a process's RSS is how much RAM it actually needs in order to work at its current pace.

(All of this is standard material from an operating system course.)

The problem of RSS on modern Unix systems is how to adopt this model to an environment where processes share significant amounts of memory with each other. In the face of a lot of sharing, what does it mean for a process to have a resident set size and how do you find the right pages to free up?

There are at least two approaches the kernel can take to reclaiming pages, which we can call the 'physical' and 'process' approaches. In the physical approach the kernel continues to scan over physical RAM to identify candidate pages to be freed up; when it finds one, it takes it away from all of the processes using it at once (this is the 'global' removal of my earlier entry). In the process approach the kernel scans each process more or less independently, finding candidate pages and removing them only from the process (a 'local' removal); only once a candidate page has been removed from all processes using it is it actually freed up.

(Scanning each 'process' is a simplification. Really the kernel scans each separate set of page tables; there are situations where multiple processes share a single set of page tables.)

The problem with the process approach is that the kernel can spend a great deal of time removing pages from processes when the pages will never actually be reclaimed for real. Imagine two processes with a shared memory area; one process uses it actively and one process only uses it slowly. The kernel can spend all the time it wants removing pages of the shared area from the less active process without ever actually getting any RAM back, because the active process is keeping all of those pages in RAM anyways.

So, why doesn't everyone use the physical approach? My understanding is that the problem with the physical approach is that it is often not necessarily a good fit for how the hardware manages virtual memory activity information. Per my earlier entry, every process mapping a shared page of RAM can have a different page table entry for it. To find out if the page of RAM has been accessed recently you may have to find and look at all of those PTEs (with locking), and do so for every page of physical RAM you look at.

My impression is that most current Unixes normally use per-process scanning, perhaps falling back on physical scanning if memory pressure gets sufficiently severe.

(I suspect and hope that virtual memory management in the face of shared pages have been studied academically, just as the older and simpler model of virtual memory has been, but I'm out of contact with OS research.)

unix/UnderstandingRSS written at 02:11:13; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.