On Tue, Jan 17, 2017 at 10:49 PM, lancedolan <[hidden email]> wrote:
> ...I've got almost every dev in the office all
> excited about this now haha....
This needs to make the New York Times front page: "almost everyone in
a developer's office excited about the same thing, which is not a
In reply to this post by chetan mehrotra
Couldn't this be simplified to simply stating that the sticky session cookie only lasts for x amount of seconds?
I like this idea, but I'm not sure this is really a sling solution rather than an API management or proxy solution. When you take an instance out of the pool, you would need to state that it's not available for new requests, but still honor it for x amount of time for those with the sticky session cookie that says they should go there.
From: Chetan Mehrotra [mailto:[hidden email]]
Sent: Wednesday, January 18, 2017 6:49 AM
To: [hidden email]
Subject: Re: Not-sticky sessions with Sling?
> Each time we remove an
> instance, those users will go to a new Sling instance, and experience
> the inconsistency. Each time we add an instance, we will invalidate
> all stickiness and users will get re-assigned to a new Sling instance,
> and experience the inconsistency.
I can understand issue around when existing Sling server is removed from the pool. However adding a new instance should not cause existing users to be reassigned
Now to your queries
> 1) When a brand new Sling instance discovers an existing JCR (Mongo), does it automatically and immediately go to the latest head revision?
It sees the latest head revision
> Increasing load increases the number of seconds before a "sync," however it's always near-exactly a second interval.
Yes there is a "asyncDelay" setting in DocumentNodeStore which defaults to 1 sec. Currently its not possible to modify it via OSGi config though.
>- What event is causing it to "miss the window" and wait until the next 1 second synch interval?
this periodic read also involves some other work. Like local cache invalidation, computing the external changes for observation etc which cause this time to increase. More the changes done more would be the time spent on that kind of work
Stickyness and Eventual Consistency
There are multiple level of eventual consistency . If we go for sticky session then we are trying for "Session Consistency". However what we require in most cases is read-your-write consistency.
We can discuss ways to do that efficiently with current Oak architecture. Something like this is best discuss on oak-dev though.
One possible approach can be to use a temporary issued sticky cookie.
Under this model
1. Sling cluster maintains a cluster wide service which records the current head revision of each cluster node and computes the minimum revision of them.
2. A Sling client (web browser) is free to connect to any server untill it performs a state change operation like POST or PUT
3. If it performs a state change operation then the server which performs that operation issues a cookie which is set to be sticky i.e.
Load balancer is configured to treat that as cookie used to determine stickiness. So from now on all request from this browser would go to same server. This cookie lets say record the current head revision
4. In addition the Sling server would constantly get notified of minimum revision which is visible cluster wide. Once that revision becomes older than revision in #3 it removes the cookie on next response sent to that browser
This state can be used to determine if server is safe to be taken out of the cluster or not.
This is just a rough thought experiment which may or may not work and would require broader discussion!
In reply to this post by Bertrand Delacretaz
Chetan is making things crystal clear for us.
Our next steps are:
1) Learn what the MAXIMUM "inconsistency window" could be.
Is it possible to delay past 5 seconds? 10 Seconds? 60? What determines this? Only server load? I'll ask on the JCR forum and also experiment.
2) Design and test a solution almost exactly as Bertrand described.
Sling responds to POST/PUT/DELETE with a JCR revision. Sling will behave differently when the Request contains a JCR revision more recent than it's current. I have no idea what I'm getting into or how hard this will be.
Bertrand, I'd feel selfish taking you up on your offer to build this for me. Yet I'd be a fool to not at least partner with you to get it done. Should we correspond outside this mail list?
Perhaps you could point me to the files you would edit to get this done and I could try to do it myself? I imagine a solution where you can configure, through OSGI, whether Sling will do one of the following:
A) Ignore JCR revision in Request, and function as it does today (Default setting)
B) Block until it has caught up to JCR revision in Request
C) Call some other custom handler? This way we can do custom things like send a redirect to enhance the user experience during a block. In a product like ours, 5 or 10 second blocks aren't acceptable without user feedback.
I also don't know how to determine the current Sling instance's Revision, or how to compute whether one revision is "more recent" than another.
Responding to a couple other minor points:
Thank you Felix :) I've actually done this work recently and it's working great! We have "stateless" authentication now, but are now dealing with the unacceptable inconsistency that Chetan warned about.
That's the question on the table: In a write-operation-heavy application, how do we provide a "read-your-writes" consistent experience on an eventually-consistent solution (Sling cluster), when traditional
sticky-sessions are an invalid solution because your userbase is large enough to demand server-scaling several times throughout the day.
When adding an instance, we purposely invalidate all sticky sessions and users will get re-assigned to a new Sling instance, so that the new server actually improves performance.
Imagine a farm of 4 app servers that has been SLAMMED and isn't performing well. Adding 1 or 100 new servers to that farm won't improve performance if every user is "stuck" to the previous 4 servers.
If we don't do this invalidation and re-assignment on scaling-up, it can takes hours potentially for a scale-up to positively impact an overloaded cluster.
Thank you for pointing me to the code Bertrand :) On new information from Chetan, I'm losing interest in changing that value. Perhaps setting aSyncDelay to 0 or some small number will cause it to perform slower but be more consistent...
However, my tentative assessment is that the interval would just be "checked" more often, but it will also get skipped more often, due to "local cache invalidation, computing the external changes for observation" as Chetan put it.
I would love to be wrong about this and I'll ask on the JCR forum.
In reply to this post by Jason Bailey
Bertrand, probably hold the phone on everything else I suggested in my last post - this solution is insanely simple, embarrassingly obvious in hindsight, and us architects on our side can see no problem with this solution.
We actually had no idea that there is a expiration by seconds setting in AWS elastic load balancer. We just checked the interface and found the setting. Obviously in the good old days of F5 we could do whatever we want, but we're married to AWS now and had no idea we could do this.
Thank you Jason, you might have just saved me some unsavory development task, whilst helping me Keep It Simple, Stupid.
In reply to this post by lancedolan
On Wed, Jan 18, 2017 at 11:21 PM, lancedolan <[hidden email]> wrote:
> ...Bertrand, I'd feel selfish taking you up on your offer to build this for me.
> Yet I'd be a fool to not at least partner with you to get it done. Should we
> correspond outside this mail list?...
I understand you're probably looking at a different solution now but
just wanted to clarify this: the Sling dev list would be the place to
discuss such things, no need for off-list communications.
|Free forum by Nabble||Edit this page|