Not-sticky sessions with Sling?

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Not-sticky sessions with Sling?

lancedolan
The only example code I can find to authenticate to Sling will use the JEE servlet container's "j_security_check" which then stores the authenticated session in App Server memory. A load-balancer without sticky-sessions enabled will cause an unstable experience for users, in which they are suddenly unauthenticated.

-Does Sling already offer a mechanism for authenticating without storing that JCR session in Servlet Container Session?
-Do any of you avoid sticky sessions without writing custom code?

I'm thinking that this problem *must* be solved already. Either there's an authenticationhandler in Sling that I haven't found yet, or there's an open-source example that somebody could share with me :)

If I must write this myself, is this the best place to start?
https://sling.apache.org/documentation/the-sling-engine/authentication/authentication-authenticationhandler.html
https://sling.apache.org/apidocs/sling8/org/apache/sling/auth/core/spi/AuthenticationHandler.html

... as usual, thanks guys. I realize I'm really dominating the mail list lately. I've got a lot to solve :)
Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

chetan mehrotra
If you are running a cluster with Sling on Oak/Mongo then sticky
sessions would be required due to eventual consistent nature of
repository. Changes done on one cluster node would not be immediately
visible on other cluster node. Hence to provide a consistent user
experience sticky sessions would be required
Chetan Mehrotra


On Thu, Jan 12, 2017 at 7:34 AM, lancedolan <[hidden email]> wrote:

> The only example code I can find to authenticate to Sling will use the JEE
> servlet container's "j_security_check" which then stores the authenticated
> session in App Server memory. A load-balancer without sticky-sessions
> enabled will cause an unstable experience for users, in which they are
> suddenly unauthenticated.
>
> -Does Sling already offer a mechanism for authenticating without storing
> that JCR session in Servlet Container Session?
> -Do any of you avoid sticky sessions without writing custom code?
>
> I'm thinking that this problem *must* be solved already. Either there's an
> authenticationhandler in Sling that I haven't found yet, or there's an
> open-source example that somebody could share with me :)
>
> If I must write this myself, is this the best place to start?
> https://sling.apache.org/documentation/the-sling-engine/authentication/authentication-authenticationhandler.html
> https://sling.apache.org/apidocs/sling8/org/apache/sling/auth/core/spi/AuthenticationHandler.html
>
> ... as usual, thanks guys. I realize I'm really dominating the mail list
> lately. I've got a lot to solve :)
>
>
>
>
> --
> View this message in context: http://apache-sling.73963.n3.nabble.com/Not-sticky-sessions-with-Sling-tp4069530.html
> Sent from the Sling - Users mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

lancedolan
Chetan,

I'd like to confirm to what degree that is true for our proposed architecture. It seems that only the OSGI configurations and bundles would be "eventually consistent." It seems the only "state" that is stored in Sling instances are OSGI configurations and OSGI bundles. Everything else is in the JCR, which Mongo can provide as strongly consistent ( I believe ). Consider this example and correct me where I'm wrong. I'd hate to shoot myself in the foot with bad assumptions.

Imagine 3 Sling instances all talking to 1 Mongo instance. In this case, it seems to me that all REPO state is captured in a single Mongo instance, which is consistent by default and eventually consistency only happens if you hit secondary members of a Mongo Replica Set. In an architecture with only one Mongo instance, the moment one instance writes to the JCR, another instance will read the same data and agree consistently. It seems to me that the JCR state is strongly consistent.

However, OSGI configurations seem to propagate to each other through the JCR only eventually... Additionally, when we deploy a new OSGI bundle to the JCR (in an install directory or whatever), then those seem to only eventually propagate to all Sling instances. I'm not totally sure that these are "eventually," but it seems like the only place that state will only be "eventual" in this architecture.

So, as long as we're cool with OSGI configurations and bundle installations being eventual, everything else, stored in the JCR, should be strongly consistent right?

And then, I believe we can even scale the Mongo instances into a replica set for better availability and we'll still be strongly consistent so long as all Sling instances only read from the primary member of the replica set: [1].

Thanks for your time and thoughts dude!

[1] https://www.mongodb.com/faq#consistency
Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

chetan mehrotra
On Fri, Jan 13, 2017 at 12:20 AM, lancedolan <[hidden email]> wrote:
> In an architecture with
> only one Mongo instance, the moment one instance writes to the JCR, another
> instance will read the same data and agree consistently. It seems to me that
> the JCR state is strongly consistent.

No. DocumentNodeStore in each Sling node which are part of cluster
would periodically poll the backend root node state revision. If there
is any change detected it would update its head revision to match with
last seen root node revision from Mongo and then it would generate an
external observation event. So any change done on cluster node N1
would be _visible sometime later_ on cluster node N2.

So if you create a node on N1 and immediately try to read it on N2
then that read would fail as that change might not be "visible" on
other cluster node. So any new session opened on N2 would have its
base revision set to current head revision of that cluster node and
which may be older than current head revision in Mongo.

However the writes would still be consistent. So if you modify same
property concurrently from different cluster nodes that one of the
write would succeed and other would fail with a conflict.

Some details are provided at [1]

Chetan Mehrotra
[1] https://jackrabbit.apache.org/oak/docs/architecture/transactional-model.html
Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

lancedolan
Alright, this is a deal breaker for our business (if sling absolutely requires sticky sessions). I hope you're not offended that I'm not 100% convinced yet. I understand you do development on the sling project and are well qualified on the topic. To be honest, however, I don't understand fully what you said in your last post and I also know that AEM 6.1 can do what I'd like, which is really just Sling+Oak. If they can do it, I don't understand why we can't.

ref: https://docs.adobe.com/docs/en/aem/6-1/administer/security/encapsulated-token.html

I'd hate to throw away all the awesome progress we've made with Sling so far when I know that AEM, which is just sling + jackrabbit, can accomplish app-server-agnostic authentication, and thus avoid sticky sessions.

Although I don't understand this "head revision" that you've described, and that's inexperience on my part, I am confident that you're telling me that when there is only one Mongo instance in existence, and all Sling instances get data from it, that directly after "sling-instance-1" writes "myProperty=myValue" to the JCR, then "sling-instances-2" could get the value of "myProperty" from somewhere else - some old value. This only seems possible to me if one of the following is true:

A) the Sling instances are caching values from Mongo (perhaps Sling or Oak is doing that?)
B) There are separate versions of that property stored in Mongo (perhaps this is what you meant by the word revision) and it's possible for a sling-instance to be reading an old version of a property from Mongo.
C) Mongo isn't consistent.

We know from mongo documentation that C isn't true - Mongo is consistent when reading from the primary replica set. So it must be that A or B is going on? And if so, what is your guess about how AEM 6, which is Sling+Oak, avoids this pitfall when they very clearly support the stateless architecture  (ie not-sticky) that I'm planning?
Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

chetan mehrotra
On Sat, Jan 14, 2017 at 2:08 AM, lancedolan <[hidden email]> wrote:
> To be honest, however, I don't understand fully
> what you said in your last post and I also know that AEM 6.1 can do what I'd
> like, which is really just Sling+Oak. If they can do it, I don't understand
> why we can't.
>
> ref:
> https://docs.adobe.com/docs/en/aem/6-1/administer/security/encapsulated-token.html

That links talks about scaling of publish instance which are in most
cases based on Segment/Tar setup and hence not forming a "homegenous"
cluster. Each cluster node has separate segment store and only
potentially shares the DataStore

> B) There are separate versions of that property stored in Mongo (perhaps
> this is what you meant by the word revision) and it's possible for a
> sling-instance to be reading an old version of a property from Mongo.

Thats bit closer to whats happening. [1] talks about the data model
being used for persistence in Mongo/RDB. For example if there is a
property 'prop' on root node i.e. /@prop then its stored in somewhat
following form in Mongo

{
"_id" : "0:/",
 "prop" : {
       "r13fcda91720-0-1" : "\"foo\"",
       "r13fcda919eb-0-1" : "\"bar\"",
    }
}

The value for this property is function of revision at which read
operation is performed. So 'prop' value is 'foo' at rev r1 and 'bar'
at rev r2. These revisions are based on timestamp. Now each cluster
node also has a "head" revision. So any read call on that cluster node
would only see those values whose revision are <= '"head" revision.
This head revision is updated periodically via background read. Due to
this snapshot isolation model you see the write skew [2]

Chetan Mehrotra
[1] https://jackrabbit.apache.org/oak/docs/nodestore/documentmk.html
[2] https://jackrabbit.apache.org/oak/docs/architecture/transactional-model.html
Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

lancedolan
This is really disappointing for us. Through this revisioning, Oak has turned a datastore that is consistent by default into a datastore that is not :p It's ironic that the cluster which involves multiple datastores (tar), and thus should have a harder time being consistent, is the one that can accomplish consistency... and the cluster that involves a single shared source of truth (mongo/rdbms), and should have the easiest time being consistent, is not. Hehe. Ahh this probably shoots down our entire Sling proof of concept project.

Our next step is to measure the consequences of moving forward with Sling+Oak+Mongo and not-sticky sessions. I'm going to try to test this, and get an empirical answer, by deploying to some AWS instances. I'll develop a custom AuthenticationHandler so that authentication is stateless and then we'll try to see how bad the "delay" might be. However, I would love a theoretical answer as well, if you've got one :)

chetan mehrotra wrote
 sticky
... sticky sessions would be required due to eventual consistent nature of
repository.
Okay, but if we disable stick sessions ANYHOW (because in our environment we must), how much time delay are we talking, do you think, in realistic practice? We might be able to solve this by giving user-feedback that covers up for the sync delay. When a user clicks save, they might just go to a different screen, providing enough time for things to sync up. It might be a race condition, but that might be acceptable if we can choose that architecture on good information. I think that, in theory, the answer to "worst case scenario" for eventual consistency is always "forever," but really... How long could a Sling instance take to get to the latest revision? More importantly, is it a function of Repo size, or repo activity? If the repo grows in size (number of nodes) and grows in use (number of writes/sec) does this impact how frequently Sling Cluster instances grab the most recent revision?

Less importantly... Myself and colleagues are really curious as to why jackrabbit is implemented this way. Is there a performance benefit to being eventually, when the shared datastore is actually consistent? What's the reasoning for not always hitting the latest data?  Also... Is there any way to force all reads to read the most recent revision, perhaps through some configuration? A performance cost for this might be tolerable
Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

chetan mehrotra
On Tue, Jan 17, 2017 at 1:46 AM, lancedolan <[hidden email]> wrote:
> It's ironic that the cluster which involves multiple datastores (tar), and
> thus should have a harder time being consistent, is the one that can
> accomplish consistency..

Thats not how it is. Cluster which involves multiple datastores (tar)
is also eventually consistent. Changes are either "pushed" to each tar
instance via some replication or changes done on one of the cluster
node surfaces on other via reverse replication. In either case change
done is not immediately visible on other cluster nodes

> More importantly, is it a function of Repo size, or repo activity?
> If the repo grows in size (number of nodes) and grows in use (number of
> writes/sec) does this impact how frequently Sling Cluster instances grab the
> most recent revision?

Its somewhat related to number of writes and is not dependent on repo size

> Less importantly... Myself and colleagues are really curious as to why
> jackrabbit is implemented this way. Is there a performance benefit to being
> eventually, when the shared datastore is actually consistent? What's the
> reasoning for not always hitting the latest data?  Also... Is there any way
> to force all reads to read the most recent revision, perhaps through some
> configuration?

Thats a question best suited for discussion on oak-dev mailing list
([hidden email])

Chetan Mehrotra
Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

Bertrand Delacretaz
In reply to this post by lancedolan
Hi,

On Mon, Jan 16, 2017 at 9:16 PM, lancedolan <[hidden email]> wrote:
> ...this probably shoots down our entire Sling
> proof of concept project...

That would be a pity, as I suppose you're starting to like Sling now ;-)

> ...Is there any way
> to force all reads to read the most recent revision, perhaps through some
> configuration?...

As Chetan say that's a question for the Oak dev list, but from a Sling
point of view having that option would be useful IMO.

If the clustered Sling instances can get consensus on what the most
recent revision is (*), having the option for Oak to block until it
sees that revision sounds useful in some cases. That should probably
happen either on opening a JCR Session or when Session.refresh() is
called.

-Bertrand

(*) which might require an additional consensus mechanism, maybe via
Mongo if that's what you're using?
Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

lancedolan
In reply to this post by chetan mehrotra
Ok First of all - I GENUINELY appreciate the heck out of your time, and patience!!

... and THIS is really interesting:

If THIS is true:

chetan mehrotra wrote
If you are running a cluster with Sling on Oak/Mongo then sticky
sessions would be required due to eventual consistent nature of
repository.
and THIS is true:

chetan mehrotra wrote
Cluster which involves multiple datastores (tar)
is also eventually consistent.
Then why is adobe recommending it's multi-million-dollar projects to go stateless with the encapsulated token here, if those architectures are *also* eventually:
https://docs.adobe.com/docs/en/aem/6-1/administer/security/encapsulated-token.html

If "being eventual" is the reason we can't go stateless, then how is adobe getting away with it if we know their architecture is also eventual?? What am I missing? I understand that the documentation I linked is a distributed segment store architecture and mine is a share documentstore datastore, but what is the REASON for them allowing a stateless (not sticky) architecture, if the REASON is not eventual consistency ? Both architectures are eventual.

Again, thanks for your patience and sticking with me on this one... whoa pun!
Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

Jörg Hoh
HI Lance,

2017-01-17 19:19 GMT+01:00 lancedolan <[hidden email]>:

> ...
>
> If "being eventual" is the reason we can't go stateless, then how is adobe
> getting away with it if we know their architecture is also eventual?? What
> am I missing? I understand that the documentation I linked is a distributed
> segment store architecture and mine is a share documentstore datastore, but
> what is the REASON for them allowing a stateless (not sticky) architecture,
> if the REASON is not eventual consistency ? Both architectures are
> eventual.
>
>
It depends a lot on your usecase. For example Facebook is also eventually
consistent (I sometimes think that the timeline is different on every
reload). Also the CAP theorem says, that you can choose only 2 of
"consistency, atomicity and partition-tolerance".

In the case of independent segment stores (in Adobe speak: publish
instances, stateless loadbalancing) you have a lot of individual requests
from multiple users. So you as an individual cannot decide if another gets
the very same content as you. And as long as this eventual consistency is
not causing annoyances and friction on and end-user side (e.g. you hit a
intra-side link, which returns in a 404), I would not consider it as a
problem. And these problems occur so rarely, that many (including me and
many other users of AEM) ignore it for daily work. But this is only valid
for a readonly usecase!

The situation is different on the clustered documentNodeStore (in Adobe
speak: authoring, sticky connections). Due to write skew write operations
will be visible with a small delay on all cluster nodes. But because there
it matters that a user sees the changes he just did. And to overcome this
limitation with the write skew, the recommendation is to use
sticky-sessions.



Jörg


--
Cheers,
Jörg Hoh,

http://cqdump.wordpress.com
Twitter: @joerghoh
Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

Jörg Hoh
My bad:
CAP = consistency, availability and partition-tolerance.

Jörg

2017-01-17 19:35 GMT+01:00 Jörg Hoh <[hidden email]>:

> HI Lance,
>
> 2017-01-17 19:19 GMT+01:00 lancedolan <[hidden email]>:
>
>> ...
>>
>> If "being eventual" is the reason we can't go stateless, then how is adobe
>> getting away with it if we know their architecture is also eventual?? What
>> am I missing? I understand that the documentation I linked is a
>> distributed
>> segment store architecture and mine is a share documentstore datastore,
>> but
>> what is the REASON for them allowing a stateless (not sticky)
>> architecture,
>> if the REASON is not eventual consistency ? Both architectures are
>> eventual.
>>
>>
> It depends a lot on your usecase. For example Facebook is also eventually
> consistent (I sometimes think that the timeline is different on every
> reload). Also the CAP theorem says, that you can choose only 2 of
> "consistency, atomicity and partition-tolerance".
>
> In the case of independent segment stores (in Adobe speak: publish
> instances, stateless loadbalancing) you have a lot of individual requests
> from multiple users. So you as an individual cannot decide if another gets
> the very same content as you. And as long as this eventual consistency is
> not causing annoyances and friction on and end-user side (e.g. you hit a
> intra-side link, which returns in a 404), I would not consider it as a
> problem. And these problems occur so rarely, that many (including me and
> many other users of AEM) ignore it for daily work. But this is only valid
> for a readonly usecase!
>
> The situation is different on the clustered documentNodeStore (in Adobe
> speak: authoring, sticky connections). Due to write skew write operations
> will be visible with a small delay on all cluster nodes. But because there
> it matters that a user sees the changes he just did. And to overcome this
> limitation with the write skew, the recommendation is to use
> sticky-sessions.
>
>
>
> Jörg
>
>
> --
> Cheers,
> Jörg Hoh,
>
> http://cqdump.wordpress.com
> Twitter: @joerghoh
>



--
Cheers,
Jörg Hoh,

http://cqdump.wordpress.com
Twitter: @joerghoh
Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

lancedolan
Bertrand Delacretaz wrote
That would be a pity, as I suppose you're starting to like Sling now ;-)
Mannnn you have no idea haha! I've got almost every dev in the office all excited about this now haha. However, it seems our hands are tied.

I wrote local consistency test scripts which POST and immediately GET a property, checking for consistency.

Results on a 2-member Sling cluster and localhost mongodb:

-0% consistency with 50ms delay between POST and GET
-35% to 50% consistency with 1 second delay between POST and GET
-90% consistency with 2 second delay
-98% to 100% consistency after 3 seconds delay.

So yes, you are all correct.

True, we could use sticky sessions to avoid inconsistency... but only until we scale our server-farm up or down, which we do daily.... So sticky sessions doesn't really solve anything for us.

If you already understand how scaling nullifies the benefit of sticky sessions, you can skip past this paragraph and move onto the next:
Each time we scale, users will lose their "stickiness." We have thousands of write users ("authors"). Hundreds concurrently. Compare that to typical AEM projects have less than 10 authors, and rarely more than 1 concurrently (I've got several global-scale AEM implementations under my belt). For us, it's a requirement that we add or remove app servers multiple times per day, optimizing between AWS costs and performance. Each time we remove an instance, those users will go to a new Sling instance, and experience the inconsistency. Each time we add an instance, we will invalidate all stickiness and users will get re-assigned to a new Sling instance, and experience the inconsistency. If we don't do this invalidation and re-assignment on scaling-up, it can takes hours potentially for a scale-up to positively impact an overloaded cluster where all users are permanently stuck to their current app server instance.

As you can see, we need to deal with the inconsistency problem, regardless of whether we use sticky sessions.

I have some ideas, but none are appealing, and would benefit greatly from your guys' knowledge:

1) Race condition
If this delay to "catch up" to latest revision is mostly predictable, it doesn't grow as the repo grows in size, or if it doesn't change due to other variables, we can measure it and then account for it reliably with user-feedback (loading screen or whatever). This *might* be a race condition we can live with.

My results above show as much as 3 or 4 seconds to "catch up."  I must know what determines the duration of this revision catch-up time. Is it a function of repo size? Does the delay grow as the repo size grows? Does the delay grow as usage increases? Does the delay grow as the number of Sling instances in the cluster grow? Does the delay grow as network latency grows (I'm testing all on the same machine with practically no latency compared to a distributed production deployment). Is there any Sling dev, who is familiar with the algorithm that Sling uses to select a "newer" revision, who could answer this for me? ... perhaps it's just polling on a predictable time period! :)

2) Browser knows what revision it's on.
The browser could know what JCR Revision it's on, learning that revision after every POST or PUT, perhaps in some response header. When its future requests are sent to a Sling instance on an older revision, it could wait until that instance "catches up." This sounds like a horrible example of client code operating on knowledge of underlying implementation details, and we're not at all excited about the chaos to implement it. That being said, can we programmatically check the revision that the current Sling instance is reading from?

3) "Pause" during scale-up or scale-down.
Each time we add or remove a sling instance, all users experience a "pause" screen while their new Sling Instance "catches up." This is essentially the same as the race condition in #1, except we'd constrain users to only experience this when we scale up or down. However, we are *extremely* unhappy to impact our users just because we're scaling up or down, especially when we must do so frequently.

Anybody have any other ideas?

Other questions:

1) When a brand new Sling instance discovers an existing JCR (Mongo), does it automatically and immediately go to the latest head revision? Or is there some progression through the revisions, and it takes time for the Sling instance to catch up to the latest?

2) Is there any reason, BESIDES JCR CONSISTENCY, why a Sling cluster must be deployed with sticky-sessions? What other problems would we introduce by not having sticky sessions?

I seem to have used this email to track my own thoughts more than anything, my sincere thanks if you've taken the time to read the whole thing.
Reply | Threaded
Open this post in threaded view
|

RE: Not-sticky sessions with Sling?

Stefan Seifert
not sure if this is of any help for your usecase - but do you need the full JCR features and complexity underneath sling, or only a sling cluster + storage in mongodb?

if you need only basic resource read and write features via the Sling API you might bypass JCR completely and directly use a NoSQL resource provider for MongoDB, see [1] and [2].

but please be aware that:
1. the code might not be production-ready for heavy usages yet (not sure how much it is used)
2. it does not add any support for cluster synchronization etc. if your multiple nodes write to the same path you have to take care of concurrency yourself
3. the code is not yet migrated to the latest resourceprovider SPI from sling 9-SNAPSHOT, but should still run with it
4. it has not built-in support for ACLs etc., you have to take care of this yourself

this resource provider is only a thin layer above the MongoDB java client, so it should be possible to have full control what mongodb features are used in which way.

stefan

[1] http://sling.apache.org/documentation/bundles/nosql-resource-providers.html
[2] https://github.com/apache/sling/tree/trunk/contrib/nosql


Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

lancedolan
In reply to this post by lancedolan
lancedolan wrote
I must know what determines the duration of this revision catch-up time ...
While I don't know where to look in src code to answer this, I did run a very revealing experiment.

It pretty much always takes 1 second exactly for a Sling instance to get the latest revision, and thus the latest data. When not 1 second, it takes 2 seconds exactly. If you increase load on the server, the likelihood of taking 2 seconds increases, and you also begin to see it take exactly 3 seconds in some rare cases. Increasing load increases the number of seconds before a "sync," however it's always near-exactly a second interval.

It seems impossible for this to be a natural coincidence - I smell a setting somewhere (or perhaps hardcode value) which is telling Sling to check the latest JCR revision on 1 second intervals. When that window can't be hit, it checks on the next second interval, and so on.

Is there a Sling dev who can tell me whether this is configurable? I have a load of questions about this discovery:

- Am I wrong? (I'll be shocked)
- Perhaps we can speed it up?
- What event is causing it to "miss the window" and wait until the next 1 second synch interval?
- If we do decrease the interval, will that just increase the likelihood of taking more intervals anyhow?
- Is there a maximum number of 1 second intervals before the things just gets the latest??

progress.
Reply | Threaded
Open this post in threaded view
|

RE: Not-sticky sessions with Sling?

lancedolan
In reply to this post by Stefan Seifert
Thissss is tempting, but I know in my dev-instinct that we won't have the time to solve all the unsolved in that effort. Thank you for suggesting though :)
Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

Felix Meschberger-3
In reply to this post by lancedolan
Hi Lance

Ok, so being as it is — eventual consistent repo replicating the Oak login token and not able to use sticky sessions, I suggest you go with something else, which does *not* need the repository for persistence.

This means you might want to investigate your own authentication handler or look at other options here at Sling — for example the old Form based login (not sure what its state is, though). Or good ol’ HTTP Basic (at some other prices like no support for „logout“)

Regards
Felix

> Am 18.01.2017 um 02:43 schrieb lancedolan <[hidden email]>:
>
> lancedolan wrote
>> I must know what determines the duration of this revision catch-up time
>> ...
>
> While I don't know where to look in src code to answer this, I did run a
> very revealing experiment.
>
> It pretty much always takes 1 second exactly for a Sling instance to get the
> latest revision, and thus the latest data. When not 1 second, it takes 2
> seconds exactly. If you increase load on the server, the likelihood of
> taking 2 seconds increases, and you also begin to see it take exactly 3
> seconds in some rare cases. Increasing load increases the number of seconds
> before a "sync," however it's always near-exactly a second interval.
>
> It seems impossible for this to be a natural coincidence - I smell a setting
> somewhere (or perhaps hardcode value) which is telling Sling to check the
> latest JCR revision on 1 second intervals. When that window can't be hit, it
> checks on the next second interval, and so on.
>
> Is there a Sling dev who can tell me whether this is configurable? I have a
> load of questions about this discovery:
>
> - Am I wrong? (I'll be shocked)
> - Perhaps we can speed it up?
> - What event is causing it to "miss the window" and wait until the next 1
> second synch interval?
> - If we do decrease the interval, will that just increase the likelihood of
> taking more intervals anyhow?
> - Is there a maximum number of 1 second intervals before the things just
> gets the latest??
>
> progress.
>
>
>
> --
> View this message in context: http://apache-sling.73963.n3.nabble.com/Not-sticky-sessions-with-Sling-tp4069530p4069711.html
> Sent from the Sling - Users mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

Bertrand Delacretaz
In reply to this post by lancedolan
Hi Lance,

On Wed, Jan 18, 2017 at 2:43 AM, lancedolan <[hidden email]> wrote:
> ...It pretty much always takes 1 second exactly for a Sling instance to get the
> latest revision, and thus the latest data. When not 1 second, it takes 2
> seconds exactly....

I don't know enough about Oak internals to give your a precise answer
here but this 1 second increment vaguely rings a bell, based on
discussions with Chetan when working on our adaptTo demo [1].

Chetan is one of the few Sling committers who's deep into Oak as well,
hopefully he can comment on this but otherwise best would be to ask on
the Oak dev list about that specific issue, as I think this delay is
entirely Oak dependent.

Apart from that, handling such things at the client level could be
valid - as you say if you had a way to send the current revision
number to the client (in an opaque way probably) it could add a header
to its next request saying that it wants to see that revision, and
Sling/Oak could block that request until that revision is available. I
suppose a one or two second delay that happens only rarely is
acceptable if it makes your system easier to scale, and hopefully that
1-second cycle can be configured to be shorter. I'm willing to help
make this functionality available if you don't find a better way, as I
think it can be generally useful.

-Bertrand

[1] https://github.com/bdelacretaz/sling-adaptto-2016
Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

chetan mehrotra
> Each time we remove an
> instance, those users will go to a new Sling instance, and experience the
> inconsistency. Each time we add an instance, we will invalidate all
> stickiness and users will get re-assigned to a new Sling instance, and
> experience the inconsistency.

I can understand issue around when existing Sling server is removed
from the pool. However adding a new instance should not cause existing
users to be reassigned

Now to your queries
---------------------------

> 1) When a brand new Sling instance discovers an existing JCR (Mongo), does it automatically and immediately go to the latest head revision?

It sees the latest head revision

>  Increasing load increases the number of seconds before a "sync," however it's always near-exactly a second interval.

Yes there is a "asyncDelay" setting in DocumentNodeStore which
defaults to 1 sec. Currently its not possible to modify it via OSGi
config though.

>- What event is causing it to "miss the window" and wait until the next 1 second synch interval?

this periodic read also involves some other work. Like local cache
invalidation, computing the external changes for observation etc which
cause this time to increase. More the changes done more would be the
time spent on that kind of work

Stickyness and Eventual Consistency
-------------------------------------------------

There are multiple level of eventual consistency [1]. If we go for
sticky session then we are trying for "Session Consistency". However
what we require in most cases is read-your-write consistency.

We can discuss ways to do that efficiently with current Oak
architecture. Something like this is best discuss on oak-dev though.
One possible approach can be to use a temporary issued sticky cookie.
Under this model

1. Sling cluster maintains a cluster wide service which records the
current head revision of each cluster node and computes the minimum
revision of them.

2. A Sling client (web browser) is free to connect to any server
untill it performs a state change operation like POST or PUT

3. If it performs a state change operation then the server which
performs that operation issues a cookie which is set to be sticky i.e.
Load balancer is configured to treat that as cookie used to determine
stickiness. So from now on all request from this browser would go to
same server. This cookie lets say record the current head revision

4. In addition the Sling server would constantly get notified of
minimum revision which is visible cluster wide. Once that revision
becomes older than revision in #3 it removes the cookie on next
response sent to that browser

This state can be used to determine if server is safe to be taken out
of the cluster or not.

This is just a rough thought experiment which may or may not work and
would require broader discussion!


Chetan Mehrotra
[1] http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
Reply | Threaded
Open this post in threaded view
|

Re: Not-sticky sessions with Sling?

Bertrand Delacretaz
On Wed, Jan 18, 2017 at 12:48 PM, Chetan Mehrotra
<[hidden email]> wrote:
> ...there is a "asyncDelay" setting in DocumentNodeStore which
> defaults to 1 sec. Currently its not possible to modify it via OSGi
> config though....

But Lance could patch [1] to experiment with different values, right?
And then replace the oak-core bundle in Sling, starting with the right
version for patching, the one his Sling instance currently uses.

-Bertrand

[1] http://svn.apache.org/repos/asf/jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/DocumentNodeStore.java
12