RCS-e clients having problems with redundancy across more IMS sites

Sometime in spring I pushed a lot to check how one of the RCS-e clients behave in terms of redundancy across more IMS sites. The reason was clear – there is lot of deployments where operator is having two (or even more) IMS sites, i.e. two or more P-CSCFs in network and nobody wants that failure of one site will affect whole service…If you have such network you certainly want to have also a mechanism how to control load-sharing among more sites. And you certainly also want to have a mechanism how to redirect users to the other side if the one is down.

For these know knowing, there is very elegant way how to do loadsharing and traffic redirection in CS / PS core using  MSS / SGSN pooling – the clever NRI concept  using Multipoint-A / Multipoint Iu-CS / Multipoint Iu-PS ( in RAN terminology generally called NNSF). Just briefly - NRI is part of TMSI that the serving MSS assigns to the MS and identifies an individual MSS/SGSN out of the pool serving the userImportant fact is that the NRI keeps the UE binded with particular MSS until the UE is wiped out from VLR due to inactivity. With the NRI you can empty VLR/MSS of users in few hours and start upgrade to the SRVCC ;-)

Do we have similar mechanism in the IMS? Answer is no… The only document that is trying to spent some time with this problem is GSMA RCS IOT RCS-e Implementation Guidelines v.3.2

The doc itself is not bad – provides basic info how RCS-e clients shall resolve P-CSCF FQDN via NAPTR, SRV and A RRs.
The load-balancing is suggested to be managed via multiple SRV records using Weight & Priorities. This is ok and the RCS-e client is compliant here, but to be focused to the load-balancing only is not enough for live networks. Because you also need to think in terms of redundancy…

First problem appeared when one site is not responding (it is not down, e.g. just overloaded) and you expect traffic to be automatically redirected to the other one. That doesn’t happen – after TCP timeout you get only 503 service unavailable, no automatic re-transmission neither to the same nor other side. Initial registration invoked manually points again to the same non-working P-CSCF due to the SRV answer…

Second problem appeared when the site where UE was registered was shut down (e.g. power outage). When the P-CSCF1 is down, client still sends INVITE on that site. (still ok). But problem is there is no way how get traffic to P-CSCF2 –  failed INVITE dialog do not trigger re-registration to the other side. Again due to the SRV answer..

The third problem is that you can’t relay the TTL will be correctly proceeded by the RCS-e clients. The public DNS is queried upon TTL expiry but just during re-registration. Therefore if you decide to change the IP on the DNS and your TTL is shorter then expiry timer in 200 OK, then it doesn’t take effect how would you expect…

The fourth problem is that the UE can “jump” between P-CSCFs after every re-registration. That is not desirable from obvious reasons.

It is apparent  that these important requirements have been completely forgotten by GSMA guys, therefore we had to improvise and figure out a workaround. But as it is under NDA we can’t say more…

Good news is that these issues have been already addressed to GSMA and fixes are planned to come in RCS implementation guidelines v3.4.

Another good news is that for VoLTE you have no DNS on access side but P-CSCF discovery – let’s talk about that later.

And that is the memo.

Jan, ClearNet Solutions