Monday, February 19, 2018

Service Fabric Application Gateway

Service Fabric applications locate other SF services using the Service Fabric Application Gateway proxy, a service that's provided as part of the whole SF environment. It's necessary if you plan to dynamically locate other SF services, especially when they move and/or scale in and out.

SF App Gateway is created if you enabled reverse proxy during the SF provisioning process. In the portal, here's the UI;

Reverse Proxy Setting During Provisioning
Reverse Proxy Setting During Provisioning

If you forget to do that, the scale set VM cluster will be constructed without the necessary configuration and service listeners, and attempts to access the proxy from a program will yield this vague error message:
System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.Net.WebException: 
Unable to connect to the remote server ---> System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it 127.0.0.1:19081
This is a clue that FabricApplicationGateway.exe is not running on that particular node (VM). If you attempt to manually start it on the VM, it'll immediately crash unless properly configured.

To correct this, you'll have to manually edit a configuration file on each VM in the scale set cluster of your SF installation. Follow these instructions, valid for Service Fabric 6.1.456.9494:
  1. Log in to the VM using Remote Desktop, using credentials you specified during Service Fabric VM configuration. Use your Service Fabric DNS name as the machine name. The first machine in the cluster will be at port 3389; subsequent VMs will be mapped to incremental port numbers, e.g. 3390, 3391...
  2. Change to directory C:\Program Files\Microsoft Service Fabric\bin\Fabric\Fabric.Code.
  3. Edit file FabricHostSettings.xml. Notepad is fine for this.
  4. On or about line 61, set this value to true:
    <Section Name="HttpGateway">
      <Parameter Name="IsEnabled" Value="true" />
    </Section>
  5. On or about line 170, note the following settings and update them as shown here. The values will probably be blank initially. Port 19081 is the default used by Azure, but can be another value so long as your application knows it.

    <Parameter Name="HttpApplicationGatewayListenAddress" Value="19081" />
    <Parameter Name="HttpApplicationGatewayProtocol" Value="http" />
  6. Save the file and reboot the machine. Rebooting isn't strictly necessary, but it'll get Service Fabric components to fully restart with the new configuration. Note that rebooting may cause a service outage.
  7. Remember to check that Reverse Proxy box during SF provisioning, next time!
With the changes above, your application can now call upon the local proxy installed on each SF VM node, as shown below (sample code from Microsoft Voting sample):

Service Fabric Application Proxy Usage
Service Fabric Application Proxy Usage


Service Fabric Temp Disk - Gotcha!

During a recent project with Azure Service Fabric, I encountered errors after provisioning which prevented proper SF operation, and ultimately led to an interesting discovery.

Service Fabric is Really IaaS


Azure Service Fabric is essentially just an intelligently managed IaaS environment. Many parts of an IaaS environment are inherently static, that is, once a particular performance level is provisioned, it remains at that perf level.

During Service Fabric provisioning, a choice of VM size is offered, at which time it's very important to get the size right (it should be sized for anticipated peak load). Too big and you'll waste money; too small and the environment won't operate properly, or will yield poor performance. Changing VM size later can cause service interruptions and is generally not recommended. The VM size is applied to all nodes (VMs) of a scale set (or cluster) that's created by SF provisioning. The scale set is initially set to a fixed size that can be manually changed, and later be configured for automatic scaling.

Being Frugal


In my personal Azure test environment, I typically provision lower-end VM offerings to save money on small, short projects. This works fine for small apps with no load beyond my own experimentation. I do my project, then tear-down the environment, or if it's something I intended to keep, I switch it to a free tier.
VM Sizes Offered During Service Fabric Provisioning
VM Sizes Offered During Service Fabric Provisioning

When it came time to build a Service Fabric environment, I used the same economical approach. Rather than choose the suggested VM at a cost of $104/VM/month, I chose a smaller one at $44/VM/month. The portal's UI allowed this choice, and during the validation phase it passed successfully. But all was not well, as I was to see.

If You Build It, They Will [not] Come


After about 30 minutes for the new SF environment to be created and brought online, I checked the results via the SF monitoring web page. This page is created automatically when a SF environment is created, and by default is available at http://<your URL>:19080/Explorer/index.html#/. Here's what that looks like with a local 5-node setup.

Service Fabric Monitor - Normal
Service Fabric Monitor - Normal

But, for some reason, my brand-new Azure-based Service Fabric booted with warnings, and when I ran my SF app, the environment quickly went into failure. I wasn't too pleased - isn't Azure technology supposed to be better than that, and self-heal in this case?

Digging-down in the SF Monitor, I noticed a single status line that gave insight to the failure - a VM disk was too full, despite having just been provisioned. As a side-note, Service Fabric VM disks are now configured as Managed Disks, an improvement over legacy disks that required traditional monitoring and oversight; Microsoft takes care of that now.

Disk Full
Disk Full
A disk-full issue is normally fairly easy to resolve, but the bigger question is how could a brand-new SF installation yield such a condition, and especially with Microsoft-managed disks? My app hadn't even started yet. Logging-in to one of the VMs showed 100 GB of free space on the C: system drive, and several GB free on the temporary D: drive. The disks weren't full.

In short order, the disk-full warning caused a node failure, and that escalated into partial cluster outage and lots of red on the screen. The error message "Partition is below target replica or instance count" is a verbose way of saying that a node is down, which implies that the node failed due to environment or startup code.

Node failure Caused by Full Disk
Node failure Caused by Full Disk


Oh Yah, RTFM


After some non-trivial research I discovered a clue. Microsoft's Service Fabric capacity planning recommendations state a requirement of a 14GB "local" disk. The word "local" isn't well defined; my little VM had a 128GB "local" system disk, and a 8GB "local" temp disk. Nonetheless, I made the inferential jump that my 8GB disk - despite having several GB of available space - was too small.

Too-Small Local Disk in Selected VM SKU
Too-Small Local Disk in Selected VM SKU
Even though my application wasn't going to touch the D: temp drive, I rightly guessed that Service Fabric had intentions for its use, and thus the implied requirement that the size be larger. I wish they'd just stated that explicitly.

Fixing It


To check my hunch, I up-sized the VM SKU, as seen below. I didn't care about a service interruption, but since this change takes down the cluster, it explains the critical guidance to pick the right (peak-load) VM SKU during provisioning:

Just-Right Local Disk of New VM SKU
Just-Right Local Disk of New VM SKU
With the change made and several minutes later, I logged into one of my SF cluster's VMs and noted the disk space consumed by the D: temp drive. Notice in the image below that Service Fabric has used about 10GB, even before my app started running. So there's the 14GB justification - and you can guess what happens if D: is only 8GB.


VM Local Storage Free Space
VM Local Storage Free Space
At last, my Service Fabric cluster was fully operational and without error!

Lessons learned

  1. Even if you know how to do something, double-check the documentation for tiny details that could derail your project.
  2. Even if Azure "validates" your configuration, it could still be invalid and non-functional.
  3. In certain parts of the Azure ecosystem, changing a resource configuration can take the environment down for a period. Not all Azure components - even those in pseudo fault-tolerant scenarios - continue operating during a reconfiguration.


Tuesday, February 6, 2018

Installing Service Fabric Locally

This'll be a very brief post with a problem and quick solution for installing Service Fabric on your local development computer. I hit a snag during installation.

To run Service Fabric apps locally, you need to first download and install Service Fabric. Installation requires reading the instructions and running a PowerShell script; it's not quite as easy as running a setup.exe and watching it go.

Unfortunately, if you try installing to a path containing a space, such a C:\Program Files, the provided PowerShell script CreateServiceFabricCluster.ps1 fails:

PowerShell Script Path Error
PowerShell Script Path Error

The solution was to places quotes around the file path string. This had been done for another variable in the code, but was apparently overlooked for this one.

Original code (in CreateServiceFabricCluster.ps1, line 58):
$DCExtractOutput = cmd.exe /c "$DCAutoExtractorPath $DCExtractArguments && exit 0 || exit 1"
New code (note the tick-quote delimiter preceding the double-quote characters around $DCAutoExtractorPath):
$DCExtractOutput = cmd.exe /c "`"$DCAutoExtractorPath`" $DCExtractArguments && exit 0 || exit 1"
After the modification, the path containing spaces is quoted and is accepted for execution properly.

Saturday, January 6, 2018

TiP'ing Over (Testing In Production)

Adding a new slot for TiP testing
Adding a new slot for TiP testing
TiP (Testing in Production) is an Azure feature allowing live, in-production testing of web apps. TiP routes a configurable portion of normal inbound production traffic to an app slot containing a new version of an app. You watch the app's performance and logs to see if all is well. I suppose this could be called Gamma testing, since it occurs after Beta testing but not yet taking a full production load. This approach to semi-production testing works well so long as data schemas are compatible between the two
versions.

Does TiP work? Well yes, it provides a useful and simple way to check a new software release prior to going fully live with it. That's business value. But there's a hitch at present that messes-up your app's configuration, causing 404's whose rectification requires (for now, anyway), deleting and redeploying the app and enduring a service outage. Oops.
Slot configuration after creation
Slot configuration after creation

TiP is available to any app that can have slots - Standard or Premium level App Service, etc. Upon setup, TiP will create a new slot and install static traffic routing, using weighted-mode traffic distribution. You decide how much traffic goes to the test slot, from 0 to 100%. The traffic manager is not directly visible via the portal, but the slot appears as a new Web App per usual, and is visible.

Once TiP warms-up, a percentage of clients will get a cookie that when submitted to Azure on a subsequent request, will direct their traffic to the test slot. That's a nice approach, until it isn't.

Cookie: [x-ms-routing-name: your-test-slot-name]

Failure Mode 1 - TiP Cookie Persistence gives 404s


HTTP 404 with x-ms-routing-name cookie
HTTP 404 with x-ms-routing-name cookie
If you've completed TiP testing and decide to tear-down the test environment - a perfectly normal thing to do in a cloud deployment - you're left with an issue. Clients with the slot cookie will continue to present it, but the traffic routing won't know about the slot anymore. An HTTP 404 Not Found error is the result. See the Postman screen shot.

The solution is for the client to drop the x-ms-routing-name cookie. That's no problem if you're testing locally, with full control of clients and cookies, but what about the real production clients, i.e. your customers? Ergo Oops.

After removing the cookie, here's the new result in Postman - HTTP 200 OK. Much better.
HTTP 200 OK after removing cookie
HTTP 200 OK after removing cookie

Failure Mode 2 - The Story of a Mucked-up Web App


There's no clear way to comprehensively remove TiP artifacts via the Azure portal, so when I completed testing, I just deleted the slot and the slot's Web App manually. That stopped those billing charges. As mentioned before, the implied existence of a static traffic routing didn't leave a visible artifact (such as a router, load balancer, etc.), so I couldn't take any action on it.
Failure creating new slot
Failure creating new slot

A bit later, I decided to do another TiP session, but could no longer make new TiP slots using the original Web App as configuration source (this is an option during slot creation - see image above). Attempting to do so yielded this error message.

Figuring that there was possibly some "dirty" JSON configuration somewhere, I searched all the config files I could find, but  couldn't locate "test" - the slot's name - in any of them.

Looking for a work-around and based on knowing a wee bit about how software works, I tried creating another slot using a different name ("anyone" - see image) and for Configuration Source, I chose Don't clone configuration from an existing web app. This worked, or so I thought. The slot got created and I could see its resources in the portal, just as before.

New slot configuration
New slot configuration
I configured the new slot for traffic, fixed my Postman session as explained above, and fired-off a new POST. Got a 404; what?! I confirmed that the slot was indeed up by using the direct URL to it; only the request to the primary URL (and through static traffic routing) got the 404. This didn't make any sense - all web apps were up and running, and the cookie was set properly.

Keep in mind that this broken configuration, and the 404, was in live production.

At this point, TiP testing is hosed for the Web App, so I reset test traffic to 100% for the original Web App, and 0% for the broken slot. Then I removed slot resources as done above. This restored normal production operation, except for the cookie issue mentioned above, which likely means that static traffic routing is still trying to function.

As a side note, the 404s seem to come from IIS, based on header information I saw, even if the app uses Kestrel. TiP apparently puts some infrastructure between the client and the web app, and that's likely what generates the cookies too. It's not entirely possible to discern this magic invisible infrastructure, but enough symptoms emit to yield a good guess.

How to fully fix the Web App


Desiring to restore the Web App completely, and rid the configuration of any bad settings, I found no other solution than to delete the app (but leave the Web App Plan). Then I recreated it, reconnected it to source control, and triggered a redeploy. This got everything back in a few minutes for this very small test. Having source-control based deployment is really nice for this case and it nearly automated the restore. Nonetheless, this approach did cause a brief service outage - in production.

I'm hoping that this entire behavior is a bug that'll one day be fixed, but for the meantime you might just want to consider testing the good old-fashioned way - not in production.

References



Tuesday, December 12, 2017

WAF Helps Keep Your Azure App Available. Mostly.

OWASP 3.0 Rules in Azure Portal
OWASP 3.0 Rules in Azure Portal
Azure Application Gateway has an optional feature called Web Application Firewall (WAF), which affords protection against numerous types of attacks against your Azure web app. The functionality and features of App Gateway and WAF are well documented online, but recently a colleague discovered a less obvious aspect that's worth sharing.

If the WAF feature is enabled in your App Gateway, a set of filtering rules is applied. WAF rules are visible and configurable in the Azure Portal, as seen in the adjacent image. You select either OWASP 2.x or 3.x rule sets. The list of rules is large and detailed, as show in the adjacent image for OWASP 3.0.

The Problem


I always advocate working with technology to discover undocumented aspects, product limits, and the like. This is how you build domain expertise. Jorge Cotillo was doing just that, using WAF Prevention mode with OWASP 3.0, and he stumbled on an interesting issue with a customer.

In Prevention mode, WAF will return HTTP 403 to callers if a rule violation is detected. The customer has a legacy app that uses calls webresource.axd to access resources, and legitimate access attempts ran afoul of those rules. A very important business function was blocked.

The problem was traced by first noticing the broken application functionality, then by looking into the WAF logs to correlate blocked requests to access attempts by the app.

The Solution

Note: Exercise caution before disabling any WAF protection rules, as this can render your web app prone to certain attacks.
OWASP 3.0 Rule 92440
OWASP 3.0 Rule 92440
Access to the .AXD was restored by disabling rule 920440, "URL file extension is restricted by policy".

With .AXD access restored, Jorge then discovered that a WCF service was also blocked. After research, the following rules were disabled and the WCF service became accessible. The service was not using transport security.

  • 920300 - Request Missing an Accept Header
  • 920320 - Missing User Agent Header
  • 920420 - Request content type is not allowed by policy


Conclusion


Security Effectiveness vs. "Tightness"
Security Effectiveness vs. "Tightness"
Security is essential on the internet, and should be part of any internet-facing web app / API from the get-go. But remember that a primary function of security is to keep the business function running and usable by customers. Balance is required to insure that goal is met.

This situation is one I've seen before, where implementing a security technology or policy to safeguard computer-based business function actually prevented its use. Security is not a linear function, it is - as are so many aspects of technology - a bell curve (or approximation of such). Crank up the "security" too far (say, to 11) and business protection actually goes down. The irony is that the business function is very secure, because it's inaccessible.

Wednesday, December 6, 2017

Certification - The Road to Azure Exams

I've recently completed a long and successful path to an Azure certification, and want to share my experience and anecdotes. Along my journey I found numerous helpful online articles and blogs from individuals, so I'd like to give back to the community to help those headed the same way.

You might also ask why I chose certification at all? Didn't I already know Azure pretty well? Wasn't that enough? I do know numerous Azure technologies and offerings quite well, and could build, deploy, and secure applications and databases. But Azure's a very broad umbrella for many other technologies and I felt it was very worthwhile to insure I'd learned as much as possible and could prove it. Certification was an excellent way to do that, and it also provides a credential to my professional portfolio.


Setting Out


There are a number of exams to consider if you're targeting Azure certification, or any other for that matter. You combine exams several from a defined certification path to earn a particular title, like Azure MCSE. The first step is to see what's available in your chosen path. Second, by reviewing the focus and content of each exam, you get an idea for the level of effort and time it might take to pass each one. It makes sense to start with an exam whose content you know best and with which you are most comfortable.


Two Things


If I boiled-down the arduous task of passing a certification test, it'd come to these two things:

  1. Know the technology well. Practice, experiment, watch videos, get familiar, build, tear down. This implies you'll need all the software & tools, and an Azure account if you'll be doing cloud work. I also recommend a VSTS account (Visual Studio Team Services) so you can work on CI/CD configuration and how to build & publish code to Azure web sites/services. This isn't strictly required, but it can help, particularly for developers. Spend a few hours every week, or better yet, work with Azure every day if possible.
  2. Buy the official Microsoft-sanctioned practice test, and work with it relentlessly. Beware of off-brand tests that make promises; they may be good, but quality can vary widely, and there's only one official practice test vendor at the time of this writing.


Know The Topic


Nothing beats actual experience.

Sounds obvious, and it is. Learn the basics, then advance into harder scenarios. I particularly enjoyed making a web site, or REST API, doing a code check-in and watching it build and deploy automatically to Azure, in just a couple minutes and without having to build or configure infrastructure. That's the promise of cloud computing, and it works really, really well with Microsoft tools and technologies. That's something that would've been impossible just a few years ago.

Another reason to do personal projects is to go beyond the marketing-speak, and find out what doesn't work well, or where limits lay. Trade-show presentations (and the like) get the audience excited about demoware, but rarely tell you where the technology falls down; that'd be silly given that the demo is to promote the technology and get people to use it. That's their agenda; yours is to find how applicable and capable it is, and where the boundaries lay. There might be some "gotcha's" that create dead-ends for an architecture or intended business direction. You can't find all these or even a majority, but you'll find enough to start building your knowledge base and to realize that hey, this thing has aspects that can interfere with your intentions.


About The Practice Test


Oh boy. This is where things got "interesting", and by that I mean problematic.

I bought a  test package that included temporary online access to the practice test, and 2 exam tries. The practice test is supposed to be very similar to the actual test, but after a little experience I was hoping that claim was untrue.

Even though I was quite familiar with Azure's features, my first practice test didn't go so well. Sure, there were a few detail knowledge gaps - an opportunity to shore-up my knowledge in a particular area - but I found it very challenging to solve the impedance mismatch.

Impedance mismatch is a term from electrical engineering referring to two analog electric circuits that do not connect their signals / energy well. In extreme cases it can cause damage. It's important that impedance be matched between the circuits in order to properly and safely operate, and achieve a goal. When used metaphorically, the term refers to how well one party can interact / communicate with another party.

Some of the questions were confusing / unclear / ambiguous as to the desired answer. Upon test completion and review of the "correct" answers, I was bewildered. It was only then that I could reverse-infer what the question wanted. Let me give an [absurd] example that I've made-up:

Question:
You take a train from New York to Chicago. The train averages 50mph, and the distance is 800 miles. Upon arriving, what do you do?
Choose one answer:
  1. Go to a restaurant
  2. Call your mother
  3. Check-in to your hotel
The correct answer is (1), because after 800 miles you're famished, and you can call your mother from the restaurant. Then after satisfying those needs, you can check-in at your hotel later.

Yes, I said it was an absurd example. But that's how some of the questions felt - no way to anticipate the answer they wanted, nor infer something reasonable. The components of the question - transport mode, distance, average speed - were mostly irrelevant in choosing an answer.

It's hard to "win" when that's the game.

After many hours of practice Q & A, and analyzing each provided answer, I began to "tune" the impedance mismatch between my logical, educated brain and the practice test. I learned to make alternate inferences when required information was missing in a scenario question.

When I got to the real certification test, this sort of problem did not materialize. Some inferential thinking was still required, but it was more modest and never absurd nor even a stretch. And that's why I'm glad that the practice test wasn't like the real one.

Thoughts from the Practice and Real Exams


In short form, here are a few notable aspects of the practice and real exams:

  • A few practice answers were just wrong, I was able to confirm.
  • A few practice answers were essentially impossible to get correct.
  • The practice test induces self-doubt as to whether you'll ever be able to pass the real test. You can.
  • Select a modest number of practice test questions for each session. The default was 50 which took hours to complete. I preferred 10-20 to fit my schedule better.
  • Some practice & real exam questions require several correct sub-answers, sometimes in a specific sequence. So to get credit for 1 correct answer, I actually had to make 4-7 correct sub-answers. The worst case was 13 correct sub-answers within a practice test, with no partial credit given. Fortunately the real exam offered partial credit for correct sub-answers, even if the overall question was not answered correctly.
  • Some VM SKU questions required detailed knowledge of what was in each size, such as slots. The real world is open-book, with internet search engines providing just-in-time knowledge, so memorizing such arcane details is not practical in my opinion, but you still have to have some idea of them for the exam.
  • The real exam's case studies were much longer and more detailed than the practice. This is why people recommended reading the questions first, then scanning the case studies for necessary information. It's essential to perform strong time management in this way, or the clock may run-out on you (the real exam is time-limited).
  • To work through all the practice exam questions without them repeating, I had to make specific choices in the online web page's configuration such as number of questions, whether to allow prior correct/incorrect questions, etc. Without doing this, I found myself simply selecting correct answers from memory rather than learning and analyzing new topics.
  • I took the proctored real exam from home. This was convenient and easy to schedule, but be aware that the requirements are very stringent - you can't take a sip of water, the whole room must be close to "sterile", and nobody had better walk in on you. And you can't talk (or mumble?).
  • Some real exam questions had a little ambiguity, and the practice test helped me to prepare for that possibility and know how to deal with it.
  • Unlike the practice exam, the real exam doesn't show you correct answers at the end. You just get a pass/fail, and your numeric score (700 is passing, but don't assume that 1000 is a perfect score).

Certification Isn't Enough


The software industry requires constant learning, and this is especially true of online technologies. Passing a certification test is great, but Azure features change every week and may take a year or more to appear on a test; likewise, features deprecated or removed can remain on the test for the same period. To be an effective architect / developer / administrator of Azure, you need far more knowledge than can be garnered through the test, so keep your eye on the industry. Microsoft has blogs, email push, web sites, and other resources to keep you abreast of larger changes and feature introductions. But even they don't capture everything - I've seen portal page changes that were entirely undocumented.

Nothing beats actual experience.

Wednesday, August 30, 2017

Azure Naming Conventions – Think First, or You’ll Be Stuck


In the Azure cloud environment, many computing resources and asset types can be created to build-up an operational environment to support business needs. These include computers, networks, databases, queues, IP addresses, etc. All Azure resources have one thing in common – they must be given a name when being created. It’s a very good idea to develop a naming convention for resources prior to creating them.

Why Naming Conventions?


For anyone who’s worked with computers, and especially for those who write software, naming conventions are a well-known topic and best practice. A good naming convention – using abbreviations or upper/lower case for example – can improve maintainability, reduce human errors, and lower training complexity. The reason is rather simple, actually – the name assists in identifying the intended use of something, and possibly a categorization. For example, a SQL Server storage procedure might be called sp_Calculate_Salestax, where “sp” indicates that the object is a stored procedure (useful when seeing a long list of items; it’ll also cause like items to sort together in a display), and it performs a sales tax calculation. Without ever seeing the source code or having to reverse-analyze it, one can discern what the object is, and what it does. The underscores and initial capitalization make the name easier to read quickly.

A significant factor in Azure resource naming is that once a name is chosen at creation-time, it cannot be changed. The closest approximation of changing a name is to create an identical new resource with the preferred name, then delete the old one. That’s fine if you’ve just created the resource, but if you’ve got 100 of them and they’re interlinked, it’s essentially impossible. You’re stuck with the “wrong” name indefinitely. That’s why adopting a convention before creating resources is a really, really good idea.

Azure Silly Naming


Azure’s a bit funny when it comes to names, however. Some resource names can include punctuation, some can include mixed-case, and others cannot. There doesn’t seem to be any rhyme or reason – it’s really up to the specific resource being created. Therefore, it’s impossible to create one naming convention and use it consistently throughout Azure. Name length limits are also variable, dependent upon resource type, from 24 to 1024 characters. Microsoft’s documentation shows 23 different naming rules for the various Azure resources (see link, below), and you can expect this list to grow as new services are added.

One factor that influences allowable name syntax is whether the resource is internet-facing with a DNS name, or not. The internet’s naming rules have been around a long time and are very set, so it’s reasonable that internet-facing resources must comply with DNS naming requirements. Azure takes this a little further, disallowing some valid DNS syntaxes in certain contexts.

The following image shows a valid internet DNS name (jimmy-azure.core.windows.net) being rejected by Azure for a storage account.
DNS name rejection by Azure
DNS name rejection by Azure

But here’s a similar name, accepted for a Function App (jimmy-azure.azurewebsites.net).
DNS name acceptance by Azure
DNS name acceptance by Azure

Automatic Names


There’s also a resource naming oddity I’ll mention in particular. When creating a Function App and specifying the Consumption Plan, Azure will make a special version of an App Plan on your behalf (with a unique icon) and set the name based upon deployment region. In the East US datacenter, Consumption App Plans will all be named EastUSPlan, a duplicate name which you cannot change. You won’t be able to distinguish which plan goes with which Function App. Internally, the identifier for each App Plan is unique, so Azure won’t confuse them, but your administrator won’t easily be able to understand which is which, and specific associations to Function Apps.

Here are 2 Consumption App Plans with duplicate names, created and named automatically during provisioning of a Function App.
Duplicate App Service Plan names
Duplicate App Service Plan names


In another example, provisioning an Azure AD Domain Service also caused the creation of a virtual network with associated NICs, address, and load balancer. Note the last 4 items' names - something Azure itself assigned, and which cannot be changed.

Azure-assigned resource names
Azure-assigned resource names

Tags


Resources also can have multiple tags associated with them. A tag is just a free-form short textual name/value pair. This allows you to locate resources according to your tagging strategy, regardless of the resource type or name. While not exactly a “name”, tags are an alternative way to label and manage Azure resources, using whatever tag-naming convention you wish and applicable across all Azure resources.

Here, a tag “applicationType : web” is used to locate resources, which are show in the right blade.
Application tags
Application tags

Naming Best-Practice Suggestions


OK, taking into account the above peculiarities, what’s an Azure architect or admin supposed to do when choosing a resource naming convention? Are there any suggestions?

Sort-of. Microsoft has published an article with good details on exactly this topic. They document the naming restrictions mentioned above, per resource type. Azure resource names should be composites, formed segment-by-segment from meaningful short strings.

Consider these name components to form a name. Use the minimum number of components to clearly identify a resource within a larger Azure environment; avoid using redundant components that don’t add identification value. Abbreviate name segments to 2 or 3 characters at most, to insure the final name won’t be too long. When allowed, hyphenate the segments for easier reading. Consider excluding your company’s organization names, since re-orgs can occur, and companies merge or acquire other companies; thus this class of names is not necessarily durable.
  • Company name (internet-facing only)
  • Division name
  • Resource location (East, West, etc.)
  • Azure resource type (Storage, SQL, Virtual Machine, etc.)
  • Technical function (Web server, DB server, domain controller, etc.)
  • Serial / sequence 

Naming Samples


Resource Name
Description
RG-Math
Resource group for math functions. Visible internally only, can contain resources in multiple regions.
JA-AS-Math-Sqrt
App Service (web app) that provides a math function for square roots for Jimmy Azure company. Region doesn’t matter. Name is visible on the internet as http://as-math-sqrt.azurewebsites.net. Using a company prefix is a good idea here, since this name must be globally unique within azurewebsites.net.
ASP-Math
App Service Plan (PaaS VM scale unit) underlying the math app service. Must be in same region as app service, but doesn’t matter which region that is. Visible internally only.
VN-East-Sql
Virtual network in East region, intended to hold SQL Server. Network isn’t directly visible on the internet; that’s the job of a public IP address.
GW-East-Sql
Gateway attached to the virtual network for SQL in the East region.
JA-FN-Math-Sine
Function App for company Jimmy Azure, for a math function that calculates Sines. Visible on the internet as https://ja-fn-math-sine.azurewebsites.net.
jasaeastsql01
Company Jimmy Azure, Storage Account intended for SQL Server in the East region, number 01. Visible on the internet. Has more restrictive character requirements. Accessible at jasaeastsql01.core.windows.net and must be unique in core.windows.net.
JA-Srch-Docs01
Company Jimmy Azure, Search engine for documents, visible on the internet at https://ja-srch-docs01.search.windows.net. Region doesn’t matter.
  

References



Friday, August 25, 2017

Azure Site Recovery Setup - Failure Was An Option

The past couple days, I performed a "lab" experiment using Azure Site Recovery. This is the feature that, amongst other things, lets you replicate on-premise Hyper-V virtual machines to Azure.

ASR is a component of a disaster recovery / business continuity plan that enables running Hyper-V loads in Azure, after a brief fail-over period. Once ASR is initially set up, one click "fails-over" the network to use a newly-provisioned Azure virtual machine to continue operations. Data from on-premise VM disks gets replicated up to an Azure storage account frequently, minimizing any lost data. RPO and RTO objectives can be met for modest requirements, if not for transactional workloads.

You can also use this technique to migrate on-premise VMs to an Azure IaaS setup, though it's a funky approach and I'm not sure I'd really recommend it.

My home office computing environment is flexible and provides a platform for modeling and experimenting with a Windows Server 2012R2 domain, Hyper-V, networks, databases, and so on. I created a test VM to use for the experiment, which was expendable upon experiment completion. I figured that I'd spend a day setting it up, playing with it, and then tearing it all back down.

That was the plan, anyway. But it didn't work out that way.

Sure, I watched the videos and read the documentation. I checked the prerequisites. And, the Azure management portal is generally pretty good at guiding administrators through a sequence of steps to setup most any resource. Using these three sources, and my good general knowledge of all the technology and architecture involved, I figured it wouldn't be too hard to do.

I started in the Azure portal, setting up resources using the Resource Group approach (i.e., not Classic). Setup steps include creating a backup vault, storage account, and virtual network. The storage account holds the replicated VM's disks, and the virtual network is where the Azure VM will connect after a fail-over. But you don't make a VM, as you'd pay quite a bit for that resource even if it's never used. Instead, Azure will automatically provision a VM upon fail-over; it'll select a best-match for the "hardware" configuration of your Hyper-V VM (this increases RTO by a few minutes, in exchange for strongly reduced costs).


Hurdle 1 - Not on a DC


The first step for preparing on-premise machines is to install an ASR agent on the Hyper-V host, obtained via the Azure portal. This is the computer that actually runs the VMs you'd like to protect.

This is when issue #1 appeared. It's not documented at present, but you cannot run the ASR agent on a domain controller. Yep, in my small office, I have a DC doing overtime by running several functions that aren't normally setup that way in a commercial environment. But in my low-load environment, it works fine. This is where my VMs were running, but thanks to this restriction, I was unable to proceed. Dead-end on that computer.

ASR error message
ASR error message

Fortunately I have a new laptop that functions as a backup DC and Hyper-V host, and is even more powerful than my older desktop/server machine, so I was able to demote it and install the agent. Annoying, but not a deal-breaker. It just meant that I was running without a domain backup for a day - a minor risk that I was willing to take, to appease somebody in Redmond.


Hurdle 2 - Not during the summer


Once the agent was installed on a Hyper-V host, the next step was to download a configuration file from the Azure portal and supply it to the ASR setup utility. Trivial, right?

No.

Turns out that ASR setup checks time-of-day accuracy of the host before proceeding. I could speculate why it does this, but that doesn't matter. With the host computer's timezone set to "(UTC-05:00) Eastern Time (US & Canada)", but Daylight Savings Time in effect (which gives an actual offset of UTC-4:00), ASR insisted that my time was wrong, and therefore it would not proceed. I keep my computers (and clocks, etc.) within 1 second of correct time, so I knew this was a bogus "error".

I've never seen such an error from any software before. The interweb is full of problem reports related to this issue, and even a couple years after those complaints, the issue still isn't fixed. Maybe they only test in Redmond during the winter, when actual time offsets match the configured timezone region?

Anyway, the work-around is easy enough: change the hosts's timezone to "(UTC-04:00) Atlantic Time (Canada)", then set my clock to the current local time where it's sitting. Problem resolved, and now I could continue with configuration. Also had to remember for later, to reset the computer's timezone.


Hurdle 3 - Not in a paired region


After getting the ASR Agent installed, configured, and running OK on the on-prem Hyper-V host, I continued setup within the Azure portal. In short order I successfully completed that, and clicked that wonderful button labeled "Enable replication".

Success! Or so I thought.

That's when the next hurdle in sequence appeared. If you imagine that I might be getting tired of getting blocked by setup problems, you'd be right.

The problem was the Azure region I had selected for the Azure VM replica. I normally place Azure assets in East US, so I chose West US for the replicated VM and disks. This is due to region pairing. Well, an error appeared in the portal indicating that West US wasn't allowed (along with certain others).

Of course, I'd already configured a virtual network and storage account in the disallowed region, and now I had to tear them down and rebuild in another acceptable region (moving those resources to another region wasn't an available portal option). I chose West US 2, rebuilt assets, and retried the replication.

Finally, replication started working.

At this point I thought it'd just take some time, and the replicated disks and VM would be ready for fail-over testing in Azure. A nice accomplishment, though with much time wasted bumping into problems that weren't documented, or were bugs. Actual experimenting like this - a Proof Of Concept - is where you discover the difference between the lovely videos and marketing info, and how something really works.


Epilogue


Lest you think I was successful with ASR after all the above, there was one last issue that broke the whole deal.

When replication of my on-premise disks had reached 98%, it stopped with an error in the Azure portal UI. The message was too general to be actionable - something about failure accessing Azure storage - and to check the Hyper-V host's event logs to find details. I did that, and found no information.

I also searched for how to restart replication as the error message further suggested, but alas could never find how to do it in the portal.

So after many hours, configuration changes, and trials, I decided to terminate the experiment. I reset the host computer's timezone, removed the ASR software, and re-promoted it to a DC. Chalk it up to learning, and move on.

After resetting the host computer, I later discovered that the replication restart for which I'd searched was in the host's Hyper-V management tool, via right-click of the appropriate VM. In other words, VM replication uses the mechanism within Windows Server 2012R2 Hyper-V, and not something in the Azure agent. Would've been nice if the error message in the Azure portal had mentioned that, since most other replication setup was in the portal, leading one to assume that's where the restart function lived.

And lastly, about that VM I was trying to replicate? I'd made sure to use a scrappable VM in case something happened, which it did. Even after the config tear-down, that VM was still configured to replicate to Azure. There was no way to disable it - the Hyper-V management tool said that replication had been enabled by another program, and that program must be used. So my VM was in an undesirable state (though it still apparently worked for the time being). The solution? Delete the VM.

Astute Azure Attainment

I've played with Microsoft Azure's cloud computing environment for a few years, and it's a lot of fun. I'm convinced that Azure, and competing cloud services, are the future of internet-based web sites, services, and SaaS (Software as a Service - essentially fully-featured online apps).

In this blog I'll be cataloging my experiences learning, using, building, and tearing-down Azure resources, and related items from the world of C# development, SQL Server, and the whole Microsoft stack. Stay tuned!