CodeX
Published in

CodeX

Analyzing Regular Text

In this article, I am going to cover the following points:

  • Brief Introduction “Regular Text”
  • Strategy Used for analyzing Regular Text

Brief Introduction

This is a part of milestone project I have done, it was related to healthcare business which included multiple questions were answered.

In this part, I am going to explain step by step with more details how can analyze regular texts were taken from a healthcare system specifically an emergency data by using python.

In the beginning, I would like to define a regular text.

What do I mean by the regular text in this data?

It is any request or reply was entered by the user then was written and translated by computer as a text in systematic way.

Let us take a look for a sample from the regular text.

As we saw above, this is a regular text which I meant. You can note the text have two orders which seprated by semicolon. Each order starts with a word Request, colon, date and time then a word Reply, colon, date and time.

There are some spaces located between date and time as well as time and a word Reply. This is the six index moreover, other indexes have the same structure with different date and time.

Let me now ask question may be rotate in your mind after you have read the above paragraph.

Who does enter the Request and Reply? What does this text refer to?

Ok, As I mentioned above this is healthcare data specifically an emergency data. Each index refer to a single patient and every patient has one order or multiply orders these orders refer to consultations which received by patient. Moreover, every order contains “Request:Date Time Reply:Date Time”.

The consultation request by the an emergency Physician but the reply by Medical Consultant. When the Physician needs a consultation, he/she requests via hospital system and stored as a request with date and time, that also happen with Medical Consultant but store as a reply with date and time then the request and reply stored as single order in the patient file.

I think now you understood the regular text in this data and how the healthcare system stores these data as text in a systematically way.

Strategy Used

As a data scientist, I think you do not need strategy if you have a simple challenge like this but this will be necessary when you want to present your work for client. As we know most clients do not interest with the way which you followed to extract outcomes, they interest with outcomes precise which they can depend on it in the decison making. So you should proof that, and that is through arrange your work with strategy which has detail steps.

The figure above appears us the strategy used which contains all steps I followed starting with Business Understanding until Delivering Outcomes. Let us take a look an overview around this strategy.

When you deal with text for instance extract any thing from a text, I think you need to be more accurate and should sure your results.

In this part, the client had multiple questions one of them was about the duration between consultation requests and replies, and we can compute that by extracting dates and times for requests and replies then subtract dates and times of replies from requests.

Thus results will be time durations and these durations should be matched the original data specifically dates and times in that data before delivered outcomes to client.

When data did not match the original data, that means there were mistakes in extracting dates and times or in computing durations. So we should come back to Extract Date Time Step and re-extracting and computing them again, however if it matched we can rely outcomes and deliver to client. Then celebrate :)

Business Understanding

This is a critical step, so that all next steps will based on it. Do not forget that in any project.

There were main questions I should answers them, these questions were related to patients who came to an emergency department and received medical consultations. Some of them did not need medical consultations while others had gotten it.

I met the client and we discussed some of challenges and reach the following questions:

How many patients who did or did not receive medical consultations?

How many were total consultations? As you know some patients receive more than a one consultation?

How long consumed consultation or the duration between the request and reply?

What was the higher number of consultations in the data?

Did there any consultatios without reply? If yes, how many did have?

Explore Data ( Text )

In this step, I imported libraries which were numpy, panda, Matplotlib and seaborn then loaded data as data-frame after that, selected the consultations feature which contained regular texts.

After I identified consultations feature, I printed the 1st ten rows, so each index refer to single patient. Some of patient did not receive consultation.

Let us firstly, find the shape of data.

How many patients who did or did not receive medical consultations?

Ok, let us now discover the regular text consultation values.

As we note above, we have two consultations each one has a request and reply, so that each of them has date and time :

  • Request date and time: this the date when an emergency physician ordered a consultation.
  • Reply date and time: this the date when a consultation doctor response.

Extract Dates & Times from Regular Text

Now we want to extract dates and times for each patient for computing the duration of consultations.

As we noted in the text above, each single consultation seperated by semicolon ; for others, and this will help us to split each single consultation by use split method with semicolon ; as a parameter, then we can convert text (string) to list.

There are two steps:

  1. Convert text (string) to list.
  2. Identify all dates and times indexes of request and reply.

Let us do it.

1. convert text (string) to list

we can split each single consultation with two ways:

  1. By using apply and lambda functions then use split method with string.
  2. By using dictionary comprehension and use split method with string.

Let us try the 1st way, by using apply and lambda functions then use split method with string.

Oh, why this error?

When we face an error we should read the last line of that error AttributeError:'Series' object has no attribute 'split' .

What did that mean?

We use apply with the data frame not with the series CONSULTATIONS, so x considered the feature of data frame not the value of our series CONSULTATIONS .

Now, let us identify the series CONSULTATIONS then use apply, lambda functins and split method.

Oh, another error that is terrible. Let us try and explore where was the mistake? Let us read that error and know why did happened?

AttributeError:'float' object has no attribute split , after we identified series Consultations and used apply, lambda functions and split method. There were some values as null values and the data types of these are float so, no attribute split with float object.

We can use isna() function to ignore these null values and then use apply,lambda functions and split method to seperate each single consultation for every patient index .

First identified series CONSULTATIONS then create mask and use not sign ~ before mask to ignore all null values.

Yahooo, we got it.

  • As we saw above in cell 23we have all consultations without null values.
  • Each patient differ for others in the date or time of request and reply or number of consultations.

Let us now, extract request date time and reply date time. by the first way which I mentioned.

As we note above, we converted each text string to list and then split each order “Request:Date Time Reply:Date Time” separately for others by comma , .

Now, let us try the 2nd way which I mentioned.

As we saw above, we have used itmes() method which we can reach to keys and values inside series and created for loop and splitted each values (text) with semicolon ; and converted it to list. Now, we should convert dictionay to series.

Let us take one patient index for instance index no #6.

This patient received two consultations each consultation with request date time and reply date time.

Let us check if this patient received two consultations in the original data after we splitted.

The two consultations of that patient with index #6 in the original data were matched with values after they splitted.

2. Identify all dates and times indexes of request and reply.

Let us identify the first index of request date and time and similar for reply as we have done it before. So the first index of request starts with index #8 and ends with index #23, moreover the first index of reply starts with index #31 until the end of text.

Ok, after we have known indexes of request date time and reply date time, we will use apply() , lambda functions and list comprehension .

Compute Durations

Ok, one thing I would like to add here. This case if we have request date time and reply date time but if we just have request date time without reply date time. In this case, the first index without reply date time and that do not effect in computing durations if we used parameter is called error='coerce' inside to_datetime() function and we should set day_first=True because we have the day first in our date.

After we identified all request and reply indexes of dates and times, we use apply , lambda and list comprehension to reach out for each value of each patient ‘index’. Inside of the list we will subtract reply date time from request date time.

Duration = Reply — Request

Checking the outcomes

Now, let us check if these durations correctly computed or not.

We will select randomly five indexes.

Then we will see each index in the list of durations and compare it with original request date time and reply date time

Amazing work, All five indexes in the list of duration matched request date time and reply date time in the original data. look at index #17243 without reply data time, so the duration should be NaT.

Now, we would like to know the maximum number of consultations.

What was the higher number of consultations in the data?

To find the number of consultations for all patients, we will use apply, lambda and lenfunctions so, len to find the length of each value consultation then use value_counts function and set normalize=True to get the percentage of our values consultations and round values to four digitsround() then time value by 100.

The maximum number of consultations was ten consultations happened “for one patient” the most frequency consultation number was one consultation with 85%.

Let us find the index of ten consulations

Let us merge all consultation durations together to find basic statistics “central tendency”.

How long consumed consultation or the duration betweeen the request and reply?

Let us see the max duration of consultation by find its index and then check if the date and time in the request and reply are matched the original data.

Let us now, explain each cell separately.

[69] We created for loopfor each patient by indexes, then created nested for loopfor each duration. After that we checked, if the duration equals the maximum duration, it will print “give us” the index of duration and all durations of that index.

[70, 71, 72] After we knew the index of the maximum duration, we check by the index in the data after we splitted if it is matched the index of original data or did not.

As we saw above, the maximum duration was in the index #27 and it is matched the original data.

Relying Outcomes

After we have checked all codes were correctly and outcomes were precise in the development environment, part of Software Engineering was applied which is OOP aka Object Oriented Programming which abbreviates all codes in the development environment and merge them in a professional way and more systematic as well as easier in updating and debugging.

Let us start by import the libraries.

Two classes were created:

  • SeparatingConsultations
  • ExtractingComputingConsultations

After we created two classes, now we pass the file “path of data” to 1st class SeparatingConsultations and then create a new object called consults_to_sep .

2nd class inheritances all variables and methods from 1st class then create an another object called consults_to_comp .

Now, let us answer for five questions which I mentioned before by run the class name with the appropriate object.

How many patients who did or did not receive medical consultations?

[6] we selected consultations variable then use isna() and sum() methods with first index to find the number of patient who did not receive consultations.

[7] we selected series_consult_without_null_vals variable then use shape method with first index to find the number of rows which refer the number of patient who received consultations.

What was the higher number of consultations in the data?

[8] We renamed the first column of the data frame as number of consultations by selected consult_counts variable then, use index and name methods.

[9] We selected consult_counts variable to display all number of consultations with their frequent.

The next codes with second object which we created related to consults_to_comp .

Did there any consultatios without reply? If yes, how many did have?

[10] We selected the unique_no variable to display details number of consultations with and without Reply.

How many were total consultations?

[11] We selected unique_no variable and use iloc() method with first row and all columns as well as sum() method to display all consultation counts in our data.

How long consumeed consultation or the duration betweeen the request and reply?

[12] We selected duration_stats variable to display basic statistics of all consultations.

Finally, we can deliver outcomes to client with two files, the first file includes the development environment which we did all steps. All codes were seperatly for checking outcomes step by step.

The second file includes all codes, so that were merged into two classes. According to variables in these two classes we can display the outcomes as easier way.

Hope you spent a good time during reading this article and learnt how could you create strategy to divid your tasks to several parts to be a clear for your clients when you present outcomes.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store