Here is an example of the code I’m using:
library(jsonlite)
library(curl)
#url
url = "http://www.zillow.com/search/GetResults.htm?spt=homes&status=001000<=000000&ht=010000&pr=999999,10000001&mp=3779,37788&bd=0%2C&ba=0%2C&sf=,&lot=0%2C&yr=,1800&singlestory=0&hoa=0%2C&pho=0&pets=0&parking=0&laundry=0&income-restricted=0&pnd=0&red=0&zso=0&days=36m&ds=all&pmf=0&pf=0&sch=100111&zoom=6&rect=-91307373,29367814,-84759521,35554574&p=1&sort=globalrelevanceex&search=maplist&rid=4&rt=2&listright=true&isMapSearch=true&zoom=6"
#json
results_data_json = fromJSON(txt = url)
I used to be able to run similar code to this with no issue. Now I’m getting the following error:
Error in feed_push_parser(buf) :
lexical error: invalid char in json text.
<html><head><title>Zillow: Real
(right here) ------^
Any ideas around this?
asked Dec 6, 2016 at 16:22
8
This happened to me reading in a JSON from a file. The code worked one day, and then the next day I got this error. I was eventually able to circumvent the error, although I do not understand why my solution works. I found a GitHub post that suggested adding the readLines() function. Eg.
r_object <- fromJSON(readLines("file.json"), warn = F)
The «warn» argument is set to FALSE to suppress the warning message triggered by the lack of a final EOL in many JSON files.
answered Feb 22, 2018 at 20:06
ADFADF
4736 silver badges13 bronze badges
2
This seems like it is a problem related to your current working directory in R.
You can view your current working directory by entering «getwd()» into your RStudio Console.
If this directory path is not pointing to the directory containing your json file, then R doesn’t know how to find the file.
You can change your current working directory with «setwd()». An example would be «setwd(/Users/me/Documents/jsonFiles/)»
Alternatively, you can have your code also point to the full directory path for any given file. This will make your code robust to any working directory changes that might happen over time. However, this does mean that using this code on someone else’s computer would require these paths to be edited. You can find the full path to any given file by navigating in the terminal to the file of interest and typing «pwd» for ‘present working directory’. This works for Mac and Linux. On a windows machine run «cd» from the terminal.
answered Jun 23, 2022 at 17:45
I can’t replicate error neither.
class(results_data_json)
[1] "list"
My sessioninfo:
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)
locale:
[1] LC_COLLATE=Spanish_Colombia.1252 LC_CTYPE=Spanish_Colombia.1252 LC_MONETARY=Spanish_Colombia.1252
[4] LC_NUMERIC=C LC_TIME=Spanish_Colombia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] curl_2.4 jsonlite_1.1
loaded via a namespace (and not attached):
[1] tools_3.3.2
answered Apr 12, 2017 at 23:28
Posting this in case someone else encounters this problem…I had to set the open file as my working directory and it began to work fine again. Wildly simple. Just right click on the open json file.
answered Feb 28, 2022 at 16:59
On my mac, it was caused by iCloud. The json file had been stored in the cloud, and R couldn’t find it. Downloading the file fixed it. This may or may not have been what happened in your case.
Apologies that I’m nearly five years too late.
answered Sep 22, 2021 at 3:16
The same thing happened to me as @ADF above…. Had a whole project I was working on a few months ago, and when re-visiting, fromJSON suddenly didn’t work.
I found that even though I have the file named as «name.json» in my file browser, my R directory showed it as «name.json.txt«. Double-check your file name and what it shows in the directory match up!
answered Jan 25, 2022 at 18:11
If you doubt the JSON file is valid, test it in another way (there are online tools).
For me, the problem was I got the file name wrong and for some reason is gives the same error (instead of missing file)
answered Feb 4 at 13:49
ornitornit
1151 gold badge1 silver badge4 bronze badges
1
I got this error when parsing a json file:
Error: lexical error: invalid char in json text.
ow_bypassed":{"local_pkts":0,"l
(right here) ------^
The json file I was parsing contained invalid json because the computer crashed when the file was being made, causing invalid json (the crash caused the next json blob to be created before the previous one had finished being written, causing invalid json, hence why fromJSON() couldn’t parse it).
To solve
Here’s how I solved:
- Open the json file with a text editor (I used Sublime Text, but any will do),
- Search for the part of the json that was causing the problem (in my case:
ow_bypassed":{"local_pkts":0,"l— see error message above), but every case will be different, take a look in your specific error message for the part that’s causing the error. - Do something about it. In my case I was happy to delete that part of the json data (it meant losing that data, but that was tolerable since it would allow the json to be valid). Another option is to fix the invalid json, that could be trickier.
Then everything worked.
answered yesterday
stevecstevec
35.7k22 gold badges179 silver badges248 bronze badges
Here is an example of the code I’m using:
library(jsonlite)
library(curl)
#url
url = "http://www.zillow.com/search/GetResults.htm?spt=homes&status=001000<=000000&ht=010000&pr=999999,10000001&mp=3779,37788&bd=0%2C&ba=0%2C&sf=,&lot=0%2C&yr=,1800&singlestory=0&hoa=0%2C&pho=0&pets=0&parking=0&laundry=0&income-restricted=0&pnd=0&red=0&zso=0&days=36m&ds=all&pmf=0&pf=0&sch=100111&zoom=6&rect=-91307373,29367814,-84759521,35554574&p=1&sort=globalrelevanceex&search=maplist&rid=4&rt=2&listright=true&isMapSearch=true&zoom=6"
#json
results_data_json = fromJSON(txt = url)
I used to be able to run similar code to this with no issue. Now I’m getting the following error:
Error in feed_push_parser(buf) :
lexical error: invalid char in json text.
<html><head><title>Zillow: Real
(right here) ------^
Any ideas around this?
asked Dec 6, 2016 at 16:22
8
This happened to me reading in a JSON from a file. The code worked one day, and then the next day I got this error. I was eventually able to circumvent the error, although I do not understand why my solution works. I found a GitHub post that suggested adding the readLines() function. Eg.
r_object <- fromJSON(readLines("file.json"), warn = F)
The «warn» argument is set to FALSE to suppress the warning message triggered by the lack of a final EOL in many JSON files.
answered Feb 22, 2018 at 20:06
ADFADF
4736 silver badges13 bronze badges
2
This seems like it is a problem related to your current working directory in R.
You can view your current working directory by entering «getwd()» into your RStudio Console.
If this directory path is not pointing to the directory containing your json file, then R doesn’t know how to find the file.
You can change your current working directory with «setwd()». An example would be «setwd(/Users/me/Documents/jsonFiles/)»
Alternatively, you can have your code also point to the full directory path for any given file. This will make your code robust to any working directory changes that might happen over time. However, this does mean that using this code on someone else’s computer would require these paths to be edited. You can find the full path to any given file by navigating in the terminal to the file of interest and typing «pwd» for ‘present working directory’. This works for Mac and Linux. On a windows machine run «cd» from the terminal.
answered Jun 23, 2022 at 17:45
I can’t replicate error neither.
class(results_data_json)
[1] "list"
My sessioninfo:
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)
locale:
[1] LC_COLLATE=Spanish_Colombia.1252 LC_CTYPE=Spanish_Colombia.1252 LC_MONETARY=Spanish_Colombia.1252
[4] LC_NUMERIC=C LC_TIME=Spanish_Colombia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] curl_2.4 jsonlite_1.1
loaded via a namespace (and not attached):
[1] tools_3.3.2
answered Apr 12, 2017 at 23:28
Posting this in case someone else encounters this problem…I had to set the open file as my working directory and it began to work fine again. Wildly simple. Just right click on the open json file.
answered Feb 28, 2022 at 16:59
On my mac, it was caused by iCloud. The json file had been stored in the cloud, and R couldn’t find it. Downloading the file fixed it. This may or may not have been what happened in your case.
Apologies that I’m nearly five years too late.
answered Sep 22, 2021 at 3:16
The same thing happened to me as @ADF above…. Had a whole project I was working on a few months ago, and when re-visiting, fromJSON suddenly didn’t work.
I found that even though I have the file named as «name.json» in my file browser, my R directory showed it as «name.json.txt«. Double-check your file name and what it shows in the directory match up!
answered Jan 25, 2022 at 18:11
If you doubt the JSON file is valid, test it in another way (there are online tools).
For me, the problem was I got the file name wrong and for some reason is gives the same error (instead of missing file)
answered Feb 4 at 13:49
ornitornit
1151 gold badge1 silver badge4 bronze badges
1
I got this error when parsing a json file:
Error: lexical error: invalid char in json text.
ow_bypassed":{"local_pkts":0,"l
(right here) ------^
The json file I was parsing contained invalid json because the computer crashed when the file was being made, causing invalid json (the crash caused the next json blob to be created before the previous one had finished being written, causing invalid json, hence why fromJSON() couldn’t parse it).
To solve
Here’s how I solved:
- Open the json file with a text editor (I used Sublime Text, but any will do),
- Search for the part of the json that was causing the problem (in my case:
ow_bypassed":{"local_pkts":0,"l— see error message above), but every case will be different, take a look in your specific error message for the part that’s causing the error. - Do something about it. In my case I was happy to delete that part of the json data (it meant losing that data, but that was tolerable since it would allow the json to be valid). Another option is to fix the invalid json, that could be trickier.
Then everything worked.
answered yesterday
stevecstevec
35.7k22 gold badges179 silver badges248 bronze badges
Содержание
- JSON parsing error: lexical error: invalid char in json text.x0a #1964
- Comments
- Lexical Error
- Types of Lexical Error:
- Other lexical errors include
- Error Recovery Technique
- Error detection and Recovery in Compiler
- Classification of Errors
- Compile-time errors
- Lexical phase errors
- Syntactic phase errors:
- Semantic errors
- A C++ Expression Parser Tutorial
- Introduction
- Context Free Grammars: The Heart of the Solution
- Nonterminals, Terminals, and Tokens
- The Meaning of Context Free
- Describing Tokens
- Operator Precedence in CFGs
- The Expression Grammar: First Version
- Token Descriptions
- The Phases of Language Translation
- Grammars and Parsers
- The Expression Grammar: Second Version
- LL(1) Grammars and Recursive Descent Parser Routines
- The Coding Phases
- Parser Errors
- The Lexer
- The Token Type
- The Lexer Class
- The Parser
- Semantics
- Helper Functions
- The Symbol Table
- Parser()
- primary()
- unary_expr()
- pow_expr()
- mul_expr()
- add_expr()
- assign_expr()
- The Complete Parser Code
JSON parsing error: lexical error: invalid char in json text.\x0a #1964
«message»: «Failed to parse request body.»,
«details»: <
«message»: «Warning. Match of «eq 0» against «REQBODY_ERROR» required.»,
«data»: «JSON parsing error: lexical error: invalid char in json text.x0a»,
«file»: «rules/REQUEST-920-PROTOCOL-ENFORCEMENT.conf»,
«line»: «143»
>,
POST request with JSON payload sent but not special characters or empty string and it’s very strange issue to get it resolved.
The text was updated successfully, but these errors were encountered:
Please find dummy JSON format which is similar to actual payload..
@victorhora
This is «BLOCKER» for us which triggered for many requests. Could you please help in suggesting FIX ASAP for same ?
What happens to be the ModSecurity version that you are currently running? Can you give more details on the request?
I figured out this issue when JSON payload string contains ENTER (/n character identical to ASCII value //xoa) which is reproducible when request made through Burp and captured following issue as violation which is blocking all requests..
This ENTER i have added manually then posted via Burp found same issue in logs. But unclear cut is JSON is not creating any ENTER or /n in middle of strings which is observed in Wireshark pcap..
How we should mitigate that ? or is this «False Positive» ?
This is first time we introduced WAF OWASP CRS 3.0 to our existing application & earlier we don’t have CRS rule engine.
@001appsec007 there are two different things: ModSecurity which is your WAF engine. The CRS which is a set of rules that once loaded into ModSecurity attempts to mitigate generic attacks on our web application. Depends where is the enter it may lead you to a invalid JSON. Which may be correctly report by the conjunction: ModSecurity + set of rules.
In below JSON payload a Line Break is entered after «1-a1b2c3d» results to same error when manually sent this request via CRS 3.0
This is indeed a invalid json:
Could you please parse this json format.
The above one is valid JSON but still we are getting «JSON parsing error: lexical error: invalid char in json text.x0a».
We verified HEX values x0A (Line Feed Character) & 5C also in PCAP but still facing same error. Could you please help on this.
@001appsec007 you could use any online JSON lint tool such as: https://jsonlint.com/
This JSON in particular shows as valid.
There’s a chance that your particular client/library that handles the JSON that gets sent to ModSecurity is screwing up something and breaking the JSON.
Please go over issue #1879 as a user was complaining about a similar problem with JSON parsing, when in the end it was found that the problem was the way that Python was handling the JSONs.
Источник
Lexical Error
When the token pattern does not match the prefix of the remaining input, the lexical analyzer gets stuck and has to recover from this state to analyze the remaining input. In simple words, a lexical error occurs when a sequence of characters does not match the pattern of any token. It typically happens during the execution of a program.
Types of Lexical Error:
Types of lexical error that can occur in a lexical analyzer are as follows:
1. Exceeding length of identifier or numeric constants.
Example:
This is a lexical error since signed integer lies between −2,147,483,648 and 2,147,483,647
2. Appearance of illegal characters
Example:
This is a lexical error since an illegal character $ appears at the end of the statement.
3. Unmatched string
Example:
This is a lexical error since the ending of comment “*/” is not present but the beginning is present.
4. Spelling Error
5. Replacing a character with an incorrect character.
Other lexical errors include
6. Removal of the character that should be present.
7. Transposition of two characters.
Error Recovery Technique
When a situation arises in which the lexical analyzer is unable to proceed because none of the patterns for tokens matches any prefix of the remaining input. The simplest recovery strategy is “panic mode” recovery. We delete successive characters from the remaining input until the lexical analyzer can identify a well-formed token at the beginning of what input is left.
Error-recovery actions are:
- Transpose of two adjacent characters.
- Insert a missing character into the remaining input.
- Replace a character with another character.
- Delete one character from the remaining input.
Источник
Error detection and Recovery in Compiler
In this phase of compilation, all possible errors made by the user are detected and reported to the user in form of error messages. This process of locating errors and reporting them to users is called the Error Handling process.
Functions of an Error handler.
- Detection
- Reporting
- Recovery
Classification of Errors
Compile-time errors
Compile-time errors are of three types:-
Lexical phase errors
These errors are detected during the lexical analysis phase. Typical lexical errors are:
- Exceeding length of identifier or numeric constants.
- The appearance of illegal characters
- Unmatched string
Error recovery for lexical errors:
Panic Mode Recovery
- In this method, successive characters from the input are removed one at a time until a designated set of synchronizing tokens is found. Synchronizing tokens are delimiters such as; or >
- The advantage is that it is easy to implement and guarantees not to go into an infinite loop
- The disadvantage is that a considerable amount of input is skipped without checking it for additional errors
Syntactic phase errors:
These errors are detected during the syntax analysis phase. Typical syntax errors are:
- Errors in structure
- Missing operator
- Misspelled keywords
- Unbalanced parenthesis
The keyword switch is incorrectly written as a swich. Hence, an “Unidentified keyword/identifier” error occurs.
Error recovery for syntactic phase error:
1. Panic Mode Recovery
- In this method, successive characters from the input are removed one at a time until a designated set of synchronizing tokens is found. Synchronizing tokens are deli-meters such as; or >
- The advantage is that it’s easy to implement and guarantees not to go into an infinite loop
- The disadvantage is that a considerable amount of input is skipped without checking it for additional errors
2. Statement Mode recovery
- In this method, when a parser encounters an error, it performs the necessary correction on the remaining input so that the rest of the input statement allows the parser to parse ahead.
- The correction can be deletion of extra semicolons, replacing the comma with semicolons, or inserting a missing semicolon.
- While performing correction, utmost care should be taken for not going in an infinite loop.
- A disadvantage is that it finds it difficult to handle situations where the actual error occurred before pointing of detection.
3. Error production
- If a user has knowledge of common errors that can be encountered then, these errors can be incorporated by augmenting the grammar with error productions that generate erroneous constructs.
- If this is used then, during parsing appropriate error messages can be generated and parsing can be continued.
- The disadvantage is that it’s difficult to maintain.
4. Global Correction
- The parser examines the whole program and tries to find out the closest match for it which is error-free.
- The closest match program has less number of insertions, deletions, and changes of tokens to recover from erroneous input.
- Due to high time and space complexity, this method is not implemented practically.
Semantic errors
These errors are detected during the semantic analysis phase. Typical semantic errors are
- Incompatible type of operands
- Undeclared variables
- Not matching of actual arguments with a formal one
It generates a semantic error because of an incompatible type of a and b.
Error recovery for Semantic errors
- If the error “Undeclared Identifier” is encountered then, to recover from this a symbol table entry for the corresponding identifier is made.
- If data types of two operands are incompatible then, automatic type conversion is done by the compiler.
Источник
A C++ Expression Parser Tutorial
Introduction
This tutorial describes how to evaluate a string as a mathematical expression. Specifically, it describes the design and coding of a recursive descent parser. The parser accepts a string having valid syntax, such as 4.0 * atan(1.0) , and returns a double .
When I was young, my family had one of those computer terminals you hook up to your TV. To do anything, you had to program it in BASIC. To evaluate an expression, you had to put it on a program line. One day I tried putting an expression in a string variable. I was surprised that there was no language mechanism to evaluate that string. I was curious from that day on about how to solve the problem.
I learned later on that there is a whole theory behind the problem. It involves technical concepts such as context free grammars (CFGs) and using the grammar to parse source text. As information became more and more available, I learned how to solve the problem. I can’t help bragging a bit. I did end up writing a rudimentary expression parser in TI99/4A Extended BASIC. Believe me, it’s much easier to do in C++.
It is possible to provide expression evaluation as a language feature. However, in that case you’re usually stuck with the syntax someone else provided. For instance, you might want to write (2 * 5) ^ 3 to mean 2 times 5 all raised to the power of 3. However, if C++ provided an expression parser, it would probably mimic the use of the ^ operator in code; i.e., ^ would be a bitwise exclusive or operator. If you can write your own parser, you’re in control of the syntax (language structure) and semantics (language meaning).
Context Free Grammars: The Heart of the Solution
Context free grammars (CFGs) are a definitional mechanism. They give a formal definition of the structure of a language. We’re not going to define an entire programming language here. We’re just going to define an expression. But, the same techniques apply.
Consider describing the structure of an addition expression. We want to describe not only 2 + 3 , but 2 + 3 + 4 , and 2 + 3 + 4 + 5 , and in general the addition of any amount of numbers. We want to devise a minimal set of rules that describes all these expressions. Let’s start with three symbols, add_expr , PLUS , and NUMBER . We want to say that an add_expr can be a subexpression PLUS a NUMBER . We will write The above rule is called a production. add_expr: is called the left hand side (lhs), and add_expr PLUS NUMBER is called the right hand side (rhs). Grammar productions are applied when evaluating a language construct. For instance, 2 + 3 + 4 is supposed to be a valid construct. We start by saying 2 + 3 + 4 is an add_expr . This must fit the pattern add_expr PLUS NUMBER . If it does, then 2 + 3 must be an add_expr . It also must fit the pattern add_expr PLUS NUMBER . It does if we can say that 2 is an add_expr . For this to happen, we need another rule. We need to say that an add_expr can also be a NUMBER : The result of applying our two rules recursively can be represented by the following tree: We constructed this tree from the top down, but is evaluated in a preorder sequence. So, the way our grammar is written, this expression will get evaluated from left to right.
So far, we have two grammar rules: These rules have the same left hand side. By convention, we usually write the left hand side once, joining the right hand sides with the | symbol to mean «or». We end with a ; to show that we’re done. Using this convention, our grammar looks like: The first rule in our grammar is called a left recursive rule. One property of left recursive rules is that they specify a left to right evaluation of an expression. If we wanted to specify that expressions should be evaluated right to left, we would use a right recursive rule:
Nonterminals, Terminals, and Tokens
When a syntax tree is generated from grammar productions, the symbols on the left hand side never appear as leaves in the tree. Hence, they are called nonterminals. The symbols that do appear as leaves of the tree are called terminals. The terminal symbols of a language are also called tokens. One property of a token is that it can appear directly in source text. For instance, you may see a 2 or a + in an expression, but there is no token representing an add_expr . You have to compose an add_expr out of the available tokens.
The Meaning of Context Free
Context free simply means that the lhs of every production in the grammar consists of one and only one nonterminal. Grammars that don’t have this property are called context sensitive, because the lhs then determines the context in which an rhs can replace an lhs.
Context sensitive grammars are more powerful, but efficient parsers for these grammars do not exist. On the other hand, efficient parsers for many classes of CFGs do exist.
Describing Tokens
Tokens can be described in a grammar. But, in practice, they are usually described separately. The reason for this is that parsing techniques are generally slower than the techniques used to recognize a token. When tokens are specified with a grammar, the grammar subset specifying a token turns out to belong to a class of grammars called regular grammars. There is an alternate way of describing a token called a regular expression. Each symbol used in such an expression corresponds directly to a regular grammar construct. If you are using parser generation tools, you will need to be familiar with regular expressions. However, describing regular expressions here would detract from our task of describing how to write a parser. It is sufficient to describe our tokens informally.
In describing NUMBER , we must decide what kind of number we want. For instance, we might want an integer, or we might want a floating point number. We will use floating point numbers here.
It’s easiest to describe our numbers if we break them up into two kinds: numbers starting with a digit, and numbers starting with a decimal point. Each of these kinds of numbers can optionally be followed by an exponent part.
A number is composed of:
- one or more digits
- followed by zero or one decimal points (i.e., an optional decimal point)
- followed by zero or more digits
- followed by zero or one exponent parts.
OR
- a decimal point
- followed by one or more digits
- followed by zero or one exponent parts.
An exponent part is composed of:
- an e or an E
- followed by zero or one sign symbols ( + or — )
- followed by one or more digits.
Token descriptions can get rather verbose in English. The advantage of regular expressions is that they are compact and concise. The disadvantage is that they can be cryptic until you get used to them. The regular expression that describes a number is:
That’s pretty unwieldy. But it’s what we implement in our lexer; you can see the regular expression implemented in code. Here is a key to understanding the above expression:
- Some characters have special meaning. But they lose their special meaning when enclosed in square brackets.
- Square brackets denote a character class. [0-9] means any one of the digits 0 through 9. [+-] means one of the symbols + or -.
- The exception is that a ^ at the beginning of a class means anything except what is in the class. For example, [^+-*] means any one character except a +, -, or *
- Outside square brackets, we use + to mean one or more, * to mean zero or more, and ? to mean zero or one. A period (.) means any one character except newline. If you want a literal one of these characters outside square brackets, precede it with a backslash. Note the use of . above to denote a decimal point.
- Regular expressions basically read as “this followed by that”. The first part of the expression above is one or more digits: [0-9]+. The next expression is in parentheses, and the ? tells us this part is optional (zero or one). So one or more digits, i.e., [0-9]+, is optionally followed by a decimal point which is followed by zero or more digits; this is the .[0-9]* in parentheses.
- The | means “or” or alternate. So the pattern matches what we previously described, OR it matches a decimal point followed by one or more digits.
Hopefully by this point you can recognize the optional exponent part of the number.
Operator Precedence in CFGs
Consider the following grammar: The expression 2 + 3 * 4 would produce the following derivation tree: Notice that when processed in a preorder sequence, the multiplication gets evaluated first. This is precisely what we want. The above grammar illustrates how to express operator precedence in a grammar. Our final grammar will contain a full set of operators and will handle things in addition to numbers such as identifiers and parenthesized expressions.
The Expression Grammar: First Version
Here is a grammar defining the syntax of our expressions. The reason that this is «version 1» is that, for reasons explained below, the type of parser we are going to use can’t deal with things such as left recursion. We will end up rewriting some rules in a different, but equivalent, way.
Token Descriptions
The Phases of Language Translation
An expression parser, or more generally, a compiler, usually operates in several phases. First, the characters in the source text must be properly grouped and turned into tokens. Next, the parser checks the sequence of tokens to see whether they satisfy the grammar. Next, the parser handles the semantics of what it recognizes; e.g., when it recognizes an addition expression, it collects the information it needs and prepares to do the addition. In our parser, the addition is done the the point that an addition subexpression is recognized. If we were writing a compiler, we would have our parser generate code.
Grammars and Parsers
Given a language construct, the rules in a grammar can be applied from the top down or from the bottom up. As long as same derivation tree is generated, the rules can be applied in any fashion. This gives rise to different classes of grammars and different parser types. The bottom up approach isn’t very intuitive. However, this approach, along with the parsers which support it, end up being more powerful than the top down approach. The bottom up approach makes parsing full featured languages easier. Programs that generate parsers from a grammar plus other information usually generate a bottom up parser.
We will be using a top down parser called a recursive descent parser. Recursive descent parsers are best at handling a class of grammars called LL(1). This arcane name refers to how a token stream is parsed. The first L stands for leftmost, and means that the token stream is read left to right. The L(1) means that the parser can look ahead at most one token. We won’t go into the differences between various grammar classes. The property of LL(1) grammars that we are interested in is that, given a token and the left hand side of a production, the token must predict one and only one right hand side. The grammar above doesn’t meet this requirement. Here is why. Take the rule set If the current token in our token stream is a number, which right hand side do we choose? We don’t know, without looking ahead further, whether we’re starting an addition expression or a multiplication expression. In general, no grammar containing left recursion can be LL(1). The reason I have used left recursion is that it’s more intuitive than the LL(1) equivalent.
An LL(1) equivalent of the above is The comment // Empty just refers to «no token». Now, if our current token is NUMBER and we’re at a point where we’re processing an add_expr , we only have the rule mul_expr add_expr_tail to choose from. If a NUMBER can’t start this rule, we would have an error. But, a NUMBER can start an add_expr , so we may proceed. Going down the parse tree, mul_expr eventually becomes a primary , which becomes a NUMBER . With that token matched, we advance to the next token. We are now processing the add_expr_tail nonterminal in add_expr: mul_expr add_expr_tail . If our current token is PLUS , the rule add_expr_tail: PLUS mul_expr add_expr_tail is used. Otherwise, the rule add_expr_tail: // EMPTY is used.
The first two rules in our grammar also exclude it from the LL(1) class: Can you see why? An add_expr become a primary which can become an ID . So an ID can begin either right hand side.
The easiest way to make our grammar LL(1) is to allow assignment to any kind of expression. We will delegate checking that assignment is being done to an identifier to the static semantic checking. Hence, we only need to transform our first two rules as follows:
The Expression Grammar: Second Version
Here is an LL(1) grammar that is equivalent to our original grammar.
LL(1) Grammars and Recursive Descent Parser Routines
One nice thing about LL(1) grammars and recursive descent parsers is that each set of productions having the same left hand side can be directly mapped into a parser function. For example, consider the production This can be simply coded as The production set can be coded as Don’t get too clever by factoring out the commonality in this code. It will disappear when you add the semantics later.
The recursion in the add_expr_tail production has potential to be slow and to eat memory. When such X and X_tail productions occur, we will code them together using the following kind of optimization: We apply a similar optimization to a X and X_suffix pair:
The Coding Phases
Ordinarily, it’s a good idea to design a program before coding it. However, our program is small enough to allow us to jump right into the coding. It also simplifies the tutorial. We will first code an object called a Lexer. The job of the lexer is to turn the input character stream into expression tokens. Next, we will code the parser proper. Finally, we will add semantic processing to the parser. Once we do that, we will have a parser that evaluates the expressions it recognizes.
Parser Errors
Our parser will support simple error handling. We want to be able to report lexical errors, syntax errors, and runtime errors (errors in semantic processing).
The Lexer
The Token Type
The first step is to give each of our token symbols a representation in code. We will use an enum class to do this.
The Lexer Class
Here is our Lexer class. The constructors simply initialize p_input and owns_input , and call init() to do a priming read on the stream. get_token() is the workhorse of our lexer. It assembles the characters from the input stream into tokens. It starts out as follows: Next, we look for an identifier or function name. Both have the same syntax. We handle this by looking for an identifier first. Then, we check what we have found against the function names. Here is the code to assemble a number from characters. We want numbers such as 304, 46., .72, 3.14, and any of patterns followed by an exponent part, e.g., 4.92e-3, 21.34e6, .5e+7. The following code recognizes our number pattern. The rest of our tokens are single character tokens. They are simple to handle. If we get through our single character tokens without returning, we’re left with an error.
The Parser
We will code the parser as a class. There are other ways to do it, but I like having the parser initialize itself, and being able to pass the expression to the parser using the call operator.
Each nonterminal in the grammar is represented by a parser function. The exceptions are the _tail and _suffix nonterminals, for which we provide the optimization mentioned above. For now, it is convenient to declare the parser functions as returning void . Later, when we add our semantics and our functions return values, we’ll change that return type to double . The alternative is to have our functions return double from the beginning, and return a dummy value. But if we miss replacing that dummy value later on, we’ll have a buggy program.
Here is the Parser class. We could let the compiler provide a default constructor at this stage, but when we add our symbol table we will use the constructor to reserve a couple of identifiers as constants.
For now, the constructor is empty. The operator() function simply creates a Lexer, calls assign_expr() to start the parsing process, and then deletes the Lexer when it’s done. Now we will code the parsing routines. Each function is named after a nonterminal. The function bodies conform closely to the right hand sides of all the productions for which the nonterminal is the left hand side. We choose which of the rules we are processing based on the current token in the stream. Whenever we encounter a nonterminal in an rhs, we simply call the function with that name. When we encounter a terminal, we check whether it matches the current token in the stream. If it does, we advance the lexer. If it doesn’t, we throw an exception. In our grammar, an assignment to any expression is valid. However, when that target is not an identifier, assignment doesn’t make sense. When we specified our grammar, we left that requirement to semantic checking. But, it’s a check we can provide during translation. Semantics that can be checked during translation are called static semantics.
Here is add_expr() . Note that the two cases in the switch statement above have common code. We don’t want to factor it out, because when we add semantics later on, the code segments will no longer be identical.
The mul_expr() is very similar. The code for pow_expr() and unary_expr are pretty straightforward. The primary() function is just a straightforward processing of the right hand side of each production in every case. Instead of writing the code for function calls over and over, we’ll put it in a function called get_argument() . When we add semantic processing, get_argument() will return the value produced by add_expr() , i.e., the argument to the function. We now write a driver for the parser. At this point, the program will only produce output if something is wrong. Don’t let that discourage you.
Semantics
We will now add the code that makes the parser do its computations. Ideally, the semantics should be specified along with the grammar. Because our parser is small, we can get away with describing them as we write them in code. Also, I think this approach is a good way to learn exactly what semantics are and what issues they present.
Helper Functions
The fundamental element in our expressions is the number. Our lexer already composes input characters into text representing a number. When our primary() function recognizes a number, it will need to convert the string in the token text into a double . We can provide a simple helper function to do that. It works by turning the given string into an input stream, and then reads the number from that stream. The opposite conversion, from number to string, is helpful when composing error messages.
The Symbol Table
We need to associate every identifier with a value. The mechanism used to do this is called a symbol table. In our expressions, the mention of a variable is sufficient to bring it into existence. We don’t need to store any information with it other than its value. Hence, a map is sufficient to use as a symbol table.
Parser()
We’re going to add two constants to our parser, pi and e . The constructor will put their values into the symbol table. It will be the job of assign_expr() to protect against assignments to these two constants.
primary()
When we recognize an Id , we look it up in the symbol table and return its associated value. If the Id hasn’t been used yet, it’s entered into the symbol table with a value of 0, and 0 is returned.
When we recognize a Number , we convert its string representation to a double and return that value.
When we recognize a subexpression, we return the value of the expression in the parentheses. We do this simply by returning what add_expr() returns.
When we process a function, we’re trying to achieve the effect of calling the function on the argument. We call our function get_argument() to get the argument value. When we need to, we check whether the argument is valid for the function. We then return the result of calling the function on the argument. Here is primary() with the semantics added. We also need ot modify get_argument() so it returns the value of the argument.
unary_expr()
Adding semantics to this function is simple. We simply want to apply the sign, if there is one.
pow_expr()
pow_expr() uses the library function pow() to do its computation. We also want to make sure that we’re not trying to take the root of a negative number. We provide a function check_domain() that does this. I chose to put check_domain() in the parser class, but made it static because it doesn’t depend on any members of the class.
mul_expr()
Providing semantics for multiplication and division is pretty straightforward. In the case of the division and modulo operations, we want to catch division by zero.
add_expr()
assign_expr()
In assign_expr() , we want to guard against assigning to our constants, and we want to update any value assigned to in our symbol table. We count assignment to a constant as a syntax error because it’s something that can be checked for during translation, and we’re grouping our syntax errors and static semantic errors together.
The Complete Parser Code
Below is the parser code in its entirety. Copyright © 2017 by Chris French (Unclechromedome»)
Источник
- Index
- » Applications & Desktop Environments
- » lexical error: invalid char in json text.
Pages: 1
#1 2011-06-05 17:36:47
- dedanna1029
- Member
- From: Cheyenne, WY, US
- Registered: 2010-10-01
- Posts: 98
lexical error: invalid char in json text.
A debate on where to post this, made it win out here. If it’s in the wrong area, please let me know.
Ever since the pacman update to 3.5 (wherein powerpill, etc. were removed), I’ve been getting an error when attempting to run yaourt:
$ yaourt lib32-glibc
lexical error: invalid char in json text.
<html> <head> <title>Tor is not
(right here) ------^
yaourt does nothing further from there. Yes, Tor is installed, but it does this even when Tor/Vidalia are not running.
Any clue as to what this might be? I’ve googled and searched here, but nothing matches this exact situation, and those that are similar don’t appear to have resolutions.
This is on both Gnome3 (fallback) and KDE, completely updated Arch system (other than what could be found in yaourt).
There have been other errors here and there too, but this one is the one that comes up the most. pacman itself doesn’t do this, IIRC.
Thanks.
Edit: I do know that lib32-glibc doesn’t exist any more. This is merely an example. It does it with any <packagename>.
Last edited by dedanna1029 (2011-06-05 17:40:17)
#2 2011-06-05 17:47:26
- thisoldman
- Member
- From: Pittsburgh
- Registered: 2009-04-25
- Posts: 1,172
Re: lexical error: invalid char in json text.
Have you tried reinstalling yaourt?
#3 2011-06-05 18:53:41
- dedanna1029
- Member
- From: Cheyenne, WY, US
- Registered: 2010-10-01
- Posts: 98
Re: lexical error: invalid char in json text.
Think I’m going to go do that, and on the advice of someone at our forum, check out and see if there’s a line in pacman.conf that might be doing this.
I had seen a tip to rebuild package-query, but that would mean re-installing the whole shootin’ sh’bang of pacman, etc., which right now, well, let’s just say I’m lazy.
#4 2011-06-05 19:25:01
- falconindy
- Developer
- From: New York, USA
- Registered: 2009-10-22
- Posts: 4,111
- Website
Re: lexical error: invalid char in json text.
That error is being thrown by yajl, which package-query links to. Rebuilding package-query is likely your solution if it happens for all packages.
#5 2011-06-05 21:19:17
- dedanna1029
- Member
- From: Cheyenne, WY, US
- Registered: 2010-10-01
- Posts: 98
Re: lexical error: invalid char in json text.
Okay, this is putting me into major error, major «sux» city. Tried reinstalling yaourt using the top manual method here, and that gave an error:
$ yaourt lib32-glibc
/usr/lib/yaourt/basicfunctions.sh: line 12: package-query: command not found
… when it had just been rebuilt and reinstalled using that method (I had to rename the yaourt and package-query folders in /home first to get a fresh start with it).
So, I renamed the folders in /home again, and used the second method of installing, by re-adding the repo and pacman -Sy. All righty then…
$ yaourt glibc
1 core/glibc 2.13-5 [7.15 M] (base) [installed]
GNU C Library
2 extra/kdesdk-kmtrace 4.6.3-2 [0.06 M] (kde kdesdk) [installed]
A KDE tool to assist with malloc debugging using glibc´s "mtrace"
functionality
3 extra/nss-mdns 0.10-3 [0.01 M]
glibc plugin providing host name resolution via mDNS
---------->lexical error: invalid char in json text.
<html> <head> <title>Tor is not
(right here) ------^<-----------
==> Enter n° of packages to be installed (ex: 1 2 3 or 1-3)
==> -------------------------------------------------------
*frustrated
Would someone please post their output of «yaourt glibc» for comparison?
Thanks.
Last edited by dedanna1029 (2011-06-05 21:22:40)
#6 2011-06-05 22:17:04
- dedanna1029
- Member
- From: Cheyenne, WY, US
- Registered: 2010-10-01
- Posts: 98
Re: lexical error: invalid char in json text.
Thanks to someone who obviously knows how to search this forum better than I can, I’m going to try downgrading yajl: https://bbs.archlinux.org/viewtopic.php?id=117606
Except for one slight problem there…
# pacman -U yajl-1.0.11-3-i686.pkg.tar.xz
warning: downgrading package yajl (2.0.2-1 => 1.0.11-3)
resolving dependencies...
looking for inter-conflicts...
error: failed to prepare transaction (could not satisfy dependencies)
:: package-query: requires yajl>=2.0
Last edited by dedanna1029 (2011-06-05 22:25:32)
#7 2011-07-02 21:45:05
- dedanna1029
- Member
- From: Cheyenne, WY, US
- Registered: 2010-10-01
- Posts: 98
Re: lexical error: invalid char in json text.
This error, even after json updates (and IIRC yajl’s been updated too since this started), etc., is still preventing me from being able to use yaourt. It’s a big vicious circle. The alternatives to use are all in yaourt (such as clyde, etc.), and I can’t use yaourt to get them. There are other packages I need from AUR as well; nothing is working. If I stop tor, then I get a curl error.
yaourt nautilus-dropbox
curl error: Couldn't connect to server
I think this must have to do with tor.
I’ve rebuilt package-query, the works, everything suggested. It r still broke here.
Thanks.
Last edited by dedanna1029 (2011-07-02 21:48:46)
#8 2011-07-02 22:22:23
- falconindy
- Developer
- From: New York, USA
- Registered: 2009-10-22
- Posts: 4,111
- Website
Re: lexical error: invalid char in json text.
You get a curl error when you stop tor because you’re still trying to hit the proxy defined by HTTP_PROXY or whatnot.
Last edited by falconindy (2011-07-02 22:22:45)
#9 2011-07-03 02:13:28
- dedanna1029
- Member
- From: Cheyenne, WY, US
- Registered: 2010-10-01
- Posts: 98
Re: lexical error: invalid char in json text.
falconindy wrote:
You get a curl error when you stop tor because you’re still trying to hit the proxy defined by HTTP_PROXY or whatnot.
That’s kind of what I’m figuring, but when tor’s running, I get the error posted initially:
lexical error: invalid char in json text.
<html> <head> <title>Tor is not
(right here) ------^
yaourt does work with sudo correctly, but we’re not supposed to let packages build as root, so…
#10 2011-07-03 17:17:30
- falconindy
- Developer
- From: New York, USA
- Registered: 2009-10-22
- Posts: 4,111
- Website
Re: lexical error: invalid char in json text.
It works with sudo because root doesn’t have the http_proxy environment var set…. package-query’s curl implementation is lacking something to work properly with proxies.
#11 2011-07-03 21:09:29
- dedanna1029
- Member
- From: Cheyenne, WY, US
- Registered: 2010-10-01
- Posts: 98
Re: lexical error: invalid char in json text.
Right, I get that.
What I need is a solution to run it properly as regular user. When tor is running, I get one error. When it’s not, I get the other. The only way it does work is with sudo, and that we aren’t supposed to do for yaourt. I tried it; and got the warning against building packages as root.
Last edited by dedanna1029 (2011-07-03 21:10:26)

